Publications
2024
- Foundation Models for Music: A SurveyYinghao Ma, Anders Øland, Anton Ragni, and 40 more authorsarXiv, 2024
In recent years, foundation models (FMs) such as large language models (LLMs) and latent diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This comprehensive review examines state-of-the-art (SOTA) pre-trained models and foundation models in music, spanning from representation learning, generative learning and multimodal learning. We first contextualise the significance of music in various industries and trace the evolution of AI in music. By delineating the modalities targeted by foundation models, we discover many of the music representations are underexplored in FM development. Then, emphasis is placed on the lack of versatility of previous methods on diverse music applications, along with the potential of FMs in music understanding, generation and medical application. By comprehensively exploring the details of the model pre-training paradigm, architectural choices, tokenisation, finetuning methodologies and controllability, we emphasise the important topics that should have been well explored, like instruction tuning and in-context learning, scaling law and emergent ability, as well as long-sequence modelling etc. A dedicated section presents insights into music agents, accompanied by a thorough analysis of datasets and evaluations essential for pre-training and downstream tasks. Finally, by underscoring the vital importance of ethical considerations, we advocate that following research on FM for music should focus more on such issues as interpretability, transparency, human responsibility, and copyright issues. The paper offers insights into future challenges and trends on FMs for music, aiming to shape the trajectory of human-AI collaboration in the music realm.
- Timbral brightness perception investigated through multimodal interferenceCharalampos Saitis, and Zachary WallmarkAttention, Perception, & Psychophysics, 2024
Brightness is among the most studied aspects of timbre perception. Psychoacoustically, sounds described as “bright” vs “dark” typically exhibit a high vs low frequency emphasis in the spectrum. However, relatively little is known about the neurocognitive mechanisms that facilitate these metaphors we listen with. Do they originate in universal magnitude representations common to more than one sensory modality? Triangulating three different interaction paradigms, we investigated using speeded classification whether intramodal, crossmodal, and amodal interference occurs when timbral brightness, as modeled by the centroid of the spectral envelope, and pitch height / visual brightness / numerical value processing are semantically congruent and incongruent. In four online experiments varying in priming strategy, onset timing, and response deadline, 189 total participants were presented a baseline stimulus (a pitch, grey square, or numeral) then asked to quickly identify a target stimulus that is higher/lower, brighter/darker, or greater/less than the baseline after being primed with a bright or dark synthetic harmonic tone. Results suggest that timbral brightness modulates the perception of pitch and possibly visual brightness, but not numerical value. Semantically incongruent pitch height-timbral brightness shifts produced significantly slower reaction time (RT) and higher error compared to congruent pairs. In the visual task, incongruent pairings of grey squares and tones elicited slower RTs than congruent pairings (in two experiments). No interference was observed in the number comparison task. These findings shed light on the embodied and multimodal nature of experiencing timbre.
- Relating timbre perception to musical craft practice: an empirical ethnographic approachCharalampos Saitis, Bleiz Macsen Del Sette, Jordie Shier, and 1 more authorTriennial Conference of the European Society for the Cognitive Sciences of Music, 2024
In crafting musical expression, the digital instrument maker is required to manipulate digital, and increasingly AI, technology as an additional medium. This raises interesting but unexplored questions about the role and practice of timbre in the development and adoption of sound technologies and their surrounding sonic cultures and, conversely, their imprint on the perceptual experience of timbre. Previous empirical research studied how the latter relates to the creative practice of sound synthesis. Here we adopt an ethnographic approach to explore the relationship between timbre and broader creative and technological practices of digital lutherie. We aim to better understand how makers think about and engage with timbre, what current practices and technologies of instrument design enable timbre exploration during the creative craft process, and how this knowledge can expand and diversify our understanding of how timbre is perceived, represented, and generated. Reflexive thematic analysis is applied to structured interviews with 20 (minimum target) instrument makers from commercial, research, independent, and artistic backgrounds. Here both ‘instrument’ and ‘maker’ are broadly construed, including composers and performers who build bespoke instruments as well as live coders. Interviews were conducted remotely and lasted around 50 minutes. Preliminary findings suggest that the entanglement of timbre and musical craft practice takes several forms, including interactions with aesthetic values and acoustics of material, which can be described as occupying places across a space encompassing many different notions (subspaces) of timbre entangled with a wide range of epistemic instruments and sonic practices. Rather than being a limited scientific (and musical) idea rooted in the psychoacoustic “timbre space” model, timbre emerges in a dynamic relay between technology and creation. Our study thus presents an empirical ethnographic understanding of timbre from the maker’s perspective, informing future development of tools to assist timbre exploration in musical craft practice.
- Timbre Tools: Ethnographic perspectives on timbre and sonic cultures in hackathon designsCharalampos Saitis, Bleiz Macsen Del Sette, Jordie Shier, and 5 more authorsInternational Audio Mostly Conference, 2024
Timbre is a nuanced yet abstractly defined concept. Its inherently subjective qualities make it challenging to design and work with. In this paper, we propose to explore the conceptualisation and negotiation of timbre within the creative practice of timbre tool makers. To this end, we hosted a hackathon event and performed an ethnographic study to explore how participants engaged with the notion of timbre and how their conception of timbre was shaped through social interactions and technological encounters. We present individual descriptions of the design process of each team and reflect across our data to identify commonalities in the ways that timbre is understood and informed by sound technologies and their surrounding sonic cultures, e.g., by relating concepts of timbre to metaphors. We further current understanding by offering novel interdisciplinary and multimodal insights into understandings of timbre.
- Automatic detection of moral values in music lyricsVjosa Preniqi, Iacopo Ghinassi, Julia Ive, and 2 more authorsInternational Society for Music Information Retrieval Conference, 2024
Moral values play a fundamental role in how we evaluate information, make decisions, and form judgements around important social issues. The possibility to extract morality rapidly from lyrics enables a deeper understanding of our music-listening behaviours. Building on the Moral Foundations Theory (MFT), we tasked a set of transformer-based language models (BERT) fine-tuned on 2,721 synthetic lyrics generated by a large language model (GPT-4) to detect moral values in 200 real music lyrics annotated by two experts. We evaluate their predictive capabilities against a series of baselines including out-of-domain (BERT fine-tuned on MFT-annotated social media texts) and zero-shot (GPT-4) classification. The proposed models yielded the best accuracy across experiments, with an average F1 weighted score of 0.8. This performance is, on average, 5% higher than out-of-domain and zero-shot models. When examining precision in binary classification, the proposed models perform on average 12% higher than the baselines. Our approach contributes to annotation-free and effective lyrics morality learning, and provides useful insights into the knowledge distillation of LLMs regarding moral expression in music, and the potential impact of these technologies on the creative industries and musical culture.
- Composer style-specific symbolic music generation using vector quantized discrete diffusion modelsJincheng Zhang, György Fazekas, and Charalampos SaitisIEEE International Workshop on Machine Learning for Signal Processing, 2024
Emerging Denoising Diffusion Probabilistic Models (DDPM) have become increasingly utilised because of promising results they have achieved in diverse generative tasks with continuous data, such as image and sound synthesis. Nonetheless, the success of diffusion models has not been fully extended to discrete symbolic music. We propose to combine a vector quantized variational autoencoder (VQ-VAE) and discrete diffusion models for the generation of symbolic music with desired composer styles. The trained VQ-VAE can represent symbolic music as a sequence of indexes that correspond to specific entries in a learned codebook. Subsequently, a discrete diffusion model is used to model the VQ-VAE’s discrete latent space. The diffusion model is trained to generate intermediate music sequences consisting of codebook indexes, which are then decoded to symbolic music using the VQ-VAE’s decoder. The evaluation results demonstrate our model can generate symbolic music with target composer styles that meet the given conditions with a high accuracy of 72.36%. Our code is available at [URL will be provided here].
- Von A bis U: Die Vokalität von Instrumentalklangfarben (From A to U: The vocality of instrumental timbres)Christoph Reuter, Charalampos Saitis, Isabella Czedik-Eysenberg, and 1 more author40. Jahrestagung der Deutschen Gesellschaft für Musikpsychologie, 2024
Reuter_DGM_2024.pdf
- A multimodal understanding of the role of sound and music in gendered toy marketingLuca Marinelli, Petra Lucht, and Charalampos SaitisPsyArXiv, 2024
Literature in music theory and psychology shows that, even in isolation, musical sounds can reliably encode gender-loaded messages. In fact, musical material can be imbued with many ideological dimensions and gender is just one of them. Nonetheless, studies of the gendering of music within multimodal communicative events are sparse and lack an encompassing theoretical framework. The present study attempts to address this literature gap by means of a critical quantitative analysis of music in gendered toy marketing, which integrated a content analytical approach with multimodal affective and music-focused perceptual responses. Ratings were collected on a set of 606 commercials spanning over a ten years time frame, and strong gender polarisation was observed in nearly all of the collected variables. Gendered music styles in toy commercials were found to exhibit synergistic design choices, as music in masculine-targeted adverts was substantially more abrasive—louder, more inharmonious, and more distorted—than that in feminine-targeted ones. Toy advertising music appeared thus to be deliberately and consistently in line with traditional gender norms. In addition, music perceptual scales and voice-related content analytical variables were found to explain quite well the heavily polarised affective ratings. This study presents an empirical understanding of the gendering of music as constructed within multimodal discourse, reiterating the importance of the sociocultural underpinnings of music cognition. We provide a public repository with all code and data necessary to reproduce the results of this study at github.com/marinelliluca/music-role-gender-marketing.
- Real-time timbre remapping with differentiable DSPJordie Shier, Charalampos Saitis, Andrew Robertson, and 1 more authorInternational Conference on New Interfaces for Musical Expression, 2024
Timbre is a primary mode of expression in diverse musical contexts. However, prevalent audio-driven synthesis methods predominantly rely on pitch and loudness envelopes, effectively flattening timbral expression from the input. Our approach draws on the concept of timbre analogies and investigates how timbral expression from an input signal can be mapped onto controls for a synthesizer. Leveraging differentiable digital signal processing, our method facilitates direct optimization of synthesizer parameters through a novel feature difference loss. This loss function, designed to learn relative timbral differences between musical events, prioritizes the subtleties of graded timbre modulations within phrases, allowing for meaningful translations in a timbre space. Using snare drum performances as a case study, where timbral expression is central, we demonstrate real-time timbre remapping from acoustic snare drums to a differentiable synthesizer modeled after the Roland TR-808.
- Building sketch-to-sound mapping with unsupervised feature extraction and interactive machine learningShuoyang Zheng, Bleiz M. Del Sette, Charalampos Saitis, and 2 more authorsInternational Conference on New Interfaces for Musical Expression, 2024
In this paper, we explore the interactive construction and exploration of mappings between visual sketches and musical controls. Interactive Machine Learning (IML) allows creators to construct mappings with personalised training examples. However, when it comes to high-dimensional data such as sketches, dimensionality reduction techniques are required to extract features for the IML model. We propose using unsupervised machine learning to encode sketches into lower-dimensional latent representations, which are then used as the source for the IML model to construct sketch-to-sound mappings. We build a proof-of-concept prototype and demonstrate it using two compositions. We reflect on the composing processes to discuss the controllability and explorability in mappings built by this approach and how they contribute to the musical expression.
- Deep learning-based audio representations for the analysis and visualisation of electronic dance music DJ mixesAlexander Williams, Haokun Tian, Stefan Lattner, and 2 more authorsAES International Symposium on AI and the Musician, 2024
Electronic dance music (EDM), produced using computers and electronic instruments, is a collection of musical subgenres that emphasise timbre and rhythm over melody and harmony. It is usually presented through the medium of DJing, where tracks are curated and mixed sequentially to offer unique listening and dancing experiences. However, unlike key and tempo annotations, DJs still rely on audition rather than metadata to examine and select tracks with complementary audio content. In this work, we investigate the use of deep learning-based representations (Complex Autoencoder and OpenL3) for analysing and visualising audio content on a corpus of DJ mixes with approximate transition timestamps and compare them with signal processing-based representations (joint time-frequency scattering transform and mel-frequency cepstral coefficients). Representations are computed once per second and visualised with UMAP dimensionality reduction. We propose heuristics based on the identification of observed patterns in visualisations and time-sensitive Euclidean distances in the representation space to compute DJ transition lengths, transition smoothness, and inter-song, song-to-song, and full-mix audio content consistency using audio representations along with rough DJ transition timestamps. Our method enables the visualisation of variations within music tracks, facilitating the analysis of DJ mixes and individual EDM tracks. This approach supports musicians in making informed creative decisions based on such visualisations. We share our code, dataset annotations, computed audio representations, and trained CAE model. We encourage researchers and music enthusiasts alike to analyse their own music using our tools: github.com/alexjameswilliams/EDMAudioRepresentations.
- Timbral effects of col legno tratto techniques on bowed cello soundsMontserrat Pàmies-Vilà, and Charalampos Saitis186th Meeting of the Acoustical Society of America/Acoustics Week, 2024
There are several playing techniques for bowed-string instruments that make use of the wooden stick of the bow. The stick is quite often used to strike the strings gently (col legno battuto) and less commonly to bow on them (col legno tratto). Col legno has existed since the 17th century, and it is often used in modern compositions. When the stick is drawn across the string (tratto), the contact between the scrubbing stick and the string introduces noise. The player may choose to combine both hair and stick, depending on the desired sound. To evaluate the timbral effects of col legno tratto on the cello sound, the current study compares three variations across ordinary and contemporary bowing techniques: using only the hair, using both hair and stick, and using only the stick. Motion capture and audio-video recordings with expert cello players show how the bow tilt varies greatly between the three cases. Suitable audio descriptors of timbre are evaluated, which may help to correlate the observed playing parameters and sound properties with the semantic attributes used by experts to describe the timbre of these techniques.
- Giving instruments a voice: Are there vowel-like qualities in the timbres of musical instruments?Christoph Reuter, Charalampos Saitis, Isabella Czedik-Eysenberg, and 1 more author50. Deutsche Jahrestagung für Akustik, 2024
Scholars have long explored similarities between musical instrument sounds and vowel qualities of human voice sounds. From a psychoacoustic standpoint, however, this relationship remains poorly understood. Here, we seek to address whether musical instruments truly exhibit vowel-like qualities, whether specific instruments, registers, and dynamic levels stand out, and what the acoustical correlates of this relation might be. In an online experiment, German native speakers listen to the sounds of oboe, clarinet, flute, bassoon, trumpet, trombone, French horn, tuba, violin, viola, cello, and double bass in three registers and two dynamic levels. Their task is to assign the following vowels and umlauts (in German pronunciation) to instrument sounds: a, a (with overring), e, i, o, u, ä, ö, and ü. Furthermore, participants rate the strength of vowel similarity. Preliminary analyses (of 43 participants) suggest that although vowel similarity is rated approximately equally high, vowel associations do not seem to be equally consistent for different instruments. Particular similarity is observed between bassoon and tuba with the vowel o, oboe and violin with the vowel i. Audio features will be used to model vowel similarity.
- Explainable modeling of gender-targeting practices in toy advertising sound and musicLuca Marinelli, and Charalampos Saitis1st Workshop on Explainable Machine Learning for Speech and Audio, 49th IEEE International Conference on Acoustics, Speech and Signal Processing, 2024
This study examines gender coding in sound and music, in a context where music plays a supportive role to other modalities, such as in toy advertising. We trained a series of binary XGBoost classifiers on handcrafted features extracted from the soundtracks and then performed SAGE and SHAP analyses to identify key audio features in predicting the gender target of the ads. Our analysis reveals that timbral dimensions play a prominent role and that commercials aimed at girls tend to be more harmonious and rhythmical, with a broader and smoother spectrum, while those targeting boys are characterised by higher loudness, spectral entropy, and roughness. Mixed audience commercials instead appear to be as rhythmical as girls-only ads, although slower, but show intermediate characteristics in terms of harmonicity and roughness. This study highlights the importance of music in shaping societal norms and the need for greater accountability in its use in marketing and other industries. We provide a public repository containing all code and data used in this study.
- A review of differentiable digital signal processing for music and speech synthesisBen Hayes, Jordie Shier, György Fazekas, and 2 more authorsFrontiers in Signal Processing, 2024
The term differentiable digital signal processing describes a family of techniques in which loss function gradients are backpropagated through digital signal processors, facilitating their integration into neural networks. This article surveys the literature on differentiable audio signal processing, focusing on its use in music and speech synthesis. We catalogue applications to tasks including music performance rendering, sound matching, and voice transformation, discussing the motivations for and implications of the use of this methodology. This is accompanied by an overview of digital signal processing operations that have been implemented differentiably, which is further supported by a web book containing practical advice on differentiable synthesiser programming. Finally, we highlight open challenges, including optimisation pathologies, robustness to real-world conditions, and design trade-offs, and discuss directions for future research.
2023
- A body-centred perspective to chronic pain self-management using generative sonificationBleiz Macsen Del Sette, and Charalampos SaitisAnnual Workshop of the Music and Human-Computer Interaction Networks (CHIME), 2023
DelSette_CHIME_2023.pdf
- Beat and Downbeat Tracking with Generative EmbeddingsHaokun Tian, Kun Liu, and Magdalena FuentesLate Breaking Demo of the 24nd International Society for Music Information Retrieval Conference, 2023
It is standard practice to use spectrograms as input features for discriminative MIR tasks. However, recent research showed using representations produced by Jukebox (a music language model) led to better model performance. This was tested on music tagging, genre classification, key detection, emotion recognition, and music transcription. In this paper, we test it on beat and downbeat tracking. Specifically, we compare compressed Jukebox embeddings with spectrograms as input to a model that jointly predicts beat, downbeat, and tempo. Experiments show that the two inputs bring comparable results for beat tracking, while using Jukebox embeddings leads to significant improvements for downbeat tracking.
- Soundscapes of morality: Linking music preferences and moral values through lyrics and audioVjosa Preniqi, Kyriaki Kalimeri, and Charalampos SaitisPLOS ONE, 2023
Music is a fundamental element in every culture, serving as a universal means of expressing our emotions, feelings, and beliefs. This work investigates the link between our moral values and musical choices through lyrics and audio analyses. We align the psychometric scores of 1,480 participants to acoustics and lyrics features obtained from the top 5 songs of their preferred music artists from Facebook Page Likes. We employ a variety of lyric text processing techniques, including lexicon-based approaches and BERT-based embeddings, to identify each song’s narrative, moral valence, attitude, and emotions. In addition, we extract both low- and high-level audio features to comprehend the encoded information in participants’ musical choices and improve the moral inferences. We propose a Machine Learning approach and assess the predictive power of lyrical and acoustic features separately and in a multimodal framework for predicting moral values. Results indicate that lyrics and audio features from the artists people like inform us about their morality. Though the most predictive features vary per moral value, the models that utilised a combination of lyrics and audio characteristics were the most successful in predicting moral values, outperforming the models that only used basic features such as user demographics, the popularity of the artists, and the number of likes per user. Audio features boosted the accuracy in the prediction of empathy and equality compared to textual features, while the opposite happened for hierarchy and tradition, where higher prediction scores were driven by lyrical features. This demonstrates the importance of both lyrics and audio features in capturing moral values. The insights gained from our study have a broad range of potential uses, including customising the music experience to meet individual needs, music rehabilitation, or even effective communication campaign crafting.
- Modelling Moral Traits with Music Listening Preferences and DemographicsVjosa Preniqi, Kyriaki Kalimeri, and Charalampos SaitisMusic in the AI Era. CMMR 2021. Springer Lecture Notes in Computer Science, vol 13770, 2023
Music has always been an integral part of our everyday lives through which we express feelings, emotions, and concepts. Here, we explore the association between music genres, demographics and moral values employing data from an ad-hoc online survey and the Music Learning Histories Dataset. To further characterise the music preferences of the participants the generalist/specialist (GS) score employed. We exploit both classification and regression approaches to assess the predictive power of music preferences for the prediction of demographic attributes as well as the moral values of the participants. Our findings point out that moral values are hard to predict (.62 AUROC_avg) solely by the music listening behaviours, while if basic sociodemographic information is provided the prediction score rises to 4% on average (.66 AUROC_avg), with the Purity foundation to be the one that is steadily the one with the highest accuracy scores. Similar results are obtained from the regression analysis. Finally, we provide with insights on the most predictive music behaviours associated with each moral value that can inform a wide range of applications from rehabilitation practices to communication campaign design.
- Fast Diffusion GAN Model for Symbolic Music Generation Controlled by EmotionsJincheng Zhang, György Fazekas, and Charalampos SaitisarXiv, 2023
Diffusion models have shown promising results for a wide range of generative tasks with continuous data, such as image and audio synthesis. However, little progress has been made on using diffusion models to generate discrete symbolic music because this new class of generative models are not well suited for discrete data while its iterative sampling process is computationally expensive. In this work, we propose a diffusion model combined with a Generative Adversarial Network, aiming to (i) alleviate one of the remaining challenges in algorithmic music generation which is the control of generation towards a target emotion, and (ii) mitigate the slow sampling drawback of diffusion models applied to symbolic music generation. We first used a trained Variational Autoencoder to obtain embeddings of a symbolic music dataset with emotion labels and then used those to train a diffusion model. Our results demonstrate the successful control of our diffusion model to generate symbolic music with a desired emotion. Our model achieves several orders of magnitude improvement in computational cost, requiring merely four time steps to denoise while the steps required by current state-of-the-art diffusion models for symbolic music generation is in the order of thousands.
- Sound of Care: Towards a Co-Operative AI Digital Pain Companion to Support People with Chronic Primary PainBleiz Macsen Del Sette, Dawn Carnes, and Charalampos SaitisCompanion Publication of the 2023 Conference on Computer Supported Cooperative Work and Social Computing, 2023
This work investigates the role of sound and technology in the everyday lives of people with chronic primary pain. Our primary goal was to inform the first participatory design workshop of Sound of Care, a new eHealth system for pain self-management. We used an ethical stakeholder analysis to inform a round of exploratory interviews, run with 8 participants including people with chronic primary pain, carers, and healthcare workers. We found that sound and technology serve as important but often unstructured tool, helping with distraction, mood regulation and sleep. The experience of pain and musical preferences are highly personal, and communicating or understanding pain can be challenging, even within family members. To address the gaps in current chronic pain self-management care, we propose the use a sound-based AI-driven system, a Digital Pain Companion, using sonification to create a shared decision-making space, enhancing agency over treatment in a co-operative care environment.
- Gender-Coded Sound: Analysing the Gendering of Music in Toy Commercials via Multi-Task LearningLuca Marinelli, György Fazekas, and Charalampos Saitis24th International Society for Music Information Retrieval Conference, 2023
Music can convey ideological stances, and gender is just one of them. Evidence from musicology and psychology research shows that gender-loaded messages can be reliably encoded and decoded via musical sounds. However, much of this evidence comes from examining music in isolation, while studies of the gendering of music within multimodal communicative events are sparse. In this paper, we outline a method to automatically analyse how music in TV advertising aimed at children may be deliberately used to reinforce traditional gender roles. Our dataset of 606 commercials included music-focused mid-level perceptual features, multimodal aesthetic emotions, and content analytical items. Despite its limited size, and because of the extreme gender polarisation inherent in toy advertisements, we obtained noteworthy results by leveraging multi-task transfer learning on our densely annotated dataset. The models were trained to categorise commercials based on their intended target audience, specifically distinguishing between masculine, feminine, and mixed audiences. Additionally, to provide explainability for the classification in gender targets, the models were jointly trained to perform regressions on emotion ratings across six scales, and on mid-level musical perceptual attributes across twelve scales. Standing in the context of MIR, computational social studies and critical analysis, this study may benefit not only music scholars but also advertisers, policymakers, and broadcasters.
- Differentiable Modelling of Percussive Audio with Transient and Spectral SynthesisJordie Shier, Franco Caspe, Andrew Robertson, and 3 more authors10th Convention of the European Acoustics Association, 2023
Differentiable digital signal processing (DDSP) techniques, including methods for audio synthesis, have gained attention in recent years and lend themselves to interpretability in the parameter space. However, current differentiable synthesis methods have not explicitly sought to model the transient portion of signals, which is important for percussive sounds. In this work, we present a unified synthesis framework aiming to address transient generation and percussive synthesis within a DDSP framework. To this end, we propose a model for percussive synthesis that builds on sinusoidal modeling synthesis and incorporates a modulated temporal convolutional network for transient generation. We use a modified sinusoidal peak picking algorithm to generate time-varying non-harmonic sinusoids and pair it with differentiable noise and transient encoders that are jointly trained to reconstruct drumset sounds. We compute a set of reconstruction metrics using a large dataset of acoustic and electronic percussion samples that show that our method leads to improved onset signal reconstruction for membranophone percussion instruments.
- The Responsibility Problem in Neural Networks with Unordered TargetsBen Hayes, Charalampos Saitis, and György Fazekas11th International Conference on Learning Representations, Tiny Papers, 2023
We discuss the discontinuities that arise when mapping unordered objects to neural network outputs of fixed permutation, referred to as the responsibility problem. Prior work has proved the existence of the issue by identifying a single discontinuity. Here, we show that discontinuities under such models are uncountably infinite, motivating further research into neural networks for unordered data.
- Interactive Neural ResonatorsRodrigo Diaz, Charalampos Saitis, and Mark SandlerInternational Conference on New Interfaces for Musical Expression, 2023
In this work, we propose a method for the controllable synthesis of real-time contact sounds using neural resonators. Previous works have used physically inspired statistical methods and physical modelling for object materials and excitation signals. Our method incorporates differentiable second-order resonators and estimates their coefficients using a neural network that is conditioned on physical parameters. This allows for interactive dynamic control and the generation of novel sounds in an intuitive manner. We demonstrate the practical implementation of our method and explore its potential creative applications.
- Gender differences in Moral Valence, Sentiment, and Narratives of Song Lyrics Over TimeVjosa Preniqi, Kyriaki Kalimeri, Andreas Kaltenbrunner, and 1 more author9th International Conference on Computational Social Science, 2023
vjosa_IC2S2_23.pdf
- Analysing the Gendering of Music in Toy Commercials via Mid-level Perceptual FeaturesLuca Marinelli, and Charalampos Saitis17th International Conference on Music Perception and Cognition, 2023
Luca_ICMPC23.pdf
- Exploring the Role of Audio and Lyrics in Explaining Moral WorldviewsVjosa Preniqi, Kyriaki Kalimeri, and Charalampos Saitis17th International Conference on Music Perception and Cognition, 2023
vjosa_ICMPC_23_1.pdf
- Evolution of Moral Valence in Lyrics Over TimeVjosa Preniqi, Kyriaki Kalimeri, Andreas Kaltenbrunner, and 1 more author17th International Conference on Music Perception and Cognition, 2023
vjosa_ICMPC_23_2.pdf
- When ChatGPT Talks TimbreCharalampos Saitis, and Kai Siedenburg3rd International Conference on Timbre, 2023
3603_cam_ready.pdf
- The language of sounds unheard: Exploring musical timbre semantics of large language modelsKai Siedenburg, and Charalampos SaitisarXiv, 2023
Semantic dimensions of sound have been playing a central role in understanding the nature of auditory sensory experience as well as the broader relation between perception, language, and meaning. Accordingly, and given the recent proliferation of large language models (LLMs), here we asked whether such models exhibit an organisation of perceptual semantics similar to those observed in humans. Specifically, we prompted ChatGPT, a chatbot based on a state-of-the-art LLM, to rate musical instrument sounds on a set of 20 semantic scales. We elicited multiple responses in separate chats, analogous to having multiple human raters. ChatGPT generated semantic profiles that only partially correlated with human ratings, yet showed robust agreement along well-known psychophysical dimensions of musical sounds such as brightness (bright-dark) and pitch height (deep-high). Exploratory factor analysis suggested the same dimensionality but different spatial configuration of a latent factor space between the chatbot and human ratings. Unexpectedly, the chatbot showed degrees of internal variability that were comparable in magnitude to that of human ratings. Our work highlights the potential of LLMs to capture salient dimensions of human sensory experience.
- Sinusoidal Frequency Estimation by Gradient DescentBen Hayes, Charalampos Saitis, and György Fazekas48th IEEE International Conference on Acoustics, Speech and Signal Processing, 2023
Sinusoidal parameter estimation is a fundamental task in applications from spectral analysis to time-series forecasting. Estimating the sinusoidal frequency parameter by gradient descent is, however, often impossible as the error function is non-convex and densely populated with local minima. The growing family of differentiable signal processing methods has therefore been unable to tune the frequency of oscillatory components, preventing their use in a broad range of applications. This work presents a technique for joint sinusoidal frequency and amplitude estimation using the Wirtinger derivatives of a complex exponential surrogate and any first order gradient-based optimiser, enabling end-to-end training of neural network controllers for unconstrained sinusoidal models.
- Rigid-Body Sound Synthesis with Differentiable Modal ResonatorsRodrigo Diaz, Ben Hayes, Charalampos Saitis, and 2 more authors48th IEEE International Conference on Acoustics, Speech and Signal Processing, 2023
Physical models of rigid bodies are used for sound synthesis in applications from virtual environments to music production. Traditional methods, such as modal synthesis, often rely on computationally expensive numerical solvers, while recent deep learning approaches are limited by post-processing of their results. In this work, we present a novel end-to-end framework for training a deep neural network to generate modal resonators for a given 2D shape and material using a bank of differentiable IIR filters. We demonstrate our method on a dataset of synthetic objects but train our model using an audio-domain objective, paving the way for physically-informed synthesisers to be learned directly from recordings of real-world objects.
- Timbre semantic associations vary both between and within instruments: An empirical study incorporating register and pitch heightLindsey Reymore, Jason Noble, Charalampos Saitis, and 2 more authorsMusic Perception, 2023
The main objective of this study is to understand how timbre semantic associations — for example, a sound’s timbre perceived as bright, rough, or hollow — vary with register and pitch height across instruments. In this experiment, 540 online participants rated single, sustained notes from eight Western orchestral instruments (flute, oboe, bass clarinet, trumpet, trombone, violin, cello, and vibraphone) across three registers (low, medium, and high) on 20 semantic scales derived from Reymore and Huron (2020). The 24 two-second stimuli, equalized in loudness, were produced using the Vienna Symphonic Library. Exploratory modeling examined relationships between mean ratings of each semantic dimension and instrument, register, and participant musician identity (‘‘musician’’ vs. ‘‘nonmusician’’). For most semantic descriptors, both register and instrument were significant predictors, though the amount of variance explained differed (marginal R^2). Terms that had the strongest positive relationships with register include shrill/harsh/noisy, sparkling/brilliant/bright, ringing/long decay, and percussive. Terms with the strongest negative relationships with register include deep/thick/heavy, raspy/grainy/gravelly, hollow, and woody. Post hoc modeling using only pitch height and only register to predict mean semantic rating suggests that pitch height may explain more variance than does register. Results help clarify the influence of both instrument and relative register (and pitch height) on common timbre semantic associations.
- Proceedings of the 3rd International Conference on TimbreEds: Marcelo Caetano, Zachary Wallmark, Asterios Zacharakis, and 2 more editorsThe School of Music Studies, Aristotle University of Thessaloniki, 2023
2022
- Real-time timbre mapping for synthesized percussive performanceJordie ShierDMRN+17: Digital Music Research Network One-Day Workshop, 2022
- More Than Words: Linking Music Preferences and Moral Values Through LyricsVjosa Preniqi, Kyriaki Kalimeri, and Charalampos Saitis23rd International Society for Music Information Retrieval Conference, 2022
This study explores the association between music preferences and moral values by applying text analysis techniques to lyrics. Harvesting data from a Facebook-hosted application, we align psychometric scores of 1,386 users to lyrics from the top 5 songs of their preferred music artists as emerged from Facebook Page Likes. We extract a set of lyrical features related to each song’s overarching narrative, moral valence, sentiment, and emotion. A machine learning framework was designed to exploit regression approaches and evaluate the predictive power of lyrical features for inferring moral values. Results suggest that lyrics from top songs of artists people like inform their morality. Virtues of hierarchy and tradition achieve higher prediction scores (between .20 and .30) than values of empathy and equality (between .08 and .11), while basic demographic variables only account for a small part in the models’ explainability. This shows the importance of music listening behaviours, as assessed via lyrical preferences, alone in capturing moral values. We discuss the technological and musicological implications and possible future improvements.
- timbre.fun: A gamified interactive system for crowdsourcing a timbre semantic vocabularyBen Hayes, Charalampos Saitis, and György Fazekas24th International Congress on Acoustics, 2022
We present timbre.fun, a web-based gamified interactive system where users create sounds in response to semantic prompts (e.g., bright, rough) through exploring a two-dimensional control space that maps nonlinearly to the parameters of a simple hybrid wavetable and amplitude-modulation synthesizer. The current version features 25 semantic adjectives mined from a popular synthesis forum. As well as creating sounds, users can explore heatmaps generated from others’ responses, and fit a classifier (k-nearest neighbors) in-browser. timbre.fun is based on recent work, including by the authors, which studied timbre semantic associations through prompted synthesis paradigms. The interactive is embedded in a digital exhibition on sensory variation and interaction (seeingmusic.app) which debuted at the 2021 Edinburgh Science Festival, where it was visited by 197 users from 21 countries over 16 days. As it continues running online, a further 596 visitors from 35 countries have engaged. To date 579 sounds have been created and tagged, which will facilitate parallel research in timbre semantics and neural audio synthesis. Future work will include further gamifying the data collection pipeline, including leveling-up to unlock new words and synthesizers, and a full open-source release.
- Seeing Music: Leveraging citizen science and gamification to study cross-sensory associationsCharalampos Saitis, Christine Cuskley, and Sebastian Löbbers20th International Multisensory Research Forum, 2022
Our recent research has shown that people lack knowledge about how the senses interact and are unaware of many common forms of sensory and perceptual variation. We present Seeing Music, a digital interactive exhibition and audiovisual game that translates high-level scientific understanding of sensory variation and cross-modality into knowledge for the public. Using a narrative-driven gamified approach, players are tasked with communicating human music to an extraterrestrial intelligence through visual shape, color and texture using two-dimensional selector panels. Music snippets (12–24 s long) are played continuously in a loop, taken from three custom instrumental compositions designed to vary systematically in terms of timbre, melody, and rhythm. Players can “level-up” to unlock new visual features and musical snippets, and explore and evaluate collaborative visualizations made by others. Outside the game, a series of interactive slideshows help visitors learn more about sensory experience, sensory diversity, and how our senses make us human. The exhibition debuted at the 2021 Edinburgh Science Festival, where it was visited by 197 users coming from 21 countries (134 visitors from the UK) over 16 days. As it continues running online, a further 596 visitors from 35 countries (164 from the UK) have engaged. To date, 169 players of Seeing Music have produced more than 42,500 audiovisual mapping datapoints for scientific research purposes. Preliminary analysis suggests that music with less high-frequency energy was mapped to less complex and rounder shapes, bluer and less bright hues, and less dense textures. These trends confirm auditory-visual correspondences previously reported in more controlled laboratory studies, while also offering new insight into how different auditory-visual associations interact with each other. Future work includes improving user motivation and interaction, refining data collection, a full open-source release, and adding new games and informational material about research on the senses.
- Exploring the Dimensionality of the Affective Space Elicited by Gendered Toy CommercialsLuca Marinelli, and Charalampos Saitis9th European Conference on Media, Communication & Film, 2022
As evidenced by a large body of literature, the gender-stereotyped nature of toy adverts has been widely scrutinised. However, little work has been done in examining the affective impact of these commercials on the audience. It has been proven that repeated exposure to gender-stereotyped messages has the capacity to influence behaviours, beliefs and attitudes. In particular, media can influence emotion socialization, and gender differences in emotion expression might emerge (Scherr 2018). In this study, we investigated whether commercials elicit emotions at different intensities with respect to the gender of their target audience. Furthermore, we evaluated whether such emotions follow distinct underlying latent structures. A total of 1081 ratings of 10 unipolar aesthetic emotion scales were collected for 135 commercials (45 for each masculine, feminine, and mixed target audience) from 80 UK nationals (35 F, 45 M) aged 18 to 76. The main reason for collecting our ratings from adults was that, already by age 11, children exhibit adult-like emotion recognition capabilities (Hunter 2011). Seven scales showed significant differences between commercials for distinct audiences; with five, in particular, revealing a strong polarization (happiness, amusement, beauty, calm, and anger). In addition, parallel analysis showed that a minimum of three factors are needed to explain the ratings for masculine and mixed targeted commercials, while only two are needed for the feminine ones, thereby indicating that the latter elicit emotions following a simpler underlying structure. Both results reflect larger issues in toy marketing, where gender essentialism is still dominant, and prompt further discussion and research.
- Timbre Transfer with Variational Auto Encoding and Cycle-Consistent Adversarial NetworksRussell Sammut Bonnici, Martin Benning, and Charalampos SaitisInternational Joint Conference on Neural Networks, 2022
This work investigates the application of deep learning to timbre transfer. The adopted approach combines Variational Autoencoders with Generative Adversarial Networks to construct meaningful representations of the source audio and produce realistic generations of the target audio and is applied to the Flickr 8k Audio dataset for transferring the vocal timbre between speakers and the URMP dataset for transferring the musical timbre between instruments. Variations of the adopted approach were trained, and performance was compared using the metrics SSIM (Structural Similarity Index) and FAD (Frechet Audio Distance). It was found that a many-to-many approach supersedes a one-to-one approach in terms of reconstructive capabilities, while one-to-one showed better results in terms of adversarial translation. The adoption of a basic over a bottleneck residual block design is more suitable for enriching content information about a latent space, and the decision on whether cyclic loss takes on a variational autoencoder or vanilla autoencoder approach does not have a significant impact on reconstructive and adversarial translation aspects of the model.
- Disembodied Timbres: A Study on Semantically Prompted FM SynthesisBen Hayes, Charalampos Saitis, and György FazekasJournal of the Audio Engineering Society, 2022
Disembodied electronic sounds constitute a large part of the modern auditory lexicon, but research into timbre perception has focused mostly on the tones of conventional acoustic musical instruments. It is unclear whether insights from these studies generalise to electronic sounds, nor is it obvious how these relate to the creation of such sounds. In this work, we present an experiment on the semantic associations of sounds produced by FM synthesis with the aim of identifying whether existing models of timbre semantics are appropriate for such sounds. We applied a novel experimental paradigm in which experienced sound designers responded to semantic prompts by programming a synthesiser, and provided semantic ratings on the sounds they created. Exploratory factor analysis revealed a five-dimensional semantic space. The first two factors mapped well to the concepts of luminance, texture, and mass. The remaining three factors did not have clear parallels, but correlation analysis with acoustic descriptors suggested an acoustical relationship to luminance and texture. Our results suggest that further enquiry into the timbres of disembodied electronic sounds, their synthesis, and their semantic associations would be worthwhile, and that this could benefit research into auditory perception and cognition, as well as synthesis control and audio engineering.
- Deep Embeddings for Robust User-Based Amateur Vocal Percussion TranscriptionAlejandro Delgado, Emir Demirel, Vinod Subramanian, and 2 more authors19th Sound and Music Computing Conference, 2022
Vocal Percussion Transcription (VPT) is concerned with the automatic detection and classification of vocal percussion sound events, allowing music creators and producers to sketch drum lines on the fly among others. VPT classifiers usually learn best from small user-specific datasets, which usually restrict modelling to small input feature sets to avoid model overfitting. This study explores several deep supervised learning strategies to obtain informative feature sets for amateur VPT classification. We evaluated their performance on regular VPT classification tasks and compared them with several baseline approaches including feature selection methods and a state-of-the-art speech recognition engine. These proposed learning models were supervised with several label sets containing information from four different levels of abstraction: instrument-level, syllable-level, phoneme-level, and boxeme-level. Results suggest that convolutional neural networks supervised with syllable-level annotations produced the most informative embeddings for VPT systems, which can be used as input representations to fit classifiers with. Finally, we used back-propagation-based saliency maps to investigate the importance of difference spectrogram regions for feature learning.
- Auditory brightness perception investigated by unimodal and crossmodal interferenceCharalampos Saitis, Zachary Wallmark, and Annie LiuBiennial Meeting of the Society for Music Perception and Cognition, 2022
Brightness is among the most studied aspects of timbre perception. Psychoacoustically, sounds described as ”bright” vs ”dark” typically exhibit a high vs low frequency emphasis in the spectrum. However, relatively little is known about the neurocognitive mechanisms that facilitate these “metaphors we listen with.” Do they originate in universal mental representations common to more than one sensory modality? Triangulating three different interaction paradigms, we investigated using speeded identification whether unimodal and crossmodal interference occurs when timbral brightness, as modelled by the centroid of the spectral envelope, and 1) pitch height, 2) visual brightness, 3) numerical value processing are semantically incongruent. In three online pilot tasks, 58 participants were presented a baseline stimulus (a pitch, gray square, or numeral) then asked to quickly identify a target stimulus that is higher/lower, brighter/darker, or greater/less than the baseline, respectively, after being primed with a bright or dark synthetic harmonic tone. Additionally, in the pitch and visual tasks, a deceptive same-target condition was included. Results suggest that timbral brightness modulates the perception of pitch and visual brightness, but not numerical value. Semantically incongruent pitch height-timbral brightness shifts produced significantly slower choice reaction time and higher error compared to congruent pairs; timbral brightness also had a strong biasing effect in the same-target condition (i.e., people heard the same pitch as higher when the target tone was timbrally brighter than the baseline, and vice versa with darker tones). In the visual task, incongruent pairings of gray squares and tones elicited slower choice reaction times than congruent pairings. No interference was observed in the number comparison task. We are currently following up on these results with a larger online replication sample, and an fMRI study to investigate the relevant neural mechanisms. Our findings shed light on the multisensory nature of experiencing timbre.
- Proceedings of the 11th International Workshop on Haptic and Audio Interaction DesignEds: Charalampos Saitis, Ildar Farkhatdinov, and Stefano PapettiSpringer Lecture Notes in Computer Science 13417, 2022
2021
- Multimodal Classification of Stressful Environments in Visually Impaired Mobility Using EEG and Peripheral BiosignalsCharalampos Saitis, and Kyriaki KalimeriIEEE Transactions on Affective Computing, 2021
In this study, we aim to better understand the cognitive-emotional experience of visually impaired people when navigating in unfamiliar urban environments, both outdoor and indoor. We propose a multimodal framework based on random forest classifiers, which predict the actual environment among predefined generic classes of urban settings, inferring on real-time, non-invasive, ambulatory monitoring of brain and peripheral biosignals. Model performance reached 93% for the outdoor and 87% for the indoor environments (expressed in weighted AUROC), demonstrating the potential of the approach. Estimating the density distributions of the most predictive biomarkers, we present a series of geographic and temporal visualizations depicting the environmental contexts in which the most intense affective and cognitive reactions take place. A linear mixed model analysis revealed significant differences between categories of vision impairment, but not between normal and impaired vision. Despite the limited size of our cohort, these findings pave the way to emotionally intelligent mobility-enhancing systems, capable of implicit adaptation not only to changing environments but also to shifts in the affective state of the user in relation to different environmental and situational factors.
- Modelling Moral Traits with Music Listening Preferences and DemographicsVjosa Preniqi, Kyriaki Kalimeri, and Charalampos Saitis15th International Symposium on Computer Music Multidisciplinary Research, 2021
Music has always been an integral part of our everyday lives through which we express feelings, emotions, and concepts. Here, we explore the association between music genres, demographics and moral values employing data from an ad-hoc online survey and the Music Learning Histories Dataset. To further characterise the music preferences of the participants the generalist/specialist (GS) score employed. We exploit both classification and regression approaches to assess the predictive power of music preferences for the prediction of demographic attributes as well as the moral values of the participants. Our findings point out that moral values are hard to predict (.62 AUROC_avg) solely by the music listening behaviours, while if basic sociodemographic information is provided the prediction score rises to 4% on average (.66 AUROC_avg), with the Purity foundation to be the one that is steadily the one with the highest accuracy scores. Similar results are obtained from the regression analysis. Finally, we provide with insights on the most predictive music behaviours associated with each moral value that can inform a wide range of applications from rehabilitation practices to communication campaign design.
- Development of a Web Application for the Education, Assessment, and Study of Timbre PerceptionCharalampos SaitisSociety for Education, Music, and Psychology Research Conference, 2021
Timbre is defined as any auditory property other than pitch, duration, and loudness that allows two sounds to be distinguished. The Timbre Explorer (TE) is a synthesiser interface designed to demonstrate timbral dimensions of sound. This project aimed to develop and evaluate a web version of the TE that attempts to train its users and test their understanding of timbre as they go through a series of gamified tasks. A pilot study with 16 participants helped to identify shortcomings ahead of a full-sized study that will evaluate the performance of the TE as an educational aid and musical assessment tool.
- We are what we listen to: How moral values reflect on musical preferencesVjosa Preniqi, Kyriaki Kalimeri, and Charalampos Saitis7th International Conference on Computational Social Science, 2021
- Neural Waveshaping SynthesisBen Hayes, Charalampos Saitis, and György Fazekas22nd International Society for Music Information Retrieval Conference, 2021
We present the Neural Waveshaping Unit (NEWT): a novel, lightweight, fully causal approach to neural audio synthesis which operates directly in the waveform domain, with an accompanying optimisation (FastNEWT) for efficient CPU inference. The NEWT uses time-distributed multilayer perceptrons with periodic activations to implicitly learn nonlinear transfer functions that encode the characteristics of a target timbre. Once trained, a NEWT can produce complex timbral evolutions by simple affine transformations of its input and output signals. We paired the NEWT with a differentiable noise synthesiser and reverb and found it capable of generating realistic musical instrument performances with only 260k total model parameters, conditioned on F0 and loudness features. We compared our method to state-of-the-art benchmarks with a multi-stimulus listening test and the Fréchet Audio Distance and found it performed competitively across the tested timbral domains. Our method significantly outperformed the benchmarks in terms of generation speed, and achieved real-time performance on a consumer CPU, both with and without FastNEWT, suggesting it is a viable basis for future creative sound design tools.
- Perceptual and semantic scaling of FM synthesis timbres: Common dimensions and the role of expertiseBen Hayes, Charalampos Saitis, and György Fazekas16th International Conference on Music Perception and Cognition, 2021
Electronic sound has a rich history, yet timbre research has typically focused on the sounds of physical instruments, while synthesised sound is often relegated to functional roles like recreating acoustic timbres. Studying the perception of synthesised sound can broaden our conception of timbre and improve musical synthesis tools. We aimed to identify the perceptually salient acoustic attributes of sounds produced by frequency modulation synthesis. We also aimed to test Zacharakis et al’s luminance-texture-mass timbre semantic model [Music Perception, 31, 339–358 (2014)] in this domain. Finally, we aimed to identify effects of prior music or synthesis experience on these results. Our results suggest that discrimination of abstract electronic timbres may rely on attributes distinct from those used with acoustic timbres. Further, the most salient attributes vary with expertise. However, the use of semantic descriptors is similar to that of acoustic instruments, and is consistent across expertise levels.
- NASH: the Neural Audio Synthesis HackathonBen Hayes, Cyrus Vahidi, and Charalampos SaitisDMRN+16: Digital Music Research Network One-Day Workshop, 2021
The field of neural audio synthesis aims to produce audio using neural networks. A recent surge in its popularity has led to several high profile works achieving impressive feats of speech and music synthesis. The development of broadly accessible neural audio synthesis tools, conversely, has been limited, and creative applications of these technologies are mostly undertaken by those with technical know-how. Research has focused largely on tasks such as realistic speech and musical instrument synthesis, whereas investigations into high-level control, esoteric sound design capabilities, and interpretability have received less attention. To encourage innovative work addressing these gaps, C4DM’s Special Interest Group on Neural Audio Synthesis (SIGNAS) propose to host our first Neural Audio Synthesis Hackathon: a two day event, with results to be presented in a session at DMRN+16.
- Acoustic Representations for Perceptual Timbre SimilarityCyrus Vahidi, Ben Hayes, Charalampos Saitis, and 1 more authorDMRN+16: Digital Music Research Network One-Day Workshop, 2021
In this work, we outline initial steps towards modelling perceptual timbre dissimilarity. We use stimuli from 17 distinct subjective timbre studies and compute pairwise distances in the spaces of MFCCs, joint time-frequency scattering coefficients and Open-L3 embeddings. We analyze agreement of distances in these spaces with human dissimilarity ratings and highlight challenges of this task.
- Variational Auto Encoding and Cycle-Consistent Adversarial Networks for Timbre TransferRussell Sammut Bonnici, Martin Benning, and Charalampos SaitisDMRN+16: Digital Music Research Network One-Day Workshop, 2021
The combination of Variational Autoencoders (VAE) with Generative Adversarial Networks (GAN) motivates meaningful representations of audio in the context of timbre transfer. This was applied to different datasets for transferring vocal timbre between speakers and musical timbre between instruments. Variations of the approach were trained and generalised performance was compared using the Structural Similarity Index and Frechet Audio Distance. Many-to-many style transfer was found to improve reconstructive performance over one-to-one style transfer.
- A Modulation Front-End for Music Audio TaggingCyrus Vahidi, Charalampos Saitis, and György FazekasInternational Joint Conference on Neural Networks, 2021
Convolutional Neural Networks have been extensively explored in the task of automatic music tagging. The problem can be approached by using either engineered time-frequency features or raw audio as input. Modulation filter bank representations that have been actively researched as a basis for timbre perception have the potential to facilitate the extraction of perceptually salient features. We explore end-to-end learned front-ends for audio representation learning, ModNet and SincModNet, that incorporate a temporal modulation processing block. The structure is effectively analogous to a modulation filter bank, where the FIR filter center frequencies are learned in a data-driven manner. The expectation is that a perceptually motivated filter bank can provide a useful representation for identifying music features. Our experimental results provide a fully visualisable and interpretable front-end temporal modulation decomposition of raw audio. We evaluate the performance of our model against the state-of-the-art of music tagging on the MagnaTagATune dataset. We analyse the impact on performance for particular tags when time-frequency bands are subsampled by the modulation filters at a progressively reduced rate. We demonstrate that modulation filtering provides promising results for music tagging and feature representation, without using extensive musical domain knowledge in the design of this frontend.
- Phoneme Mappings for Online Vocal Percussion TranscriptionAlejandro Delgado, Charalampos Saitis, and Mark Sandler151st Audio Engineering Society Convention, 2021, Honourable Mention for Outstanding Paper
Vocal Percussion Transcription (VPT) aims at detecting vocal percussion sound events in a beatboxing performance and classifying them into the correct drum instrument class (kick, snare, or hi-hat). To do this in an online (real-time) setting, however, algorithms are forced to classify these events within just a few milliseconds after they are detected. The purpose of this study was to investigate which phoneme-to-instrument mappings are the most robust for online transcription purposes. We used three different evaluation criteria to base our decision upon: frequency of use of phonemes among different performers, spectral similarity to reference drum sounds, and classification separability. With these criteria applied, the recommended mappings would potentially feel natural for performers to articulate while enabling the classification algorithms to achieve the best performance possible. Given the final results, we provided a detailed discussion on which phonemes to choose given different contexts and applications.
- Learning Models for Query by Vocal Percussion: A Comparative StudyAlejandro Delgado, McDonald SKoT, Ning Xu, and 2 more authors46th International Computer Music Conference, 2021
The imitation of percussive sounds via the human voice is a natural and effective tool for communicating rhythmic ideas on the fly. Thus, the automatic retrieval of drum sounds using vocal percussion can help artists prototype drum patterns in a comfortable and quick way, smoothing the creative workflow as a result. Here we explore different strategies to perform this type of query, making use of both traditional machine learning algorithms and recent deep learning techniques. The main hyperparameters from the models involved are carefully selected by feeding performance metrics to a grid search algorithm. We also look into several audio data augmentation techniques, which can potentially regularise deep learning models and improve generalisation. We compare the final performances in terms of effectiveness (classification accuracy), efficiency (computational speed), stability (performance consistency), and interpretability (decision patterns), and discuss the relevance of these results when it comes to the design of successful query-by-vocal-percussion systems.
- The Timbre Explorer: A Synthesizer Interface for Educational Purposes and Perceptual StudiesJoshua Ryan Lam, and Charalampos SaitisInternational Conference on New Interfaces for Musical Expression, 2021
When two sounds are played at the same loudness, pitch, and duration, what sets them apart are their timbres. This study documents the design and implementation of the Timbre Explorer, a synthesizer interface based on efforts to dimensionalize this perceptual concept. The resulting prototype controls four perceptually salient dimensions of timbre in real-time: attack time, brightness, spectral flux, and spectral density. A graphical user interface supports user understanding with live visualizations of the effects of each dimension. The applications of this interface are three-fold; further perceptual timbre studies, usage as a practical shortcut for synthesizers, and educating users about the frequency domain, sound synthesis, and the concept of timbre. The project has since been expanded to a standalone version independent of a computer and a purely online web-audio version.
2020
- How we talk about sound: Semantic dimensions of abstract timbresBen Hayes, and Charalampos SaitisSound Instruments and Sonic Cultures: An Interdisciplinary Conference, 2020, National Science & Media Museum
Synthesisers, in their many forms, enable the realisation of almost any conceivable sound. Their fine-grained control and broad timbral palette call for a descriptive lexicon to enable their verbal differentiation and discussion. While acoustic instruments of the western classical lineage are the subject of an extensive body of enquiry into the perceptual attributes and semantic associations of the sounds they produce, abstract electronic sounds have been comparatively understudied in this regard. In particular, the diverse vocabulary used to describe such classical acoustic instruments can be summarised with three conceptual metaphors—such musical tones have luminance, texture, and mass—but this has yet to be explicitly confirmed for the kinds of electronic sounds that pervade many modern sonic cultures. In this work, we present an experimental paradigm for studying the semantic associations of synthesised sounds, wherein a group of experienced music producers and sound designers interacted with a web-based synthesiser in response to descriptive prompts, and provided comparative semantic ratings on the sounds they created. The words used for semantic ratings were selected by mining a text corpus from the popular modular synthesis forum Muff Wiggler, and analysing the frequency of adjectives in contexts pertaining to timbre. The ratings provided by participants were subject to statistical analysis. From 27 initial adjectives, two underlying semantic factors were revealed: terms including aggressive, hard, and complex associated with the first, and dark and warm with the second. These factors differ from those found for classical acoustic sounds, implying a relationship between the qualia of a sonic experience and the language employed to talk about it. Such insight has implications for how sound is conceptualised, understood, and received within sonic cultures—in particular, those predicated on electronic or abstract sound—and applications in developing novel control schemes for synthesis methods.
- Analysing and countering bodily interference in vibrotactile devices introduced by human interaction and physiologyMaximilian Weber, and Charalampos Saitis12th EuroHaptics Conference, 2020
- Timbre semantics through the lens of crossmodal correspondences: A new way of asking old questionsCharalampos Saitis, Stefan Weinzierl, Katharina Kriegstein, and 2 more authorsAcoustical Science and Technology, 2020
This position paper argues that a systematic study of the behavioral and neural mechanisms of crossmodal correspondences between timbral dimensions of sound and perceptual dimensions of other sensory modalities, such as brightness, roughness, or sweetness, can offer a new way of addressing old questions about the perceptual and neurocognitive mechanisms of auditory semantics. At the same time, timbre and the crossmodal metaphors that dominate its conceptualization can provide a test case for better understanding the neural basis of crossmodal correspondences and human semantic processing in general.
- What do people know about sensation and perception? Understanding perceptions of sensory experienceChristine Cuskley, and Charalampos SaitisPsyArXiv, 2020
Academic disciplines spanning cognitive science, art, and music have made strides in understanding how humans sense and experience the world. We now have a better scientific understanding of how human sensation and perception function both in the brain and in interaction than ever before. However, there is little research on how this high level scientific understanding is translated into knowledge for the public more widely. We present descriptive results from a simple survey and compare how public understanding and perception of sensory experience lines up with scientific understanding. Results show that even in a sample with fairly high educational attainment, many respondents were unaware of fairly common forms of sensory variation. In line with the well-documented under representation of sign languages within linguistics, respondents tended to under-estimate the number of sign languages in the world. We outline how our results represent gaps in public understanding of sensory variation, and argue that filling these gaps can form an important early intervention, acting as a basic foundation for improving acceptance, inclusivity, and accessibility for cognitively diverse populations.
- Timbre in Binaural Listening: A Comparison of Timbre Descriptors in Anechoic and HRTF Filtered Orchestral SoundsGeorgios Marentakis, and Charalampos SaitisForum Acusticum, 2020
The psychoacoustic investigation of timbre traditionally relies on audio descriptors extracted from anechoic or semi-anechoic recordings of musical instrument sounds, which are presented to listeners in diotic fashion. As a result, the extent to which spectral modifications due to the outer ear interact with timbre perception is not fully understood. As a first step towards investigating this research question, we examine here whether timbre descriptors calculated using HRTF filtered instrumental sounds deviate across ears and from values obtained from the same sounds without HRTF filtering for different listeners. The sound set comprised isolated notes played at the same fundamental frequency and dynamic from a database of anechoic recordings of modern orchestral instruments and some of their classical and baroque precursors. These were convolved with anechoic high spatial resolution HRTFs of human listeners. We present results and discuss implications for research on timbre perception and cognition.
- Perceptual Similarities in Neural Timbre EmbeddingsBen Hayes, Luke Brosnahan, Charalampos Saitis, and 1 more authorDMRN+15: Digital Music Research Network One-Day Workshop, 2020
Many neural audio synthesis models learn a representational space which can be used for control or exploration of the sounds generated. It is unclear what relationship exists between this space and human perception of these sounds. In this work, we compute configurational similarity metrics between an embedding space learned by a neural audio synthesis model and conventional perceptual and seman- tic timbre spaces. These spaces are computed using abstract synthesised sounds. We find significant similarities between these spaces, suggesting a shared organisational influence.
- There’s More to Timbre than Musical Instruments: Semantic Dimensions of FM SoundsBen Hayes, and Charalampos Saitis2nd International Conference on Timbre, 2020
Much previous research into timbre semantics (such as when an oboe is described as “hollow”) has focused on sounds produced by acoustic instruments, particularly those associated with western tonal music (Saitis & Weinzierl, 2019). Many synthesisers are capable of producing sounds outside the timbral range of physical instruments, but which are still discriminable by their timbre. Research into the perception of such sounds, therefore, may help elucidate further the mechanisms underpinning our experience of timbre in the broader sense. In this paper, we present a novel paradigm on the application of semantic descriptors to sounds produced by experienced sound designers using an FM synthesiser with a full set of controls.
- Evidence for Timbre Space Robustness to an Uncontrolled Online Stimulus PresentationAsterios Zacharakis, Ben Hayes, Charalampos Saitis, and 1 more author2nd International Conference on Timbre, 2020
Research on timbre perception is typically conducted under controlled laboratory conditions where every effort is made to maintain stimulus presentation conditions fixed (McAdams, 2019). This conforms with the ANSI (1973) definition of timbre suggesting that in order to judge the timbre differences between a pair of sounds the rest perceptual attributes (i.e., pitch, duration and loudness) should remain unchanged. Therefore, especially in pairwise dissimilarity studies, particular care is taken to ensure that loudness is not used by participants as a criterion for judgements by equalising it across experimental stimuli. On the other hand, conducting online experiments is an increasingly favoured practice in the music perception and cognition field as targeting relevant communities can potentially provide a large number of suitable participants with relatively little time investment from the side of the experimenters (e.g., Woods et al., 2015). However, the strict requirements for stimuli preparation and presentation prevents timbre studies from conducting online experimentation. Despite the obvious difficulties in imposing equal loudness on online experiments, the different playback equipment chain (DACs, pre-amplifiers, headphones) will also almost inevitably ‘colour’ the sonic outcome in a different way. Despite the above limitations, in a social distancing time like this, it would be of major importance to be able to lift some of the physical requirements in order to carry on conducting behavioural research on timbre perception. Therefore, this study aims to investigate the extent to which an uncontrolled online replication of a past laboratory-conducted pairwise dissimilarity task will distort the findings.
- Spectral and Temporal Timbral Cues of Vocal ImitationsAlejandro Delgado, Charalampos Saitis, and Mark Sandler2nd International Conference on Timbre, 2020
The imitation of non-vocal sounds using the human voice is a resource we sometimes rely on when communicating sound concepts to other people. Query by Vocal Percussion (QVP) is a subfield in Music Information Retrieval (MIR) that explores techniques to query percussive sounds using vocal imitations as input, usually plosive consonant sounds. The goal of this work was to investigate timbral relationships between real drum sounds and their vocal imitations. We believe these insights could shed light on how to select timbre descriptors for extraction when designing offline and online QVP systems. In particular, we studied a dataset composed of 30 acoustic and electronic drum sound recordings and vocal imitations of each sound performed by 14 musicians. Our approach was to study the correlation of audio content descriptors of timbre extracted from the drum samples with the same descriptors taken from vocal imitations. Three timbral descriptors were selected: the Log Attack Time (LAT), the Spectral Centroid (SC), and the Derivative After Maximum of the sound envelope (DAM). LAT and SC have been shown to represent salient dimensions of timbre across different types of sounds including percussion. In this sense, one intriguing question would be to what extent listeners can communicate these salient timbral cues in vocal imitations. The third descriptor, DAM, was selected for its role in describing the sound’s tail, which we considered to be a relevant part of percussive utterances.
- Timbre Space Representation of a Subtractive SynthesizerCyrus Vahidi, György Fazekas, Charalampos Saitis, and 1 more author2nd International Conference on Timbre, 2020
In this study, we produce a geometrically scaled perceptual timbre space from dissimilarity ratings of subtractive synthesized sounds and correlate the resulting dimensions with a set of acoustic descriptors. We curate a set of 15 sounds, produced by a synthesis model that uses varying source waveforms, frequency modulation (FM) and a lowpass filter with an enveloped cutoff frequency. Pairwise dissimilarity ratings were collected within an online browser-based experiment. We hypothesized that a varied waveform input source and enveloped filter would act as the main vehicles for timbral variation, providing novel acoustic correlates for the perception of synthesized timbres.
- Verbal description of musical brightnessChristos Drouzas, and Charalampos Saitis2nd International Conference on Timbre, 2020
Amongst the most common descriptive expressions of timbre used by musicians, music engineers, audio researchers as well as everyday listeners are words related to the notion of brightness (e.g., bright, dark, dull, brilliant, shining). From a psychoacoustic perspective, brightness ratings of instrumental timbres as well as music excerpts systematically correlate with the centre of gravity of the spectral envelope and thus brightness as a semantic descriptor of musical sound has come to denote a prevalence of high-frequency over low-frequency energy. However, relatively little is known about the higher-level cognitive processes underpinning musical brightness ratings. Psycholinguistic investigations of verbal descriptions of timbre suggest a more complex, polysemic picture (Saitis & Weinzierl 2019) that warrants further research. To better understand how musical brightness is conceptualised by listeners, here we analysed free verbal descriptions collected along brightness ratings of short music snippets (involving 69 listeners) and brightness ratings of orchestral instrument notes (involving 68 listeners). Such knowledge can help delineate the intrinsic structure of brightness as a perceptual attribute of musical sounds, and has broad implications and applications in orchestration, audio engineering, and music psychology.
- Brightness perception for musical instrument sounds: Relation to timbre dissimilarity and source-cause categoriesCharalampos Saitis, and Kai SiedenburgThe Journal of the Acoustical Society of America, 2020
Timbre dissimilarity of orchestral sounds is well-known to be multidimensional, with attack time and spectral centroid representing its two most robust acoustical correlates. The centroid dimension is traditionally considered as reflecting timbral brightness. However, the question of whether multiple continuous acoustical and/or categorical cues influence brightness perception has not been addressed comprehensively. A triangulation approach was used to examine the dimensionality of timbral brightness, its robustness across different psychoacoustical contexts, and relation to perception of the sounds’ source-cause. Listeners compared 14 acoustic instrument sounds in three distinct tasks that collected general dissimilarity, brightness dissimilarity, and direct multi-stimulus brightness ratings. Results confirmed that brightness is a robust unitary auditory dimension, with direct ratings recovering the centroid dimension of general dissimilarity. When a two-dimensional space of brightness dissimilarity was considered, its second dimension correlated with the attack-time dimension of general dissimilarity, which was interpreted as reflecting a potential infiltration of the latter into brightness dissimilarity. Dissimilarity data were further modeled using partial least-squares regression with audio descriptors as predictors. Adding predictors derived from instrument family and the type of resonator and excitation did not improve the model fit, indicating that brightness perception is underpinned primarily by acoustical rather than source-cause cues.
- Towards a framework for ubiquitous audio-tactile designMaximilian Weber, and Charalampos Saitis10th International Workshop on Haptic and Audio Interaction Design, 2020
To enable a transition towards rich vibrotactile feedback in applications and media content, a complete end-to-end system — from the design of the tactile experience all the way to the tactile stimulus reproduction — needs to be considered. Currently, most applications are at best limited to dull vibration patterns due to limited hard- and software implementations, while the design of ubiquitous platform-agnostic tactile stimuli remains challenging due to a lack of standardized protocols and tools for tactile design, storage, transport, and reproduction. This work proposes a conceptual framework, utilizing audio assets as a starting point for the design of vibrotactile stimuli, including ideas for a parametric tactile data model, and outlines challenges for a platform-agnostic stimuli reproduction. Finally, the benefits and shortcomings of a commercial and wide-spread vibrotactile API are investigated as an example for the current state of a complete end-to-end framework.
- Musical dynamics classification with CNN and modulation spectraLuca Marinelli, Athanasios Lykartsis, Stefan Weinzierl, and 1 more author17th Sound and Music Computing Conference, 2020
To investigate variations in the timbre space with regards to musical dynamics, convolutional neural networks (CNNs) were trained on modulation power spectra (MPS), melscaled and ERB-scaled spectrograms of single notes of sustained instruments played at two dynamics extremes (pp and ff). The samples, from an extensive dataset of several timbre families, were rms normalized in order to eliminate the loudness information and force the network to focus on timbre attributes of musical dynamics that are shared across different instrument families. The proposed CNN architecture obtained competitive results in three classification tasks with all three input representations. In order to compare the different input representations, the test sets in three experiments were partitioned in order to promote or avoid selection bias. When selection bias was avoided, models trained on MPS were outperformed by those trained on time-frequency representations, conversely, those trained on MPS achieved the best results when selection bias was promoted. Low-temporal modulations emerged in classspecific MPS saliency maps as markers of musical dynamics. This led to the implementation of a MPS-based scalar descriptor of timbre that largely outperformed the chosen baseline (44.8% error reduction).
- Proceedings of the 2nd International Conference on TimbreEds: Asterios Zacharakis, Charalampos Saitis, and Kai SiedenburgThe School of Music Studies, Aristotle University of Thessaloniki, 2020
2019
- Modulation Spectra for Musical Dynamics Perception and RetrievalLuca Marinelli, Athanasios Lykartsis, and Charalampos SaitisDMRN+14: Digital Music Research Network One-Day Workshop, 2019
luca_dmrn_2019.pdf
- The role of attack transients in timbral brightness perceptionCharalampos Saitis, Kai Siedenburg, Paul Schuladen, and 1 more author23rd International Congress on Acoustics, 2019
http://pub.dega-akustik.de/ICA2019/data/articles/000813.pdf
- Revisiting timbral brightness perceptionCharalampos Saitis, Kai Siedenburg, and Christoph ReuterBiennial Meeting of the Society for Music Perception and Cognition, 2019
Brightness has been long shown to play a major role in timbre perception but relatively little is known about the specific acoustic and cognitive factors that affect brightness ratings of musical instrument sounds. Previous work indicated that sound source categories influence general timbre dissimilarity ratings. To examine whether source categories also exert an effect on brightness ratings of timbre, we collected brightness dissimilarity ratings of 14 orchestral instrument tones from 40 musically experienced listeners and the data were modeled using a partial least-squares regression model that takes audio descriptors of timbre as regressors. It was found that adding predictors derived from sound source categories did not improve the model fit, indicating that timbral brightness is informed mainly by continuously varying properties of the acoustic signal. A multidimensional scaling analysis suggested at least two salient cues: spectral energy distribution and attack time and/or asynchrony in the rise of harmonics. This finding seems to challenge the typical approach of seeking acoustical correlates of brightness in the spectral envelope of the steady-state portion of sounds. To further investigate these aspects in timbral brightness perception, a new group of 40 musically experienced listeners will perform MUSHRA-like brightness ratings of an expanded set of 24 orchestral instrument notes. The goal is to obtain a perceptual scaling of the attribute across a larger set of sounds to help delineate the acoustic ingredients of this important aspect of timbre perception. Preliminary results indicate that between sounds with very close spectral centroid values but different attack times, those with faster attacks tend to be perceived as brighter. Overall, these experiments help clarify the relation between two salient dimensions of timbre: onset and spectral energy distribution.
- There’s more to timbre than musical instruments: a meta-analysis of timbre semantics in singing voice quality perceptionCharalampos Saitis, and Johanna DevaneyBiennial Meeting of the Society for Music Perception and Cognition, 2019
Imagine listening to the famous soprano Maria Callas (1923–1977) singing the aria “Vissi d’arte” from Puccini’s Tosca. How would you describe the quality of her voice? When describing the timbre of musical sounds, listeners use descriptions such as bright, heavy, round, and rough, among others. In 1890, Stumpf theorized that this diverse vocabulary can be summarized, on the basis of semantic proximities, by three pairs of opposites: dark–bright, soft–rough, and full–empty. Empirical findings across many semantic differential studies from the late 1950s until today have generally confirmed that these are the salient dimensions of timbre semantics. However, most prior work has considered only orchestral instruments, with relatively little attention given to sung tones. At the same time, research on the perception of singing voice quality has primarily focused on verbal attributes associated with phonation type, voice classification, vocal register, vowel intelligibility, and vibrato. Descriptions like pressed, soprano, falsetto, hoarse, or wobble, albeit in themselves a type of timbre semantics, are essentially sound source identifiers acting as semantic descriptors. It remains an open question as to whether the timbral attributes of sung tones, that is verbal attributes that bear no source associations, can be described adequately on the basis of the bright-rough-full semantic space. We present a meta-analysis of previous research on verbal attributes of singing voice timbre that covers not only pedagogical texts but also work from music cognition, psychoacoustics, music information retrieval, musicology, and ethnomusicology. The meta-analysis lays the groundwork for a semantic differential study of sung sounds, providing a more appropriate lexicon on which to draw than simply using verbal scales from related work on instrumental timbre. The meta-analysis will be complemented by a psycholinguistic analysis of free verbalizations provided by singing teachers in a listening test and an acoustic analysis of the tested stimuli.
- Spectrotemporal modulation timbre cues in musical dynamicsCharalampos Saitis, Luca Marinelli, Athanasios Lykartsis, and 1 more authorBiennial Meeting of the Society for Music Perception and Cognition, 2019
Timbre is often described as a complex set of sound features that are not accounted for by pitch, loudness, duration, spatial location, and the acoustic environment. Musical dynamics refers to the perceived or intended loudness of a played note, instructed in music notation as piano or forte (soft or loud) with different dynamic gradations between and beyond. Recent research has shown that even if no loudness cues are available, listeners can still quite reliably identify the intended dynamic strength of a performed sound by relying on timbral features. More recently, acoustical analyses across an extensive set of anechoic recordings of orchestral instrument notes played at pianissimo (pp) and fortissimo (ff) showed that attack slope, spectral skewness, and spectral flatness together explained 72% of the variance in dynamic strength across all instruments, and 89% with an instrument-specific model. Here, we further investigate the role of timbre in musical dynamics, focusing specifically on the contribution of spectral and temporal modulations. Loudness-normalized modulation power spectra (MPS) were used as input representation for a convolutional neural network (CNN). Through visualization of the pp and ff saliency maps of the CNN it was possible to identify discriminant regions of the MPS and define a novel task-specific scalar audio descriptor. A linear discriminant analysis with 10-fold cross-validation using this new MPS-based descriptor on the entire dataset performed better than using the two spectral descriptors (27% error rate reduction). Overall, audio descriptors based on different regions of the MPS could serve as sound representation for machine listening applications, as well as to better delineate the acoustic ingredients of different aspects of auditory perception.
- Beyond the semantic differential: Timbre semantics as crossmodal correspondencesCharalampos Saitis14th International Symposium on Computer Music Multidisciplinary Research, 2019
This position paper argues that a systematic study of crossmodal correspondences between timbral dimensions of sound and perceptual dimensions of other sensory modalities (e.g., brightness, fullness, roughness, sweetness) can offer a new way of addressing old questions about the perceptual and cognitive mechanisms of timbre semantics, while the latter can provide a test case for better understanding crossmodal correspondences and human semantic processing in general. Furthermore, a systematic investigation of auditory-nonauditory crossmodal correspondences necessitates auditory stimuli that can be intuitively controlled along intrinsic continuous dimensions of timbre, and the collection of behavioural data from appropriate tasks that extend beyond the semantic differential paradigm.
- Sounds like melted chocolate: how musicians conceptualize violin sound richnessCharalampos Saitis, Claudia Fritz, and Gary ScavoneInternational Symposium on Musical Acoustics, 2019
Results from a previous study on the perceptual evaluation of violins that involved playing-based semantic ratings showed that preference for a violin was strongly associated with its perceived sound richness. However, both preference and richness ratings varied widely between individual violinists, likely because musicians conceptualize the same attribute in different ways. To better understand how richness is conceptualized by violinists and how it contributes to the perceived quality of a violin, we analyzed free verbal descriptions collected during a carefully controlled playing task (involving 16 violinists) and in an online survey where no sound examples or other contextual information was present (involving 34 violinists). The analysis was based on a psycholinguistic method, whereby semantic categories are inferred from the verbal data itself through syntactic context and linguistic markers. The main sensory property related to violin sound richness was expressed through words such as full, complex, and dense versus thin and small, referring to the perceived number of partials present in the sound. Another sensory property was expressed through words such as warm, velvety, and smooth versus strident, harsh, and tinny, alluding to spectral energy distribution cues. Haptic cues were also implicated in the conceptualization of violin sound richness.
- The Semantics of TimbreCharalampos Saitis, and Stefan WeinzierlTimbre: Acoustics, Perception, and Cognition, 2019
Because humans lack a sensory vocabulary for auditory experiences, timbral qualities of sounds are often conceptualized and communicated through readily available sensory attributes from different modalities (e.g., bright, warm, sweet) but also through the use of onomatopoeic attributes (e.g., ringing, buzzing, shrill) or nonsensory attributes relating to abstract constructs (e.g., rich, complex, harsh). The analysis of the linguistic description of timbre, or timbre semantics, can be considered as one way to study its perceptual representation empirically. In the most commonly adopted approach, timbre is considered as a set of verbally defined perceptual attributes that represent the dimensions of a semantic timbre space. Previous studies have identified three salient semantic dimensions for timbre along with related acoustic properties. Comparisons with similarity-based multidimensional models confirm the strong link between perceiving timbre and talking about it. Still, the cognitive and neural mechanisms of timbre semantics remain largely unknown and underexplored, especially when one looks beyond the case of acoustic musical instruments.
- The present, past, and future of timbre researchKai Siedenburg, Charalampos Saitis, and Stephen McAdamsTimbre: Acoustics, Perception, and Cognition, 2019
Timbre is a foundational aspect of hearing. The remarkable ability of humans to recognize sound sources and events (e.g., glass breaking, a friend’s voice, a tone from a piano) stems primarily from a capacity to perceive and process differences in the timbre of sounds. Roughly defined, timbre is thought of as any property other than pitch, duration, and loudness that allows two sounds to be distinguished. Current research unfolds along three main fronts: (1) principal perceptual and cognitive processes; (2) the role of timbre in human voice perception, perception through cochlear implants, music perception, sound quality, and sound design; and (3) computational acoustic modeling. Along these three scientific fronts, significant breakthroughs have been achieved during the decade prior to the production of this volume. Bringing together leading experts from around the world, this volume provides a joint forum for novel insights and the first comprehensive modern account of research topics and methods on the perception, cognition, and acoustic modeling of timbre. This chapter provides background information and a roadmap for the volume.
- Audio Content Descriptors of TimbreMarcelo Caetano, Charalampos Saitis, and Kai SiedenburgTimbre: Acoustics, Perception, and Cognition, 2019
This chapter introduces acoustic modeling of timbre with the audio descriptors commonly used in music, speech, and environmental sound studies. These descriptors derive from different representations of sound, ranging from the waveform to sophisticated time-frequency transforms. Each representation is more appropriate for a specific aspect of sound description that is dependent on the information captured. Auditory models of both temporal and spectral information can be related to aspects of timbre perception, whereas the excitation-filter model of sound production provides links to the acoustics of sound production. A brief review of the most common representations of audio signals used to extract audio descriptors related to timbre is followed by a discussion of the audio descriptor extraction process using those representations. This chapter covers traditional temporal and spectral descriptors, including harmonic description, time-varying descriptors, and techniques for descriptor selection and descriptor decomposition. The discussion is focused on conceptual aspects of the acoustic modeling of timbre and the relationship between the descriptors and timbre perception, semantics, and cognition, including illustrative examples. The applications covered in this chapter range from timbre psychoacoustics and multimedia descriptions to computer-aided orchestration and sound morphing. Finally, the chapter concludes with speculation on the role of deep learning in the future of timbre description and on the challenges of audio content descriptors of timbre.
- Timbre: Acoustics, Perception, and CognitionEds: Kai Siedenburg, Charalampos Saitis, Stephen McAdams, and 2 more editorsSpringer Handbook of Auditory Research 69, 2019