Publications | Communication Acoustics Lab

2025

Hearing brightness: Multidisciplinary perspectives on a ubiquitous attribute of timbre and orchestration (in press)

Charalampos Saitis, Kai Siedenburg, and Moe Touizrar

The Oxford Handbook of Orchestration Studies, 2025

Abstract

Orchestration can have undeniable effects on what listeners hear as changes in auditory brightness. But the precise source of our sensation of brightness in orchestral music is difficult to pinpoint. What is the locus of brightness in terms of perception and orchestration? Moreover, how do separate notions of timbre and orchestration spanning music theory, composition, acoustics, and psychology each offer possible explanations? In this chapter, we attempt to articulate and distinguish between two modes: timbral brightness—the psychophysical perception of brightness in a given instance of sound, and orchestral brightness—the experience of brightness in the configuration and unfolding of orchestral music. We approach the central question from a multidisciplinary vantage, attempting to hold any tensions between the two modes without resolution.
Audio synthesizer inversion in symmetric parameter spaces with approximately equivariant flow matching

Ben Hayes, Charalampos Saitis, and György Fazekas

International Society for Music Information Retrieval Conference, 2025

Abstract

Many audio synthesizers can produce the same signal given different parameter configurations, meaning the inversion from sound to parameters is an inherently ill-posed problem. We show that this is largely due to intrinsic symmetries of the synthesizer, and focus in particular on permutation invariance. First, we demonstrate on a synthetic task that regressing point estimates under permutation symmetry degrades performance, even when using a permutation-invariant loss function or symmetry-breaking heuristics. Then, viewing equivalent solutions as modes of a probability distribution, we show that a conditional generative model substantially improves performance. Further, acknowledging the invariance of the implicit parameter distribution, we find that performance is further improved by using a permutation equivariant continuous normalizing flow. To accommodate intriciate symmetries in real synthesizers, we also propose a relaxed equivariance strategy that adaptively discovers relevant symmetries from data. Applying our method to Surge XT, a full-featured open source synthesizer used in real world audio production, we find our method outperforms regression and generative baselines across audio reconstruction metrics.
Assessing the alignment of audio representations with timbre similarity ratings

Haokun Tian, Stefan Lattner, and Charalampos Saitis

International Society for Music Information Retrieval Conference, 2025

Abstract

Psychoacoustical so-called “timbre spaces” map perceptual similarity ratings of instrument sounds onto low-dimensional embeddings via multidimensional scaling but suffer from scalability issues and are incapable of generalization. Recent results from audio (music and speech) quality assessment as well as image similarity have shown that deep learning provides emergent embeddings that align well with human perception while being largely free from these constraints. Although the existing ’timbre space’ data is not large enough to train deep neural networks (only 2,614 pairwise ratings on 334 audio samples), it is sufficient and suitable for evaluating existing audio models. In this paper, we introduce metrics to assess the alignment of diverse audio representations with human judgements of timbre similarity by comparing both the absolute values and the rankings of embedding distances to human dissimilarity ratings. Our evaluation involves 3 signal-processing based methods, 10 pretrained models, and a novel sound matching model where three representations (including ’style’ embeddings inspired by the style transfer task in the vision domain) are extracted and evaluated. Our analysis reveals that CLAP-based models and the style embeddings from our sound matching model achieve marginal gains over alternatives, yet MFCC remains competitive—underscoring gaps in current deep features’ ability to encode timbre similarity.
Sampling the latent space: exploring the creative potential of generative AI through the lens of sample-based music making

Ash Noel-Hirst, Charalampos Saitis, and Nick Bryan-Kinns

Conference on AI Music Creativity, 2025

Abstract

Sample-based composition is a century-old practice in which artists repurpose existing sounds to create new music. As generative AI (GenAI) systems increasingly reuse existing material, they are often compared to sampling and remix practices. In this paper, we draw on the relatively mature tradition of sampling to address a key concern of GenAI: how can a tool which is dependent on the recycling of existing art support meaningful creativity? We present an ethnography and thematic analysis of a sample-based music community, connecting these insights to contemporary challenges in GenAI design. We discuss how designers might support artists in engaging with environmental, fictional, communal, and agential dimensions of GenAI models and data, building on literature about sample-based music to speak not only to generative music but to GenAI systems more broadly.
Designing Percussive Timbre Remappings: Negotiating Audio Representations and Evolving Parameter Spaces

Jordie Shier, Rodrigo Constanzo, Charalampos Saitis, Andrew Robertson, and Andrew McPherson

International Conference on New Interfaces for Musical Expression, 2025

Abstract Website

Timbre remapping is an approach to audio-to-synthesizer parameter mapping that aims to transfer timbral expressions from a source instrument onto synthesizer controls. This process is complicated by the ill-defined nature of timbre and the complex relationship between synthesizer parameters and their sonic output. In this work, we focus on real-time timbre remapping with percussion instruments, combining technical development with practice-based methods to address these challenges. As a technical contribution, we introduce a genetic algorithm – applicable to black-box synthesizers including VSTs and modular synthesizers – to generate datasets of synthesizer presets that vary according to target timbres. Additionally, we propose a neural network-based approach to predict control features from short onset windows, enabling low-latency performance and feature-based control. Our technical development is grounded in musical practice, demonstrating how iterative and collaborative processes can yield insights into open-ended challenges in DMI design. Experiments on various audio representations uncover meaningful insights into timbre remapping by coupling data-driven design with practice-based reflection. This work is accompanied by an annotated portfolio, presenting a series of musical performances and experiments with reflections.
Diffy

Jordie Shier, and Xiaowan Yi

International Conference on New Interfaces for Musical Expression Music Track, 2025

Abstract

Diffy is a duo music project comprising a drummer and a sound designer, connected by a set of machine learning-based sound design agents. In this project, we explore and juxtapose a set of three machine learning-based techniques for manipulating the timbral qualities of percussion instruments in real-time with low-latency. These techniques include a neural audio synthesizer trained on non-percussive material, a timbre remapping 808 drum synthesizer, and a modular synthesizer controlled by a neural network. Each sound design agent operates on different modes of timbral understanding – reacting to the drum performance based on this understanding, and suggesting sonic transformations. Sonic negotiations between the human sound designer and the sound-design agent are relayed back to the drummer, creating a feedback loop that shapes a structured improvisation.
Drum Modal Feedback: Concept Design of an Augmented Percussion Instrument

Lewis Wolstanholme, Jordie Shier, Rodrigo Constanzo, and Andrew McPherson

International Conference on New Interfaces for Musical Expression, 2025

Abstract

We here outline the concept design of an augmented percussion instrument, conceived for and used as part of a variety of distinct performances and compositions. Throughout the curation of this project, each creative act has enabled us to contextualise, examine and reflect upon the design of this augmented instrument. In accordance with Stolterman and Wiberg’s concept driven design methodology, we do not present a singular instrument design, but instead an overarching design concept alongside its developmental and evaluative narrative. This augmentation centres upon the use of a drum trigger and a tactile transducer, which when coupled together can be used to feedback or resonate a drum. The resultant soundworld develops upon the idiomatic sonority of a drum, and allows for the duration and timbre of a drum strike to be continuously manipulated and shaped throughout a performance. In exploring the soundworld which results from this approach, we have experimented with numerous configurations of these pieces of hardware, and have also employed various pieces of software to parametrise the sonic subtleties that this approach engenders. Most prominently, we have developed a bespoke piece of software which analyses the modes of a drum prior to performance, and uses this modal analysis to shape the overall feedback and resonance. Throughout this design process, we have consistently been met with new creative criteria that challenge our approach and ideas, in response to the particularities of the musicians we are working alongside, as well as the performative and aesthetic environments we are working within.
Mamba-diffusion model with learnable wavelet for controllable symbolic music generation

Jincheng Zhang, György Fazekas, and Charalampos Saitis

International Joint Conference on Neural Networks, 2025

Abstract arXiv Code

The recent surge in the popularity of diffusion models for image synthesis has attracted new attention to their potential for generation tasks in other domains. However, their applications to symbolic music generation remain largely under-explored because symbolic music is typically represented as sequences of discrete events and standard diffusion models are not well-suited for discrete data. We represent symbolic music as image-like pianorolls, facilitating the use of diffusion models for the generation of symbolic music. Moreover, this study introduces a novel diffusion model that incorporates our proposed Transformer-Mamba block and learnable wavelet transform. Classifier-free guidance is utilised to generate symbolic music with target chords. Our evaluation shows that our method achieves compelling results in terms of music quality and controllability, outperforming the strong baseline in pianoroll generation.
(De)Constructing timbre at NIME: Reflecting on technology and aesthetic entanglements in instrument design

Charalampos Saitis, Courtney N Reed, Ashley Noel-Hirst, Giacomo Lepri, and Andrew McPherson

International Conference on New Interfaces for Musical Expression, 2025

Abstract Paper

Timbre, pitch, and timing are often relevant in digital musical instrument (DMI) design. Amongst the three, timbre is the most difficult to define and discretise when negotiating audio repre- sentations and gesture-sound mappings. We conduct a corpus-assisted discourse analysis of “timbre” in all NIME proceedings to date (2001–2024). Combining this with a detailed review of 18 timbre-focused papers, we examine how definitions of timbre and timbre interaction methods are constructed through, for instance, Wessel’s numerical timbre control space, synthesis tools and programming languages, machine learning and AI approaches, and other trends in digital lutherie practices. While acknowledging the practical utility of technical constructions of timbre in NIME (and other digital music research communities), we contribute discussion on the entanglement of technology and aesthetics in instrument design, which constitutes what “timbre’ becomes in NIME research, and reflect on the tension between technoscientific and constructivist understandings of timbre: how DMIs and musical practices have been reconstituted around particular timbral values operationalised in NIME. In response, we propose ways that the community can embrace more critical approaches and awareness to how our methods and tools shape and co-create our notions of timbre, as well as other musical concepts, connecting more openly with diverse types of sonic phenomena.
Sonicolour: Exploring colour control of sound synthesis with interactive machine learning

Tug F. O’Flaherty, Luigi Marino, Charalampos Saitis, and Anna Xambó Sedó

International Conference on New Interfaces for Musical Expression, 2025

Abstract Website

This paper explores crossmodal mappings of colour to sound. The instrument presented analyses the colour of physical objects via a colour light-to-frequency sensor and maps the corresponding red, green, and blue data values to parameters of a synthesiser. Interactive machine learning is used to facilitate the discovery of new relationships between sound and colour. The role of interactive machine learning is to find unexpected relationships between the visual features of the objects and the sound synthesis. The performance is evaluated by its ability to provide the user with a playful interaction between the visual and tactile exploration of coloured objects, and the generation of synthetic sounds. We conclude by outlining the potential of this approach for musical interaction design and music performance.
Designing neural synthesizers for low latency interaction (in press)

Franco Caspe, Jordie Shier, Mark Sandler, Charalampos Saitis, and Andrew McPherson

Journal of the Audio Engineering Society, 2025

Abstract Paper Code Try It

Neural Audio Synthesis (NAS) models offer interactive musical control over high-quality, expressive audio generators. While these models can operate in real-time, they often suffer from high latency, making them unsuitable for intimate musical interaction. The impact of architectural choices in deep learning models on audio latency remains largely unexplored in the NAS literature. In this work, we investigate the sources of latency and jitter typically found in interactive NAS models. We then apply this analysis to the task of timbre transfer using RAVE, a convolutional variational autoencoder for audio waveforms introduced by Caillon et al. in 2021. Finally, we present an iterative design approach for optimizing latency. This culminates with a model we call BRAVE (Bravely Realtime Audio Variational autoEncoder), which is low-latency and exhibits better pitch and loudness replication while showing timbre modification capabilities similar to RAVE. We implement it in a specialized inference framework for low-latency, real-time inference and present a proof-of-concept audio plugin compatible with audio signals from musical instruments. We expect the challenges and guidelines described in this document to support NAS researchers in designing models for low-latency inference from the ground up, enriching the landscape of possibilities for musicians.
Hybrid losses for hierarchical embedding learning

Haokun Tian, Stefan Lattner, Brian McFee, and Charalampos Saitis

50th IEEE International Conference on Acoustics, Speech and Signal Processing, 2025

Abstract arXiv Code

In traditional supervised learning, cross-entropy loss penalises all incorrect predictions equally, failing to account for the relevance or proximity of wrong labels to the correct answer. Leveraging tree hierarchy for fine-grained labels, we investigate hybrid losses, including generalised triplet and cross-entropy losses, to impose inter-label similarity within a multi-task learning framework. We propose metrics to evaluate the embedding space structure and to test the generalisation ability for unseen classes, that is, whether the model can guess a similar class for data of an unseen class. Our experiments were conducted on OrchideaSOL, a four-level hierarchical instrument sound dataset with nearly 200 fine-grained categories. The proposed hybrid losses outperform previous work in classification, retrieval, embedding space structure, and generalisation.

2024

Colored timbres: Do crossmodal correspondences between musical instrument sounds and visual colors rather depend on pitch instead of timbre?

Saleh Siddiq, Isabella Czedik-Eysenberg, Jörg Jewanski, Charalampos Saitis, Sascha Kruchten, Rustem Sakhabiev, Michael Oehler, and Christoph Reuter

Musik-, Tanz- & Kunsttherapie, 2024

Abstract Paper

Crossmodal correspondences between music and color gained much attention from researchers. Especially mappings of pitch to colors were investigated, while mappings of timbre to color garnered less interest. A short review over historic and recent studies on timbre-color mappings is given. An empirical study with 40 participants who were asked to match 60 musical instrument sounds with 38 colors was conducted in order to shed light on the underlying factors of the assignment of musical instruments’ sounds to colors. The question is if timbre-color matchings depend on timbre or if participants indeed resort to pitch as dominating factor. Despite notable interindividual inconsistencies, the results reflect some of the more common associations of colors and musical instruments, e.g., red/yellow and trumpet sounds. However, the influence of timbre was found to be less robust than the influence of pitch. The most reliable relation was the well known tendency to match lower pitches with darker colors and higher pitches with bright colors. As starting point for future research, the results were qualitatively compared with the sensations of one synesthete.
Ethnographic exploration of timbre in hackathon designs

Charalampos Saitis, Bleiz Macsen Del Sette, Jordie Shier, Haokun Tian, Shuoyang Zheng, Sophie Skach, Courtney N Reed, and Corey Ford

Annual Workshop of the Music and Human-Computer Interaction Networks (CHIME), 2024

Abstract

https://static1.squarespace.com/static/6227c31a43daf21135453605/t/6734b6d5b882961f8ec04316/1731507925478/6+Charalampos+Saitis+et+al.pdf
Foundation Models for Music: A Survey

Yinghao Ma, Anders Øland, Anton Ragni, Bleiz MacSen Del Sette, Charalampos Saitis, Chris Donahue, Chenghua Lin, Christos Plachouras, Emmanouil Benetos, Elio Quinton, Elona Shatri, Fabio Morreale, Ge Zhang, György Fazekas, Gus Xia, Huan Zhang, Ilaria Manco, Jiawen Huang, Julien Guinot, Liwei Lin, Luca Marinelli, Max W. Y. Lam, Megha Sharma, Qiuqiang Kong, Roger B. Dannenberg, Ruibin Yuan, Shangda Wu, Shih-Lun Wu, Shuqi Dai, Shun Lei, Shiyin Kang, Simon Dixon, Wenhu Chen, Wenhao Huang, Xingjian Du, Xingwei Qu, Xu Tan, Yizhi Li, Zeyue Tian, Zhiyong Wu, Zhizheng Wu, Ziyang Ma, and Ziyu Wang

arXiv, 2024

Abstract arXiv

In recent years, foundation models (FMs) such as large language models (LLMs) and latent diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This comprehensive review examines state-of-the-art (SOTA) pre-trained models and foundation models in music, spanning from representation learning, generative learning and multimodal learning. We first contextualise the significance of music in various industries and trace the evolution of AI in music. By delineating the modalities targeted by foundation models, we discover many of the music representations are underexplored in FM development. Then, emphasis is placed on the lack of versatility of previous methods on diverse music applications, along with the potential of FMs in music understanding, generation and medical application. By comprehensively exploring the details of the model pre-training paradigm, architectural choices, tokenisation, finetuning methodologies and controllability, we emphasise the important topics that should have been well explored, like instruction tuning and in-context learning, scaling law and emergent ability, as well as long-sequence modelling etc. A dedicated section presents insights into music agents, accompanied by a thorough analysis of datasets and evaluations essential for pre-training and downstream tasks. Finally, by underscoring the vital importance of ethical considerations, we advocate that following research on FM for music should focus more on such issues as interpretability, transparency, human responsibility, and copyright issues. The paper offers insights into future challenges and trends on FMs for music, aiming to shape the trajectory of human-AI collaboration in the music realm.
Timbral brightness perception investigated through multimodal interference

Charalampos Saitis, and Zachary Wallmark

Attention, Perception, & Psychophysics, 2024

Abstract Paper

Brightness is among the most studied aspects of timbre perception. Psychoacoustically, sounds described as “bright” vs “dark” typically exhibit a high vs low frequency emphasis in the spectrum. However, relatively little is known about the neurocognitive mechanisms that facilitate these metaphors we listen with. Do they originate in universal magnitude representations common to more than one sensory modality? Triangulating three different interaction paradigms, we investigated using speeded classification whether intramodal, crossmodal, and amodal interference occurs when timbral brightness, as modeled by the centroid of the spectral envelope, and pitch height / visual brightness / numerical value processing are semantically congruent and incongruent. In four online experiments varying in priming strategy, onset timing, and response deadline, 189 total participants were presented a baseline stimulus (a pitch, grey square, or numeral) then asked to quickly identify a target stimulus that is higher/lower, brighter/darker, or greater/less than the baseline after being primed with a bright or dark synthetic harmonic tone. Results suggest that timbral brightness modulates the perception of pitch and possibly visual brightness, but not numerical value. Semantically incongruent pitch height-timbral brightness shifts produced significantly slower reaction time (RT) and higher error compared to congruent pairs. In the visual task, incongruent pairings of grey squares and tones elicited slower RTs than congruent pairings (in two experiments). No interference was observed in the number comparison task. These findings shed light on the embodied and multimodal nature of experiencing timbre.
Relating timbre perception to musical craft practice: an empirical ethnographic approach

Charalampos Saitis, Bleiz Macsen Del Sette, Jordie Shier, and Haokun Tian

Triennial Conference of the European Society for the Cognitive Sciences of Music, 2024

Abstract Video

In crafting musical expression, the digital instrument maker is required to manipulate digital, and increasingly AI, technology as an additional medium. This raises interesting but unexplored questions about the role and practice of timbre in the development and adoption of sound technologies and their surrounding sonic cultures and, conversely, their imprint on the perceptual experience of timbre. Previous empirical research studied how the latter relates to the creative practice of sound synthesis. Here we adopt an ethnographic approach to explore the relationship between timbre and broader creative and technological practices of digital lutherie. We aim to better understand how makers think about and engage with timbre, what current practices and technologies of instrument design enable timbre exploration during the creative craft process, and how this knowledge can expand and diversify our understanding of how timbre is perceived, represented, and generated. Reflexive thematic analysis is applied to structured interviews with 20 (minimum target) instrument makers from commercial, research, independent, and artistic backgrounds. Here both ‘instrument’ and ‘maker’ are broadly construed, including composers and performers who build bespoke instruments as well as live coders. Interviews were conducted remotely and lasted around 50 minutes. Preliminary findings suggest that the entanglement of timbre and musical craft practice takes several forms, including interactions with aesthetic values and acoustics of material, which can be described as occupying places across a space encompassing many different notions (subspaces) of timbre entangled with a wide range of epistemic instruments and sonic practices. Rather than being a limited scientific (and musical) idea rooted in the psychoacoustic “timbre space” model, timbre emerges in a dynamic relay between technology and creation. Our study thus presents an empirical ethnographic understanding of timbre from the maker’s perspective, informing future development of tools to assist timbre exploration in musical craft practice.
Timbre Tools: Ethnographic perspectives on timbre and sonic cultures in hackathon designs

Charalampos Saitis, Bleiz Macsen Del Sette, Jordie Shier, Haokun Tian, Shuoyang Zheng, Sophie Skach, Courtney N Reed, and Corey Ford

International Audio Mostly Conference, 2024

Abstract Paper

Timbre is a nuanced yet abstractly defined concept. Its inherently subjective qualities make it challenging to design and work with. In this paper, we propose to explore the conceptualisation and negotiation of timbre within the creative practice of timbre tool makers. To this end, we hosted a hackathon event and performed an ethnographic study to explore how participants engaged with the notion of timbre and how their conception of timbre was shaped through social interactions and technological encounters. We present individual descriptions of the design process of each team and reflect across our data to identify commonalities in the ways that timbre is understood and informed by sound technologies and their surrounding sonic cultures, e.g., by relating concepts of timbre to metaphors. We further current understanding by offering novel interdisciplinary and multimodal insights into understandings of timbre.
Automatic detection of moral values in music lyrics

Vjosa Preniqi, Iacopo Ghinassi, Julia Ive, Kyriaki Kalimeri, and Charalampos Saitis

International Society for Music Information Retrieval Conference, 2024

Abstract arXiv Try It

Moral values play a fundamental role in how we evaluate information, make decisions, and form judgements around important social issues. The possibility to extract morality rapidly from lyrics enables a deeper understanding of our music-listening behaviours. Building on the Moral Foundations Theory (MFT), we tasked a set of transformer-based language models (BERT) fine-tuned on 2,721 synthetic lyrics generated by a large language model (GPT-4) to detect moral values in 200 real music lyrics annotated by two experts. We evaluate their predictive capabilities against a series of baselines including out-of-domain (BERT fine-tuned on MFT-annotated social media texts) and zero-shot (GPT-4) classification. The proposed models yielded the best accuracy across experiments, with an average F1 weighted score of 0.8. This performance is, on average, 5% higher than out-of-domain and zero-shot models. When examining precision in binary classification, the proposed models perform on average 12% higher than the baselines. Our approach contributes to annotation-free and effective lyrics morality learning, and provides useful insights into the knowledge distillation of LLMs regarding moral expression in music, and the potential impact of these technologies on the creative industries and musical culture.
MoralBERT: A fine-tuned language model for capturing moral values in social discussions

Vjosa Preniqi, Iacopo Ghinassi, Charalampos Saitis, and Kyriaki Kalimeri

ACM International Conference on Information Technology for Social Good, 2024

Abstract arXiv Code Try It
Composer style-specific symbolic music generation using vector quantized discrete diffusion models

Jincheng Zhang, György Fazekas, and Charalampos Saitis

IEEE International Workshop on Machine Learning for Signal Processing, 2024

Abstract Paper Code

Emerging Denoising Diffusion Probabilistic Models (DDPM) have become increasingly utilised because of promising results they have achieved in diverse generative tasks with continuous data, such as image and sound synthesis. Nonetheless, the success of diffusion models has not been fully extended to discrete symbolic music. We propose to combine a vector quantized variational autoencoder (VQ-VAE) and discrete diffusion models for the generation of symbolic music with desired composer styles. The trained VQ-VAE can represent symbolic music as a sequence of indexes that correspond to specific entries in a learned codebook. Subsequently, a discrete diffusion model is used to model the VQ-VAE’s discrete latent space. The diffusion model is trained to generate intermediate music sequences consisting of codebook indexes, which are then decoded to symbolic music using the VQ-VAE’s decoder. The evaluation results demonstrate our model can generate symbolic music with target composer styles that meet the given conditions with a high accuracy of 72.36%. Our code is available at [URL will be provided here].
Von A bis U: Die Vokalität von Instrumentalklangfarben (From A to U: The vocality of instrumental timbres)

Christoph Reuter, Charalampos Saitis, Isabella Czedik-Eysenberg, and Kai Siedenburg

40. Jahrestagung der Deutschen Gesellschaft für Musikpsychologie, 2024

Abstract

Reuter_DGM_2024.pdf
A multimodal understanding of the role of sound and music in gendered toy marketing

Luca Marinelli, Petra Lucht, and Charalampos Saitis

PLOS ONE, 2024

Abstract Code Link

Literature in music theory and psychology shows that, even in isolation, musical sounds can reliably encode gender-loaded messages. In fact, musical material can be imbued with many ideological dimensions and gender is just one of them. Nonetheless, studies of the gendering of music within multimodal communicative events are sparse and lack an encompassing theoretical framework. The present study attempts to address this literature gap by means of a critical quantitative analysis of music in gendered toy marketing, which integrated a content analytical approach with multimodal affective and music-focused perceptual responses. Ratings were collected on a set of 606 commercials spanning over a ten years time frame, and strong gender polarisation was observed in nearly all of the collected variables. Gendered music styles in toy commercials were found to exhibit synergistic design choices, as music in masculine-targeted adverts was substantially more abrasive—louder, more inharmonious, and more distorted—than that in feminine-targeted ones. Toy advertising music appeared thus to be deliberately and consistently in line with traditional gender norms. In addition, music perceptual scales and voice-related content analytical variables were found to explain quite well the heavily polarised affective ratings. This study presents an empirical understanding of the gendering of music as constructed within multimodal discourse, reiterating the importance of the sociocultural underpinnings of music cognition. We provide a public repository with all code and data necessary to reproduce the results of this study at github.com/marinelliluca/music-role-gender-marketing.
Real-time timbre remapping with differentiable DSP

Jordie Shier, Charalampos Saitis, Andrew Robertson, and Andrew McPherson

International Conference on New Interfaces for Musical Expression, 2024

Abstract Paper Code Link

Timbre is a primary mode of expression in diverse musical contexts. However, prevalent audio-driven synthesis methods predominantly rely on pitch and loudness envelopes, effectively flattening timbral expression from the input. Our approach draws on the concept of timbre analogies and investigates how timbral expression from an input signal can be mapped onto controls for a synthesizer. Leveraging differentiable digital signal processing, our method facilitates direct optimization of synthesizer parameters through a novel feature difference loss. This loss function, designed to learn relative timbral differences between musical events, prioritizes the subtleties of graded timbre modulations within phrases, allowing for meaningful translations in a timbre space. Using snare drum performances as a case study, where timbral expression is central, we demonstrate real-time timbre remapping from acoustic snare drums to a differentiable synthesizer modeled after the Roland TR-808.
Building sketch-to-sound mapping with unsupervised feature extraction and interactive machine learning

Shuoyang Zheng, Bleiz M. Del Sette, Charalampos Saitis, Anna Xambó, and Nick Bryan-Kinns

International Conference on New Interfaces for Musical Expression, 2024

Abstract Paper Code

In this paper, we explore the interactive construction and exploration of mappings between visual sketches and musical controls. Interactive Machine Learning (IML) allows creators to construct mappings with personalised training examples. However, when it comes to high-dimensional data such as sketches, dimensionality reduction techniques are required to extract features for the IML model. We propose using unsupervised machine learning to encode sketches into lower-dimensional latent representations, which are then used as the source for the IML model to construct sketch-to-sound mappings. We build a proof-of-concept prototype and demonstrate it using two compositions. We reflect on the composing processes to discuss the controllability and explorability in mappings built by this approach and how they contribute to the musical expression.
Deep learning-based audio representations for the analysis and visualisation of electronic dance music DJ mixes

Alexander Williams, Haokun Tian, Stefan Lattner, Mathieu Barthet, and Charalampos Saitis

AES International Symposium on AI and the Musician, 2024

Abstract Paper Code

Electronic dance music (EDM), produced using computers and electronic instruments, is a collection of musical subgenres that emphasise timbre and rhythm over melody and harmony. It is usually presented through the medium of DJing, where tracks are curated and mixed sequentially to offer unique listening and dancing experiences. However, unlike key and tempo annotations, DJs still rely on audition rather than metadata to examine and select tracks with complementary audio content. In this work, we investigate the use of deep learning-based representations (Complex Autoencoder and OpenL3) for analysing and visualising audio content on a corpus of DJ mixes with approximate transition timestamps and compare them with signal processing-based representations (joint time-frequency scattering transform and mel-frequency cepstral coefficients). Representations are computed once per second and visualised with UMAP dimensionality reduction. We propose heuristics based on the identification of observed patterns in visualisations and time-sensitive Euclidean distances in the representation space to compute DJ transition lengths, transition smoothness, and inter-song, song-to-song, and full-mix audio content consistency using audio representations along with rough DJ transition timestamps. Our method enables the visualisation of variations within music tracks, facilitating the analysis of DJ mixes and individual EDM tracks. This approach supports musicians in making informed creative decisions based on such visualisations. We share our code, dataset annotations, computed audio representations, and trained CAE model. We encourage researchers and music enthusiasts alike to analyse their own music using our tools: github.com/alexjameswilliams/EDMAudioRepresentations.
Timbral effects of col legno tratto techniques on bowed cello sounds

Montserrat Pàmies-Vilà, and Charalampos Saitis

186th Meeting of the Acoustical Society of America/Acoustics Week, 2024

Abstract

There are several playing techniques for bowed-string instruments that make use of the wooden stick of the bow. The stick is quite often used to strike the strings gently (col legno battuto) and less commonly to bow on them (col legno tratto). Col legno has existed since the 17th century, and it is often used in modern compositions. When the stick is drawn across the string (tratto), the contact between the scrubbing stick and the string introduces noise. The player may choose to combine both hair and stick, depending on the desired sound. To evaluate the timbral effects of col legno tratto on the cello sound, the current study compares three variations across ordinary and contemporary bowing techniques: using only the hair, using both hair and stick, and using only the stick. Motion capture and audio-video recordings with expert cello players show how the bow tilt varies greatly between the three cases. Suitable audio descriptors of timbre are evaluated, which may help to correlate the observed playing parameters and sound properties with the semantic attributes used by experts to describe the timbre of these techniques.
Giving instruments a voice: Are there vowel-like qualities in the timbres of musical instruments?

Christoph Reuter, Charalampos Saitis, Isabella Czedik-Eysenberg, and Kai Siedenburg

50. Deutsche Jahrestagung für Akustik, 2024

Abstract Paper

Scholars have long explored similarities between musical instrument sounds and vowel qualities of human voice sounds. From a psychoacoustic standpoint, however, this relationship remains poorly understood. Here, we seek to address whether musical instruments truly exhibit vowel-like qualities, whether specific instruments, registers, and dynamic levels stand out, and what the acoustical correlates of this relation might be. In an online experiment, German native speakers listen to the sounds of oboe, clarinet, flute, bassoon, trumpet, trombone, French horn, tuba, violin, viola, cello, and double bass in three registers and two dynamic levels. Their task is to assign the following vowels and umlauts (in German pronunciation) to instrument sounds: a, a (with overring), e, i, o, u, ä, ö, and ü. Furthermore, participants rate the strength of vowel similarity. Preliminary analyses (of 43 participants) suggest that although vowel similarity is rated approximately equally high, vowel associations do not seem to be equally consistent for different instruments. Particular similarity is observed between bassoon and tuba with the vowel o, oboe and violin with the vowel i. Audio features will be used to model vowel similarity.
Explainable modeling of gender-targeting practices in toy advertising sound and music

Luca Marinelli, and Charalampos Saitis

1st Workshop on Explainable Machine Learning for Speech and Audio, 49th IEEE International Conference on Acoustics, Speech and Signal Processing, 2024

Abstract Paper Code Video

This study examines gender coding in sound and music, in a context where music plays a supportive role to other modalities, such as in toy advertising. We trained a series of binary XGBoost classifiers on handcrafted features extracted from the soundtracks and then performed SAGE and SHAP analyses to identify key audio features in predicting the gender target of the ads. Our analysis reveals that timbral dimensions play a prominent role and that commercials aimed at girls tend to be more harmonious and rhythmical, with a broader and smoother spectrum, while those targeting boys are characterised by higher loudness, spectral entropy, and roughness. Mixed audience commercials instead appear to be as rhythmical as girls-only ads, although slower, but show intermediate characteristics in terms of harmonicity and roughness. This study highlights the importance of music in shaping societal norms and the need for greater accountability in its use in marketing and other industries. We provide a public repository containing all code and data used in this study.
A review of differentiable digital signal processing for music and speech synthesis

Ben Hayes, Jordie Shier, György Fazekas, Andrew McPherson, and Charalampos Saitis

Frontiers in Signal Processing, 2024

Abstract Paper

The term differentiable digital signal processing describes a family of techniques in which loss function gradients are backpropagated through digital signal processors, facilitating their integration into neural networks. This article surveys the literature on differentiable audio signal processing, focusing on its use in music and speech synthesis. We catalogue applications to tasks including music performance rendering, sound matching, and voice transformation, discussing the motivations for and implications of the use of this methodology. This is accompanied by an overview of digital signal processing operations that have been implemented differentiably, which is further supported by a web book containing practical advice on differentiable synthesiser programming. Finally, we highlight open challenges, including optimisation pathologies, robustness to real-world conditions, and design trade-offs, and discuss directions for future research.

2023

A body-centred perspective to chronic pain self-management using generative sonification

Bleiz Macsen Del Sette, and Charalampos Saitis

Annual Workshop of the Music and Human-Computer Interaction Networks (CHIME), 2023

Abstract

DelSette_CHIME_2023.pdf
Timbre Tools for the Digital Instrument Maker

Charalampos Saitis, Haokun Tian, Jordie Shier, and Bleiz Macsen Del Sette

Annual Workshop of the Music and Human-Computer Interaction Networks (CHIME), 2023

Abstract Poster

Saitis_CHIME_2023.pdf
Beat and Downbeat Tracking with Generative Embeddings

Haokun Tian, Kun Liu, and Magdalena Fuentes

Late Breaking Demo of the 24nd International Society for Music Information Retrieval Conference, 2023

Abstract Paper Poster

It is standard practice to use spectrograms as input features for discriminative MIR tasks. However, recent research showed using representations produced by Jukebox (a music language model) led to better model performance. This was tested on music tagging, genre classification, key detection, emotion recognition, and music transcription. In this paper, we test it on beat and downbeat tracking. Specifically, we compare compressed Jukebox embeddings with spectrograms as input to a model that jointly predicts beat, downbeat, and tempo. Experiments show that the two inputs bring comparable results for beat tracking, while using Jukebox embeddings leads to significant improvements for downbeat tracking.
Soundscapes of morality: Linking music preferences and moral values through lyrics and audio

Vjosa Preniqi, Kyriaki Kalimeri, and Charalampos Saitis

PLOS ONE, 2023

Abstract Paper

Music is a fundamental element in every culture, serving as a universal means of expressing our emotions, feelings, and beliefs. This work investigates the link between our moral values and musical choices through lyrics and audio analyses. We align the psychometric scores of 1,480 participants to acoustics and lyrics features obtained from the top 5 songs of their preferred music artists from Facebook Page Likes. We employ a variety of lyric text processing techniques, including lexicon-based approaches and BERT-based embeddings, to identify each song’s narrative, moral valence, attitude, and emotions. In addition, we extract both low- and high-level audio features to comprehend the encoded information in participants’ musical choices and improve the moral inferences. We propose a Machine Learning approach and assess the predictive power of lyrical and acoustic features separately and in a multimodal framework for predicting moral values. Results indicate that lyrics and audio features from the artists people like inform us about their morality. Though the most predictive features vary per moral value, the models that utilised a combination of lyrics and audio characteristics were the most successful in predicting moral values, outperforming the models that only used basic features such as user demographics, the popularity of the artists, and the number of likes per user. Audio features boosted the accuracy in the prediction of empathy and equality compared to textual features, while the opposite happened for hierarchy and tradition, where higher prediction scores were driven by lyrical features. This demonstrates the importance of both lyrics and audio features in capturing moral values. The insights gained from our study have a broad range of potential uses, including customising the music experience to meet individual needs, music rehabilitation, or even effective communication campaign crafting.
Modelling Moral Traits with Music Listening Preferences and Demographics

Vjosa Preniqi, Kyriaki Kalimeri, and Charalampos Saitis

Music in the AI Era. CMMR 2021. Springer Lecture Notes in Computer Science, vol 13770, 2023

Abstract Paper

Music has always been an integral part of our everyday lives through which we express feelings, emotions, and concepts. Here, we explore the association between music genres, demographics and moral values employing data from an ad-hoc online survey and the Music Learning Histories Dataset. To further characterise the music preferences of the participants the generalist/specialist (GS) score employed. We exploit both classification and regression approaches to assess the predictive power of music preferences for the prediction of demographic attributes as well as the moral values of the participants. Our findings point out that moral values are hard to predict (.62 AUROC_avg) solely by the music listening behaviours, while if basic sociodemographic information is provided the prediction score rises to 4% on average (.66 AUROC_avg), with the Purity foundation to be the one that is steadily the one with the highest accuracy scores. Similar results are obtained from the regression analysis. Finally, we provide with insights on the most predictive music behaviours associated with each moral value that can inform a wide range of applications from rehabilitation practices to communication campaign design.
Fast Diffusion GAN Model for Symbolic Music Generation Controlled by Emotions

Jincheng Zhang, György Fazekas, and Charalampos Saitis

arXiv, 2023

Abstract arXiv

Diffusion models have shown promising results for a wide range of generative tasks with continuous data, such as image and audio synthesis. However, little progress has been made on using diffusion models to generate discrete symbolic music because this new class of generative models are not well suited for discrete data while its iterative sampling process is computationally expensive. In this work, we propose a diffusion model combined with a Generative Adversarial Network, aiming to (i) alleviate one of the remaining challenges in algorithmic music generation which is the control of generation towards a target emotion, and (ii) mitigate the slow sampling drawback of diffusion models applied to symbolic music generation. We first used a trained Variational Autoencoder to obtain embeddings of a symbolic music dataset with emotion labels and then used those to train a diffusion model. Our results demonstrate the successful control of our diffusion model to generate symbolic music with a desired emotion. Our model achieves several orders of magnitude improvement in computational cost, requiring merely four time steps to denoise while the steps required by current state-of-the-art diffusion models for symbolic music generation is in the order of thousands.
Sound of Care: Towards a Co-Operative AI Digital Pain Companion to Support People with Chronic Primary Pain

Bleiz Macsen Del Sette, Dawn Carnes, and Charalampos Saitis

Companion Publication of the 2023 Conference on Computer Supported Cooperative Work and Social Computing, 2023

Abstract Paper Poster

This work investigates the role of sound and technology in the everyday lives of people with chronic primary pain. Our primary goal was to inform the first participatory design workshop of Sound of Care, a new eHealth system for pain self-management. We used an ethical stakeholder analysis to inform a round of exploratory interviews, run with 8 participants including people with chronic primary pain, carers, and healthcare workers. We found that sound and technology serve as important but often unstructured tool, helping with distraction, mood regulation and sleep. The experience of pain and musical preferences are highly personal, and communicating or understanding pain can be challenging, even within family members. To address the gaps in current chronic pain self-management care, we propose the use a sound-based AI-driven system, a Digital Pain Companion, using sonification to create a shared decision-making space, enhancing agency over treatment in a co-operative care environment.
Gender-Coded Sound: Analysing the Gendering of Music in Toy Commercials via Multi-Task Learning

Luca Marinelli, György Fazekas, and Charalampos Saitis

24th International Society for Music Information Retrieval Conference, 2023

Abstract Paper Video

Music can convey ideological stances, and gender is just one of them. Evidence from musicology and psychology research shows that gender-loaded messages can be reliably encoded and decoded via musical sounds. However, much of this evidence comes from examining music in isolation, while studies of the gendering of music within multimodal communicative events are sparse. In this paper, we outline a method to automatically analyse how music in TV advertising aimed at children may be deliberately used to reinforce traditional gender roles. Our dataset of 606 commercials included music-focused mid-level perceptual features, multimodal aesthetic emotions, and content analytical items. Despite its limited size, and because of the extreme gender polarisation inherent in toy advertisements, we obtained noteworthy results by leveraging multi-task transfer learning on our densely annotated dataset. The models were trained to categorise commercials based on their intended target audience, specifically distinguishing between masculine, feminine, and mixed audiences. Additionally, to provide explainability for the classification in gender targets, the models were jointly trained to perform regressions on emotion ratings across six scales, and on mid-level musical perceptual attributes across twelve scales. Standing in the context of MIR, computational social studies and critical analysis, this study may benefit not only music scholars but also advertisers, policymakers, and broadcasters.
Differentiable Modelling of Percussive Audio with Transient and Spectral Synthesis

Jordie Shier, Franco Caspe, Andrew Robertson, Mark Sandler, Charalampos Saitis, and Andrew McPherson

10th Convention of the European Acoustics Association, 2023

Abstract Paper

Differentiable digital signal processing (DDSP) techniques, including methods for audio synthesis, have gained attention in recent years and lend themselves to interpretability in the parameter space. However, current differentiable synthesis methods have not explicitly sought to model the transient portion of signals, which is important for percussive sounds. In this work, we present a unified synthesis framework aiming to address transient generation and percussive synthesis within a DDSP framework. To this end, we propose a model for percussive synthesis that builds on sinusoidal modeling synthesis and incorporates a modulated temporal convolutional network for transient generation. We use a modified sinusoidal peak picking algorithm to generate time-varying non-harmonic sinusoids and pair it with differentiable noise and transient encoders that are jointly trained to reconstruct drumset sounds. We compute a set of reconstruction metrics using a large dataset of acoustic and electronic percussion samples that show that our method leads to improved onset signal reconstruction for membranophone percussion instruments.
The Responsibility Problem in Neural Networks with Unordered Targets

Ben Hayes, Charalampos Saitis, and György Fazekas

11th International Conference on Learning Representations, Tiny Papers, 2023

Abstract arXiv

We discuss the discontinuities that arise when mapping unordered objects to neural network outputs of fixed permutation, referred to as the responsibility problem. Prior work has proved the existence of the issue by identifying a single discontinuity. Here, we show that discontinuities under such models are uncountably infinite, motivating further research into neural networks for unordered data.
Interactive Neural Resonators

Rodrigo Diaz, Charalampos Saitis, and Mark Sandler

International Conference on New Interfaces for Musical Expression, 2023

Abstract arXiv Video

In this work, we propose a method for the controllable synthesis of real-time contact sounds using neural resonators. Previous works have used physically inspired statistical methods and physical modelling for object materials and excitation signals. Our method incorporates differentiable second-order resonators and estimates their coefficients using a neural network that is conditioned on physical parameters. This allows for interactive dynamic control and the generation of novel sounds in an intuitive manner. We demonstrate the practical implementation of our method and explore its potential creative applications.
Gender differences in Moral Valence, Sentiment, and Narratives of Song Lyrics Over Time

Vjosa Preniqi, Kyriaki Kalimeri, Andreas Kaltenbrunner, and Charalampos Saitis

9th International Conference on Computational Social Science, 2023

Abstract

vjosa_IC2S2_23.pdf
Analysing the Gendering of Music in Toy Commercials via Mid-level Perceptual Features

Luca Marinelli, and Charalampos Saitis

17th International Conference on Music Perception and Cognition, 2023

Abstract

Luca_ICMPC23.pdf
Exploring the Role of Audio and Lyrics in Explaining Moral Worldviews

Vjosa Preniqi, Kyriaki Kalimeri, and Charalampos Saitis

17th International Conference on Music Perception and Cognition, 2023

Abstract

vjosa_ICMPC_23_1.pdf
Evolution of Moral Valence in Lyrics Over Time

Vjosa Preniqi, Kyriaki Kalimeri, Andreas Kaltenbrunner, and Charalampos Saitis

17th International Conference on Music Perception and Cognition, 2023

Abstract

vjosa_ICMPC_23_2.pdf
When ChatGPT Talks Timbre

Charalampos Saitis, and Kai Siedenburg

3rd International Conference on Timbre, 2023

Abstract

3603_cam_ready.pdf
When NIME and ISMIR Talk Timbre

Charalampos Saitis, Maryam F. Torshizi, Vjosa Preniqi, Bleiz M. Del Sette, and György Fazekas

3rd International Conference on Timbre, 2023

Abstract Poster

553_camera_ready.pdf
The language of sounds unheard: Exploring musical timbre semantics of large language models

Kai Siedenburg, and Charalampos Saitis

arXiv, 2023

Abstract arXiv

Semantic dimensions of sound have been playing a central role in understanding the nature of auditory sensory experience as well as the broader relation between perception, language, and meaning. Accordingly, and given the recent proliferation of large language models (LLMs), here we asked whether such models exhibit an organisation of perceptual semantics similar to those observed in humans. Specifically, we prompted ChatGPT, a chatbot based on a state-of-the-art LLM, to rate musical instrument sounds on a set of 20 semantic scales. We elicited multiple responses in separate chats, analogous to having multiple human raters. ChatGPT generated semantic profiles that only partially correlated with human ratings, yet showed robust agreement along well-known psychophysical dimensions of musical sounds such as brightness (bright-dark) and pitch height (deep-high). Exploratory factor analysis suggested the same dimensionality but different spatial configuration of a latent factor space between the chatbot and human ratings. Unexpectedly, the chatbot showed degrees of internal variability that were comparable in magnitude to that of human ratings. Our work highlights the potential of LLMs to capture salient dimensions of human sensory experience.
Sinusoidal Frequency Estimation by Gradient Descent

Ben Hayes, Charalampos Saitis, and György Fazekas

48th IEEE International Conference on Acoustics, Speech and Signal Processing, 2023

Abstract arXiv Paper

Sinusoidal parameter estimation is a fundamental task in applications from spectral analysis to time-series forecasting. Estimating the sinusoidal frequency parameter by gradient descent is, however, often impossible as the error function is non-convex and densely populated with local minima. The growing family of differentiable signal processing methods has therefore been unable to tune the frequency of oscillatory components, preventing their use in a broad range of applications. This work presents a technique for joint sinusoidal frequency and amplitude estimation using the Wirtinger derivatives of a complex exponential surrogate and any first order gradient-based optimiser, enabling end-to-end training of neural network controllers for unconstrained sinusoidal models.
Rigid-Body Sound Synthesis with Differentiable Modal Resonators

Rodrigo Diaz, Ben Hayes, Charalampos Saitis, György Fazekas, and Mark Sandler

48th IEEE International Conference on Acoustics, Speech and Signal Processing, 2023

Abstract arXiv Paper

Physical models of rigid bodies are used for sound synthesis in applications from virtual environments to music production. Traditional methods, such as modal synthesis, often rely on computationally expensive numerical solvers, while recent deep learning approaches are limited by post-processing of their results. In this work, we present a novel end-to-end framework for training a deep neural network to generate modal resonators for a given 2D shape and material using a bank of differentiable IIR filters. We demonstrate our method on a dataset of synthetic objects but train our model using an audio-domain objective, paving the way for physically-informed synthesisers to be learned directly from recordings of real-world objects.
Timbre semantic associations vary both between and within instruments: An empirical study incorporating register and pitch height

Lindsey Reymore, Jason Noble, Charalampos Saitis, Caroline Traube, and Zachary Wallmark

Music Perception, 2023

Abstract Paper

The main objective of this study is to understand how timbre semantic associations — for example, a sound’s timbre perceived as bright, rough, or hollow — vary with register and pitch height across instruments. In this experiment, 540 online participants rated single, sustained notes from eight Western orchestral instruments (flute, oboe, bass clarinet, trumpet, trombone, violin, cello, and vibraphone) across three registers (low, medium, and high) on 20 semantic scales derived from Reymore and Huron (2020). The 24 two-second stimuli, equalized in loudness, were produced using the Vienna Symphonic Library. Exploratory modeling examined relationships between mean ratings of each semantic dimension and instrument, register, and participant musician identity (‘‘musician’’ vs. ‘‘nonmusician’’). For most semantic descriptors, both register and instrument were significant predictors, though the amount of variance explained differed (marginal R^2). Terms that had the strongest positive relationships with register include shrill/harsh/noisy, sparkling/brilliant/bright, ringing/long decay, and percussive. Terms with the strongest negative relationships with register include deep/thick/heavy, raspy/grainy/gravelly, hollow, and woody. Post hoc modeling using only pitch height and only register to predict mean semantic rating suggests that pitch height may explain more variance than does register. Results help clarify the influence of both instrument and relative register (and pitch height) on common timbre semantic associations.
Proceedings of the 3rd International Conference on Timbre

Eds: Marcelo Caetano, Zachary Wallmark, Asterios Zacharakis, Charalampos Saitis, and Kai Siedenburg

The School of Music Studies, Aristotle University of Thessaloniki, 2023

Link

2022

Real-time timbre mapping for synthesized percussive performance

Jordie Shier

DMRN+17: Digital Music Research Network One-Day Workshop, 2022

Poster
More Than Words: Linking Music Preferences and Moral Values Through Lyrics

Vjosa Preniqi, Kyriaki Kalimeri, and Charalampos Saitis

23rd International Society for Music Information Retrieval Conference, 2022

Abstract Paper Video

This study explores the association between music preferences and moral values by applying text analysis techniques to lyrics. Harvesting data from a Facebook-hosted application, we align psychometric scores of 1,386 users to lyrics from the top 5 songs of their preferred music artists as emerged from Facebook Page Likes. We extract a set of lyrical features related to each song’s overarching narrative, moral valence, sentiment, and emotion. A machine learning framework was designed to exploit regression approaches and evaluate the predictive power of lyrical features for inferring moral values. Results suggest that lyrics from top songs of artists people like inform their morality. Virtues of hierarchy and tradition achieve higher prediction scores (between .20 and .30) than values of empathy and equality (between .08 and .11), while basic demographic variables only account for a small part in the models’ explainability. This shows the importance of music listening behaviours, as assessed via lyrical preferences, alone in capturing moral values. We discuss the technological and musicological implications and possible future improvements.
timbre.fun: A gamified interactive system for crowdsourcing a timbre semantic vocabulary

Ben Hayes, Charalampos Saitis, and György Fazekas

24th International Congress on Acoustics, 2022

Abstract Paper

We present timbre.fun, a web-based gamified interactive system where users create sounds in response to semantic prompts (e.g., bright, rough) through exploring a two-dimensional control space that maps nonlinearly to the parameters of a simple hybrid wavetable and amplitude-modulation synthesizer. The current version features 25 semantic adjectives mined from a popular synthesis forum. As well as creating sounds, users can explore heatmaps generated from others’ responses, and fit a classifier (k-nearest neighbors) in-browser. timbre.fun is based on recent work, including by the authors, which studied timbre semantic associations through prompted synthesis paradigms. The interactive is embedded in a digital exhibition on sensory variation and interaction (seeingmusic.app) which debuted at the 2021 Edinburgh Science Festival, where it was visited by 197 users from 21 countries over 16 days. As it continues running online, a further 596 visitors from 35 countries have engaged. To date 579 sounds have been created and tagged, which will facilitate parallel research in timbre semantics and neural audio synthesis. Future work will include further gamifying the data collection pipeline, including leveling-up to unlock new words and synthesizers, and a full open-source release.
Seeing Music: Leveraging citizen science and gamification to study cross-sensory associations

Charalampos Saitis, Christine Cuskley, and Sebastian Löbbers

20th International Multisensory Research Forum, 2022

Abstract

Our recent research has shown that people lack knowledge about how the senses interact and are unaware of many common forms of sensory and perceptual variation. We present Seeing Music, a digital interactive exhibition and audiovisual game that translates high-level scientific understanding of sensory variation and cross-modality into knowledge for the public. Using a narrative-driven gamified approach, players are tasked with communicating human music to an extraterrestrial intelligence through visual shape, color and texture using two-dimensional selector panels. Music snippets (12–24 s long) are played continuously in a loop, taken from three custom instrumental compositions designed to vary systematically in terms of timbre, melody, and rhythm. Players can “level-up” to unlock new visual features and musical snippets, and explore and evaluate collaborative visualizations made by others. Outside the game, a series of interactive slideshows help visitors learn more about sensory experience, sensory diversity, and how our senses make us human. The exhibition debuted at the 2021 Edinburgh Science Festival, where it was visited by 197 users coming from 21 countries (134 visitors from the UK) over 16 days. As it continues running online, a further 596 visitors from 35 countries (164 from the UK) have engaged. To date, 169 players of Seeing Music have produced more than 42,500 audiovisual mapping datapoints for scientific research purposes. Preliminary analysis suggests that music with less high-frequency energy was mapped to less complex and rounder shapes, bluer and less bright hues, and less dense textures. These trends confirm auditory-visual correspondences previously reported in more controlled laboratory studies, while also offering new insight into how different auditory-visual associations interact with each other. Future work includes improving user motivation and interaction, refining data collection, a full open-source release, and adding new games and informational material about research on the senses.
Exploring the Dimensionality of the Affective Space Elicited by Gendered Toy Commercials

Luca Marinelli, and Charalampos Saitis

9th European Conference on Media, Communication & Film, 2022

Abstract

As evidenced by a large body of literature, the gender-stereotyped nature of toy adverts has been widely scrutinised. However, little work has been done in examining the affective impact of these commercials on the audience. It has been proven that repeated exposure to gender-stereotyped messages has the capacity to influence behaviours, beliefs and attitudes. In particular, media can influence emotion socialization, and gender differences in emotion expression might emerge (Scherr 2018). In this study, we investigated whether commercials elicit emotions at different intensities with respect to the gender of their target audience. Furthermore, we evaluated whether such emotions follow distinct underlying latent structures. A total of 1081 ratings of 10 unipolar aesthetic emotion scales were collected for 135 commercials (45 for each masculine, feminine, and mixed target audience) from 80 UK nationals (35 F, 45 M) aged 18 to 76. The main reason for collecting our ratings from adults was that, already by age 11, children exhibit adult-like emotion recognition capabilities (Hunter 2011). Seven scales showed significant differences between commercials for distinct audiences; with five, in particular, revealing a strong polarization (happiness, amusement, beauty, calm, and anger). In addition, parallel analysis showed that a minimum of three factors are needed to explain the ratings for masculine and mixed targeted commercials, while only two are needed for the feminine ones, thereby indicating that the latter elicit emotions following a simpler underlying structure. Both results reflect larger issues in toy marketing, where gender essentialism is still dominant, and prompt further discussion and research.
Timbre Transfer with Variational Auto Encoding and Cycle-Consistent Adversarial Networks

Russell Sammut Bonnici, Martin Benning, and Charalampos Saitis

International Joint Conference on Neural Networks, 2022

Abstract Paper Poster

This work investigates the application of deep learning to timbre transfer. The adopted approach combines Variational Autoencoders with Generative Adversarial Networks to construct meaningful representations of the source audio and produce realistic generations of the target audio and is applied to the Flickr 8k Audio dataset for transferring the vocal timbre between speakers and the URMP dataset for transferring the musical timbre between instruments. Variations of the adopted approach were trained, and performance was compared using the metrics SSIM (Structural Similarity Index) and FAD (Frechet Audio Distance). It was found that a many-to-many approach supersedes a one-to-one approach in terms of reconstructive capabilities, while one-to-one showed better results in terms of adversarial translation. The adoption of a basic over a bottleneck residual block design is more suitable for enriching content information about a latent space, and the decision on whether cyclic loss takes on a variational autoencoder or vanilla autoencoder approach does not have a significant impact on reconstructive and adversarial translation aspects of the model.
Disembodied Timbres: A Study on Semantically Prompted FM Synthesis

Ben Hayes, Charalampos Saitis, and György Fazekas

Journal of the Audio Engineering Society, 2022

Abstract Paper

Disembodied electronic sounds constitute a large part of the modern auditory lexicon, but research into timbre perception has focused mostly on the tones of conventional acoustic musical instruments. It is unclear whether insights from these studies generalise to electronic sounds, nor is it obvious how these relate to the creation of such sounds. In this work, we present an experiment on the semantic associations of sounds produced by FM synthesis with the aim of identifying whether existing models of timbre semantics are appropriate for such sounds. We applied a novel experimental paradigm in which experienced sound designers responded to semantic prompts by programming a synthesiser, and provided semantic ratings on the sounds they created. Exploratory factor analysis revealed a five-dimensional semantic space. The first two factors mapped well to the concepts of luminance, texture, and mass. The remaining three factors did not have clear parallels, but correlation analysis with acoustic descriptors suggested an acoustical relationship to luminance and texture. Our results suggest that further enquiry into the timbres of disembodied electronic sounds, their synthesis, and their semantic associations would be worthwhile, and that this could benefit research into auditory perception and cognition, as well as synthesis control and audio engineering.
Deep Embeddings for Robust User-Based Amateur Vocal Percussion Transcription

Alejandro Delgado, Emir Demirel, Vinod Subramanian, Charalampos Saitis, and Mark Sandler

19th Sound and Music Computing Conference, 2022

Abstract arXiv Paper

Vocal Percussion Transcription (VPT) is concerned with the automatic detection and classification of vocal percussion sound events, allowing music creators and producers to sketch drum lines on the fly among others. VPT classifiers usually learn best from small user-specific datasets, which usually restrict modelling to small input feature sets to avoid model overfitting. This study explores several deep supervised learning strategies to obtain informative feature sets for amateur VPT classification. We evaluated their performance on regular VPT classification tasks and compared them with several baseline approaches including feature selection methods and a state-of-the-art speech recognition engine. These proposed learning models were supervised with several label sets containing information from four different levels of abstraction: instrument-level, syllable-level, phoneme-level, and boxeme-level. Results suggest that convolutional neural networks supervised with syllable-level annotations produced the most informative embeddings for VPT systems, which can be used as input representations to fit classifiers with. Finally, we used back-propagation-based saliency maps to investigate the importance of difference spectrogram regions for feature learning.
Auditory brightness perception investigated by unimodal and crossmodal interference

Charalampos Saitis, Zachary Wallmark, and Annie Liu

Biennial Meeting of the Society for Music Perception and Cognition, 2022

Abstract Poster

Brightness is among the most studied aspects of timbre perception. Psychoacoustically, sounds described as ”bright” vs ”dark” typically exhibit a high vs low frequency emphasis in the spectrum. However, relatively little is known about the neurocognitive mechanisms that facilitate these “metaphors we listen with.” Do they originate in universal mental representations common to more than one sensory modality? Triangulating three different interaction paradigms, we investigated using speeded identification whether unimodal and crossmodal interference occurs when timbral brightness, as modelled by the centroid of the spectral envelope, and 1) pitch height, 2) visual brightness, 3) numerical value processing are semantically incongruent. In three online pilot tasks, 58 participants were presented a baseline stimulus (a pitch, gray square, or numeral) then asked to quickly identify a target stimulus that is higher/lower, brighter/darker, or greater/less than the baseline, respectively, after being primed with a bright or dark synthetic harmonic tone. Additionally, in the pitch and visual tasks, a deceptive same-target condition was included. Results suggest that timbral brightness modulates the perception of pitch and visual brightness, but not numerical value. Semantically incongruent pitch height-timbral brightness shifts produced significantly slower choice reaction time and higher error compared to congruent pairs; timbral brightness also had a strong biasing effect in the same-target condition (i.e., people heard the same pitch as higher when the target tone was timbrally brighter than the baseline, and vice versa with darker tones). In the visual task, incongruent pairings of gray squares and tones elicited slower choice reaction times than congruent pairings. No interference was observed in the number comparison task. We are currently following up on these results with a larger online replication sample, and an fMRI study to investigate the relevant neural mechanisms. Our findings shed light on the multisensory nature of experiencing timbre.
Proceedings of the 11th International Workshop on Haptic and Audio Interaction Design

Eds: Charalampos Saitis, Ildar Farkhatdinov, and Stefano Papetti

Springer Lecture Notes in Computer Science 13417, 2022

Link

2021

Mapping the semantics of timbre across pitch registers

Lindsey Reymore, Jason Noble, Charalampos Saitis, Caroline Traube, and Zachary Wallmark

16th International Conference on Music Perception and Cognition, 2021

Abstract Video

https://drive.google.com/file/d/1oax1QfdTgmP5bWZF7aOoNyLbm5qFQsex/view
Multimodal Classification of Stressful Environments in Visually Impaired Mobility Using EEG and Peripheral Biosignals

Charalampos Saitis, and Kyriaki Kalimeri

IEEE Transactions on Affective Computing, 2021

Abstract Paper

In this study, we aim to better understand the cognitive-emotional experience of visually impaired people when navigating in unfamiliar urban environments, both outdoor and indoor. We propose a multimodal framework based on random forest classifiers, which predict the actual environment among predefined generic classes of urban settings, inferring on real-time, non-invasive, ambulatory monitoring of brain and peripheral biosignals. Model performance reached 93% for the outdoor and 87% for the indoor environments (expressed in weighted AUROC), demonstrating the potential of the approach. Estimating the density distributions of the most predictive biomarkers, we present a series of geographic and temporal visualizations depicting the environmental contexts in which the most intense affective and cognitive reactions take place. A linear mixed model analysis revealed significant differences between categories of vision impairment, but not between normal and impaired vision. Despite the limited size of our cohort, these findings pave the way to emotionally intelligent mobility-enhancing systems, capable of implicit adaptation not only to changing environments but also to shifts in the affective state of the user in relation to different environmental and situational factors.
Modelling Moral Traits with Music Listening Preferences and Demographics

Vjosa Preniqi, Kyriaki Kalimeri, and Charalampos Saitis

15th International Symposium on Computer Music Multidisciplinary Research, 2021

Abstract Paper

Music has always been an integral part of our everyday lives through which we express feelings, emotions, and concepts. Here, we explore the association between music genres, demographics and moral values employing data from an ad-hoc online survey and the Music Learning Histories Dataset. To further characterise the music preferences of the participants the generalist/specialist (GS) score employed. We exploit both classification and regression approaches to assess the predictive power of music preferences for the prediction of demographic attributes as well as the moral values of the participants. Our findings point out that moral values are hard to predict (.62 AUROC_avg) solely by the music listening behaviours, while if basic sociodemographic information is provided the prediction score rises to 4% on average (.66 AUROC_avg), with the Purity foundation to be the one that is steadily the one with the highest accuracy scores. Similar results are obtained from the regression analysis. Finally, we provide with insights on the most predictive music behaviours associated with each moral value that can inform a wide range of applications from rehabilitation practices to communication campaign design.
Development of a Web Application for the Education, Assessment, and Study of Timbre Perception

Charalampos Saitis

Society for Education, Music, and Psychology Research Conference, 2021

Abstract

Timbre is defined as any auditory property other than pitch, duration, and loudness that allows two sounds to be distinguished. The Timbre Explorer (TE) is a synthesiser interface designed to demonstrate timbral dimensions of sound. This project aimed to develop and evaluate a web version of the TE that attempts to train its users and test their understanding of timbre as they go through a series of gamified tasks. A pilot study with 16 participants helped to identify shortcomings ahead of a full-sized study that will evaluate the performance of the TE as an educational aid and musical assessment tool.
We are what we listen to: How moral values reflect on musical preferences

Vjosa Preniqi, Kyriaki Kalimeri, and Charalampos Saitis

7th International Conference on Computational Social Science, 2021

Poster
Neural Waveshaping Synthesis

Ben Hayes, Charalampos Saitis, and György Fazekas

22nd International Society for Music Information Retrieval Conference, 2021

Abstract Paper

We present the Neural Waveshaping Unit (NEWT): a novel, lightweight, fully causal approach to neural audio synthesis which operates directly in the waveform domain, with an accompanying optimisation (FastNEWT) for efficient CPU inference. The NEWT uses time-distributed multilayer perceptrons with periodic activations to implicitly learn nonlinear transfer functions that encode the characteristics of a target timbre. Once trained, a NEWT can produce complex timbral evolutions by simple affine transformations of its input and output signals. We paired the NEWT with a differentiable noise synthesiser and reverb and found it capable of generating realistic musical instrument performances with only 260k total model parameters, conditioned on F0 and loudness features. We compared our method to state-of-the-art benchmarks with a multi-stimulus listening test and the Fréchet Audio Distance and found it performed competitively across the tested timbral domains. Our method significantly outperformed the benchmarks in terms of generation speed, and achieved real-time performance on a consumer CPU, both with and without FastNEWT, suggesting it is a viable basis for future creative sound design tools.
Perceptual and semantic scaling of FM synthesis timbres: Common dimensions and the role of expertise

Ben Hayes, Charalampos Saitis, and György Fazekas

16th International Conference on Music Perception and Cognition, 2021

Abstract Paper

Electronic sound has a rich history, yet timbre research has typically focused on the sounds of physical instruments, while synthesised sound is often relegated to functional roles like recreating acoustic timbres. Studying the perception of synthesised sound can broaden our conception of timbre and improve musical synthesis tools. We aimed to identify the perceptually salient acoustic attributes of sounds produced by frequency modulation synthesis. We also aimed to test Zacharakis et al’s luminance-texture-mass timbre semantic model [Music Perception, 31, 339–358 (2014)] in this domain. Finally, we aimed to identify effects of prior music or synthesis experience on these results. Our results suggest that discrimination of abstract electronic timbres may rely on attributes distinct from those used with acoustic timbres. Further, the most salient attributes vary with expertise. However, the use of semantic descriptors is similar to that of acoustic instruments, and is consistent across expertise levels.
NASH: the Neural Audio Synthesis Hackathon

Ben Hayes, Cyrus Vahidi, and Charalampos Saitis

DMRN+16: Digital Music Research Network One-Day Workshop, 2021

Abstract Paper

The field of neural audio synthesis aims to produce audio using neural networks. A recent surge in its popularity has led to several high profile works achieving impressive feats of speech and music synthesis. The development of broadly accessible neural audio synthesis tools, conversely, has been limited, and creative applications of these technologies are mostly undertaken by those with technical know-how. Research has focused largely on tasks such as realistic speech and musical instrument synthesis, whereas investigations into high-level control, esoteric sound design capabilities, and interpretability have received less attention. To encourage innovative work addressing these gaps, C4DM’s Special Interest Group on Neural Audio Synthesis (SIGNAS) propose to host our first Neural Audio Synthesis Hackathon: a two day event, with results to be presented in a session at DMRN+16.
Acoustic Representations for Perceptual Timbre Similarity

Cyrus Vahidi, Ben Hayes, Charalampos Saitis, and György

DMRN+16: Digital Music Research Network One-Day Workshop, 2021

Abstract Paper

In this work, we outline initial steps towards modelling perceptual timbre dissimilarity. We use stimuli from 17 distinct subjective timbre studies and compute pairwise distances in the spaces of MFCCs, joint time-frequency scattering coefficients and Open-L3 embeddings. We analyze agreement of distances in these spaces with human dissimilarity ratings and highlight challenges of this task.
Variational Auto Encoding and Cycle-Consistent Adversarial Networks for Timbre Transfer

Russell Sammut Bonnici, Martin Benning, and Charalampos Saitis

DMRN+16: Digital Music Research Network One-Day Workshop, 2021

Abstract Paper

The combination of Variational Autoencoders (VAE) with Generative Adversarial Networks (GAN) motivates meaningful representations of audio in the context of timbre transfer. This was applied to different datasets for transferring vocal timbre between speakers and musical timbre between instruments. Variations of the approach were trained and generalised performance was compared using the Structural Similarity Index and Frechet Audio Distance. Many-to-many style transfer was found to improve reconstructive performance over one-to-one style transfer.
A Modulation Front-End for Music Audio Tagging

Cyrus Vahidi, Charalampos Saitis, and György Fazekas

International Joint Conference on Neural Networks, 2021

Abstract Paper

Convolutional Neural Networks have been extensively explored in the task of automatic music tagging. The problem can be approached by using either engineered time-frequency features or raw audio as input. Modulation filter bank representations that have been actively researched as a basis for timbre perception have the potential to facilitate the extraction of perceptually salient features. We explore end-to-end learned front-ends for audio representation learning, ModNet and SincModNet, that incorporate a temporal modulation processing block. The structure is effectively analogous to a modulation filter bank, where the FIR filter center frequencies are learned in a data-driven manner. The expectation is that a perceptually motivated filter bank can provide a useful representation for identifying music features. Our experimental results provide a fully visualisable and interpretable front-end temporal modulation decomposition of raw audio. We evaluate the performance of our model against the state-of-the-art of music tagging on the MagnaTagATune dataset. We analyse the impact on performance for particular tags when time-frequency bands are subsampled by the modulation filters at a progressively reduced rate. We demonstrate that modulation filtering provides promising results for music tagging and feature representation, without using extensive musical domain knowledge in the design of this frontend.
Phoneme Mappings for Online Vocal Percussion Transcription

Alejandro Delgado, Charalampos Saitis, and Mark Sandler

151st Audio Engineering Society Convention, 2021, Honourable Mention for Outstanding Paper

Abstract Paper

Vocal Percussion Transcription (VPT) aims at detecting vocal percussion sound events in a beatboxing performance and classifying them into the correct drum instrument class (kick, snare, or hi-hat). To do this in an online (real-time) setting, however, algorithms are forced to classify these events within just a few milliseconds after they are detected. The purpose of this study was to investigate which phoneme-to-instrument mappings are the most robust for online transcription purposes. We used three different evaluation criteria to base our decision upon: frequency of use of phonemes among different performers, spectral similarity to reference drum sounds, and classification separability. With these criteria applied, the recommended mappings would potentially feel natural for performers to articulate while enabling the classification algorithms to achieve the best performance possible. Given the final results, we provided a detailed discussion on which phonemes to choose given different contexts and applications.
Learning Models for Query by Vocal Percussion: A Comparative Study

Alejandro Delgado, McDonald SKoT, Ning Xu, Charalampos Saitis, and Mark Sandler

46th International Computer Music Conference, 2021

Abstract arXiv

The imitation of percussive sounds via the human voice is a natural and effective tool for communicating rhythmic ideas on the fly. Thus, the automatic retrieval of drum sounds using vocal percussion can help artists prototype drum patterns in a comfortable and quick way, smoothing the creative workflow as a result. Here we explore different strategies to perform this type of query, making use of both traditional machine learning algorithms and recent deep learning techniques. The main hyperparameters from the models involved are carefully selected by feeding performance metrics to a grid search algorithm. We also look into several audio data augmentation techniques, which can potentially regularise deep learning models and improve generalisation. We compare the final performances in terms of effectiveness (classification accuracy), efficiency (computational speed), stability (performance consistency), and interpretability (decision patterns), and discuss the relevance of these results when it comes to the design of successful query-by-vocal-percussion systems.
The Timbre Explorer: A Synthesizer Interface for Educational Purposes and Perceptual Studies

Joshua Ryan Lam, and Charalampos Saitis

International Conference on New Interfaces for Musical Expression, 2021

Abstract Link

When two sounds are played at the same loudness, pitch, and duration, what sets them apart are their timbres. This study documents the design and implementation of the Timbre Explorer, a synthesizer interface based on efforts to dimensionalize this perceptual concept. The resulting prototype controls four perceptually salient dimensions of timbre in real-time: attack time, brightness, spectral flux, and spectral density. A graphical user interface supports user understanding with live visualizations of the effects of each dimension. The applications of this interface are three-fold; further perceptual timbre studies, usage as a practical shortcut for synthesizers, and educating users about the frequency domain, sound synthesis, and the concept of timbre. The project has since been expanded to a standalone version independent of a computer and a purely online web-audio version.

2020

How we talk about sound: Semantic dimensions of abstract timbres

Ben Hayes, and Charalampos Saitis

Sound Instruments and Sonic Cultures: An Interdisciplinary Conference, 2020, National Science & Media Museum

Abstract

Synthesisers, in their many forms, enable the realisation of almost any conceivable sound. Their fine-grained control and broad timbral palette call for a descriptive lexicon to enable their verbal differentiation and discussion. While acoustic instruments of the western classical lineage are the subject of an extensive body of enquiry into the perceptual attributes and semantic associations of the sounds they produce, abstract electronic sounds have been comparatively understudied in this regard. In particular, the diverse vocabulary used to describe such classical acoustic instruments can be summarised with three conceptual metaphors—such musical tones have luminance, texture, and mass—but this has yet to be explicitly confirmed for the kinds of electronic sounds that pervade many modern sonic cultures. In this work, we present an experimental paradigm for studying the semantic associations of synthesised sounds, wherein a group of experienced music producers and sound designers interacted with a web-based synthesiser in response to descriptive prompts, and provided comparative semantic ratings on the sounds they created. The words used for semantic ratings were selected by mining a text corpus from the popular modular synthesis forum Muff Wiggler, and analysing the frequency of adjectives in contexts pertaining to timbre. The ratings provided by participants were subject to statistical analysis. From 27 initial adjectives, two underlying semantic factors were revealed: terms including aggressive, hard, and complex associated with the first, and dark and warm with the second. These factors differ from those found for classical acoustic sounds, implying a relationship between the qualia of a sonic experience and the language employed to talk about it. Such insight has implications for how sound is conceptualised, understood, and received within sonic cultures—in particular, those predicated on electronic or abstract sound—and applications in developing novel control schemes for synthesis methods.
Analysing and countering bodily interference in vibrotactile devices introduced by human interaction and physiology

Maximilian Weber, and Charalampos Saitis

12th EuroHaptics Conference, 2020

Abstract
Timbre semantics through the lens of crossmodal correspondences: A new way of asking old questions

Charalampos Saitis, Stefan Weinzierl, Katharina Kriegstein, Sølvi Ystad, and Christine Cuskley

Acoustical Science and Technology, 2020

Abstract Paper

This position paper argues that a systematic study of the behavioral and neural mechanisms of crossmodal correspondences between timbral dimensions of sound and perceptual dimensions of other sensory modalities, such as brightness, roughness, or sweetness, can offer a new way of addressing old questions about the perceptual and neurocognitive mechanisms of auditory semantics. At the same time, timbre and the crossmodal metaphors that dominate its conceptualization can provide a test case for better understanding the neural basis of crossmodal correspondences and human semantic processing in general.
What do people know about sensation and perception? Understanding perceptions of sensory experience

Christine Cuskley, and Charalampos Saitis

PsyArXiv, 2020

Abstract PsyArXiv

Academic disciplines spanning cognitive science, art, and music have made strides in understanding how humans sense and experience the world. We now have a better scientific understanding of how human sensation and perception function both in the brain and in interaction than ever before. However, there is little research on how this high level scientific understanding is translated into knowledge for the public more widely. We present descriptive results from a simple survey and compare how public understanding and perception of sensory experience lines up with scientific understanding. Results show that even in a sample with fairly high educational attainment, many respondents were unaware of fairly common forms of sensory variation. In line with the well-documented under representation of sign languages within linguistics, respondents tended to under-estimate the number of sign languages in the world. We outline how our results represent gaps in public understanding of sensory variation, and argue that filling these gaps can form an important early intervention, acting as a basic foundation for improving acceptance, inclusivity, and accessibility for cognitively diverse populations.
Timbre in Binaural Listening: A Comparison of Timbre Descriptors in Anechoic and HRTF Filtered Orchestral Sounds

Georgios Marentakis, and Charalampos Saitis

Forum Acusticum, 2020

Abstract Paper

The psychoacoustic investigation of timbre traditionally relies on audio descriptors extracted from anechoic or semi-anechoic recordings of musical instrument sounds, which are presented to listeners in diotic fashion. As a result, the extent to which spectral modifications due to the outer ear interact with timbre perception is not fully understood. As a first step towards investigating this research question, we examine here whether timbre descriptors calculated using HRTF filtered instrumental sounds deviate across ears and from values obtained from the same sounds without HRTF filtering for different listeners. The sound set comprised isolated notes played at the same fundamental frequency and dynamic from a database of anechoic recordings of modern orchestral instruments and some of their classical and baroque precursors. These were convolved with anechoic high spatial resolution HRTFs of human listeners. We present results and discuss implications for research on timbre perception and cognition.
Perceptual Similarities in Neural Timbre Embeddings

Ben Hayes, Luke Brosnahan, Charalampos Saitis, and György Fazekas

DMRN+15: Digital Music Research Network One-Day Workshop, 2020

Abstract Paper

Many neural audio synthesis models learn a representational space which can be used for control or exploration of the sounds generated. It is unclear what relationship exists between this space and human perception of these sounds. In this work, we compute configurational similarity metrics between an embedding space learned by a neural audio synthesis model and conventional perceptual and seman- tic timbre spaces. These spaces are computed using abstract synthesised sounds. We find significant similarities between these spaces, suggesting a shared organisational influence.
There’s More to Timbre than Musical Instruments: Semantic Dimensions of FM Sounds

Ben Hayes, and Charalampos Saitis

2nd International Conference on Timbre, 2020

Abstract Paper

Much previous research into timbre semantics (such as when an oboe is described as “hollow”) has focused on sounds produced by acoustic instruments, particularly those associated with western tonal music (Saitis & Weinzierl, 2019). Many synthesisers are capable of producing sounds outside the timbral range of physical instruments, but which are still discriminable by their timbre. Research into the perception of such sounds, therefore, may help elucidate further the mechanisms underpinning our experience of timbre in the broader sense. In this paper, we present a novel paradigm on the application of semantic descriptors to sounds produced by experienced sound designers using an FM synthesiser with a full set of controls.
Evidence for Timbre Space Robustness to an Uncontrolled Online Stimulus Presentation

Asterios Zacharakis, Ben Hayes, Charalampos Saitis, and Konstantinos Pastiadis

2nd International Conference on Timbre, 2020

Abstract Paper

Research on timbre perception is typically conducted under controlled laboratory conditions where every effort is made to maintain stimulus presentation conditions fixed (McAdams, 2019). This conforms with the ANSI (1973) definition of timbre suggesting that in order to judge the timbre differences between a pair of sounds the rest perceptual attributes (i.e., pitch, duration and loudness) should remain unchanged. Therefore, especially in pairwise dissimilarity studies, particular care is taken to ensure that loudness is not used by participants as a criterion for judgements by equalising it across experimental stimuli. On the other hand, conducting online experiments is an increasingly favoured practice in the music perception and cognition field as targeting relevant communities can potentially provide a large number of suitable participants with relatively little time investment from the side of the experimenters (e.g., Woods et al., 2015). However, the strict requirements for stimuli preparation and presentation prevents timbre studies from conducting online experimentation. Despite the obvious difficulties in imposing equal loudness on online experiments, the different playback equipment chain (DACs, pre-amplifiers, headphones) will also almost inevitably ‘colour’ the sonic outcome in a different way. Despite the above limitations, in a social distancing time like this, it would be of major importance to be able to lift some of the physical requirements in order to carry on conducting behavioural research on timbre perception. Therefore, this study aims to investigate the extent to which an uncontrolled online replication of a past laboratory-conducted pairwise dissimilarity task will distort the findings.
Spectral and Temporal Timbral Cues of Vocal Imitations

Alejandro Delgado, Charalampos Saitis, and Mark Sandler

2nd International Conference on Timbre, 2020

Abstract Paper

The imitation of non-vocal sounds using the human voice is a resource we sometimes rely on when communicating sound concepts to other people. Query by Vocal Percussion (QVP) is a subfield in Music Information Retrieval (MIR) that explores techniques to query percussive sounds using vocal imitations as input, usually plosive consonant sounds. The goal of this work was to investigate timbral relationships between real drum sounds and their vocal imitations. We believe these insights could shed light on how to select timbre descriptors for extraction when designing offline and online QVP systems. In particular, we studied a dataset composed of 30 acoustic and electronic drum sound recordings and vocal imitations of each sound performed by 14 musicians. Our approach was to study the correlation of audio content descriptors of timbre extracted from the drum samples with the same descriptors taken from vocal imitations. Three timbral descriptors were selected: the Log Attack Time (LAT), the Spectral Centroid (SC), and the Derivative After Maximum of the sound envelope (DAM). LAT and SC have been shown to represent salient dimensions of timbre across different types of sounds including percussion. In this sense, one intriguing question would be to what extent listeners can communicate these salient timbral cues in vocal imitations. The third descriptor, DAM, was selected for its role in describing the sound’s tail, which we considered to be a relevant part of percussive utterances.
Timbre Space Representation of a Subtractive Synthesizer

Cyrus Vahidi, György Fazekas, Charalampos Saitis, and Alessandro Palladini

2nd International Conference on Timbre, 2020

Abstract Paper

In this study, we produce a geometrically scaled perceptual timbre space from dissimilarity ratings of subtractive synthesized sounds and correlate the resulting dimensions with a set of acoustic descriptors. We curate a set of 15 sounds, produced by a synthesis model that uses varying source waveforms, frequency modulation (FM) and a lowpass filter with an enveloped cutoff frequency. Pairwise dissimilarity ratings were collected within an online browser-based experiment. We hypothesized that a varied waveform input source and enveloped filter would act as the main vehicles for timbral variation, providing novel acoustic correlates for the perception of synthesized timbres.
Verbal description of musical brightness

Christos Drouzas, and Charalampos Saitis

2nd International Conference on Timbre, 2020

Abstract Paper

Amongst the most common descriptive expressions of timbre used by musicians, music engineers, audio researchers as well as everyday listeners are words related to the notion of brightness (e.g., bright, dark, dull, brilliant, shining). From a psychoacoustic perspective, brightness ratings of instrumental timbres as well as music excerpts systematically correlate with the centre of gravity of the spectral envelope and thus brightness as a semantic descriptor of musical sound has come to denote a prevalence of high-frequency over low-frequency energy. However, relatively little is known about the higher-level cognitive processes underpinning musical brightness ratings. Psycholinguistic investigations of verbal descriptions of timbre suggest a more complex, polysemic picture (Saitis & Weinzierl 2019) that warrants further research. To better understand how musical brightness is conceptualised by listeners, here we analysed free verbal descriptions collected along brightness ratings of short music snippets (involving 69 listeners) and brightness ratings of orchestral instrument notes (involving 68 listeners). Such knowledge can help delineate the intrinsic structure of brightness as a perceptual attribute of musical sounds, and has broad implications and applications in orchestration, audio engineering, and music psychology.
Brightness perception for musical instrument sounds: Relation to timbre dissimilarity and source-cause categories

Charalampos Saitis, and Kai Siedenburg

The Journal of the Acoustical Society of America, 2020

Abstract Paper

Timbre dissimilarity of orchestral sounds is well-known to be multidimensional, with attack time and spectral centroid representing its two most robust acoustical correlates. The centroid dimension is traditionally considered as reflecting timbral brightness. However, the question of whether multiple continuous acoustical and/or categorical cues influence brightness perception has not been addressed comprehensively. A triangulation approach was used to examine the dimensionality of timbral brightness, its robustness across different psychoacoustical contexts, and relation to perception of the sounds’ source-cause. Listeners compared 14 acoustic instrument sounds in three distinct tasks that collected general dissimilarity, brightness dissimilarity, and direct multi-stimulus brightness ratings. Results confirmed that brightness is a robust unitary auditory dimension, with direct ratings recovering the centroid dimension of general dissimilarity. When a two-dimensional space of brightness dissimilarity was considered, its second dimension correlated with the attack-time dimension of general dissimilarity, which was interpreted as reflecting a potential infiltration of the latter into brightness dissimilarity. Dissimilarity data were further modeled using partial least-squares regression with audio descriptors as predictors. Adding predictors derived from instrument family and the type of resonator and excitation did not improve the model fit, indicating that brightness perception is underpinned primarily by acoustical rather than source-cause cues.
Towards a framework for ubiquitous audio-tactile design

Maximilian Weber, and Charalampos Saitis

10th International Workshop on Haptic and Audio Interaction Design, 2020

Abstract Paper

To enable a transition towards rich vibrotactile feedback in applications and media content, a complete end-to-end system — from the design of the tactile experience all the way to the tactile stimulus reproduction — needs to be considered. Currently, most applications are at best limited to dull vibration patterns due to limited hard- and software implementations, while the design of ubiquitous platform-agnostic tactile stimuli remains challenging due to a lack of standardized protocols and tools for tactile design, storage, transport, and reproduction. This work proposes a conceptual framework, utilizing audio assets as a starting point for the design of vibrotactile stimuli, including ideas for a parametric tactile data model, and outlines challenges for a platform-agnostic stimuli reproduction. Finally, the benefits and shortcomings of a commercial and wide-spread vibrotactile API are investigated as an example for the current state of a complete end-to-end framework.
Musical dynamics classification with CNN and modulation spectra

Luca Marinelli, Athanasios Lykartsis, Stefan Weinzierl, and Charalampos Saitis

17th Sound and Music Computing Conference, 2020

Abstract Paper

To investigate variations in the timbre space with regards to musical dynamics, convolutional neural networks (CNNs) were trained on modulation power spectra (MPS), melscaled and ERB-scaled spectrograms of single notes of sustained instruments played at two dynamics extremes (pp and ff). The samples, from an extensive dataset of several timbre families, were rms normalized in order to eliminate the loudness information and force the network to focus on timbre attributes of musical dynamics that are shared across different instrument families. The proposed CNN architecture obtained competitive results in three classification tasks with all three input representations. In order to compare the different input representations, the test sets in three experiments were partitioned in order to promote or avoid selection bias. When selection bias was avoided, models trained on MPS were outperformed by those trained on time-frequency representations, conversely, those trained on MPS achieved the best results when selection bias was promoted. Low-temporal modulations emerged in classspecific MPS saliency maps as markers of musical dynamics. This led to the implementation of a MPS-based scalar descriptor of timbre that largely outperformed the chosen baseline (44.8% error reduction).
Proceedings of the 2nd International Conference on Timbre

Eds: Asterios Zacharakis, Charalampos Saitis, and Kai Siedenburg

The School of Music Studies, Aristotle University of Thessaloniki, 2020

Link

2019

Modulation Spectra for Musical Dynamics Perception and Retrieval

Luca Marinelli, Athanasios Lykartsis, and Charalampos Saitis

DMRN+14: Digital Music Research Network One-Day Workshop, 2019

Abstract

luca_dmrn_2019.pdf
The role of attack transients in timbral brightness perception

Charalampos Saitis, Kai Siedenburg, Paul Schuladen, and Christoph Reuter

23rd International Congress on Acoustics, 2019

Abstract

http://pub.dega-akustik.de/ICA2019/data/articles/000813.pdf
Revisiting timbral brightness perception

Charalampos Saitis, Kai Siedenburg, and Christoph Reuter

Biennial Meeting of the Society for Music Perception and Cognition, 2019

Abstract Poster

Brightness has been long shown to play a major role in timbre perception but relatively little is known about the specific acoustic and cognitive factors that affect brightness ratings of musical instrument sounds. Previous work indicated that sound source categories influence general timbre dissimilarity ratings. To examine whether source categories also exert an effect on brightness ratings of timbre, we collected brightness dissimilarity ratings of 14 orchestral instrument tones from 40 musically experienced listeners and the data were modeled using a partial least-squares regression model that takes audio descriptors of timbre as regressors. It was found that adding predictors derived from sound source categories did not improve the model fit, indicating that timbral brightness is informed mainly by continuously varying properties of the acoustic signal. A multidimensional scaling analysis suggested at least two salient cues: spectral energy distribution and attack time and/or asynchrony in the rise of harmonics. This finding seems to challenge the typical approach of seeking acoustical correlates of brightness in the spectral envelope of the steady-state portion of sounds. To further investigate these aspects in timbral brightness perception, a new group of 40 musically experienced listeners will perform MUSHRA-like brightness ratings of an expanded set of 24 orchestral instrument notes. The goal is to obtain a perceptual scaling of the attribute across a larger set of sounds to help delineate the acoustic ingredients of this important aspect of timbre perception. Preliminary results indicate that between sounds with very close spectral centroid values but different attack times, those with faster attacks tend to be perceived as brighter. Overall, these experiments help clarify the relation between two salient dimensions of timbre: onset and spectral energy distribution.
There’s more to timbre than musical instruments: a meta-analysis of timbre semantics in singing voice quality perception

Charalampos Saitis, and Johanna Devaney

Biennial Meeting of the Society for Music Perception and Cognition, 2019

Abstract Poster

Imagine listening to the famous soprano Maria Callas (1923–1977) singing the aria “Vissi d’arte” from Puccini’s Tosca. How would you describe the quality of her voice? When describing the timbre of musical sounds, listeners use descriptions such as bright, heavy, round, and rough, among others. In 1890, Stumpf theorized that this diverse vocabulary can be summarized, on the basis of semantic proximities, by three pairs of opposites: dark–bright, soft–rough, and full–empty. Empirical findings across many semantic differential studies from the late 1950s until today have generally confirmed that these are the salient dimensions of timbre semantics. However, most prior work has considered only orchestral instruments, with relatively little attention given to sung tones. At the same time, research on the perception of singing voice quality has primarily focused on verbal attributes associated with phonation type, voice classification, vocal register, vowel intelligibility, and vibrato. Descriptions like pressed, soprano, falsetto, hoarse, or wobble, albeit in themselves a type of timbre semantics, are essentially sound source identifiers acting as semantic descriptors. It remains an open question as to whether the timbral attributes of sung tones, that is verbal attributes that bear no source associations, can be described adequately on the basis of the bright-rough-full semantic space. We present a meta-analysis of previous research on verbal attributes of singing voice timbre that covers not only pedagogical texts but also work from music cognition, psychoacoustics, music information retrieval, musicology, and ethnomusicology. The meta-analysis lays the groundwork for a semantic differential study of sung sounds, providing a more appropriate lexicon on which to draw than simply using verbal scales from related work on instrumental timbre. The meta-analysis will be complemented by a psycholinguistic analysis of free verbalizations provided by singing teachers in a listening test and an acoustic analysis of the tested stimuli.
Spectrotemporal modulation timbre cues in musical dynamics

Charalampos Saitis, Luca Marinelli, Athanasios Lykartsis, and Stefan Weinzierl

Biennial Meeting of the Society for Music Perception and Cognition, 2019

Abstract

Timbre is often described as a complex set of sound features that are not accounted for by pitch, loudness, duration, spatial location, and the acoustic environment. Musical dynamics refers to the perceived or intended loudness of a played note, instructed in music notation as piano or forte (soft or loud) with different dynamic gradations between and beyond. Recent research has shown that even if no loudness cues are available, listeners can still quite reliably identify the intended dynamic strength of a performed sound by relying on timbral features. More recently, acoustical analyses across an extensive set of anechoic recordings of orchestral instrument notes played at pianissimo (pp) and fortissimo (ff) showed that attack slope, spectral skewness, and spectral flatness together explained 72% of the variance in dynamic strength across all instruments, and 89% with an instrument-specific model. Here, we further investigate the role of timbre in musical dynamics, focusing specifically on the contribution of spectral and temporal modulations. Loudness-normalized modulation power spectra (MPS) were used as input representation for a convolutional neural network (CNN). Through visualization of the pp and ff saliency maps of the CNN it was possible to identify discriminant regions of the MPS and define a novel task-specific scalar audio descriptor. A linear discriminant analysis with 10-fold cross-validation using this new MPS-based descriptor on the entire dataset performed better than using the two spectral descriptors (27% error rate reduction). Overall, audio descriptors based on different regions of the MPS could serve as sound representation for machine listening applications, as well as to better delineate the acoustic ingredients of different aspects of auditory perception.
Beyond the semantic differential: Timbre semantics as crossmodal correspondences

Charalampos Saitis

14th International Symposium on Computer Music Multidisciplinary Research, 2019

Abstract Paper

This position paper argues that a systematic study of crossmodal correspondences between timbral dimensions of sound and perceptual dimensions of other sensory modalities (e.g., brightness, fullness, roughness, sweetness) can offer a new way of addressing old questions about the perceptual and cognitive mechanisms of timbre semantics, while the latter can provide a test case for better understanding crossmodal correspondences and human semantic processing in general. Furthermore, a systematic investigation of auditory-nonauditory crossmodal correspondences necessitates auditory stimuli that can be intuitively controlled along intrinsic continuous dimensions of timbre, and the collection of behavioural data from appropriate tasks that extend beyond the semantic differential paradigm.
Sounds like melted chocolate: how musicians conceptualize violin sound richness

Charalampos Saitis, Claudia Fritz, and Gary Scavone

International Symposium on Musical Acoustics, 2019

Abstract Paper

Results from a previous study on the perceptual evaluation of violins that involved playing-based semantic ratings showed that preference for a violin was strongly associated with its perceived sound richness. However, both preference and richness ratings varied widely between individual violinists, likely because musicians conceptualize the same attribute in different ways. To better understand how richness is conceptualized by violinists and how it contributes to the perceived quality of a violin, we analyzed free verbal descriptions collected during a carefully controlled playing task (involving 16 violinists) and in an online survey where no sound examples or other contextual information was present (involving 34 violinists). The analysis was based on a psycholinguistic method, whereby semantic categories are inferred from the verbal data itself through syntactic context and linguistic markers. The main sensory property related to violin sound richness was expressed through words such as full, complex, and dense versus thin and small, referring to the perceived number of partials present in the sound. Another sensory property was expressed through words such as warm, velvety, and smooth versus strident, harsh, and tinny, alluding to spectral energy distribution cues. Haptic cues were also implicated in the conceptualization of violin sound richness.
The Semantics of Timbre

Charalampos Saitis, and Stefan Weinzierl

Timbre: Acoustics, Perception, and Cognition, 2019

Abstract Paper

Because humans lack a sensory vocabulary for auditory experiences, timbral qualities of sounds are often conceptualized and communicated through readily available sensory attributes from different modalities (e.g., bright, warm, sweet) but also through the use of onomatopoeic attributes (e.g., ringing, buzzing, shrill) or nonsensory attributes relating to abstract constructs (e.g., rich, complex, harsh). The analysis of the linguistic description of timbre, or timbre semantics, can be considered as one way to study its perceptual representation empirically. In the most commonly adopted approach, timbre is considered as a set of verbally defined perceptual attributes that represent the dimensions of a semantic timbre space. Previous studies have identified three salient semantic dimensions for timbre along with related acoustic properties. Comparisons with similarity-based multidimensional models confirm the strong link between perceiving timbre and talking about it. Still, the cognitive and neural mechanisms of timbre semantics remain largely unknown and underexplored, especially when one looks beyond the case of acoustic musical instruments.
The present, past, and future of timbre research

Kai Siedenburg, Charalampos Saitis, and Stephen McAdams

Timbre: Acoustics, Perception, and Cognition, 2019

Abstract Paper

Timbre is a foundational aspect of hearing. The remarkable ability of humans to recognize sound sources and events (e.g., glass breaking, a friend’s voice, a tone from a piano) stems primarily from a capacity to perceive and process differences in the timbre of sounds. Roughly defined, timbre is thought of as any property other than pitch, duration, and loudness that allows two sounds to be distinguished. Current research unfolds along three main fronts: (1) principal perceptual and cognitive processes; (2) the role of timbre in human voice perception, perception through cochlear implants, music perception, sound quality, and sound design; and (3) computational acoustic modeling. Along these three scientific fronts, significant breakthroughs have been achieved during the decade prior to the production of this volume. Bringing together leading experts from around the world, this volume provides a joint forum for novel insights and the first comprehensive modern account of research topics and methods on the perception, cognition, and acoustic modeling of timbre. This chapter provides background information and a roadmap for the volume.
Audio Content Descriptors of Timbre

Marcelo Caetano, Charalampos Saitis, and Kai Siedenburg

Timbre: Acoustics, Perception, and Cognition, 2019

Abstract Paper

This chapter introduces acoustic modeling of timbre with the audio descriptors commonly used in music, speech, and environmental sound studies. These descriptors derive from different representations of sound, ranging from the waveform to sophisticated time-frequency transforms. Each representation is more appropriate for a specific aspect of sound description that is dependent on the information captured. Auditory models of both temporal and spectral information can be related to aspects of timbre perception, whereas the excitation-filter model of sound production provides links to the acoustics of sound production. A brief review of the most common representations of audio signals used to extract audio descriptors related to timbre is followed by a discussion of the audio descriptor extraction process using those representations. This chapter covers traditional temporal and spectral descriptors, including harmonic description, time-varying descriptors, and techniques for descriptor selection and descriptor decomposition. The discussion is focused on conceptual aspects of the acoustic modeling of timbre and the relationship between the descriptors and timbre perception, semantics, and cognition, including illustrative examples. The applications covered in this chapter range from timbre psychoacoustics and multimedia descriptions to computer-aided orchestration and sound morphing. Finally, the chapter concludes with speculation on the role of deep learning in the future of timbre description and on the challenges of audio content descriptors of timbre.
Timbre: Acoustics, Perception, and Cognition

Eds: Kai Siedenburg, Charalampos Saitis, Stephen McAdams, Arthur N. Popper, and Richard R. Fay

Springer Handbook of Auditory Research 69, 2019

Link