Publications
- JournalTimbre semantic associations vary both between and within instruments: An empirical study incorporating register and pitch heightLindsey Reymore, Jason Noble, Charalampos Saitis, and 2 more authorsMusic Perception, in press
2022
- ISMIRMore Than Words: Linking Music Preferences and Moral Values Through LyricsVjosa Preniqi, Kyriaki Kalimeri, and Charalampos Saitis23rd International Society for Music Information Retrieval Conference, 2022
This study explores the association between music preferences and moral values by applying text analysis techniques to lyrics. Harvesting data from a Facebook-hosted application, we align psychometric scores of 1,386 users to lyrics from the top 5 songs of their preferred music artists as emerged from Facebook Page Likes. We extract a set of lyrical features related to each song’s overarching narrative, moral valence, sentiment, and emotion. A machine learning framework was designed to exploit regression approaches and evaluate the predictive power of lyrical features for inferring moral values. Results suggest that lyrics from top songs of artists people like inform their morality. Virtues of hierarchy and tradition achieve higher prediction scores (between .20 and .30) than values of empathy and equality (between .08 and .11), while basic demographic variables only account for a small part in the models’ explainability. This shows the importance of music listening behaviours, as assessed via lyrical preferences, alone in capturing moral values. We discuss the technological and musicological implications and possible future improvements.
- ICAtimbre.fun: A gamified interactive system for crowdsourcing a timbre semantic vocabularyBen Hayes, Charalampos Saitis, and György Fazekas24th International Congress on Acoustics, 2022
We present timbre.fun, a web-based gamified interactive system where users create sounds in response to semantic prompts (e.g., bright, rough) through exploring a two-dimensional control space that maps nonlinearly to the parameters of a simple hybrid wavetable and amplitude-modulation synthesizer. The current version features 25 semantic adjectives mined from a popular synthesis forum. As well as creating sounds, users can explore heatmaps generated from others’ responses, and fit a classifier (k-nearest neighbors) in-browser. timbre.fun is based on recent work, including by the authors, which studied timbre semantic associations through prompted synthesis paradigms. The interactive is embedded in a digital exhibition on sensory variation and interaction (seeingmusic.app) which debuted at the 2021 Edinburgh Science Festival, where it was visited by 197 users from 21 countries over 16 days. As it continues running online, a further 596 visitors from 35 countries have engaged. To date 579 sounds have been created and tagged, which will facilitate parallel research in timbre semantics and neural audio synthesis. Future work will include further gamifying the data collection pipeline, including leveling-up to unlock new words and synthesizers, and a full open-source release.
- IMRFSeeing Music: Leveraging citizen science and gamification to study cross-sensory associationsCharalampos Saitis, Christine Cuskley, and Sebastian Löbbers20th International Multisensory Research Forum, 2022
Our recent research has shown that people lack knowledge about how the senses interact and are unaware of many common forms of sensory and perceptual variation. We present Seeing Music, a digital interactive exhibition and audiovisual game that translates high-level scientific understanding of sensory variation and cross-modality into knowledge for the public. Using a narrative-driven gamified approach, players are tasked with communicating human music to an extraterrestrial intelligence through visual shape, color and texture using two-dimensional selector panels. Music snippets (12–24 s long) are played continuously in a loop, taken from three custom instrumental compositions designed to vary systematically in terms of timbre, melody, and rhythm. Players can “level-up” to unlock new visual features and musical snippets, and explore and evaluate collaborative visualizations made by others. Outside the game, a series of interactive slideshows help visitors learn more about sensory experience, sensory diversity, and how our senses make us human. The exhibition debuted at the 2021 Edinburgh Science Festival, where it was visited by 197 users coming from 21 countries (134 visitors from the UK) over 16 days. As it continues running online, a further 596 visitors from 35 countries (164 from the UK) have engaged. To date, 169 players of Seeing Music have produced more than 42,500 audiovisual mapping datapoints for scientific research purposes. Preliminary analysis suggests that music with less high-frequency energy was mapped to less complex and rounder shapes, bluer and less bright hues, and less dense textures. These trends confirm auditory-visual correspondences previously reported in more controlled laboratory studies, while also offering new insight into how different auditory-visual associations interact with each other. Future work includes improving user motivation and interaction, refining data collection, a full open-source release, and adding new games and informational material about research on the senses.
- EuroMediaExploring the Dimensionality of the Affective Space Elicited by Gendered Toy CommercialsLuca Marinelli, and Charalampos Saitis9th European Conference on Media, Communication & Film, 2022
As evidenced by a large body of literature, the gender-stereotyped nature of toy adverts has been widely scrutinised. However, little work has been done in examining the affective impact of these commercials on the audience. It has been proven that repeated exposure to gender-stereotyped messages has the capacity to influence behaviours, beliefs and attitudes. In particular, media can influence emotion socialization, and gender differences in emotion expression might emerge (Scherr 2018). In this study, we investigated whether commercials elicit emotions at different intensities with respect to the gender of their target audience. Furthermore, we evaluated whether such emotions follow distinct underlying latent structures. A total of 1081 ratings of 10 unipolar aesthetic emotion scales were collected for 135 commercials (45 for each masculine, feminine, and mixed target audience) from 80 UK nationals (35 F, 45 M) aged 18 to 76. The main reason for collecting our ratings from adults was that, already by age 11, children exhibit adult-like emotion recognition capabilities (Hunter 2011). Seven scales showed significant differences between commercials for distinct audiences; with five, in particular, revealing a strong polarization (happiness, amusement, beauty, calm, and anger). In addition, parallel analysis showed that a minimum of three factors are needed to explain the ratings for masculine and mixed targeted commercials, while only two are needed for the feminine ones, thereby indicating that the latter elicit emotions following a simpler underlying structure. Both results reflect larger issues in toy marketing, where gender essentialism is still dominant, and prompt further discussion and research.
- IJCNNTimbre Transfer with Variational Auto Encoding and Cycle-Consistent Adversarial NetworksRussell Sammut Bonnici, Martin Benning, and Charalampos SaitisInternational Joint Conference on Neural Networks, 2022
This work investigates the application of deep learning to timbre transfer. The adopted approach combines Variational Autoencoders with Generative Adversarial Networks to construct meaningful representations of the source audio and produce realistic generations of the target audio and is applied to the Flickr 8k Audio dataset for transferring the vocal timbre between speakers and the URMP dataset for transferring the musical timbre between instruments. Variations of the adopted approach were trained, and performance was compared using the metrics SSIM (Structural Similarity Index) and FAD (Frechet Audio Distance). It was found that a many-to-many approach supersedes a one-to-one approach in terms of reconstructive capabilities, while one-to-one showed better results in terms of adversarial translation. The adoption of a basic over a bottleneck residual block design is more suitable for enriching content information about a latent space, and the decision on whether cyclic loss takes on a variational autoencoder or vanilla autoencoder approach does not have a significant impact on reconstructive and adversarial translation aspects of the model.
- JournalDisembodied Timbres: A Study on Semantically Prompted FM SynthesisBen Hayes, Charalampos Saitis, and György FazekasJournal of the Audio Engineering Society, 2022
Disembodied electronic sounds constitute a large part of the modern auditory lexicon, but research into timbre perception has focused mostly on the tones of conventional acoustic musical instruments. It is unclear whether insights from these studies generalise to electronic sounds, nor is it obvious how these relate to the creation of such sounds. In this work, we present an experiment on the semantic associations of sounds produced by FM synthesis with the aim of identifying whether existing models of timbre semantics are appropriate for such sounds. We applied a novel experimental paradigm in which experienced sound designers responded to semantic prompts by programming a synthesiser, and provided semantic ratings on the sounds they created. Exploratory factor analysis revealed a five-dimensional semantic space. The first two factors mapped well to the concepts of luminance, texture, and mass. The remaining three factors did not have clear parallels, but correlation analysis with acoustic descriptors suggested an acoustical relationship to luminance and texture. Our results suggest that further enquiry into the timbres of disembodied electronic sounds, their synthesis, and their semantic associations would be worthwhile, and that this could benefit research into auditory perception and cognition, as well as synthesis control and audio engineering.
- SMCDeep Embeddings for Robust User-Based Amateur Vocal Percussion TranscriptionAlejandro Delgado, Emir Demirel, Vinod Subramanian, and 2 more authors19th Sound and Music Computing Conference, 2022
Vocal Percussion Transcription (VPT) is concerned with the automatic detection and classification of vocal percussion sound events, allowing music creators and producers to sketch drum lines on the fly among others. VPT classifiers usually learn best from small user-specific datasets, which usually restrict modelling to small input feature sets to avoid model overfitting. This study explores several deep supervised learning strategies to obtain informative feature sets for amateur VPT classification. We evaluated their performance on regular VPT classification tasks and compared them with several baseline approaches including feature selection methods and a state-of-the-art speech recognition engine. These proposed learning models were supervised with several label sets containing information from four different levels of abstraction: instrument-level, syllable-level, phoneme-level, and boxeme-level. Results suggest that convolutional neural networks supervised with syllable-level annotations produced the most informative embeddings for VPT systems, which can be used as input representations to fit classifiers with. Finally, we used back-propagation-based saliency maps to investigate the importance of difference spectrogram regions for feature learning.
- SMPCAuditory brightness perception investigated by unimodal and crossmodal interferenceCharalampos Saitis, Zachary Wallmark, and Annie LiuBiennial Meeting of the Society for Music Perception and Cognition, 2022
Brightness is among the most studied aspects of timbre perception. Psychoacoustically, sounds described as ”bright” vs ”dark” typically exhibit a high vs low frequency emphasis in the spectrum. However, relatively little is known about the neurocognitive mechanisms that facilitate these “metaphors we listen with.” Do they originate in universal mental representations common to more than one sensory modality? Triangulating three different interaction paradigms, we investigated using speeded identification whether unimodal and crossmodal interference occurs when timbral brightness, as modelled by the centroid of the spectral envelope, and 1) pitch height, 2) visual brightness, 3) numerical value processing are semantically incongruent. In three online pilot tasks, 58 participants were presented a baseline stimulus (a pitch, gray square, or numeral) then asked to quickly identify a target stimulus that is higher/lower, brighter/darker, or greater/less than the baseline, respectively, after being primed with a bright or dark synthetic harmonic tone. Additionally, in the pitch and visual tasks, a deceptive same-target condition was included. Results suggest that timbral brightness modulates the perception of pitch and visual brightness, but not numerical value. Semantically incongruent pitch height-timbral brightness shifts produced significantly slower choice reaction time and higher error compared to congruent pairs; timbral brightness also had a strong biasing effect in the same-target condition (i.e., people heard the same pitch as higher when the target tone was timbrally brighter than the baseline, and vice versa with darker tones). In the visual task, incongruent pairings of gray squares and tones elicited slower choice reaction times than congruent pairings. No interference was observed in the number comparison task. We are currently following up on these results with a larger online replication sample, and an fMRI study to investigate the relevant neural mechanisms. Our findings shed light on the multisensory nature of experiencing timbre.
- HAIDProceedings of the 11th International Workshop on Haptic and Audio Interaction DesignEds: Charalampos Saitis, Ildar Farkhatdinov, and Stefano PapettiSpringer Lecture Notes in Computer Science 13417, 2022
2021
- JournalMultimodal Classification of Stressful Environments in Visually Impaired Mobility Using EEG and Peripheral BiosignalsCharalampos Saitis, and Kyriaki KalimeriIEEE Transactions on Affective Computing, 2021
In this study, we aim to better understand the cognitive-emotional experience of visually impaired people when navigating in unfamiliar urban environments, both outdoor and indoor. We propose a multimodal framework based on random forest classifiers, which predict the actual environment among predefined generic classes of urban settings, inferring on real-time, non-invasive, ambulatory monitoring of brain and peripheral biosignals. Model performance reached 93% for the outdoor and 87% for the indoor environments (expressed in weighted AUROC), demonstrating the potential of the approach. Estimating the density distributions of the most predictive biomarkers, we present a series of geographic and temporal visualizations depicting the environmental contexts in which the most intense affective and cognitive reactions take place. A linear mixed model analysis revealed significant differences between categories of vision impairment, but not between normal and impaired vision. Despite the limited size of our cohort, these findings pave the way to emotionally intelligent mobility-enhancing systems, capable of implicit adaptation not only to changing environments but also to shifts in the affective state of the user in relation to different environmental and situational factors.
- CMMRModelling Moral Traits with Music Listening Preferences and DemographicsVjosa Preniqi, Kyriaki Kalimeri, and Charalampos Saitis15th International Symposium on Computer Music Multidisciplinary Research, 2021
Music has always been an integral part of our everyday lives through which we express feelings, emotions, and concepts. Here, we explore the association between music genres, demographics and moral values employing data from an ad-hoc online survey and the Music Learning Histories Dataset. To further characterise the music preferences of the participants the generalist/specialist (GS) score employed. We exploit both classification and regression approaches to assess the predictive power of music preferences for the prediction of demographic attributes as well as the moral values of the participants. Our findings point out that moral values are hard to predict (.62 AUROC_avg) solely by the music listening behaviours, while if basic sociodemographic information is provided the prediction score rises to 4% on average (.66 AUROC_avg), with the Purity foundation to be the one that is steadily the one with the highest accuracy scores. Similar results are obtained from the regression analysis. Finally, we provide with insights on the most predictive music behaviours associated with each moral value that can inform a wide range of applications from rehabilitation practices to communication campaign design.
- SEMPREDevelopment of a Web Application for the Education, Assessment, and Study of Timbre PerceptionCharalampos SaitisSociety for Education, Music, and Psychology Research Conference, 2021
Timbre is defined as any auditory property other than pitch, duration, and loudness that allows two sounds to be distinguished. The Timbre Explorer (TE) is a synthesiser interface designed to demonstrate timbral dimensions of sound. This project aimed to develop and evaluate a web version of the TE that attempts to train its users and test their understanding of timbre as they go through a series of gamified tasks. A pilot study with 16 participants helped to identify shortcomings ahead of a full-sized study that will evaluate the performance of the TE as an educational aid and musical assessment tool.
- IC2S2We are what we listen to: How moral values reflect on musical preferencesVjosa Preniqi, Kyriaki Kalimeri, and Charalampos Saitis7th International Conference on Computational Social Science, 2021
- ISMIRNeural Waveshaping SynthesisBen Hayes, Charalampos Saitis, and György Fazekas22nd International Society for Music Information Retrieval Conference, 2021
We present the Neural Waveshaping Unit (NEWT): a novel, lightweight, fully causal approach to neural audio synthesis which operates directly in the waveform domain, with an accompanying optimisation (FastNEWT) for efficient CPU inference. The NEWT uses time-distributed multilayer perceptrons with periodic activations to implicitly learn nonlinear transfer functions that encode the characteristics of a target timbre. Once trained, a NEWT can produce complex timbral evolutions by simple affine transformations of its input and output signals. We paired the NEWT with a differentiable noise synthesiser and reverb and found it capable of generating realistic musical instrument performances with only 260k total model parameters, conditioned on F0 and loudness features. We compared our method to state-of-the-art benchmarks with a multi-stimulus listening test and the Fréchet Audio Distance and found it performed competitively across the tested timbral domains. Our method significantly outperformed the benchmarks in terms of generation speed, and achieved real-time performance on a consumer CPU, both with and without FastNEWT, suggesting it is a viable basis for future creative sound design tools.
- ICMPCPerceptual and semantic scaling of FM synthesis timbres: Common dimensions and the role of expertiseBen Hayes, Charalampos Saitis, and György Fazekas16th International Conference on Music Perception and Cognition, 2021
Electronic sound has a rich history, yet timbre research has typically focused on the sounds of physical instruments, while synthesised sound is often relegated to functional roles like recreating acoustic timbres. Studying the perception of synthesised sound can broaden our conception of timbre and improve musical synthesis tools. We aimed to identify the perceptually salient acoustic attributes of sounds produced by frequency modulation synthesis. We also aimed to test Zacharakis et al’s luminance-texture-mass timbre semantic model [Music Perception, 31, 339–358 (2014)] in this domain. Finally, we aimed to identify effects of prior music or synthesis experience on these results. Our results suggest that discrimination of abstract electronic timbres may rely on attributes distinct from those used with acoustic timbres. Further, the most salient attributes vary with expertise. However, the use of semantic descriptors is similar to that of acoustic instruments, and is consistent across expertise levels.
- DMRNNASH: the Neural Audio Synthesis HackathonBen Hayes, Cyrus Vahidi, and Charalampos SaitisDMRN+16: Digital Music Research Network One-Day Workshop, 2021
The field of neural audio synthesis aims to produce audio using neural networks. A recent surge in its popularity has led to several high profile works achieving impressive feats of speech and music synthesis. The development of broadly accessible neural audio synthesis tools, conversely, has been limited, and creative applications of these technologies are mostly undertaken by those with technical know-how. Research has focused largely on tasks such as realistic speech and musical instrument synthesis, whereas investigations into high-level control, esoteric sound design capabilities, and interpretability have received less attention. To encourage innovative work addressing these gaps, C4DM’s Special Interest Group on Neural Audio Synthesis (SIGNAS) propose to host our first Neural Audio Synthesis Hackathon: a two day event, with results to be presented in a session at DMRN+16.
- DMRNAcoustic Representations for Perceptual Timbre SimilarityCyrus Vahidi, Ben Hayes, Charalampos Saitis, and 1 more authorDMRN+16: Digital Music Research Network One-Day Workshop, 2021
In this work, we outline initial steps towards modelling perceptual timbre dissimilarity. We use stimuli from 17 distinct subjective timbre studies and compute pairwise distances in the spaces of MFCCs, joint time-frequency scattering coefficients and Open-L3 embeddings. We analyze agreement of distances in these spaces with human dissimilarity ratings and highlight challenges of this task.
- DMRNVariational Auto Encoding and Cycle-Consistent Adversarial Networks for Timbre TransferRussell Sammut Bonnici, Martin Benning, and Charalampos SaitisDMRN+16: Digital Music Research Network One-Day Workshop, 2021
The combination of Variational Autoencoders (VAE) with Generative Adversarial Networks (GAN) motivates meaningful representations of audio in the context of timbre transfer. This was applied to different datasets for transferring vocal timbre between speakers and musical timbre between instruments. Variations of the approach were trained and generalised performance was compared using the Structural Similarity Index and Frechet Audio Distance. Many-to-many style transfer was found to improve reconstructive performance over one-to-one style transfer.
- IJCNNA Modulation Front-End for Music Audio TaggingCyrus Vahidi, Charalampos Saitis, and György FazekasInternational Joint Conference on Neural Networks, 2021
Convolutional Neural Networks have been extensively explored in the task of automatic music tagging. The problem can be approached by using either engineered time-frequency features or raw audio as input. Modulation filter bank representations that have been actively researched as a basis for timbre perception have the potential to facilitate the extraction of perceptually salient features. We explore end-to-end learned front-ends for audio representation learning, ModNet and SincModNet, that incorporate a temporal modulation processing block. The structure is effectively analogous to a modulation filter bank, where the FIR filter center frequencies are learned in a data-driven manner. The expectation is that a perceptually motivated filter bank can provide a useful representation for identifying music features. Our experimental results provide a fully visualisable and interpretable front-end temporal modulation decomposition of raw audio. We evaluate the performance of our model against the state-of-the-art of music tagging on the MagnaTagATune dataset. We analyse the impact on performance for particular tags when time-frequency bands are subsampled by the modulation filters at a progressively reduced rate. We demonstrate that modulation filtering provides promising results for music tagging and feature representation, without using extensive musical domain knowledge in the design of this frontend.
- AESPhoneme Mappings for Online Vocal Percussion TranscriptionAlejandro Delgado, Charalampos Saitis, and Mark Sandler151st Audio Engineering Society Convention, 2021, Honourable Mention for Outstanding Paper
Vocal Percussion Transcription (VPT) aims at detecting vocal percussion sound events in a beatboxing performance and classifying them into the correct drum instrument class (kick, snare, or hi-hat). To do this in an online (real-time) setting, however, algorithms are forced to classify these events within just a few milliseconds after they are detected. The purpose of this study was to investigate which phoneme-to-instrument mappings are the most robust for online transcription purposes. We used three different evaluation criteria to base our decision upon: frequency of use of phonemes among different performers, spectral similarity to reference drum sounds, and classification separability. With these criteria applied, the recommended mappings would potentially feel natural for performers to articulate while enabling the classification algorithms to achieve the best performance possible. Given the final results, we provided a detailed discussion on which phonemes to choose given different contexts and applications.
- ICMCLearning Models for Query by Vocal Percussion: A Comparative StudyAlejandro Delgado, McDonald SKoT, Ning Xu, and 2 more authors46th International Computer Music Conference, 2021
The imitation of percussive sounds via the human voice is a natural and effective tool for communicating rhythmic ideas on the fly. Thus, the automatic retrieval of drum sounds using vocal percussion can help artists prototype drum patterns in a comfortable and quick way, smoothing the creative workflow as a result. Here we explore different strategies to perform this type of query, making use of both traditional machine learning algorithms and recent deep learning techniques. The main hyperparameters from the models involved are carefully selected by feeding performance metrics to a grid search algorithm. We also look into several audio data augmentation techniques, which can potentially regularise deep learning models and improve generalisation. We compare the final performances in terms of effectiveness (classification accuracy), efficiency (computational speed), stability (performance consistency), and interpretability (decision patterns), and discuss the relevance of these results when it comes to the design of successful query-by-vocal-percussion systems.
- NIMEThe Timbre Explorer: A Synthesizer Interface for Educational Purposes and Perceptual StudiesJoshua Ryan Lam, and Charalampos SaitisInternational Conference on New Interfaces for Musical Expression, 2021
When two sounds are played at the same loudness, pitch, and duration, what sets them apart are their timbres. This study documents the design and implementation of the Timbre Explorer, a synthesizer interface based on efforts to dimensionalize this perceptual concept. The resulting prototype controls four perceptually salient dimensions of timbre in real-time: attack time, brightness, spectral flux, and spectral density. A graphical user interface supports user understanding with live visualizations of the effects of each dimension. The applications of this interface are three-fold; further perceptual timbre studies, usage as a practical shortcut for synthesizers, and educating users about the frequency domain, sound synthesis, and the concept of timbre. The project has since been expanded to a standalone version independent of a computer and a purely online web-audio version.
2020
- TalkHow we talk about sound: Semantic dimensions of abstract timbresBen Hayes, and Charalampos SaitisSound Instruments and Sonic Cultures: An Interdisciplinary Conference, 2020, National Science & Media Museum
Synthesisers, in their many forms, enable the realisation of almost any conceivable sound. Their fine-grained control and broad timbral palette call for a descriptive lexicon to enable their verbal differentiation and discussion. While acoustic instruments of the western classical lineage are the subject of an extensive body of enquiry into the perceptual attributes and semantic associations of the sounds they produce, abstract electronic sounds have been comparatively understudied in this regard. In particular, the diverse vocabulary used to describe such classical acoustic instruments can be summarised with three conceptual metaphors—such musical tones have luminance, texture, and mass—but this has yet to be explicitly confirmed for the kinds of electronic sounds that pervade many modern sonic cultures. In this work, we present an experimental paradigm for studying the semantic associations of synthesised sounds, wherein a group of experienced music producers and sound designers interacted with a web-based synthesiser in response to descriptive prompts, and provided comparative semantic ratings on the sounds they created. The words used for semantic ratings were selected by mining a text corpus from the popular modular synthesis forum Muff Wiggler, and analysing the frequency of adjectives in contexts pertaining to timbre. The ratings provided by participants were subject to statistical analysis. From 27 initial adjectives, two underlying semantic factors were revealed: terms including aggressive, hard, and complex associated with the first, and dark and warm with the second. These factors differ from those found for classical acoustic sounds, implying a relationship between the qualia of a sonic experience and the language employed to talk about it. Such insight has implications for how sound is conceptualised, understood, and received within sonic cultures—in particular, those predicated on electronic or abstract sound—and applications in developing novel control schemes for synthesis methods.
- EuroHapticsAnalysing and countering bodily interference in vibrotactile devices introduced by human interaction and physiologyMaximilian Weber, and Charalampos Saitis12th EuroHaptics Conference, 2020
- JournalTimbre semantics through the lens of crossmodal correspondences: A new way of asking old questionsCharalampos Saitis, Stefan Weinzierl, Katharina Kriegstein, and 2 more authorsAcoustical Science and Technology, 2020
This position paper argues that a systematic study of the behavioral and neural mechanisms of crossmodal correspondences between timbral dimensions of sound and perceptual dimensions of other sensory modalities, such as brightness, roughness, or sweetness, can offer a new way of addressing old questions about the perceptual and neurocognitive mechanisms of auditory semantics. At the same time, timbre and the crossmodal metaphors that dominate its conceptualization can provide a test case for better understanding the neural basis of crossmodal correspondences and human semantic processing in general.
- PreprintWhat do people know about sensation and perception? Understanding perceptions of sensory experienceChristine Cuskley, and Charalampos SaitisPsyArXiv preprint, 2020
Academic disciplines spanning cognitive science, art, and music have made strides in understanding how humans sense and experience the world. We now have a better scientific understanding of how human sensation and perception function both in the brain and in interaction than ever before. However, there is little research on how this high level scientific understanding is translated into knowledge for the public more widely. We present descriptive results from a simple survey and compare how public understanding and perception of sensory experience lines up with scientific understanding. Results show that even in a sample with fairly high educational attainment, many respondents were unaware of fairly common forms of sensory variation. In line with the well-documented under representation of sign languages within linguistics, respondents tended to under-estimate the number of sign languages in the world. We outline how our results represent gaps in public understanding of sensory variation, and argue that filling these gaps can form an important early intervention, acting as a basic foundation for improving acceptance, inclusivity, and accessibility for cognitively diverse populations.
- F. Acust.Timbre in Binaural Listening: A Comparison of Timbre Descriptors in Anechoic and HRTF Filtered Orchestral SoundsGeorgios Marentakis, and Charalampos SaitisForum Acusticum, 2020
The psychoacoustic investigation of timbre traditionally relies on audio descriptors extracted from anechoic or semi-anechoic recordings of musical instrument sounds, which are presented to listeners in diotic fashion. As a result, the extent to which spectral modifications due to the outer ear interact with timbre perception is not fully understood. As a first step towards investigating this research question, we examine here whether timbre descriptors calculated using HRTF filtered instrumental sounds deviate across ears and from values obtained from the same sounds without HRTF filtering for different listeners. The sound set comprised isolated notes played at the same fundamental frequency and dynamic from a database of anechoic recordings of modern orchestral instruments and some of their classical and baroque precursors. These were convolved with anechoic high spatial resolution HRTFs of human listeners. We present results and discuss implications for research on timbre perception and cognition.
- DMRNPerceptual Similarities in Neural Timbre EmbeddingsBen Hayes, Luke Brosnahan, Charalampos Saitis, and 1 more authorDMRN+15: Digital Music Research Network One-Day Workshop, 2020
Many neural audio synthesis models learn a representational space which can be used for control or exploration of the sounds generated. It is unclear what relationship exists between this space and human perception of these sounds. In this work, we compute configurational similarity metrics between an embedding space learned by a neural audio synthesis model and conventional perceptual and seman- tic timbre spaces. These spaces are computed using abstract synthesised sounds. We find significant similarities between these spaces, suggesting a shared organisational influence.
- TimbreThere’s More to Timbre than Musical Instruments: Semantic Dimensions of FM SoundsBen Hayes, and Charalampos Saitis2nd International Conference on Timbre, 2020
Much previous research into timbre semantics (such as when an oboe is described as “hollow”) has focused on sounds produced by acoustic instruments, particularly those associated with western tonal music (Saitis & Weinzierl, 2019). Many synthesisers are capable of producing sounds outside the timbral range of physical instruments, but which are still discriminable by their timbre. Research into the perception of such sounds, therefore, may help elucidate further the mechanisms underpinning our experience of timbre in the broader sense. In this paper, we present a novel paradigm on the application of semantic descriptors to sounds produced by experienced sound designers using an FM synthesiser with a full set of controls.
- TimbreEvidence for Timbre Space Robustness to an Uncontrolled Online Stimulus PresentationAsterios Zacharakis, Ben Hayes, Charalampos Saitis, and 1 more author2nd International Conference on Timbre, 2020
Research on timbre perception is typically conducted under controlled laboratory conditions where every effort is made to maintain stimulus presentation conditions fixed (McAdams, 2019). This conforms with the ANSI (1973) definition of timbre suggesting that in order to judge the timbre differences between a pair of sounds the rest perceptual attributes (i.e., pitch, duration and loudness) should remain unchanged. Therefore, especially in pairwise dissimilarity studies, particular care is taken to ensure that loudness is not used by participants as a criterion for judgements by equalising it across experimental stimuli. On the other hand, conducting online experiments is an increasingly favoured practice in the music perception and cognition field as targeting relevant communities can potentially provide a large number of suitable participants with relatively little time investment from the side of the experimenters (e.g., Woods et al., 2015). However, the strict requirements for stimuli preparation and presentation prevents timbre studies from conducting online experimentation. Despite the obvious difficulties in imposing equal loudness on online experiments, the different playback equipment chain (DACs, pre-amplifiers, headphones) will also almost inevitably ‘colour’ the sonic outcome in a different way. Despite the above limitations, in a social distancing time like this, it would be of major importance to be able to lift some of the physical requirements in order to carry on conducting behavioural research on timbre perception. Therefore, this study aims to investigate the extent to which an uncontrolled online replication of a past laboratory-conducted pairwise dissimilarity task will distort the findings.
- TimbreSpectral and Temporal Timbral Cues of Vocal ImitationsAlejandro Delgado, Charalampos Saitis, and Mark Sandler2nd International Conference on Timbre, 2020
The imitation of non-vocal sounds using the human voice is a resource we sometimes rely on when communicating sound concepts to other people. Query by Vocal Percussion (QVP) is a subfield in Music Information Retrieval (MIR) that explores techniques to query percussive sounds using vocal imitations as input, usually plosive consonant sounds. The goal of this work was to investigate timbral relationships between real drum sounds and their vocal imitations. We believe these insights could shed light on how to select timbre descriptors for extraction when designing offline and online QVP systems. In particular, we studied a dataset composed of 30 acoustic and electronic drum sound recordings and vocal imitations of each sound performed by 14 musicians. Our approach was to study the correlation of audio content descriptors of timbre extracted from the drum samples with the same descriptors taken from vocal imitations. Three timbral descriptors were selected: the Log Attack Time (LAT), the Spectral Centroid (SC), and the Derivative After Maximum of the sound envelope (DAM). LAT and SC have been shown to represent salient dimensions of timbre across different types of sounds including percussion. In this sense, one intriguing question would be to what extent listeners can communicate these salient timbral cues in vocal imitations. The third descriptor, DAM, was selected for its role in describing the sound’s tail, which we considered to be a relevant part of percussive utterances.
- TimbreTimbre Space Representation of a Subtractive SynthesizerCyrus Vahidi, György Fazekas, Charalampos Saitis, and 1 more author2nd International Conference on Timbre, 2020
In this study, we produce a geometrically scaled perceptual timbre space from dissimilarity ratings of subtractive synthesized sounds and correlate the resulting dimensions with a set of acoustic descriptors. We curate a set of 15 sounds, produced by a synthesis model that uses varying source waveforms, frequency modulation (FM) and a lowpass filter with an enveloped cutoff frequency. Pairwise dissimilarity ratings were collected within an online browser-based experiment. We hypothesized that a varied waveform input source and enveloped filter would act as the main vehicles for timbral variation, providing novel acoustic correlates for the perception of synthesized timbres.
- TimbreVerbal description of musical brightnessChristos Drouzas, and Charalampos Saitis2nd International Conference on Timbre, 2020
Amongst the most common descriptive expressions of timbre used by musicians, music engineers, audio researchers as well as everyday listeners are words related to the notion of brightness (e.g., bright, dark, dull, brilliant, shining). From a psychoacoustic perspective, brightness ratings of instrumental timbres as well as music excerpts systematically correlate with the centre of gravity of the spectral envelope and thus brightness as a semantic descriptor of musical sound has come to denote a prevalence of high-frequency over low-frequency energy. However, relatively little is known about the higher-level cognitive processes underpinning musical brightness ratings. Psycholinguistic investigations of verbal descriptions of timbre suggest a more complex, polysemic picture (Saitis & Weinzierl 2019) that warrants further research. To better understand how musical brightness is conceptualised by listeners, here we analysed free verbal descriptions collected along brightness ratings of short music snippets (involving 69 listeners) and brightness ratings of orchestral instrument notes (involving 68 listeners). Such knowledge can help delineate the intrinsic structure of brightness as a perceptual attribute of musical sounds, and has broad implications and applications in orchestration, audio engineering, and music psychology.
- JournalBrightness perception for musical instrument sounds: Relation to timbre dissimilarity and source-cause categoriesCharalampos Saitis, and Kai SiedenburgThe Journal of the Acoustical Society of America, 2020
Timbre dissimilarity of orchestral sounds is well-known to be multidimensional, with attack time and spectral centroid representing its two most robust acoustical correlates. The centroid dimension is traditionally considered as reflecting timbral brightness. However, the question of whether multiple continuous acoustical and/or categorical cues influence brightness perception has not been addressed comprehensively. A triangulation approach was used to examine the dimensionality of timbral brightness, its robustness across different psychoacoustical contexts, and relation to perception of the sounds’ source-cause. Listeners compared 14 acoustic instrument sounds in three distinct tasks that collected general dissimilarity, brightness dissimilarity, and direct multi-stimulus brightness ratings. Results confirmed that brightness is a robust unitary auditory dimension, with direct ratings recovering the centroid dimension of general dissimilarity. When a two-dimensional space of brightness dissimilarity was considered, its second dimension correlated with the attack-time dimension of general dissimilarity, which was interpreted as reflecting a potential infiltration of the latter into brightness dissimilarity. Dissimilarity data were further modeled using partial least-squares regression with audio descriptors as predictors. Adding predictors derived from instrument family and the type of resonator and excitation did not improve the model fit, indicating that brightness perception is underpinned primarily by acoustical rather than source-cause cues.
- HAIDTowards a framework for ubiquitous audio-tactile designMaximilian Weber, and Charalampos Saitis10th International Workshop on Haptic and Audio Interaction Design, 2020
To enable a transition towards rich vibrotactile feedback in applications and media content, a complete end-to-end system — from the design of the tactile experience all the way to the tactile stimulus reproduction — needs to be considered. Currently, most applications are at best limited to dull vibration patterns due to limited hard- and software implementations, while the design of ubiquitous platform-agnostic tactile stimuli remains challenging due to a lack of standardized protocols and tools for tactile design, storage, transport, and reproduction. This work proposes a conceptual framework, utilizing audio assets as a starting point for the design of vibrotactile stimuli, including ideas for a parametric tactile data model, and outlines challenges for a platform-agnostic stimuli reproduction. Finally, the benefits and shortcomings of a commercial and wide-spread vibrotactile API are investigated as an example for the current state of a complete end-to-end framework.
- SMCMusical dynamics classification with CNN and modulation spectraLuca Marinelli, Athanasios Lykartsis, Stefan Weinzierl, and 1 more author17th Sound and Music Computing Conference, 2020
To investigate variations in the timbre space with regards to musical dynamics, convolutional neural networks (CNNs) were trained on modulation power spectra (MPS), melscaled and ERB-scaled spectrograms of single notes of sustained instruments played at two dynamics extremes (pp and ff). The samples, from an extensive dataset of several timbre families, were rms normalized in order to eliminate the loudness information and force the network to focus on timbre attributes of musical dynamics that are shared across different instrument families. The proposed CNN architecture obtained competitive results in three classification tasks with all three input representations. In order to compare the different input representations, the test sets in three experiments were partitioned in order to promote or avoid selection bias. When selection bias was avoided, models trained on MPS were outperformed by those trained on time-frequency representations, conversely, those trained on MPS achieved the best results when selection bias was promoted. Low-temporal modulations emerged in classspecific MPS saliency maps as markers of musical dynamics. This led to the implementation of a MPS-based scalar descriptor of timbre that largely outperformed the chosen baseline (44.8% error reduction).
- TimbreProceedings of the 2nd International Conference on TimbreEds: Asterios Zacharakis, Charalampos Saitis, and Kai SiedenburgThe School of Music Studies, Aristotle University of Thessaloniki, 2020
2019
- DMRNModulation Spectra for Musical Dynamics Perception and RetrievalLuca Marinelli, Athanasios Lykartsis, and Charalampos SaitisDMRN+14: Digital Music Research Network One-Day Workshop, 2019
luca_dmrn_2019.pdf
- ICAThe role of attack transients in timbral brightness perceptionCharalampos Saitis, Kai Siedenburg, Paul Schuladen, and 1 more author23rd International Congress on Acoustics, 2019
http://pub.dega-akustik.de/ICA2019/data/articles/000813.pdf
- SMPCRevisiting timbral brightness perceptionCharalampos Saitis, Kai Siedenburg, and Christoph ReuterBiennial Meeting of the Society for Music Perception and Cognition, 2019
Brightness has been long shown to play a major role in timbre perception but relatively little is known about the specific acoustic and cognitive factors that affect brightness ratings of musical instrument sounds. Previous work indicated that sound source categories influence general timbre dissimilarity ratings. To examine whether source categories also exert an effect on brightness ratings of timbre, we collected brightness dissimilarity ratings of 14 orchestral instrument tones from 40 musically experienced listeners and the data were modeled using a partial least-squares regression model that takes audio descriptors of timbre as regressors. It was found that adding predictors derived from sound source categories did not improve the model fit, indicating that timbral brightness is informed mainly by continuously varying properties of the acoustic signal. A multidimensional scaling analysis suggested at least two salient cues: spectral energy distribution and attack time and/or asynchrony in the rise of harmonics. This finding seems to challenge the typical approach of seeking acoustical correlates of brightness in the spectral envelope of the steady-state portion of sounds. To further investigate these aspects in timbral brightness perception, a new group of 40 musically experienced listeners will perform MUSHRA-like brightness ratings of an expanded set of 24 orchestral instrument notes. The goal is to obtain a perceptual scaling of the attribute across a larger set of sounds to help delineate the acoustic ingredients of this important aspect of timbre perception. Preliminary results indicate that between sounds with very close spectral centroid values but different attack times, those with faster attacks tend to be perceived as brighter. Overall, these experiments help clarify the relation between two salient dimensions of timbre: onset and spectral energy distribution.
- SMPCThere’s more to timbre than musical instruments: a meta-analysis of timbre semantics in singing voice quality perceptionCharalampos Saitis, and Johanna DevaneyBiennial Meeting of the Society for Music Perception and Cognition, 2019
Imagine listening to the famous soprano Maria Callas (1923–1977) singing the aria “Vissi d’arte” from Puccini’s Tosca. How would you describe the quality of her voice? When describing the timbre of musical sounds, listeners use descriptions such as bright, heavy, round, and rough, among others. In 1890, Stumpf theorized that this diverse vocabulary can be summarized, on the basis of semantic proximities, by three pairs of opposites: dark–bright, soft–rough, and full–empty. Empirical findings across many semantic differential studies from the late 1950s until today have generally confirmed that these are the salient dimensions of timbre semantics. However, most prior work has considered only orchestral instruments, with relatively little attention given to sung tones. At the same time, research on the perception of singing voice quality has primarily focused on verbal attributes associated with phonation type, voice classification, vocal register, vowel intelligibility, and vibrato. Descriptions like pressed, soprano, falsetto, hoarse, or wobble, albeit in themselves a type of timbre semantics, are essentially sound source identifiers acting as semantic descriptors. It remains an open question as to whether the timbral attributes of sung tones, that is verbal attributes that bear no source associations, can be described adequately on the basis of the bright-rough-full semantic space. We present a meta-analysis of previous research on verbal attributes of singing voice timbre that covers not only pedagogical texts but also work from music cognition, psychoacoustics, music information retrieval, musicology, and ethnomusicology. The meta-analysis lays the groundwork for a semantic differential study of sung sounds, providing a more appropriate lexicon on which to draw than simply using verbal scales from related work on instrumental timbre. The meta-analysis will be complemented by a psycholinguistic analysis of free verbalizations provided by singing teachers in a listening test and an acoustic analysis of the tested stimuli.
- SMPCSpectrotemporal modulation timbre cues in musical dynamicsCharalampos Saitis, Luca Marinelli, Athanasios Lykartsis, and 1 more authorBiennial Meeting of the Society for Music Perception and Cognition, 2019
Timbre is often described as a complex set of sound features that are not accounted for by pitch, loudness, duration, spatial location, and the acoustic environment. Musical dynamics refers to the perceived or intended loudness of a played note, instructed in music notation as piano or forte (soft or loud) with different dynamic gradations between and beyond. Recent research has shown that even if no loudness cues are available, listeners can still quite reliably identify the intended dynamic strength of a performed sound by relying on timbral features. More recently, acoustical analyses across an extensive set of anechoic recordings of orchestral instrument notes played at pianissimo (pp) and fortissimo (ff) showed that attack slope, spectral skewness, and spectral flatness together explained 72% of the variance in dynamic strength across all instruments, and 89% with an instrument-specific model. Here, we further investigate the role of timbre in musical dynamics, focusing specifically on the contribution of spectral and temporal modulations. Loudness-normalized modulation power spectra (MPS) were used as input representation for a convolutional neural network (CNN). Through visualization of the pp and ff saliency maps of the CNN it was possible to identify discriminant regions of the MPS and define a novel task-specific scalar audio descriptor. A linear discriminant analysis with 10-fold cross-validation using this new MPS-based descriptor on the entire dataset performed better than using the two spectral descriptors (27% error rate reduction). Overall, audio descriptors based on different regions of the MPS could serve as sound representation for machine listening applications, as well as to better delineate the acoustic ingredients of different aspects of auditory perception.
- CMMRBeyond the semantic differential: Timbre semantics as crossmodal correspondencesCharalampos Saitis14th International Symposium on Computer Music Multidisciplinary Research, 2019
This position paper argues that a systematic study of crossmodal correspondences between timbral dimensions of sound and perceptual dimensions of other sensory modalities (e.g., brightness, fullness, roughness, sweetness) can offer a new way of addressing old questions about the perceptual and cognitive mechanisms of timbre semantics, while the latter can provide a test case for better understanding crossmodal correspondences and human semantic processing in general. Furthermore, a systematic investigation of auditory-nonauditory crossmodal correspondences necessitates auditory stimuli that can be intuitively controlled along intrinsic continuous dimensions of timbre, and the collection of behavioural data from appropriate tasks that extend beyond the semantic differential paradigm.
- ISMASounds like melted chocolate: how musicians conceptualize violin sound richnessCharalampos Saitis, Claudia Fritz, and Gary ScavoneInternational Symposium on Musical Acoustics, 2019
Results from a previous study on the perceptual evaluation of violins that involved playing-based semantic ratings showed that preference for a violin was strongly associated with its perceived sound richness. However, both preference and richness ratings varied widely between individual violinists, likely because musicians conceptualize the same attribute in different ways. To better understand how richness is conceptualized by violinists and how it contributes to the perceived quality of a violin, we analyzed free verbal descriptions collected during a carefully controlled playing task (involving 16 violinists) and in an online survey where no sound examples or other contextual information was present (involving 34 violinists). The analysis was based on a psycholinguistic method, whereby semantic categories are inferred from the verbal data itself through syntactic context and linguistic markers. The main sensory property related to violin sound richness was expressed through words such as full, complex, and dense versus thin and small, referring to the perceived number of partials present in the sound. Another sensory property was expressed through words such as warm, velvety, and smooth versus strident, harsh, and tinny, alluding to spectral energy distribution cues. Haptic cues were also implicated in the conceptualization of violin sound richness.
- ChapterThe Semantics of TimbreCharalampos Saitis, and Stefan WeinzierlTimbre: Acoustics, Perception, and Cognition, 2019
Because humans lack a sensory vocabulary for auditory experiences, timbral qualities of sounds are often conceptualized and communicated through readily available sensory attributes from different modalities (e.g., bright, warm, sweet) but also through the use of onomatopoeic attributes (e.g., ringing, buzzing, shrill) or nonsensory attributes relating to abstract constructs (e.g., rich, complex, harsh). The analysis of the linguistic description of timbre, or timbre semantics, can be considered as one way to study its perceptual representation empirically. In the most commonly adopted approach, timbre is considered as a set of verbally defined perceptual attributes that represent the dimensions of a semantic timbre space. Previous studies have identified three salient semantic dimensions for timbre along with related acoustic properties. Comparisons with similarity-based multidimensional models confirm the strong link between perceiving timbre and talking about it. Still, the cognitive and neural mechanisms of timbre semantics remain largely unknown and underexplored, especially when one looks beyond the case of acoustic musical instruments.
- ChapterThe present, past, and future of timbre researchKai Siedenburg, Charalampos Saitis, and Stephen McAdamsTimbre: Acoustics, Perception, and Cognition, 2019
Timbre is a foundational aspect of hearing. The remarkable ability of humans to recognize sound sources and events (e.g., glass breaking, a friend’s voice, a tone from a piano) stems primarily from a capacity to perceive and process differences in the timbre of sounds. Roughly defined, timbre is thought of as any property other than pitch, duration, and loudness that allows two sounds to be distinguished. Current research unfolds along three main fronts: (1) principal perceptual and cognitive processes; (2) the role of timbre in human voice perception, perception through cochlear implants, music perception, sound quality, and sound design; and (3) computational acoustic modeling. Along these three scientific fronts, significant breakthroughs have been achieved during the decade prior to the production of this volume. Bringing together leading experts from around the world, this volume provides a joint forum for novel insights and the first comprehensive modern account of research topics and methods on the perception, cognition, and acoustic modeling of timbre. This chapter provides background information and a roadmap for the volume.
- ChapterAudio Content Descriptors of TimbreMarcelo Caetano, Charalampos Saitis, and Kai SiedenburgTimbre: Acoustics, Perception, and Cognition, 2019
This chapter introduces acoustic modeling of timbre with the audio descriptors commonly used in music, speech, and environmental sound studies. These descriptors derive from different representations of sound, ranging from the waveform to sophisticated time-frequency transforms. Each representation is more appropriate for a specific aspect of sound description that is dependent on the information captured. Auditory models of both temporal and spectral information can be related to aspects of timbre perception, whereas the excitation-filter model of sound production provides links to the acoustics of sound production. A brief review of the most common representations of audio signals used to extract audio descriptors related to timbre is followed by a discussion of the audio descriptor extraction process using those representations. This chapter covers traditional temporal and spectral descriptors, including harmonic description, time-varying descriptors, and techniques for descriptor selection and descriptor decomposition. The discussion is focused on conceptual aspects of the acoustic modeling of timbre and the relationship between the descriptors and timbre perception, semantics, and cognition, including illustrative examples. The applications covered in this chapter range from timbre psychoacoustics and multimedia descriptions to computer-aided orchestration and sound morphing. Finally, the chapter concludes with speculation on the role of deep learning in the future of timbre description and on the challenges of audio content descriptors of timbre.
- BookTimbre: Acoustics, Perception, and CognitionEds: Kai Siedenburg, Charalampos Saitis, Stephen McAdams, and 2 more editorsSpringer Handbook of Auditory Research 69, 2019