Introduction

Understanding how musical features shape the perception of emotion is a core area of research in music psychology (Gabrielsson & Lindström, 2010). Perceived emotion, defined as the emotions that listeners recognize as being expressed by music, is central to how music communicates meaning (Feng et al., 2003; Gabrielsson, 2001; Schubert, 2013). In contrast, felt emotion refers to the subjective emotional experience elicited in listeners (Gabrielsson, 2001), which can vary significantly depending on individual factors such as mood, personality, and context (Juslin & Laukka, 2004; Xu, Wen et al., 2021). Although these two dimensions are interconnected, perceived emotion focuses on the communicative intent of the music itself, making it particularly relevant for understanding how musical elements, such as melody, harmony, rhythm, and especially timbre, contribute to emotional expression (Korsmit et al., 2024; Xu, Sun et al, 2021). By shaping listeners’ interpretations of emotion, these elements enable music to convey nuanced emotional content independent of listeners’ personal reactions.

While most research has predominantly focused on Western music, there is a growing recognition of the need to explore this relationship across different musical traditions (Hu & Yang, 2017; Jacoby et al., 2020; Trehub et al., 2015). Chinese traditional instrumental music, characterized by its distinctive pentatonic system and unique timbral qualities (Nan & Guan, 2023), offers a rich context for such exploration. This study seeks to deepen our understanding of how timbral features contribute to perceived emotion in Chinese traditional music, broadening the global perspective on music’s emotional expressiveness.

Chinese traditional music has a long-standing history that sets it apart from Western music, both in terms of instruments and aesthetics (Rao, 2002; Wu et al., 2024). Instruments such as the Erhu, Pipa, and Sheng are crafted from natural materials, giving them distinctive tonal qualities that significantly impact the emotional experience of the music (Hao, 2023). The pentatonic system, which forms the basis of much Chinese traditional music, creates a spacious and open sound that contrasts with the heptatonic system of Western music (Zhang et al., 2022). These differences extend beyond the structural level and are deeply intertwined with cultural philosophies, such as Confucianism and Taoism, which emphasize harmony and balance (Hao, 2023). The unique cultural and aesthetic context of Chinese traditional music thus shapes the emotional expression and perception, offering an opportunity to explore how timbral features may evoke emotions differently than in Western music.

In the broader study of musical emotion, timbre has been recognized as a critical factor influencing listeners’ affective responses (Korsmit et al., 2024), although its importance is sometimes overlooked (Filipic et al., 2010). Research highlights that timbre plays a fundamental role in conveying musical emotion (McAdams, 2019; Schutz et al., 2008). However, findings about the emotional associations of different timbres are often inconsistent. For example, timbres have been linked to both anger and fear as well as positive emotions (Grimaud & Eerola, 2022; Xu, Wen et al., 2021). Hence, how do these relationships manifest in Chinese traditional instrumental music? This study seeks to investigate this question in depth. Moreover, recent advancements in computational analysis have enabled more precise explorations of how specific audio features relate to perceived emotion (Panda et al., 2020). Timbre, often characterized by parameters such as brightness, harmonicity, and spectral features (Peeters et al., 2011; Korsmit et al., 2024), plays a crucial role in shaping emotional perception. For instance, a higher spectral centroid is associated with brighter, more energetic emotions, whereas lower centroids evoke darker, more subdued feelings (Peeters et al., 2011). By employing advanced computational techniques, this study will investigate how timbral features in Chinese traditional music influence perceived emotion.

An additional important aspect of studying musical emotion is the selection of an appropriate emotion model. While many studies have applied general human emotion models, such as Ekman’s discrete emotion model (Ekman, 1992) or Russell’s dimensional model (Russell, 1980), these frameworks may not fully capture the specific nuances of music-related emotions (Korsmit et al., 2023). To address these limitations, music-specific models, like Zentner et al.‘s (2008) nine-factor model and the three-dimensional model proposed by Greenberg et al. (2016), have been developed. In the context of Chinese traditional music, Shi (2015) proposed a seven-factor model of musical emotions, encompassing anger, sadness, happiness, peacefulness, transcendence, gentleness, and solemnness. The development of this discrete emotion model followed a methodological framework similar to that of Zentner et al. (2008), involving three key steps: compiling music-related emotion terms, conducting exploratory factor analysis to identify the underlying emotional dimensions, and employing confirmatory factor analysis to validate the structure (Shi, 2015). Many of the factors identified in Shi’s model align with well-established dimensions of musical emotions, underscoring cross-cultural commonalities in emotional experiences. For instance, anger, happiness, and sadness are basic emotions extensively studied in the field of affective science (Laukka et al., 2013), while gentleness and peacefulness are often used to describe neutral emotional states (Zentner et al., 2008). Notably, solemnness and transcendence stand out as prominent aesthetic emotions, reflecting the deeper, often spiritual dimensions of musical experience (Akkermans et al., 2018; Zentner et al., 2008). This study will adopt Shi’s (2015) seven-factor model to explore the relationship between timbral features and the perceived emotions in Chinese traditional music.

In sum, we conducted an exploratory study to investigate the associations between affective timbres and the perceived emotions of Chinese traditional music. By integrating audio feature extraction techniques and machine learning (ML) methods, this study aimed to address the following three questions: (a) Can timbral features that have been shown to predict different emotions in Western music also predict perceived emotion in Chinese traditional music? (b) Based on the results of computational modeling, which timbral features are most effective in predicting perceived emotion in Chinese traditional music? (c) How do these findings compare to those from studies on Western music, highlighting similarities and differences? Answering these questions will deepen our understanding of the emotional expressiveness of Chinese traditional music and contribute to a broader cross-cultural perspective in music research.

Methods

Dataset

This study employed music excerpts and corresponding emotion annotations from the Chinese Traditional Instrumental Music (CTIM) dataset (Wu et al., 2024), which was specifically designed to comprehensively represent the diversity and emotional depth of traditional Chinese instrumental music. While earlier datasets (e.g., Li et al., 2012; Xu, Yun et al., 2022) included some Chinese instrumental pieces, they often lacked genre-specific focus and sufficient coverage of traditional repertoire. In contrast, the CTIM dataset was curated through a rigorous process led by an expert panel comprising seasoned musicians and psychology graduate students. This panel selected 145 ensemble performances featuring traditional bayin instruments, spanning a historical timeline from the Qin dynasty (221 BCE–206 BCE) to the 20th century. The selection emphasized emotional richness and stylistic diversity.

To ensure consistency and scientific utility, each musical piece was edited into one to four 10 s excerpts (Wu et al., 2024). These excerpts were carefully segmented at phrase boundaries containing core melodies, with particular attention to preserving emotional continuity and minimizing variations in musical elements such as rhythm, timbre, and dynamics. The final dataset was processed uniformly: all excerpts were sampled at 44 kHz, encoded at a bit-rate of 192 kbps, and standardized in sound intensity. For the current study, all the 273 excerpts from the CTIM dataset were utilized as stimuli, each lasting 10 s.

Given that this research retrospectively used publicly available data, the Research Ethics Committee confirmed that ethical approval was not required. All data collection procedures and analytical methods adhered strictly to relevant ethical guidelines and standards.

Music emotion annotations

Emotion annotations were provided by 168 Chinese participants (Wu et al., 2024), with each excerpt being rated by 56 individuals. Wu et al. (2024) employed an adjective-based rating system on a 7-point Likert scale to evaluate emotions. Participants were instructed to assess the intensity of each discrete emotion (anger, gentleness, happiness, peacefulness, sadness, solemnness, and transcendence; Shi, 2015), from 1 (“nonexistent”) to 7 (“extremely intense”). For the dimensional model, valence ranged from 1 (“extremely negative”) to 7 (“extremely positive”), and arousal from 1 (“not at all aroused”) to 7 (“extremely aroused”). Further details on the annotation process are available in Wu et al. (2024).

Timbre feature extraction

Timbre features were computed using the Timbre Toolbox (Kazazis et al., 2021), developed from the work of Peeters et al. (2011). In line with Korsmit et al. (2024), timbre features were derived from the short-term fast-Fourier transform in the spectral domain (including spectral centroid, spectral spread, spectral skewness, spectral kurtosis, spectral flatness, spectral crest, spectral slope, spectral decrease, spectral roll off, spectral variation, and spectral flux), the harmonic partials (including fundamental frequency, harmonic spectral deviation, Tristimulus 1, Tristimulus 2, Tristimulus 3, harmonic odd to even ratio, inharmonicity, harmonic energy, noise energy, noisiness, harmonic to noise energy, and partials to noise energy), and the temporal energy envelope (including attack time, log attack time, attack slope, decrease slope, temporal centroid, effective duration, frequency of energy modulation, and amplitude of energy modulation). Time-varying descriptors were summarized using the median and interquartile range (IQR) across each 10 s excerpt. A total of 54 descriptors, as outlined by Korsmit et al. (2024), were used for predicting emotional perception. The specific details of these features are provided in the Supplementary Table S1.

Analytical approach

To predict musical emotions, both linear and nonlinear regression techniques were applied (e.g., Korsmit et al., 2024; Wen et al., 2022), following previous findings that suggest nonlinear relationships might better capture the interaction between timbre and emotion (McAdams & Goodchild, 2017; Xu, Wen et al., 2021). Following the method of Korsmit et al. (2024), Lasso regression was first used for variable selection, after which standard linear regression was applied to predict different emotion ratings.

For the nonlinear approach, random forest regression (RFR) was used to assess the contribution of timbre descriptors. Each timbre feature served as input, while emotion ratings were treated as the output (ground truth) for building separate RFR models for each emotion (Xu et al., 2024). A grid search was performed to fine-tune the model’s parameters, and tenfold cross-validation was implemented to validate the model’s generalizability. Model performance was assessed using the statistic, while Gini importance (Strobl, Malley, & Tutz, 2009) was utilized to rank the importance of variables in predicting emotions. Gini importance efficiently identifies key features by measuring their contribution to impurity reduction at each decision tree split (Archer & Kimes, 2008), making it a suitable choice for exploratory analysis. However, it is not without limitations, such as potential biases toward features with higher variability or more categories, and it provides limited interpretive value regarding feature relationships. To address this, correlation analysis was incorporated to complement the rankings and enhance the interpretability of the model results.

In addition, to visualize the similarity among timbral correlates of different emotion dimensions, we applied Principal Component Analysis (PCA). Specifically, we first computed the correlation coefficients between each of the 54 timbral features and each emotion dimension, yielding a 54-dimensional timbral vector for every emotion. These vectors were then submitted to PCA, and the first two principal components were plotted to provide a two-dimensional visualization of how timbral profiles of different emotions cluster or diverge. This analysis was used solely for visualization and interpretation purposes, without being part of the main inferential analyses.

Results

Linear regressions

To explore the linear relationships between timbral features and various emotions, we first applied Lasso regression for feature selection, followed by standard linear regression. The full results of the feature selection and standard linear regression for each emotion are presented in Supplementary Tables S3–S11. Table 1 summarizes the predictive performance of the linear regression models for each emotion category, along with the top five features with the highest absolute standardized regression coefficients.

Table 1 Results of linear regressions.

We observed that for the dimensional model (valence and arousal), the selected timbral features predicted arousal more effectively, with an adjusted of 0.751. Partials Noise Energy and the IQR of Inharmonicity were significant negative predictors of perceived arousal, whereas Fundamental Frequency and Spectral Variation positively predicted arousal. In contrast, the model for valence had a lower predictive performance (adjusted  = 0.492). Interestingly, both the median and IQR of Inharmonicity played key roles in predicting valence and arousal, indirectly supporting previous research that found a strong positive correlation between these two dimensions (Chen et al., 2015).

For the discrete emotion models, transcendence had the highest predictive accuracy (adjusted  = 0.642), followed by happiness (adjusted  = 0.530), peacefulness (adjusted  = 0.505), anger (adjusted  = 0.406), sadness (adjusted  = 0.404), gentleness (adjusted  = 0.307), and solemnness (adjusted  = 0.302). We also observed that certain timbral features were consistently important predictors across different emotions. For example, the median of Spectral Spread positively predicted gentleness and happiness, while negatively predicting anger, solemnness, and transcendence. Additionally, the median of Tristimulus 3 positively predicted anger and solemnness but negatively predicted gentleness. Notably, noisiness was strongly negatively correlated with peacefulness (β = -0.975) and positively correlated with anger (β = -0.561). These findings provide valuable insights into the linear relationships between various timbral features and perceived emotions in Chinese traditional instrumental music.

Machine learning analysis

We then used RFR to explore the nonlinear relationship between timbre features and various emotions. Figures 1a–i present the 12 most important timbre features for each emotion recognition model, with the complete feature importance results available in Supplementary Table S12. The RFR model successfully captured the nonlinear associations between timbre characteristics and perceived emotions. For example, in the RFR model for valence, the median of Noisiness emerged as the most crucial predictor of valence. This feature, however, was excluded in the linear models using Lasso regression. In the case of arousal, the median of Partials Noise Energy was identified as the most significant feature, contributing 29.11% to the model’s total predictive power as measured by Gini importance, a finding consistent with linear regression results.

Fig. 1: Feature importance of different RFR models.
Fig. 1: Feature importance of different RFR models.The alternative text for this image may have been generated using AI.
Full size image

Figures 1a-i illustrate the distribution of feature importance across the predictive models for valence, arousal, anger, gentleness, happiness, peacefulness, sadness, solemnness, and transcendence. The boxplots in each figure are arranged in order of their mean values, with only the top 12 features displayed for clarity. The trends for the remaining features are approximately similar. The “×” symbol represents the mean value. The complete results of feature importance are provided in Supplemental Materials Table S12.

For discrete emotion models, the median of Partials Noise Energy was also the most important feature for recognizing anger, contributing 9.49% to the model’s total predictive power. This was followed by the median of Noisiness (7.94%) and the IQR of Noisiness (7.81%), indicating that noise-related timbre features play a key role in predicting anger. Similarly, the median of Noisiness was the most important predictor for sadness (explaining 7.12%) and happiness (12.81%). The negative relationship between Noisiness and sadness reveals that Chinese traditional music tends to use less noise when conveying sadness. Conversely, the positive relationship between Noisiness and happiness suggests that more noise elements are incorporated into music to express happiness.

For gentleness, the most significant predictor in the RFR model was Effective Duration, contributing 6.74% to the model’s total predictive power. The positive association between Effective Duration and gentleness reflects a tendency in Chinese traditional music to use longer perceived sounds when expressing gentle emotions. Regarding peacefulness, both the median and IQR of Spectral Variation played a crucial role, contributing 16.63 and 13.03% to the model’s total predictive power, respectively. This aligns with findings from Western music, where peaceful compositions often exhibit less spectral variation. A similar pattern was observed in the RFR model for transcendence, where the IQR of Spectral Variation accounted for 32.80% of the variance, suggesting that transcendence is often associated with reduced spectral variation.

Finally, for solemnness, the IQR of Spectral Variation and Noisiness contributed 8.54 and 6.82% to the model’s total predictive power, respectively. These results indicate that lower variability in both spectral variation and noisiness is linked to a heightened perception of solemnness in Chinese traditional music. In other words, more consistent spectral properties and reduced noisiness variability may contribute to a more solemn emotional tone in the music.

Comparison of top timbre features in linear and nonlinear models

To better understand the differences in results between the linear regression (LR) and RFR models and clarify the influence of timbre features on different perceived emotions, Table 2 presents a comparison of the most important features identified by the two models. As shown in Table 2, for a few emotions (such as arousal and peacefulness), the key features identified by both models were similar. For instance, in the prediction model for peacefulness, both LR and RFR highlighted Spectral Variation IQR, Partials to Noise Energy MED, and Noisiness MED as important features.

Table 2 Top five timbral features in different music emotion recognition models.

However, for most emotions, RFR captured key timbre features that differed significantly from those identified by LR. For example, in the prediction of solemnness, only one feature—Spectral Variation IQR—was shared among the top five features in both models. Notably, Effective Duration, which was among the top features in RFR, had a standardized regression coefficient of just -0.038 in LR, indicating minimal significance in the linear model. Similarly, in the prediction of sadness, RFR identified features such as Noisiness MED and Noisiness IQR as highly predictive, complementing the results of LR. These findings suggest that combining linear and nonlinear regression models provides a more comprehensive understanding of the complex relationships between timbre features and perceived emotions than relying solely on linear regression. Additional details on the results from both models can be found in Supplementary Tables S3–S12.

Discussion

The primary goal of this study is to investigate how musical timbre influences the perception of emotion in Chinese traditional instrumental music. To achieve this, we employed timbre feature extraction techniques alongside computational modeling to explore the relationships between various timbral features and perceived emotions. Figure 2 highlights key timbral features associated with different emotions (see Section 2.4 for more details), including valence, arousal, anger, sadness, happiness, peacefulness, transcendence, gentleness, and solemnness. This analysis revealed several patterns similar to those found in Western music but also identified unique forms of emotional expression within the context of Chinese traditional music.

Fig. 2: The associations between timbre features and perceived emotions in Chinese traditional music.
Fig. 2: The associations between timbre features and perceived emotions in Chinese traditional music.The alternative text for this image may have been generated using AI.
Full size image

The more similar the emotional timbre features are, the closer the vectors appear in the figure.

One of the most intriguing findings is that Chinese traditional music conveys happiness and positive emotions through increased noise energy, inharmonicity, and spectral variability. One possible interpretation of our observations is that the prominence of percussive instruments (e.g., gongs and drums) in Chinese traditional music may contribute to a lively atmosphere of Re Nao (热闹), a concept emphasizing communal celebration and shared joy. The “roughness” in timbre, associated with inharmonicity and spectral variability, might reflect the energetic and vibrant social dynamics typical of Chinese festivals. Future studies could directly test these cultural interpretations, for instance by asking listeners to rate celebratory feelings beyond general happiness, or by experimentally manipulating timbral features to examine whether they elicit Re Nao-related responses.

By contrast, studies of Western music have often highlighted the role of harmonic consonance and melodic structure in conveying positive emotions (Webster & Weir, 2005). These are not timbral features per se, but there is also evidence that timbre plays a role in Western contexts. For instance, brightness, spectral centroid, and attack time have been associated with joy or positive affect in Western classical and popular music (Eerola, Ferrer, & Alluri, 2012; Eerola, Friberg, & Bresin, 2013). This suggests that timbre contributes to emotional expression across cultures, although the specific features emphasized may differ. In addition, recent cross-cultural studies suggest that Western music may rely more on harmonic consonance and pitch-based cues to convey joy, whereas Chinese music emphasizes timbral cues such as loudness and spectral variability (Wang, Wang, & Xie, 2022). This contrast could stem from differences in instrumentation, performance practices, or aesthetic preferences, such as the prominence of percussive timbres in Chinese ensembles versus the harmonic resources emphasized in Western traditions. At the same time, cultural concepts like Re Nao, which value communal energy, may also provide a useful lens for interpreting these findings, though such connections remain speculative and require further empirical validation.

In contrast to happiness, Chinese traditional music expresses sadness through reduced noise energy and lower inharmonicity, portraying a more subdued and restrained form of sorrow. The restrained expression of sadness in Chinese music might reflect a more introspective and inward-focused emotional style, aligning with cultural values that emphasize emotional balance and social harmony (Reilly, 2017). These musical features—reduced inharmonicity and smoother timbre—may represent a form of acceptance or reflection rather than overt grief, consistent with collectivist values that prioritize emotional restraint (Ip et al., 2021) and maintain harmony within the group (Chiu & Kosinski, 1994). Interestingly, this subdued and introspective mode of expressing sadness is also found in certain Western musical traditions (Juslin & Laukka, 2004; Juslin & Sloboda, 2011), where slower tempos, softer dynamics, and smoother timbres are often employed to convey sadness in a more understated manner.

For solemnness, we found that solemn music exhibited a narrower range of spectral variability, reduced noise energy, and shorter durations, which points to a more focused and controlled sonic texture. These acoustic features likely contribute to a perception of solemnity by limiting excessive variation and complexity, aligning with the expectation of emotional restraint typically associated with solemn contexts. Recognizing that cultural factors significantly influence the perception of complex emotions (Matsumoto & Hwang, 2012) like solemnness, we further examined solemn excerpts from the CTIM database. These selections shared similarities with music used in Chinese Buddhist ceremonies (Zhang et al., 2016), suggesting a potential cultural connection. It is plausible that solemn music in this context draws upon traditional religious soundscapes, where simplicity and clarity in sound are integral. However, this resemblance does not necessarily confirm a direct relationship and should be investigated in the future.

Similar cultural phenomena are also in the expression of the emotion of transcendence. Our findings reveal that the musical expression of transcendence in Chinese traditional music is closely tied to natural sounds, characterized by a narrower range of spectral variability and overall less spectral variability. The recurring theme of stable timbral patterns associated with transcendence might reflect a philosophical resonance with Daoist ideas of harmony between humans and nature (Lun, 2012; Verellen, 1995). In Daoist thought, transcendence is not an escape from reality but a state of attunement with nature’s rhythms and cycles. The use of stable and consistent spectral properties in transcendent music may symbolize a sense of unity and balance, reflecting the Daoist ideal of “Wu Wei” (non-action) and an effortless existence in accordance with the natural order (Loy, 1985; Slingerland, 2000).

In summary, this study deeply explores the relationship between timbre features and perceived emotions in Chinese traditional music, and discusses these findings from a Chinese cultural perspective. However, the study has several limitations. First, the CTLM dataset (Wu et al., 2024) used in this study is unbalanced, with some emotions (such as happiness) represented by more excerpts than others (such as anger and gentleness). This imbalance may influence the machine learning results and limit their generalizability (Kaur et al., 2019). Furthermore, the relatively small number of music pieces for certain emotions (i.e., gentleness, solemnness, and transcendence) raises questions about the reliability of the findings for those specific categories. Future studies should aim to construct more balanced datasets and incorporate larger sample sizes (Krawczyk, 2016) to ensure robust and reliable modeling of emotional associations in music. Second, as all participants in the CTLM database (Wu et al., 2024) were Chinese, this study cannot comprehensively address cross-cultural differences in the perception of emotions in Chinese traditional music. While our findings suggest that certain timbral features are associated with specific emotions, these associations may be shaped by cultural factors, such as collectivist values (Hu, 2024). For instance, timbral features linked to happiness in this study may not evoke the same emotional responses in listeners from cultures with more individualist values (Wang et al., 2022). Future research should include participants from diverse cultural backgrounds to explore whether such associations are consistent across cultures or culturally specific. Comparative studies examining how listeners from different cultures interpret timbre and emotion in Chinese traditional music would provide stronger evidence to validate or challenge these claims.

Third, the connections drawn between timbral features and Chinese cultural concepts, such as Daoist harmony or Buddhist ideas, are speculative and not empirically tested. These associations were inferred based on theoretical considerations rather than direct evidence from the data or participant feedback. Future research should employ empirical methodologies, such as balanced experimental designs and participant ratings of cultural concepts (Cowen et al., 2020), to rigorously validate these claims. Such approaches would provide a more robust foundation for understanding the interplay between timbre and cultural interpretations in traditional Chinese music. Fourth, although machine learning methods like RFR provide powerful predictive capabilities, their interpretability remains a key limitation (Krishnan, 2020; Murdoch et al., 2019). Feature importance metrics, including Gini importance, can be influenced by feature correlations and may not fully capture causal relationships. This highlights the importance of combining such models with interpretable methods, as done in this study, to provide a more comprehensive understanding of the relationships between timbre features and perceived emotions. Finally, the present study does not fully account for the historical evolution of Chinese traditional instrumental music or the different contexts in which it is performed and experienced. The emotional expression in music can be influenced by its historical background (Xu et al., 2023), regional styles (Argstatter, 2016), and performance settings (Rocke et al., 2022), all of which have evolved over time. A deeper exploration of these contextual factors is needed in future.

Conclusion

In conclusion, this study provides important insights into how timbral features influence emotional perception in Chinese traditional instrumental music, revealing both shared patterns and unique cultural expressions. The findings show that happiness is conveyed through increased noise energy, inharmonicity, and spectral variability, which may reflect collectivist values of communal celebration and shared joy. Sadness, on the other hand, is expressed with reduced noise energy and smoother timbre, possibly aligning with cultural ideals of emotional restraint and social harmony. Solemnness is characterized by a narrower range of spectral variability, reduced noise energy, and shorter durations, potentially suggesting a controlled and focused sonic texture that resembles features of traditional religious soundscapes. The portrayal of transcendence, through minimal spectral variability, might resonate with Daoist philosophy, emphasizing harmony with nature and balance between humans and the natural world. These findings highlight the potential ways in which Chinese traditional music expresses emotion, shaped by cultural and philosophical influences that warrant further exploration.