Introduction

Vocal fatigue is a common symptom in individuals seeking vocal health treatment. It is also a prevalent complaint in populations without voice disorders and can be an early sign of vocal health risks seen in occupational voice users, particularly teachers. One systematic review summarized vocal fatigue prevalence in teachers to be between 42% and 92%1. This wide range of prevalence has been reported as being caused by inherent differences in measurements of fatigue based on either state or trait fatigue. Trait fatigue has been defined as the “average amount of perceived fatigue over a period of time” and state fatigue as the “change in perception of fatigue during an ongoing activity”2. The Vocal Fatigue Index (VFI) is a validated instrument in measuring trait fatigue3. However, the quantification of state fatigue has been hindered by both the wide range of research protocols, which are not comparable, and an array of metrics that have resulted in both mixed and sometimes contradictory results. These differences may mask underlying individual variations, which could potentially be used to identify subgroups with distinct response patterns or risk profiles.

The research protocols range from in situ voice observations to laboratory-induced vocal fatigue. One approach for quantifying state vocal fatigue has been to monitor occupational voice users within their actual work environment and attempt to detect change over time. This has been attempted in schoolteachers4,5, call-center workers6, singers7,8, and radio broadcasters.9. These studies illustrate the variety of devices and techniques to track vocal use in ecologically valid (although less controlled) environments. On the other end of the spectrum are laboratory environments with prescriptive tasks designed to induce vocal fatigue, i.e. a vocal loading task (VLT). While the specifics of a VLT can vary widely across studies, a VLT typically includes a prolonged speaking task or elevated vocal effort. Fujiki and Sivasankar10 reported that for VLTs the most common duration was two hours with shortest and longest durations as 15 min and 3.75 h, respectively. The most common type of loading task was prolonged, loud reading. Previous work has typically used two approaches to elicit elevated loudness: either with background noise11,12 or a loudness target13,14. While a variety of VLT-influencing metrics have been reported (e.g., direct sensation of fatigue or discomfort, auditory perception of vocal effort, voice acoustic parameters), the results have been inconclusive with the only consistent measure related to assumed state vocal fatigue being perceived vocal effort15. Throughout VLT studies, acoustic measures have been used to track changes in voice production associated with prolonged speaking or other vocal demands. Common measures used to assess vocal fatigue in this manner include fundamental frequency16, speech level6, and cepstral peak prominence17. Unfortunately, these results vary and are inconclusive, illustrated by reports of increases, decreases, or no change associated with vocal fatigue18 and considerable inter- and intra-subject variability5.

These inconsistent results could be related to three problems with the implementation of VLTs. In general, (1) there is not a consistent definition and framework for studying state vocal fatigue, (2) the VLT studies vary widely in design and are not comparable, and finally, (3) the assumption that fatigue is induced by the VLT may not be appropriate due to potential individual differences in the biophysiological response to vocal demands and fatigue.

To address the first concern, a proposed consensus definition and framework for vocal fatigue and its related terms was introduced by Hunter and colleagues2. Here vocal fatigue is defined as “the perceived measurable symptom that influences vocal task performance and is individual specific; it is a multifaceted concept integrating self-perceived vocal symptoms and/or physiologic deficit,” which may be a result of high “vocal demand response,” high “vocal effort,” or “neuromuscular deficit.” This definition supports previous work in the use of measurements of vocal effort and vocal performance. Importantly, it also states that vocal fatigue is “individual specific”–a concept that has not been commonly considered.

In addition to a consistent framework, a consistent protocol for VLTs is needed. Direct comparisons of VLTs across different studies are essential for a comprehensive understanding. Additionally, a VLT designed with scalability would allow for a much larger sample of participants, which is critical for the detection of individualistic vocal demand responses. Previous work by Hunter et al.19 has discussed this need and proposed a VLT protocol. This protocol allows for broad adoption and comparability through a design philosophy of modularity for flexibility and scalability that leverages computation tools for data acquisition, segmentation, and signal processing.

While a proposed framework and protocol exist for measuring state vocal fatigue, addressing the inherent heterogeneous response from participants remains an unresolved challenge. Nanjundeswaran and Shembel20 have proposed a conceptual framework that highlights the need for a better understanding of individual differences in vocal demand responses related to vocal fatigue. Recent studies by Shembel and colleagues provide direct evidence for this heterogeneity in vocal demand responses. Their work examining the effects of vocal loading on various voice parameters demonstrated significant variability in how individuals with and without voice disorders respond to similar vocal demands21,22.

Given this documented variability, a critical next step is to categorize these different response patterns. One potential approach to measuring the heterogeneous response to vocal fatigue is to develop detection and classification of individuals into subtypes of vocal demand responses. Based on previous VLT research and the framework from Hunter et al.2 changes in vocal performance and/or perceived vocal effort may demonstrate vocal demand responses, which implicate vocal fatigue. For the purposes of this study, vocal fatigue is operationalized as measurable changes in either or both of two key dimensions: (1) self-perceived vocal effort as quantified by the Borg CR-100 scale ratings before, during, and after the VLT; and (2) objective changes in vocal performance parameters encompassing the subjective qualities of pitch, loudness, and voice quality, quantified by speaking fundamental frequency (F0), speech level (SL), and smoothed cepstral peak prominence (CPPS) respectively. Since these two types of demand responses (vocal performance and perceived vocal effort) are not necessarily related, each demand response will be independently classified. The combination of these classifications provides a basis for subgroups of individuals with homogeneous responses to the VLT for the study of vocal fatigue.

The purpose of this paper is to use a highly structured VLT protocol to classify individuals who respond to either changes in vocal effort or in vocal performance as a result of prolonged loud speaking with background noise. We hypothesize that a combination of measured changes in vocal performance and perceived vocal effort will classify VLT participants with and without vocal demand responses. Additionally, individuals who exhibit vocal demand responses may be implicated in having vocal fatigue. The successful confirmation of the hypothesis paves the way for personalized medical interventions that will enhance patient health outcomes.

Methods

Participants

A total of 37 participants qualified and consented to participate. A target sample size of 40 participants was initially devised based on previous VLT studies10, which would accommodate potential four-class subtyping (yielding approximately 10 participants per subgroup) while also providing sufficient statistical power for aggregate analysis. The participant group consisted of 19 participants who identified their gender as women and 18 participants who identified their gender as men. The average age of the participants was 20.1 years (SD: 1.4) with 29 participants identifying as White, 5 as Black or African American, 2 as Hispanic or Latino, and 1 as Asian. The participants were enrolled at Michigan State University and received course credit as compensation for their participation. Michigan State University’s Human Research Protection Programs Human Subject Review Board provided human research participation oversight and approval for all experimental protocols (STUDY00004125, LEGACY16-689). Written, informed consent was obtained from all participants, and all methods followed relevant guidelines and regulations. To be included in the study, participants must be between the ages of 18 and 49 and be native speakers of American English. Participants were excluded prior to participation if they self-reported as having current or past speech, voice, or hearing problems, currently smoking, or had a significant self-reported vocal handicap as determined by a VHI-1023 score exceeding 20. The participant group had a mean VHI-10 score of 8.4 (range 0–16; SD: 4.4) indicating mild to moderate self-perceived vocal handicap. It should be noted that while self-report measures and questionnaires were used for screening, direct laryngeal visualization was not performed as part of the study protocol. Additionally, the participants were tested for normal hearing through pure-tone stimulation (air conduction) of at least 20 dB HL in both ears at 500 Hz, 1 kHz, 2 kHz, and 4 kHz.

Instrumentation

Participants’ speech was recorded using a head-mounted omnidirectional microphone (B3, Countryman Associates, Menlo Park, CA) placed 5 cm from their mouth. The microphone signal was pre-amplified (HV-3D, Millennia Media, Diamond Springs, CA) and digitized (ADI-8 DS, RME Audio, Haimhausen, Germany) before being recorded using a digital audio workstation (REAPER, Cockos, Rosendale, NY) at a sampling rate of 44.1 kHz with 16-bit resolution. A reference sound level meter (IEC 60651 Type 2) was positioned 50 cm from the speaker’s mouth, and its reference microphone was calibrated to 94 dB SPL (relative to 20 \(\upmu \hbox {Pa}\)) using the two-step calibration procedure for head-mounted microphones found in Švec et al.24 PsychoPy (v3.0.225) was used to present the stimuli and collect the user’s vocal effort ratings. The schematic for instrumentation is shown in Fig. 1.

Fig. 1
figure 1

Schematic for experimental procedure.

Procedure

After informed consent and the hearing screening, participants completed a series of tutorials to be introduced to the rating scales and speech stimuli used during the experiment. The tutorials provided instruction on how to use the computer interface and gave examples and practice of the map description task (see below). These tutorials were pre-scripted with simultaneous text and audio instructions to ensure that all participants received identical instructions and optimize participant economy2,15. Then the participants completed a vocal loading task (VLT), which consisted of describing routes on maps in background noise for up to 30 min. The participants were asked to describe the routes accurately and in a manner that their instructions would be understood to someone needing to create the route. This context was include to provide communicative intent in the task. The background noise was multi-talker speech babble (six female and six male North American speakers26) that gradually increased in intensity from 45 dBA to 75 dBA over 30 s at a rate of 10 dB every 10 s. The maximal level of noise persisted throughout the task until voluntary termination or completion of six 5-min intervals (30 min). Before and after the VLT, participants read aloud the first paragraph of the Rainbow Passage27 and were instructed to read with comfortable pitch and loudness. During the VLT, the participants rated their perceived vocal effort using the Borg CR-100 (see Fig. 228,29). These ratings were measured before, after, and every five minutes of the VLT for a total of eight measurements throughout the task. The Borg CR-100 scale was both anecdotally and experientially anchored following the procedure in Hunter et al.29.

Fig. 2
figure 2

Borg CR-100 scale adapted for vocal effort rating.

Acoustic measurement

Speech samples were processed to remove non-voicing segments30,31. From each voice-only-concatenated speech segment, five acoustic parameters were computed: mean speaking fundamental frequency (F0), standard deviation of speaking fundamental frequency (F0sd), speech level (SL), standard deviation of speech level (SLsd), and smoothed cepstral peak prominence (CPPS). These parameters were selected to reflect basic vocal performance parameters including pitch, pitch variability, loudness, loudness variability, and voice quality.

F0, F0sd, and CPPS were computed using Praat (v.6.1.0932). Settings for F0 computation in Praat were: F0 range for male-pitched voices, 65 Hz to 350 Hz; F0 range for female-pitched voices, 150 to 800 Hz. Additionally, F0 and F0sd were converted from Hertz to semitones (ST) with the average F0 of the pre-VLT Rainbow Passage as the reference. The mean and standard deviations of speech level were computed from a distribution of speech level measurements from a moving window of 20 ms with 50% overlap.

Statistical analysis

SPSS (v. 26.0, IBM, Armonk, NY) was used for statistical analysis. Normality, independence, and equal variance assumptions were checked. If these assumptions were met, one-way analysis of variance (ANOVA) tests with an alpha level of 0.05 with Bonferroni multiple comparison adjustments were used to compare the sample means for self-reported vocal effort level ratings (VER) and the five acoustic parameters (F0, F0sd, L, Lsd, CPPS) across each time point of the VLT. Pair-wise comparisons of each time point (pre, post, and the six 5-min increments during the loading task) were done using post hoc Tukey HSD tests. Welch’s ANOVA and Tamhane’s T2 post hoc tests were used if equal variance could not be assumed.

Two dimensions of participant grouping related to vocal demand responses were used, one for vocal effort response and the other for vocal performance response. To classify participants based on vocal effort response, ten proposed features were derived from the vocal effort ratings (see Table 1). Two groups with minimal and significant features were clustered using iterative k-means. The feature set was reduced based on feature importance and statistical significance within the models. The two resulting groups were labeled as “high vocal demand response” and “low vocal demand response” based on assumptions about the relationship between changes in vocal effort and vocal fatigue during vocal loading.

Table 1 Vocal effort response features used for clustering.

Participants’ vocal performance responses were categorized into groups using a general linear model (GLM) with an alpha level of 0.05 fitted for each participant, with time (pre- and post-vocal loading task) as the dependent variable and the five acoustic parameters as covariates. Participants were then grouped based on whether they exhibited a significant model and at least one significant change in acoustic feature within the model, with the “voice change” group indicating significant change(s) in vocal performance and the “no voice change” group indicating no significant changes.

After clustering based on both the vocal effort response and vocal performance response, four groups were created by intersecting these groups. The vocal performance of the groups was compared before and after vocal loading using the same statistical procedure as for the aggregate group.

Results

The extracted parameters VER, F0, SLsd, and CPPS met the assumptions for normality and independence, however, equal variance could not be assumed. F0sd and SL met all three assumptions. Table 2 summarizes the mean and standard deviation estimates for each measure across the time points (pre-VLT, 5-min increments during the VLT, and post-VLT) for all participants.

There were only significant differences between PRE and POST with VER and F0. There was significant increase of VER of 22.7 from pre-VLT to post-VLT (p = 0.001). There was significant but small increase of F0 of 0.83 ST from pre-VLT to post-VLT (p < 0.0001). For VER, F0, F0sd, SL, and SLsd there were significant increases between PRE- and the measurements during the VLT (p < 0.05 for each test; see Table 2 for magnitude of change). For F0, F0sd, SL, and SLsd there were significant decreases between POST- and 30 min of VLT (p < 0.05 for each test; see Table 2 for magnitude of change). VER did not decrease to pre-VLT values. There were not significant changes with CPPs. Additionally, there were no significant differences for any measure between the time increments of the VLT.

Table 2 Summary of mean (standard deviation) for aggregate vocal effort ratings (VER) and vocal performance measure including mean speaking fundamental frequency (F0), standard deviation of the speaking fundamental frequency (F0sd), speech level (SL), speech level (SLsd), smoothed cepstral peak prominence (CPPs).

Vocal demand response clustering

The data were clustered based on two significant features, the noise demand response (NDR) and the temporal demand response (TDR). NDR is the difference between the vocal effort rating after five minutes of vocal loading \((VER_5)\) and the vocal effort rating prior to the loading task \((VER_0)\), while TDR is the difference between vocal effort rating after thirty minutes of vocal loading \((VER_{30})\) and the vocal effort rating after 5 min of vocal loading \((VER_5)\). NDR quantifies the effect of noise on vocal effort as expected by the Lombard effect,14 while TDR quantifies the effect of time speaking within noise across the VLT. Both features were found to be statistically significant (NDR: p = 0.003; TDR: p < 0.001) in the k-means clustering analysis. It is important to note that while statistical significance was achieved for these features, they were chosen to maximize the differences across the cases in the clusters and are used only for descriptive purposes. Independent samples t-tests showed that the means of the two clusters for NDR (p < 0.001) and TDR (p = 0.001) were significantly different. Additionally, these two features were also found to be uncorrelated (r = 0.08). Cluster 1 consisted of 14 participants and had a center of NDR of 26.4 and TDR of 32.9, while Cluster 2 consisted of 23 participants and had a center of NDR of 12.2 and TDR of 0.8. Cluster 1 was relabeled as high vocal effort response (HVER) and Cluster 2 was relabeled as low vocal effort response (LVER) based on their respective features. Table 3 summarizes the count and centers of the two clusters, while Fig. 3 shows the data separated by cluster, including the cluster centers.

Table 3 Count and centers of the two vocal effort response clusters: high vocal effort response (HVER) and low vocal effort response (LVER) for the features noise demand response (NDR) and temporal demand response (TDR).
Fig. 3
figure 3

Scatter plot of clustered data with cluster centers. Square markers are the low vocal effort response (LVER) sub group, triangle markers are the high vocal effort response (HVER) sub group, and the cross markers are the cluster centers.

The analyses for VER over the duration of the VLT were repeated with the two groups (see Fig. 4). HVER showed a significant (p < 0.001) main effect of VER across the VLT, whereas LVER did not exhibit any significant effect of VER. For HVER, there was a considerable increase in VER from PRE to VL05 (26.4, p < 0.001), from VEL30 to VEL05 (32.9, p < 0.001), and between PRE and POST (46.4, p = 0.001). There were no significant differences in VER between HVER and LVER at PRE, VL05, or VL10. However, for the other time points, HVER had significantly higher VER than LVER (see Table 4).

Fig. 4
figure 4

Line graph of vocal effort ratings (VER) over time separated clusters high vocal effort level (HVER) and low vocal effort level (LVER) with error bars (standard deviation).

Table 4 Summary of mean (standard deviation) of vocal effort ratings for the high vocal effort response (HVER) and low vocal effort response (LVER) clusters.

Acoustic voice change clustering

Out of the total number of participants, 16 were found to have significant general linear models (GLM) with a p-value less than 0.05 for the model and at least one acoustic covariate when comparing the PRE-POST differences of the five vocal performance measures. The remaining 21 participants who did not have significant models were grouped as the no voice change group (NC), while the 16 participants with significant models were placed in the voice change group (VC). The VC group had an average goodness-of-fit coefficient of 0.93 (SD = 0.27). As the GLMs for the NC group were not significant, no goodness-of-fit coefficients are reported.

Vocal demand response and acoustic voice change subtyping

Four subgroups were formed by combining the clusters based on vocal effort response and acoustic voice change, namely low vocal demand response and no voice change (LVER-NC), low vocal effort response and voice changes (LVER-VC), high vocal effort response and no voice change (HVER-NC), and high vocal effort response and voice changes (HVER-VC), as cross-sections. LVER-NC had 15 participants (10 males and 5 females), LVER-VC had 8 participants (2 males and 6 females), HVER-NC had 6 participants (3 males and 3 females), and HVER-VC had 8 participants (3 males and 5 females). A summary of the groups is presented in Table 5.

Table 5 Summary of number of participants in the four cross sectional groups following vocal effort response and acoustic voice response clustering.

Following the combined clustering, pre-post differences in vocal performance measures were repeated for each subgroup. These differences with additional statistical details are summarized in Table 6. Mean fundamental frequency (F0) significantly increased for all groups except LVER-VC. For the HVER-VC group (n = 6), all five acoustic parameters of vocal performance were significantly changed from pre to post VLT. Specifically, F0 increased by 0.78 ST (p < 0.001), F0sd increased by 0.42 ST (p = 0.022), SL increased by 1.48 dB (p = 0.022), SLsd increased by 0.28 dB (p = 0.004), and CPPs decreased by 0.54 dB (p = 0.045). No other statistically significant relationships were observed.

Table 6 Summary of mean pre-post differences in vocal performance measures: vocal effort rating (VER), mean speaking fundamental frequency (F0), standard deviation of speaking fundamental frequency (F0sd), speech level (SL), standard deviation of speech level (SLsd), and smoothed cepstral peak prominence (CPPs).

Discussion

This study aims to identify vocal fatigue through the classification of individuals based on their response to the vocal demands of prolonged speaking with elevated background noise. This step is important in better understanding individual differences and group classification which can lead to better interventions for vocal fatigue. For this research, vocal fatigue was operationally defined through two dimensions: (1) self-perceived vocal effort (Borg CR-100 scale) and (2) objective changes in vocal performance measured through speaking fundamental frequency (F0), speech level (SL), and smoothed cepstral peak prominence (CPPS). A combination of unsupervised machine learning and null-hypothesis testing was used to subtype participants. The main hypothesis is that this approach will help identify participants who exhibit changes in vocal effort or vocal performance, as well as those who experience state vocal fatigue related to the tested communication demands (prolonged speaking and elevated loudness). This hypothesis is supported by the distinct differences in vocal demand responses across the four classified responder subgroups.

Prior to classification, there were few detectable changes because of the VLT. The most notable changes in vocal effort and performance occur between the time before the vocal loading task (PRE) and after 5 min of the task (VL05), with significant increases in VER, F0, F0sd, SL, and SLsd. These changes are consistent with the Lombard effect where an increase in background noise results in a change in voicing to accommodate the noise33,34. Interestingly, no significant changes were observed throughout the duration of the VLT, suggesting that the voicing pattern remained constant until the background noise was removed. While VER trended upward throughout the VLT, it was not significant until clustering was performed. The PRE-VLT VER levels (17.1; between slight and moderate vocal effort) were higher than previously reported vocal effort ratings using the Borg CR-10 scale in conversational speech (1.4; between very slight and slight vocal effort35), but still fell within the same range as baseline vocal effort ratings measured with the Borg CR-100 scale in a laboratory setting (24; between slight and moderate vocal effort14). The PRE-POST increase in VER was expected and consistent with previous VLT studies, but the increase in F0 was not consistently observed in previous studies and may be due to a vocal warm-up effect. Studies have demonstrated a similar warm-up effect in college students, where there was a change in voice quality throughout the day6, and in schoolteachers throughout their workday36. More changes in vocal production were expected between PRE and POST, but before subtyping, the changes in VER and F0 did not implicate vocal fatigue.

Vocal demand response subgroups

The clustering analysis of VER revealed two distinct groups with significantly different responses to the vocal demand. These groups were characterized by their noise demand response and temporal demand response, which relate to individual responses to the background noise demand and prolonged speaking demand presented during the study.

While the VER clustering provided valuable information, the second stage of acoustic clustering offered additional insight. Three of the four groups exhibited the same changes in F0 similar to the aggregate subject pool. Notably, the LVER-VC group, which had low vocal demand responses but significant voice changes, did not show any significant acoustic voice changes as a group. Upon closer inspection, it was discovered that individual variation was high in this group and the direction of voice change was inconsistent, resulting in an aggregate of no change. Conversely, the HVER-VC group, which had high vocal demand responses and significant voice changes, exhibited similar changes in all acoustic measures between PRE and POST, resulting in statistically significant results. These findings indicate a measurable individual component in voice change and vocal fatigue resulting from vocal demand26. These results also shed light on the conflicting findings from previous attempts to measure vocal fatigue associated with vocal loading. Moreover, they offer additional empirical support for the conceptual framework proposed by Nanjundeswaran and Shembel20, which highlights the heterogeneous nature of vocal fatigue.

This heterogeneity is empirically supported by findings from related vocal loading task studies, which demonstrated variable responses across different clinical populations and voice parameters21,37. These investigations observed that self-perceptual measures of vocal effort and discomfort consistently showed significant changes after vocal loading in both typical voice users and those with primary muscle tension dysphonia (pMTD), while objective measures such as supraglottic compression, acoustic parameters, and extrinsic laryngeal muscle tension varied considerably between and within groups. Specifically, quantitative measures of laryngeal configuration and acoustic measures like cepstral peak prominence (CPP) showed complex relationships with perceived effort rather than consistent group-level changes after vocal loading38. These findings parallel the current study’s observation that only when individuals are classified by their specific responses to vocal demands do meaningful patterns of vocal fatigue emerge, supporting the need for subtyping approaches rather than relying solely on group averages39. This has implications for tailoring vocal health interventions to individual profiles rather than general trends.

Building on these insights, although the present study did not specifically investigate explanatory factors for individual classification, the framework suggests that differences in individuals’ baseline vocal fitness and their perception of vocal demands could explain the formation of vocal demand response groups20,39. Future studies should directly assess baseline vocal fitness potentially through physiological, aerodynamic, and self-assessment measures to determine if pre-existing vocal capabilities predict an individual’s vocal demand response pattern and susceptibility to vocal fatigue.

Vocal fatigue symptoms

Secondary analyses were performed to investigate potential associations between vocal fatigue symptoms and the vocal demand response subtypes. Prior to the experiment, participants completed the Vocal Fatigue Index (VFI)3, allowing for comparison between self-reported vocal fatigue symptoms and the identified response subtypes. The VFI subscales (tiredness of voice, physical discomfort, improvement with rest) were compared across the clusters and response dimensions using one-way ANOVA tests (alpha = 0.05) with Bonferroni multiple comparison corrections. Notably, participants in the voice change group (LVER-VC and HVER-VC combined) demonstrated significantly higher scores (p = 0.01) on the second VFI component (physical discomfort; VFI-2) compared to those in the no voice change group (LVER-NC and HVER-NC combined). The voice change group had a mean VFI-2 score of 3.64 (SD = 2.56), while the no voice change group had a mean score of 1.52 (SD = 2.04). No other relationships were found to be statistically significant. These findings suggest the potential utility of the VFI-2 subscale as a screening tool for identifying individuals at risk for vocal misuse under vocal demands such as background noise or prolonged speaking. Prior research supports this application, showing meaningful correlations between the physical discomfort subscale and both physiological measures (pulmonary function)40 and environmental factors (classroom size)41 in occupational voice users. The relationship between self-reported physical discomfort and objective vocal changes may offer clinicians an efficient means to identify patients who would benefit most from targeted intervention strategies.

Potential clinical implications

The LVER-VC group holds promise for clinical interest because these individuals may experience vocal fatigue without perceiving changes in vocal effort. Consequently, they might not observe the need for proper vocal rest during periods of fatigue, unlike the HVER-VC group. This aligns with the theory proposed by Whitling et al.26, who observed a subset of participants exhibiting remarkable endurance in VLTs. The authors suggested that this group of individuals with heightened endurance may share characteristics with patients seen in voice clinics, implying that repetitive overuse of the voice without adequate regulation could pose a risk factor for voice disorders. Moreover, this group is primarily comprised of participants who are female (female-to-male ratio of 6:2)—a demographic with a higher risk of voice problems42,43.

Building upon these observations about the LVER-VC group, vocal loading research on patients with primary muscle tension dysphonia (pMTD) contributes to a framework for understanding the disconnect between objective vocal changes and subjective perception21,22,37. The poor correlations between physical measures (extrinsic laryngeal muscle tension, supraglottic compression) and perceived vocal effort suggest that afferent (sensory) mechanisms may be more critical in symptom manifestation than motor function. For the LVER-VC group, this sensory processing difference may delay appropriate compensatory behaviors. This extends beyond simple endurance to suggest that sensory awareness training could be a valuable therapeutic approach, potentially preventing progression from subclinical voice changes to voice disorders through improved proprioceptive monitoring.

Another potential clinical interest is applying the VLT classification as an objective marker for how vocal responses change over time. Goals of intervention relating to reducing excessive vocal effort or adverse vocal demand responses could be evaluated by the classification of the responses to the VLT. Additionally, this classification approach focuses on individual performance, which could help create a personalized way to detect the positive impacts of therapeutic intervention.

Limitations and opportunities

As with all studies, there are limitations as well as opportunities for future work. One limitation of this study is the restricted sample population of college-age adults, which may limit the generalization of the findings to other age groups or populations which may have a different response to the study parameters. Additionally, the segmenting of the population into four distinct subgroups reduces the statistical power of the study, despite having more participants than many other vocal loading studies10. Nonetheless, the ability to observe significant differences within these smaller groups is noteworthy.

Expanding the subject pool both by amount and diversity would enhance the study’s validity. To facilitate this, the study was designed and executed using the free PsychoPy platform, which allows for identical instructions and protocols to be employed in various locations with the necessary hardware (e.g., microphones and speakers). The presentation program incorporated automated segmentation protocols, allowing for rapid data processing, which substantially lowers computation costs. Deploying similar VLT designs will enable comparable research and a practical increase in sample size.

Additionally, psychological and physical measurements of the participants should be collected to investigate possible correlations between vocal demand responses and individual attributes, such as personality and vocal experience. Discovering these traits could unveil potential risk factors for vocal fatigue, leading to a better comprehension of vocal fatigue and laying the groundwork for reducing its prevalence and impact. Given previous research which has shown connections between voice and psychophysical measurement, future studies should better incorporate a battery of measures to assist in vocal health research.

Conclusions

This study shows that inconsistencies in vocal loading task (VLT) studies on state vocal fatigue can be reduced using a multi-faceted approach. This includes a consistent framework and definition for vocal fatigue, a modular and comparable VLT protocol, and a computational method for classifying vocal demand responders. While the first two suggestions have been proposed in previous work, it is important to again reiterate that a constant framework and definition are crucial and modular and comparable protocols are essential for advancing the field. Novel to this report is the approach of vocal demand responder subtyping. This approach quantifies state vocal fatigue through measurable changes in both perceived effort and vocal acoustic parameters, enabling individual classification of fatigue responses. Identifying these responders is important for developing personalized therapeutic approaches and understanding the underlying mechanisms of vocal fatigue, while also providing an example for other voice assessment situations on implementing a precision medicine approach.