Abstract
Problems understanding speech-in-noise (SIN) are commonly associated with peripheral hearing loss. But pure tone audiometry (PTA) alone fails to fully explain SIN ability. This is because SIN perception is based on complex interactions between peripheral hearing, central auditory processing (CAP) and other cognitive abilities. We assessed interaction between these factors and age using a multivariate approach that allows the modelling of directional effects on theoretical constructs: structural equation modelling. We created a model to explain SIN using latent constructs for sound segregation, auditory (working) memory, and SIN perception, as well as PTA, age, and measures of non-verbal reasoning. In a sample of 207 participants aged 18–81 years old, age was the biggest determinant of SIN ability, followed by auditory memory. PTA did not contribute to SIN directly, although it modified sound segregation ability, which covaried with auditory memory. A second model, using a CAP latent structure formed by measures of sound segregation, auditory memory, and temporal processing, revealed CAP to be the largest determinant of SIN ahead of age. Furthermore, we demonstrated the impact of PTA and non-verbal reasoning on SIN are mediated by their influence on CAP. Our results highlight the importance of central auditory processing in speech-in-noise perception.
Similar content being viewed by others
Introduction
Approximately 1 in 4 people, over 1.3 billion worldwide, have hearing loss, and those of advanced age are overly impacted1,2. With an ageing population, hearing loss poses an increasing societal problem. Older adults suffering from hearing loss not only face a world without sound, but experience impaired communication, social isolation, and increased risk of depression3,4 and dementia5,6,7,8,9.
Older adults who struggle to hear in everyday life situations often complain about their inability to understand speech, especially in adverse conditions, such as with competing speakers in the background, environmental noises, or low or degraded speech10. Speech-in-noise (SIN) perception, colloquially termed the “cocktail party problem”, has been the target of hearing research for several decades. Although a person’s overall hearing ability, measured by pure tone audiometry (PTA), is a major factor determining speech-in-noise perception, it fails to fully explain it11,12,13. Additionally, the use of hearing aids is unable to fully resolve SIN difficulties, as demonstrated by the dissatisfaction of some hearing aid users in noisy situations and in large group settings14,15.
One reason that PTA cannot fully explain speech-in-noise ability nor interventions that only target sound levels (i.e., hearing aids) can fully restore it, is because SIN perception is inherently more complex than a process solely reliant on peripheral auditory mechanisms; it also requires central sound processing, involving the brainstem and cortex16,17. When speech embedded in noise reaches a person’s cochlea, the target speech still needs to be isolated from the background and remembered until meaning can be extracted. This complex process demands both sound segregation ability and working memory capacity.
Sound segregation is primarily a bottom-up auditory process driven by acoustic cues such as pitch, temporal processing, and spatial location, allowing the brain to separate target sounds from competing background noise18.To resemble speech in noise, we developed a Stochastic Figure Ground (SFG) paradigm, which aims to emulate the necessity of extracting a meaningful signal from a perceptually similar background but with no linguistic information19,20. SFG is comprised of pure-tone components that repeat over time to form a figure (target) while random tones varying over time form the ground (masker). Previous research has been successful at linking SFG and speech-in-noise ability, as well as proving the involvement of higher-level brain structures beyond the auditory cortex in its processing19,21,22,23,24. Further, SFG can elicit similar neural entrainment to that of speech25,26.
Auditory working memory (AWM) capacity has been repeatedly linked to speech-in-noise perception [for reviews, see 27, and 8]. From first principles, understanding speech in adverse conditions when the perceptual signal is not clear, such as in speech in noise paradigms, requires a level of “post-processing” of the input sound to decode it, for example, to support postdiction - the reconstruction of misheard words. Thus, it is not surprising that SIN relies partly on our ability to maintain and manipulate sounds in mind. For example, working memory is associated with noise-vocoded speech perception29 and with the processing of speech from competing sources30 – both adverse listening conditions similar to SIN. Working memory may even be the mechanism behind better SIN in musicians12,31. It has been proposed32 that working memory is especially important when the periphery starts to degrade –e.g., due to age– and the sound input lacks fidelity. In this view, working memory acts to compensate for inaccurate perception. Nevertheless, short-term auditory memory must be, axiomatically, involved in sentence-level speech comprehension; a sentence being processed always needs to be retained in mind until this sentence is completed (or can be accurately predicted) before it can be understood.
Speech-in-noise perception relies on other cognitive abilities beyond central auditory processing, for example, processing speed, inhibitory control, and crystallized intelligence27,28. It is generally understood that as hearing deteriorates, speech perception requires greater input from cognition33. This may extend to speech-in-noise perception, where the sensory input is also “deteriorated”. For example, nonverbal reasoning is associated with the recognition of degraded speech in both cochlear implant users and normal-hearing adults34. Additionally, speech-in-noise perception requires less sensory evidence and relies more on preparatory (cognitive) processes35.
Age is undoubtedly linked to SIN ability, as peripheral hearing, central auditory processing, and cognition decrease with age [for a review, see 36]. Age is also known to degrade temporal processing37, another central mechanism implicated in speech-in-noise perception11. In older adults with and without hearing loss, cognitive function appears highly influential in SIN perception, although cognitive function is tightly related to age itself38. The inter-relationship amongst age, hearing, central auditory processing, and cognition complicates the interpretation of these associations. Mainly, if age leads to greater hearing loss and greater cognitive decline, as well as worse auditory processing, how can we disentangle the relationship between these variables and speech-in-noise perception? Additionally, the discrete contributions of working memory and sound segregation to speech-in-noise perception remain to be elucidated.
Here, we have taken a multivariate approach that allows us to study theoretical constructs and their complex casual relationships: structural equation modelling (SEM). SEM is a statistical method where latent constructs can be defined using measured variables or indicators, and directional links modelled based on empirical hypotheses. By creating separate latent variables for central auditory processes, as well as general intelligence as measured by a matrices test, their discrete contributions to speech-in-noise perception for both words and sentences can be assessed. Furthermore, by adding age, PTA, and the causal link between them, a more accurate and detailed view of the effects of age and hearing on SIN can be discovered. We first test a structural equation model where sound segregation and auditory (working) memory are separated in two latent variables. The sound segregation construct is measured using SFG tasks, while the auditory memory construct is measured using the backwards digit span test and tests of precision for delay-matching sounds based on frequency and amplitude modulation rate31. Considering sound segregation and auditory memory are part of an overlying theoretical construct, i.e. central auditory processing (CAP), and both rely on similar brain architecture, including primary and non-primary auditory cortex22,39,40, a second structural equation model is created. In this model, sound segregation and auditory memory, in addition to temporal processing, measured with a between-channels gap detection task, are joined in a latent construct representing CAP.
Materials and methods
Participants
Data from 222 participants (148 females) aged between 18 and 81 years old was collected. Inclusion criteria were native English-speaker status, the absence of any hearing complaints (such as self-perceived or diagnosed hearing loss, the use of hearing aids, or tinnitus), no history of neuropsychological disorders, and no current use of neurotropic medication. A total of 15 participants were removed from all analyses due to dyslexia diagnosis (2), inability to perform the sentence-in-noise task (2), tinnitus (1), non-native English-speaker status (1), incomplete word-in-noise test (1), and due to being duplicated data (8). For those who were tested twice (i.e., duplicated) only the earliest session was used for analysis. Thus, data from 208 participants (138 female) was used for analysis. Participants’ age ranged from 18 to 81 years (mean: 49.13; median: 51.08; standard deviation [SD]: 16.00). Data from 2 participants was incomplete and thus were not included in the SEM analysis. The study was approved by Newcastle University’s Ethics Committee (Reference numbers: 10356/2018 and 46225/2023), and written informed consent was obtained from all participants before the start of the study. The study was performed in accordance with the Declaration of Helsinki (World Medical Association, 2024).
Materials
Speech-in-noise
Speech-in-noise tasks for both words and sentences were included. The word-in-noise (WiN) test consisted of the Iowa Test of Consonant Perception - British version (ITCP-B)41. The ITCP-B is a single-word closed-set computer task with phonetically-balanced features. The test consists of 120 consonant-vowel-consonant words spoken by either a male or female speaker amongst 8-talker babble noise. Participants hear the target word (e.g., “moon”), which has a one second post-babble onset, and are then presented with a self-paced 4-alternative forced-choice screen with phonetically similar words (or minimal pairs) (e.g., “moon-boon-dune-noon”). Participants make their selection using the numbers 1–4 on the keyboard and are then presented with feedback (“Correct-Incorrect”) for 0.6 s. A new trial starts one second afterwards while a fixation cross is shown on the screen. Words are always presented at a -2 dB signal-to-noise ratio (SNR). The babble was formed by 4 female and 4 male British speakers, and the recording lasted 15 s. A list of pre-defined starting points for the babble was created spanning every 0.1 s starting at 0, and permuted without replacement for each participant. Half of the words were spoken by a male speaker while the other half by a female speaker, and this was randomly selected per participant. Participants were given a break after every 40 trials. The performance was calculated as the proportion of correct answers.
The sentence-in-noise (SiN) test consisted of the British version of the Oldenburg sentences21. This is a closed-set test where sentences follow the structure < name-verb-number-adjective-noun> (e.g., “William sees four white houses”) and are spoken by a male speaker masked by 16-talker babble noise. A 5 × 10 matrix of all possible word combinations is presented to participants, who have to respond using a mouse. Answers are considered correct only if all selected words are correct. This is an adaptive paradigm with a 1-up 1-down staircase. The starting SNR was 10 dB, which changed in steps of 3 dB, reducing to 2 dB after the first reversal, and further reducing to 1 and 0.5 dB after 4 and 6 reversals, respectively. The babble was presented for 3.3 s, and sentence onset was 0.25 s post-babble. The masker consisted of a recording of 21.49 s, and its starting point is fully randomised per trial from 0 to 18 s. Sentence presentation is also fully randomised per trial. The task ends after 12 reversals, and the performance threshold is calculated as the median SNR of the last 6 reversals.
SFG
Two SFG tasks were included. The SFG-Gap discrimination task was adapted from Holmes and Griffiths21. SFG is formed by tone chords lasting 50 ms. SFG can be divided into two components called “Figure” (target) and “Ground” (masker). The ground was composed of between 5 and 15 tone elements per chord, for a total of 70 chords (3.5 s). Tones were selected from a frequency space between 179.73 and 7246.29 Hz in a logarithmic scale. The figure was composed of 3 tone elements per chord which repeated over time for a total of 42 chords (2.1 s), and it started between chords 16–20. A set of 144 figures was created in advance and the order randomised to present to each participant. When necessary, a new iteration of this set was presented. Two SFG stimuli were presented per trial with an inter-stimulus interval (ISI) of 400 ms. One of the stimuli had a 6-chords-long gap in the Figure, constrained to start between chords 11–32 of the figure. Participants were required to respond to which stimulus had a gap in the figure. This is a 1-up 1-down adaptive procedure, where the target-to-masker ratio (TMR) is changed. Starting TMR was 10 dB, and TMR was changed in steps of 4 dB, which are reduced to 2 dB after the first reversal, and further reduced to 1 dB after 4 reversals. The task ended after 10 reversals. Participants were familiarised with the stimuli at the beginning of each task by introducing the concepts of “figure” and “ground”, and allowing a practice run of 6 trials at the starting TMR.
The SFG-Figure discrimination task followed the same trial structure and adaptive procedure as the SFG-Gap, but consisted of stimuli of 2 s-duration. One of the stimuli was background only while the other one included a figure spanning 6 chords, which again could start at chords 16–20, and the participants’ task was to select which stimulus contained the figure. Due to the adaptive nature of the paradigm and to avoid changes in overall power between both stimuli, the background-only stimulus included a “dummy” figure of the same duration created of random elements not already contained within the ground that changed in TMR in the same fashion. More specifically, three tone elements were added to the 6 adjacent chords representing the figure. The frequencies of said tone elements were selected randomly for each of the 6 chords. The current trial TMR was then applied to this “dummy” figure. This was done to prevent successful task completion by perceiving changes in overall power between stimuli with and without a figure. Thresholds for both tasks were calculated as the median TMR of the last 6 reversals.
Auditory working memory
An auditory memory task to calculate memory precision for frequency (Freq) and amplitude modulation (AM) rate was used31,42. Participants heard either a one-second pure-tone or AM-modulated white noise. After a delay of 2 s, the target sound had to be matched by clicking on an unlabelled and continuous visual horizontal scale representing the frequency (440–880 Hz) or AM rate (5–20 Hz) space. Frequencies were selected from a uniform distribution and a sinusoidal function was used to apply the amplitude modulation. Every click with a mouse played their selected sound, which could be done repeatedly and without time limit, after which they would press the ‘Enter’ key to confirm. The stimulus type was alternated trial-by-trial for a total of 32 trials. A break was given to participants after 16 trials. Four practice trials, 2 for each stimulus type, were presented to the participant at the beginning of the task. A precision score was obtained for frequency and AM performance by using the inverse of the standard deviation of the errors calculated using a Gaussian function.
A measure of phonological working memory was also included: the digit span (DS) test from WMS-III (Wechsler Memory Scale – Third Edition. The Psychological Corporation). Participants are required to repeat a sequence of digits increasing in load as they heard them (DS Forward) or in the opposite order (DS Backward). The total score represents the number of sequences repeated accurately.
Between channels gap discrimination
A between-channels gap (B-C Gap) discrimination task was adapted from Phillips, Taylor, Hall, Carr and Mossop43. This test was designed to be a ‘central’ gap-detection task that requires the recognition of a gap between frequencies that are represented separately in the ascending auditory pathway. Two narrow-band noises with a bandwidth of 0.25 octaves and a 0.5 ms ramp were separated by a silent interval. The first sound was centred at 4 kHz and lasted 10 ms, while the second sound was centred at 1 kHz and lasted 300 ms. The gap duration started at 200 ms and changed following a 1-up 2-down staircase paradigm, starting with a step-change of 20 ms, followed by 15 (after 3 reversals), 10 (after 6 reversals), 5 (after 8 reversals), 2 (after 10 reversals), and 1 (after 12 reversals) ms. The task ended after a total of 19 reversals or after reaching 125 trials, whichever happened first. Participants were presented with two pairs of stimuli separated by 600 ms, one with a gap as described above and one without (1 ms gap). Participants had to press (self-paced) the number keys ‘1’ or ‘2’ depending on whether the gap was in the first or second position. Feedback was shown on the screen (‘Correct!’ or ‘Wrong!’) for 500 ms. A new trial (inter-trial interval) started after 1 s. During the task, if the gap duration reached 1 ms, any answer was considered wrong and the gap duration increased. Before the beginning of the task, and after a familiarisation run where participants were introduced to the target stimuli, 12 practice trials were presented where the gap duration was 230 ms, and it adaptively changed in a 1-up 1-down pattern by 5 ms. The performance threshold was calculated as the median duration of the gap on the last 6 reversals.
General (fluid) intelligence
A matrix test to measure general or fluid intelligence was created using the matrix reasoning item bank [MaRs-IB; 44]. Matrices were all taken from set 1. Participants familiarised themselves with the task with 4 practice trials using the first matrices from the set (numbers 1–4). The test included a total of 26 matrices; 25 in sequential order starting with item 6 (numbers 6–30), and the matrix number 47 of greater difficulty to avoid ceiling effects. Participants had 30 s to respond to each matrix, and a countdown timer appeared for the last 5 s. A total score representing the number of correct matrices was created (0–26) per participant.
Other measures
Measures for musicality (Goldsmith Musical Sophistication Index; Gold-MSI), premorbid intelligence and literacy (Wechsler Test of Adult Reading; WTAR), and self-reported SiN ability (Spatial Speech Questionnaire, SSQ) were also taken. These measures are not included in the current analyses and thus are not described further.
Procedure
After arriving at the lab and providing informed consent, PTA thresholds were measured for frequencies 0.25–8 kHz in a soundproof room using air conduction only with the diagnostic audiometer AD226 by Interacoustics. The computer tasks were then performed in the same soundproof room in the following order: SIN tests (words, then sentences), SFG-Figure discrimination, auditory memory (Freq + AM), SFG-Gap discrimination, Matrices, Gold-MSI, and Gap detection. Paper tests were then completed in another room in the following order: DS Forward, DS Backward, WTAR, and SSQ. The testing session usually lasted 2 h and participants received compensation for their time. Most computer tasks were coded in JavaScript and run using Chrome, except WiN and Gap detection, which were coded using Matlab R2017a. All stimuli were presented between 65 and 73 dB SPL (sound pressure level) depending on the task but at the same level across participants.
Analysis
Before SEM analyses, data were linearly transformed to reduce differences in variance amongst variables and so that positive values would reflect better performance. Thus, the scores of SiN, both SFG tasks, and B-C Gap were inverted. SiN was further multiplied by 10 while B-C Gap was multiplied by 0.1. WiN was changed from reflecting proportion to reflecting percentage. Lastly, the precision scores for AM and Freq were multiplied by 50 and 10, respectively.
Structural equation models (SEM) were built using the lavaan package (version 0.6–17) in R (version 4.2.1). To calculate the models, maximum likelihood estimation with nonnormality correction based on the Satorra-Bentler scaled test statistic was used45,46. The α level used for significance testing of path coefficients was set at 5%. The models were evaluated by a set of criteria using several goodness-of-fit measures: the Bentler comparative fit index (CFI), Tucker-Lewis Index (TLI), the root-mean-square error of approximation (RMSEA), and the standardised root mean squared residual (SRMR)47,48. Only robust measures of these indices are reported in this study45,46. Bootstrapped 95% confidence intervals were created for each model fit measure using 1000 repetitions with the ‘Bollen-Stine’ method49 as implemented in lavaan.
The choice of scaling variables for each latent construct was decided on theory alone. Due to previous research showing good predictability of SIN with SFG-Gap21, this was selected as the scaling variable of the SFG latent construct. For auditory working memory, recent findings implicate memory for amplitude modulation rates as one of the greatest factors determining speech-in-noise perception42, thus AM was used as the scaling variable for the AWM construct. Because word-in-noise perception forms the basis for sentence-in-noise perception, this (i.e., WiN) was used as the scaling variable for SIN. For a Central Auditory Processing (CAP) latent variable, the working memory component was used as the scaling variable, as the contribution of working memory is the most replicated finding in speech-in-noise research28. Although scaling variables are not estimated nor tested for significance, as lavaan uses the fixed-marker technique, these are assumed significant for visualization purposes only.
Results
Hearing thresholds
Participants averaged thresholds over all frequencies (0.25–8 kHz) and both ears were 11.969 (± 9.245) dB hearing level (HL). No participant had greater than mild-hearing loss (< 40 dB) when averaged over all frequencies across both ears. Individual and average thresholds are plotted in Fig. 1.
Demographic data and performance
Demographic data and the average performance of all tasks are shown in Table 1. A Spearman correlation matrix was performed using Holm-Bonferroni correction for multiple comparisons. Most variables were correlated to each other, although age did not correlate with DS backwards. DS backwards also showed no correlation to PTA or both SFG tasks. Lastly, between-channels gap detection did not correlate with SFG for Figure discrimination.
Sound segregation and auditory (working) memory as separate contributors to speech-in-noise
A SEM model was constructed with separate latent variables representing sound segregation (‘SFG’) and Auditory Working Memory (‘AWM’). The contributions of age, PTA, and overall intelligence (‘MTX’) were also included in the model. This model (Model 1), including path coefficients and model fit indices, can be seen in Fig. 2.
Model 1 (Fig. 2) had a good model fit as demonstrated by the CFI (0.997, 95% CI: 0.978-1) and RMSEA (0.021, 95% CI: 0-0.060). This model was able to explain 79.8% of the variance of speech-in-noise (R2: 0.830, Adjusted R2: 0.798). Based on path coefficients (β), the factor that had the greatest effect on speech-in-noise was age (β: −0.49, p < 0.001), followed by Auditory Working Memory (β: 0.29, p < 0.01). Further, age significantly modified hearing (PTA; β: 0.74, p < 0.001), sound segregation (SFG; β: −0.32, p < 0.001), and general intelligence (MTX; β: −0.38, p < 0.001). Although intelligence contributed to SFG (β: 0.2, p < 0.01) and AWM (β: 0.51, p < 0.001), it did not have a direct effect on SIN (β: −0.03, p > 0.05). Similarly, PTA and SFG did not predict SIN changes significantly (PTA; β: −0.15, p > 0.05; SFG; β: 0.19, p > 0.05). The contribution of hearing (PTA) on SIN ability was also not mediated by (its effect on) working memory (β: −0.15, p > 0.05). Hearing (PTA) had only an effect on sound segregation (SFG; β: −0.32, p < 0.001). The shared variance between working memory and sound segregation was large and statistically significant (β: 0.43, p < 0.001).
Model 1. Path coefficients are shown within the arrows representing the paths. Latent variables are represented with elliptic shapes, while indicators and observed variables are denoted with rectangles. The exogenous variable is plotted in a diamond shape. Model fit indices are shown on the bottom left corner with bootstrapped 95% confidence intervals in square brackets. * = p < 0.05, ** = p < 0.01, *** = p < 0.001.
The effects of central auditory processing on speech-in-noise
An alternative model (Model 2; Fig. 3) was created where sound segregation and auditory memory were joined under a general process of CAP, with one indicator each. SFG-Gap was used to exemplify sound segregation ability, while precision for amplitude modulation (AM) was used to index auditory memory. A third indicator was used to represent temporal processing: between-channels gap discrimination.
By combining three types of indicators into one latent construct – CAP – this factor achieved a significant contribution to SIN with a path coefficient of 0.50 (p < 0.001), greater than that of age (β: −0.44, p < 0.001). The relationship between SIN and PTA (β: −0.09, p > 0.05), and SIN and MTX (β: −0.02, p > 0.05), was mediated through their effects on CAP (PTA→CAP, β: −0.34, p < 0.001; MTX→CAP, β: 0.37, p < 0.001). Age had a causal effect on CAP (β: −0.26, p < 0.05), PTA (β: 0.74, p < 0.001), and MTX (β: −0.38, p < 0.001).
Model fit, although slightly smaller than the previous model (Model 1), was good as exemplified by CFI (0.991, 95% CI: 0.980-1) and RMSEA (0.046, 95% CI: 0-0.069). This model was able to explain 81.1% of the variance of speech-in-noise (R2: 0.830, Adjusted R2: 0.811), slightly higher than the previous model despite using fewer variables.
Model 2. A central auditory process (CAP) latent variable was created to include measures of sound segregation (SFG-Gap), auditory working memory (AM), and temporal processing (between-channels Gap detection). Path coefficients are shown within the arrows representing each path. Latent variables are represented with elliptic shapes, while indicators and observed variables are denoted with rectangles. The exogenous variable is plotted in a diamond shape. Model fit indices are shown on the bottom left corner with bootstrapped 95% confidence intervals in square brackets. * = p < 0.05, ** = p < 0.01, *** = p < 0.001.
All path coefficients for Models 1 and 2 between latent variables, including observed variables, can be seen in Table 2.
To assess multicollinearity, which could weaken the confidence in coefficient estimates, we built a linear regression model. First, sentence and word scores were standardized and summed together to create the dependent variable, and all other measured variables used to construct both SEM (Model 1 and Model 2) were defined as predictors. Then, tolerance and variance inflation factors (VIF) were calculated using the package “olsrr” in R. VIF values > 10 are considered to indicate potential collinearity, and thus could undermine the interpretation of the models50. VIF values were greatest for Age and PTA, but these were far from the multicollinearity threshold (2.50 and 2.39, respectively).
Discussion
In the current study, we constructed a structural equation model (SEM) of speech-in-noise perception (SIN). SEM is a complex multivariate approach that allows for the exploration of causal relationships between theoretical constructs. We included the observed/exogenous variables age, hearing thresholds (PTA), and general intelligence (MTX), and created latent constructs for sound segregation (SFG) and auditory working memory (AWM). By modelling the effects of age on hearing thresholds and the influence of PTA on central auditory processing, we found that hearing did not have a direct impact on speech in noise. This is in contrast to previous research that highlighted the importance of hearing thresholds over age on speech-in-noise perception42,51. One possibility for this discrepancy may be due to the fact that the majority of people in our sample had normal or near-normal hearing, with average hearing thresholds not exceeding 40 dB. Additionally, when modelling complex relationships in normal-hearing people between auditory-cognitive factors and hearing, other research has demonstrated little or no effect of audiometric thresholds12. Nevertheless, the inclusion of young adults with no hearing loss and the fact that our sample has only mild-hearing loss that is highly age-related, prevent us from drawing strong conclusions on the lack of direct relationships between PTA and SIN. We are unfortunately unable to assess whether extended high-frequency (> 8 kHz) hearing level would be a better predictor than standard PTA as suggested by others52,53,54; although this finding is not always replicated55. Similarly, hearing was only measured using air conduction, thus we cannot isolate sensorineural from conductive or mixed hearing loss.
We found that age was the biggest contributor to speech-in-noise perception. The unique variance explained by age was not captured by any of our auditory-cognitive variables. Previous research has highlighted the importance of age [e.g., 56], yet the non-specific impacts of ageing preclude a single interpretation. Although the influence of age on some level of cognitive decline was seen in its significant relationship to nonverbal intelligence (as measured by MTX), MTX did not show any direct impact on speech-in-noise thresholds. Due to the significant direct path between age and SIN, part of the age effects on speech-in-noise difficulties relies on mechanisms not explained by our model. One possible factor may be age-related reduced inhibition, as research has demonstrated that, for people with normal hearing, deficits in speech-in-noise perception are mediated by impaired inhibitory processes57. Research has found that older adults over-represent sound signals in the cortex, and this extends to the neural processing of “unattended” speech58, thus it is plausible that impaired inhibition to “noise” signals hinders speech-in-noise perception. Another factor is likely the age-related slowing in processing speed, which has been associated with SIN perception28,59 and has been previously attributed as a contributor to SIN decline over time60. Nevertheless, age has a multifactorial effect which includes deterioration of overall cognitive abilities61, temporal processing37, frequency selectivity62, and binaural processing63, all of which could impact speech-in-noise perception.
Auditory working memory had a discrete effect on SIN not explained by age or hearing thresholds. As mentioned earlier, the relationship between auditory memory and speech-in-noise is the most replicated finding in SIN research27,28. Although previous research has found weak or no relationship between working memory and SIN in young adults with normal hearing32, our model demonstrated an association between auditory memory and SIN independent of age and hearing levels, as they did not have significant contributions to the constructed working memory latent variable.
Sound segregation did not contribute to SIN directly, but shared a significant amount of variance with auditory working memory. Sound segregation partly relies on peripheral mechanisms, as PTA significantly affected SFG performance. According to our model, the variance explained by SFG is shared and captured by auditory memory. Besides the shared variance with auditory memory, the addition of a new SFG stimulus (figure discrimination) may be behind the lack of discrete contribution to SIN by this sound segregation construct. Other than the previously developed SFG where participants ought to discriminate the presence of a gap within the figure21, and which requires the tracking of the figure over time, a newly developed SFG was used that combined this type of SFG and the original version where the presence of a figure had to be detected. However, the original detection paradigm used a fixed SNR of 0 dB, such that the target and ground are only differentiated by temporal coherence20. In the current newly developed task, which involved figure discrimination in an adaptive paradigm with varying SNR, the figure could be detected within the background by sound-level differences alone, and did not require tracking over time.
Nonverbal reasoning greatly modulated auditory working memory. It also impacted sound segregation mechanisms albeit less strongly. General intelligence is thought to determine all cognitive-related processes; however, our model stresses that the effect of general intelligence on SIN is mediated by auditory working memory. Working memory capacity and general intelligence are highly related64, but our model considered a directional mechanism where intelligence determines memory ability. Associations between nonverbal intelligence and degraded-speech recognition have been found before34,65, but differences in processing speed may be behind this relationship59. The matrix test used in the current study posed a time limit to respond, and thus is confounded by processing speed, which is also undoubtedly linked to general intelligence and working memory66. The relationship between intelligence and working memory may also partly be driven by the association between intelligence and sensory performance67, as some of the tasks used to measure working memory were based on frequency and temporal precision.
We further created a simplified SEM where sound segregation (SFG-Gap), auditory memory (AM rate), and temporal processing (between-channel gap detection) all act under one latent structure: central auditory processing (CAP). By joining sound segregation and auditory memory mechanisms, and adding temporal precision, this latent construct surpassed age as the most important predictor of speech-in-noise perception. We further revealed that the impacts of hearing and intelligence on SIN act through their influence on central auditory processing. In other words, the effects of hearing and intelligence on speech-in-noise are mediated through central auditory mechanisms: greater abstract reasoning and better hearing support improved central processing, which causes enhanced comprehension of speech-in-noise. Previous research using SEM has emphasised the importance of central auditory processing to speech in noise perception. In a sample of 120 older adults, Anderson, White-Schwoch12 found central processing was the biggest determinant of speech in noise perception, followed by cognitive function which included auditory (and working) memory.
Overall, our research demonstrates a critical role of central auditory processing, encompassing auditory (working) memory, sound segregation, and temporal precision, in speech in noise perception. Furthermore, our models emphasize the importance of non-specific age effects that go beyond declines in hearing and reasoning abilities, and how central auditory processing mechanisms mediate the effect of hearing and cognition on speech-in-noise perception. Our results suggest that CAP processes should be assessed in clinical settings to obtain a comprehensive view of the reasons for SIN difficulties.
Data availability
The data and analysis script used in the current manuscript can be publicly accessed through OSF (https://osf.io/nz9v7/).
References
Vos, T. et al. Global, regional, and National incidence, prevalence, and years lived with disability for 310 diseases and injuries, 1990–2015: A systematic analysis for the global burden of disease study 2015. Lancet 388 (10053), 1545–1602. https://doi.org/10.1016/S0140-6736(16)31678-6 (2016).
Akeroyd, M. A. & Munro, K. J. Population estimates of the number of adults in the UK with a hearing loss updated using 2021 and 2022 census data. Int. J. Audiol. 63 (9), 659–660. https://doi.org/10.1080/14992027.2024.2341956 (2024).
Mener, D. J., Betz, J., Genther, D. J., Chen, D. & Lin, F. R. Hearing loss and depression in older adults. J. Am. Geriatr. Soc. 61 (9), 1627–1629. https://doi.org/10.1111/jgs.12429 (2013).
Li, C. M. et al. Hearing impairment associated with depression in US adults, National health and nutrition examination survey 2005–2010. JAMA Otolaryngol. Head Neck Surg. 140 (4), 293–302. https://doi.org/10.1001/jamaoto.2014.42 (2014).
Lin, F. R. et al. Hearing loss and incident dementia. Arch. Neurol. 68 (2), 214–220. https://doi.org/10.1001/archneurol.2010.362 (2011).
Gallacher, J. et al. Auditory threshold, phonologic demand, and incident dementia. Neurology 79 (15), 1583–1590. https://doi.org/10.1212/WNL.0b013e31826e263d (2012).
Griffiths, T. D. et al. How can hearing loss cause dementia?? Neuron 108 (3), 401–412. https://doi.org/10.1016/j.neuron.2020.08.003 (2020).
Lin, F. R. et al. Hearing loss and cognitive decline in older adults. JAMA Intern. Med. 173 (4), 293–299. https://doi.org/10.1001/jamainternmed.2013.1868 (2013).
Livingston, G. et al. Dementia prevention, intervention, and care: 2024 report of the lancet standing commission. Lancet 404 (10452), 572–628. https://doi.org/10.1016/s0140-6736(24)01296-0 (2024).
Kochkin, S. MarkeTrak VIII: Consumer satisfaction with hearing aids is slowly increasing. Hear. J. 63 (1). https://doi.org/10.1097/01.Hj.0000366912.40173.76 (2010).
Füllgrabe, C., Moore, B. C. J. & Stone, M. A. Age-group differences in speech identification despite matched audiometrically normal hearing: Contributions from auditory temporal processing and cognition. Frontiers in Aging Neuroscience. 6. https://doi.org/10.3389/fnagi.2014.00347 (2015).
Anderson, S., White-Schwoch, T., Parbery-Clark, A. & Kraus, N. A dynamic auditory-cognitive system supports speech-in-noise perception in older adults. Hear. Res. 300, 18–32. https://doi.org/10.1016/j.heares.2013.03.006 (2013).
Griffiths, T. D. Predicting speech-in-noise ability in normal and impaired hearing based on auditory cognitive measures. Front. Neurosci. 17, 1077344. https://doi.org/10.3389/fnins.2023.1077344 (2023).
Kochkin, S. MarkeTrak V: Consumer satisfaction revisited. Hear. J. 53 (1), 45–46 (2000).
Kochkin, S. MarkeTrak VII: Customer satisfaction with hearing instruments in the digital age. Hear. J. 58 (9). https://doi.org/10.1097/01.HJ.0000286545.33961.e7 (2005).
Chandrasekaran, B. & Kraus, N. The scalp-recorded brainstem response to speech: Neural origins and plasticity. Psychophysiology 47 (2), 236–246. https://doi.org/10.1111/j.1469-8986.2009.00928.x (2010).
Ding, N. & Simon, J. Z. Emergence of neural encoding of auditory objects while listening to competing speakers. In Proceedings of the National Academy of Sciences. 109(29): pp. 11854–11859. https://doi.org/10.1073/pnas.1205381109 (2012).
Carlyon, R. P. How the brain separates sounds. Trends Cogn. Sci. 8 (10), 465–471. https://doi.org/10.1016/j.tics.2004.08.008 (2004).
Teki, S., Chait, M., Kumar, S., von Kriegstein, K. & Griffiths, T. D. Brain bases for auditory stimulus-driven figure-ground segregation. J. Neurosci. 31 (1), 164–171. https://doi.org/10.1523/JNEUROSCI.3788-10.2011 (2011).
Teki, S., Chait, M., Kumar, S., Shamma, S. & Griffiths, T. D. Segregation of complex acoustic scenes based on temporal coherence. Elife 2, e00699. https://doi.org/10.7554/eLife.00699 (2013).
Holmes, E. & Griffiths, T. D. Normal’ hearing thresholds and fundamental auditory grouping processes predict difficulties with speech-in-noise perception. Sci. Rep. 9 (1), 16771. https://doi.org/10.1038/s41598-019-53353-5 (2019).
Teki, S. et al. Neural correlates of auditory figure-ground segregation based on temporal coherence. Cereb. Cortex. 26 (9), 3669–3680. https://doi.org/10.1093/cercor/bhw173 (2016).
Holmes, E., Zeidman, P., Friston, K. J. & Griffiths, T. D. Difficulties with speech-in-noise perception related to fundamental grouping processes in auditory cortex. Cereb. Cortex. 31 (3), 1582–1596. https://doi.org/10.1093/cercor/bhaa311 (2021).
Guo, X. et al. Predicting speech-in-noise ability with static and dynamic auditory figure-ground analysis using structural equation modelling. Proc Biol Sci. 292(2042): 20242503. https://doi.org/10.1098/rspb.2024.2503 (2025).
O’Sullivan, J. A., Shamma, S. A. & Lalor, E. C. Evidence for neural computations of temporal coherence in an auditory scene and their enhancement during active listening. J. Neurosci. 35 (18), 7256–7263. https://doi.org/10.1523/JNEUROSCI.4973-14.2015 (2015).
Guo, X. et al. Neural entrainment to pitch changes of auditory targets in noise. NeuroImage. 314, 121270. https://doi.org/10.1016/j.neuroimage.2025.121270 (2025).
Akeroyd, M. A. Are individual differences in speech reception related to individual differences in cognitive ability? A survey of Twenty experimental studies with normal and hearing-impaired adults. Int. J. Audiol. 47 Suppl 2, S53–71. https://doi.org/10.1080/14992020802301142 (2008).
Dryden, A., Allen, H. A., Henshaw, H. & Heinrich, A. The association between cognitive performance and Speech-in-Noise perception for adult listeners: A systematic literature review and meta-analysis. Trends Hear. 21, 2331216517744675. https://doi.org/10.1177/2331216517744675 (2017).
Rosemann, S. et al. The contribution of cognitive factors to individual differences in understanding noise-vocoded speech in young and older adults. Front. Hum. Neurosci. 11, 294. https://doi.org/10.3389/fnhum.2017.00294 (2017).
James, P. J., Krishnan, S. & Aydelott, J. Working memory predicts semantic comprehension in dichotic listening in older adults. Cognition 133 (1), 32–42. https://doi.org/10.1016/j.cognition.2014.05.014 (2014).
Lad, M., Billig, A. J., Kumar, S. & Griffiths, T. D. A specific relationship between musical sophistication and auditory working memory. Sci. Rep. 12 (1), 3517. https://doi.org/10.1038/s41598-022-07568-8 (2022).
Füllgrabe, C. & Rosen, S. On the (Un)importance of working memory in speech-in-noise processing for listeners with normal hearing thresholds. Front. Psychol. 7, 1268. https://doi.org/10.3389/fpsyg.2016.01268 (2016).
Wayne, R. V. & Johnsrude, I. S. A review of causal mechanisms underlying the link between age-related hearing loss and cognitive decline. Ageing Res. Rev. 23, 154–166. https://doi.org/10.1016/j.arr.2015.06.002 (2015).
Mattingly, J. K., Castellanos, I. & Moberly, A. C. Nonverbal reasoning as a contributor to sentence recognition outcomes in adults with cochlear implants. Otol Neurotol. 39 (10), e956–e963. https://doi.org/10.1097/MAO.0000000000001998 (2018).
Vaden, K. I. Jr., Teubner-Rhodes, S., Ahlstrom, J. B., Dubno, J. R. & Eckert, M. A. Evidence for cortical adjustments to perceptual decision criteria during word recognition in noise. Neuroimage 253, 119042. https://doi.org/10.1016/j.neuroimage.2022.119042 (2022).
Windle, R., Dillon, H. & Heinrich, A. A review of auditory processing and cognitive change during normal ageing, and the implications for setting hearing aids for older adults. Front. Neurol. 14, 1122420. https://doi.org/10.3389/fneur.2023.1122420 (2023).
Anderson, S. & Karawani, H. Objective evidence of Temporal processing deficits in older adults. Hear. Res. 397, 108053. https://doi.org/10.1016/j.heares.2020.108053 (2020).
Marsja, E., Stenbäck, V., Moradi, S., Danielsson, H. & Rönnberg, J. Is having hearing loss fundamentally different? Multigroup structural equation modeling of the effect of cognitive functioning on speech identification. Ear Hear. 43 (5), 1437–1446. https://doi.org/10.1097/aud.0000000000001196 (2022).
Kumar, S. et al. Oscillatory correlates of auditory working memory examined with human electrocorticography. Neuropsychologia 150, 107691. https://doi.org/10.1016/j.neuropsychologia.2020.107691 (2021).
Kumar, S. et al. A brain system for auditory working memory. J. Neurosci. 36 (16), 4492. https://doi.org/10.1523/JNEUROSCI.4341-14.2016 (2016).
Guo, X. et al. British version of the Iowa test of consonant perception. JASA Express Lett. 4 (12). https://doi.org/10.1121/10.0034738 (2024).
Lad, M., Taylor, J. P. & Griffiths, T. D. The contribution of short-term memory for sound features to speech-in-noise perception and cognition. Hear. Res. 451, 109081. https://doi.org/10.1016/j.heares.2024.109081 (2024).
Phillips, D. P., Taylor, T. L., Hall, S. E., Carr, M. M. & Mossop, J. E. Detection of silent intervals between noises activating different perceptual channels: Some properties of central auditory gap detection. J. Acoust. Soc. Am. 101 (6), 3694–3705. https://doi.org/10.1121/1.419376 (1997).
Chierchia, G. et al. The matrix reasoning item bank (MaRs-IB): Novel, open-access abstract reasoning items for adolescents and adults. R Soc. Open. Sci. 6 (10), 190232. https://doi.org/10.1098/rsos.190232 (2019).
Brosseau-Liard, P. E. & Savalei, V. Adjusting incremental fit indices for nonnormality. Multivar. Behav. Res. 49 (5), 460–470. https://doi.org/10.1080/00273171.2014.933697 (2014).
Brosseau-Liard, P. E., Savalei, V. & Li, L. An investigation of the sample performance of two nonnormality corrections for RMSEA. Multivar. Behav. Res. 47 (6), 904–930. https://doi.org/10.1080/00273171.2012.715252 (2012).
Hu, L. & Bentler, P. M. Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Struct. Eq. Model. Multidis. J. 6 (1), 1–55. https://doi.org/10.1080/10705519909540118 (1999).
Kline, R. B. Principles and Practice of Structural Equation Modeling 3rd. edn (Guilford Press, 2011).
Bollen, K. A. & Stine, R. A. Bootstrapping Goodness-of-Fit measures in structural equation models. Sociol. Methods Res. 21 (2), 205–229. https://doi.org/10.1177/0049124192021002004 (1992).
Chatterjee, S. & Hadi, A. S. Analysis of collinear data, In Regression Analysis by Example, (eds Chatterjee, S. & Hadi A.S.) 221–258. https://doi.org/10.1002/0470055464.ch9 (2006).
Billings, C. J. & Madsen, B. M. A perspective on brain-behavior relationships and effects of age and hearing using speech-in-noise stimuli. Hear. Res. 369, 90–102. https://doi.org/10.1016/j.heares.2018.03.024 (2018).
Yeend, I., Beach, E. F. & Sharma, M. Working memory and extended High-Frequency hearing in adults: Diagnostic predictors of speech-in-noise perception. Ear Hear. 40 (3), 458–467. https://doi.org/10.1097/aud.0000000000000640 (2019).
Çolak, H. et al. Subcortical auditory processing and speech perception in noise among individuals with and without extended high-frequency hearing loss. J. Speech Lang. Hear. Res. 67 (1), 221–231. https://doi.org/10.1044/2023_JSLHR-23-00023 (2024).
Motlagh Zadeh, L. et al. Extended high-frequency hearing enhances speech perception in noise. In Proceedings of the National Academy of Sciences 23753–23759 Vol .116(47). https://doi.org/10.1073/pnas.1903315116 (2019).
Smith, S. B. et al. Investigating peripheral sources of speech-in-noise variability in listeners with normal audiograms. Hear. Res. 371, 66–74. https://doi.org/10.1016/j.heares.2018.11.008 (2019).
Besser, J., Festen, J. M., Goverts, S. T., Kramer, S. E. & Pichora-Fuller, M. K. Speech-in-Speech listening on the LiSN-S test by older adults with good audiograms depends on cognition and hearing acuity at high frequencies. Ear Hear. 36 (1), 24–41. https://doi.org/10.1097/aud.0000000000000096 (2015).
Gomez-Alvarez, M., Johannesen, P. T., Coelho-de-Sousa, S. L., Klump, G. M. & Lopez-Poveda, E. A. The relative contribution of cochlear synaptopathy and reduced inhibition to age-related hearing impairment for people with normal audiograms. Trends Hear. 27. https://doi.org/10.1177/23312165231213191 (2023).
Presacco, A., Simon, J. Z. & Anderson, S. Evidence of degraded representation of speech in noise, in the aging midbrain and cortex. J. Neurophysiol. 116 (5), 2346–2355. https://doi.org/10.1152/jn.00372.2016 (2016).
Moberly, A. C., Mattingly, J. K. & Castellanos, I. How does nonverbal reasoning affect sentence recognition in adults with cochlear implants and Normal-Hearing peers?. Audiol. Neurootol. 24 (3), 127–138. https://doi.org/10.1159/000500699 (2019).
Pronk, M. et al. Decline in older persons’ ability to recognize speech in noise: The influence of demographic, Health-Related, environmental, and cognitive factors. Ear Hear. 34 (6). https://doi.org/10.1097/AUD.0b013e3182994eee (2013).
Harada, C. N., Love, M. C. N., & Triebel, K. L. Normal cognitive aging. Clin. Geriatr. Med. 29 (4), 737–752. https://doi.org/10.1016/j.cger.2013.07.002 (2013).
Regev, J., Zaar, J., Relaño-Iborra, H. & Dau, T. Age-related reduction of amplitude modulation frequency selectivity. J. Acoust. Soc. Am. 153 (4), 2298–2298. https://doi.org/10.1121/10.0017835 (2023).
Moore, B. C. J. Effects of age and hearing loss on the processing of auditory Temporal fine structure. In Physiology, Psychoacoustics and Cognition in Normal and Impaired Hearing. https://doi.org/10.1007/978-3-319-25474-6_1 (Springer International Publishing. 2016).
Conway, A. R. A., Kane, M. J. & Engle, R. W. Working memory capacity and its relation to general intelligence. Trends Cogn. Sci. 7 (12), 547–552. https://doi.org/10.1016/j.tics.2003.10.005 (2003).
Moberly, A. C. et al. How does aging affect recognition of spectrally degraded speech? Laryngoscope. 128 Suppl 5(Suppl 5). https://doi.org/10.1002/lary.27457 (2018).
Fry, A. F. & Hale, S. Relationships among processing speed, working memory, and fluid intelligence in children. Biol. Psychol. 54 (1), 1–34. https://doi.org/10.1016/S0301-0511(00)00051-X (2000).
Troche, S. J. & Rammsayer, T. H. Temporal and non-temporal sensory discrimination and their predictions of capacity- and speed-related aspects of psychometric intelligence. Pers. Indiv. Differ. 47 (1), 52–57. https://doi.org/10.1016/j.paid.2009.02.001 (2009).
Author information
Authors and Affiliations
Contributions
E.B, X.G. and T.G. conceptualized and designed the study. T.G. received the funding. E.B. and M.L. programmed the computer tasks. E.B., H.Ç., and X.G. collected the data. E.B., H.Ç., X.G. and S.R. analysed the data. E.B. prepared all figures. E.B. wrote the original draft. T.D., H.Ç., and X.G. revised the manuscript. All authors reviewed and approved the submitted manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Benzaquén, E., Çolak, H., Guo, X. et al. Auditory-cognitive contributions to speech-in-noise perception determined with structural equation modelling of a large sample. Sci Rep 15, 34915 (2025). https://doi.org/10.1038/s41598-025-18800-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-18800-6