Introduction

Listening to a lecture, whether in a traditional classroom setting or via online platforms, requires individuals to focus their attention on the teacher for a long period of time1,2,3. However, many factors—external and internal—can make sustained attention to a lecture difficult4,5. For example, the presence of background noise may distract listeners from the lecture or make hearing more difficult6,7,8,9, and features related to the lecture content or delivery (e.g., the speaker’s charisma or presentation style) or to the listeners themselves (e.g., interest in the topic) may prompt mind-wandering or off-task behavior10,11. Despite ample behavioral evidence (and decades-long introspective insight) that attention can vary and fluctuate throughout a long lecture, our current understanding of the neurophysiological manifestations of these fluctuations, and the factors influencing them, is extremely limited.

The disruptive impact of external noise on speech-processing has been studied extensively. It is well established that the presence of background noise can reduce speech intelligibility12,13,14,15 and requires investing more effort due to the increased perceptual and cognitive load14,16,17. Consequently, this can lead to elevated stress levels18,19 and reduced neural encoding of the speech17,20,21. That said, not all noise has the same disruptive effects, but can vary as a function of sound-level, temporal structure, contextuality and noise-type22,23,24,25, and moderate noise can sometimes even improve performance, in line with the notion of an inverted U-shape relationship between arousal and performance26,27,28,29,30 and the moderate brain arousal model31,32,33. Therefore, simply assuming that adding noise is detrimental to speech processing, and consequently impairs attention to speech, is not straightforward.

Importantly, insights gained from traditional speech-in-noise studies might not generalize fully to more ecological contexts, such as listening to a full lecture34,35,36. One reason for this is the prevalent use of short, highly edited sentences as stimuli that lack the context and semantic continuity of a lecture37. Real-life speech, and particularly a lecture, is continuous and contextual, with a structured narrative and built-in redundancies, factors that are designed to engage listeners, hold their attention over time, and which can also potentially mitigate the masking effects of noise. However, as anyone who has ever attended a frontal lecture can attest—it is precisely the continuous nature of lectures that renders their quality non-uniform over time. In stark opposition from the well-controlled stimuli of experimental designs (including highly edited and rehearsed audiobooks30,38 or TED-style talks), the speech in live lectures is generated by the instructor ‘on the fly’ and therefore can contain frequent disfluencies, repetitions, ill-formed sentences and rambling39,40. In addition, the content itself can vary in its clarity, novelty and interest to the listener as well as speaker eloquence, style and charisma. These variations in quality over time impact listeners’ level of engagement and sustained attention to speech, with more engaging and interesting content leading to higher levels attentional focus and reduced processing load1, and reduced interest associated with mind-wandering and off-task behaviors41,42,43. The current study broadens the scope of investigation of natural speech processing to assess the independent effects of external background noise and of non-uniform levels of interest. We focus specifically on “situational interest”, a term used in educational and psychological research to describe fluctuations in interest that are common to most people and can be attributed to variations in features of the lecture itself such as clarity, novelty, relevance, speaker charisma, use of humor etc10,44,45,46,47. This is distinct from “individual interest” which reflects personal preference for specific topics or other contextual influences that contribute to individual differences, but is beyond the scope of the current study48,49,50,51.

Our aim was to examine how external noise and fluctuations in situational interest-levels influence speech processing under naturalistic conditions that require sustained attention, and to link these effects to changes in neurophysiological metrics associated with arousal, attention and speech processing. To this end, we measured neural activity (electroencephalogram; EEG) and physiological responses (skin-conductance) while individuals watched an unedited video recording of a real-life frontal-lecture. We chose a lecture that was aimed for a general public audience, on a relatively unknown topic (Agism). The lecture was rated as “relatively interesting” in a pre-screening stage and contained substantial minute-by-minute variability in listeners’ reported interest ratings (details below). Group-level interest ratings were used to split the lecture into portions with high vs. low levels of situational interest. In addition, we manipulated the presence and type of background noise by playing two types of construction sounds—intermittent and continuous—in 2/3rds of the trials9. We tested how these two factors—external noise and situational interest—affected specific physiological and neural metrics associated with arousal, attention and speech processing. These included: (a) Neural speech-tracking, which reflects neural encoding of the speech itself and is known to be reduced under conditions of decreased attention and/or poor speech intelligibility52,53; (b) Neural alpha-band oscillations, which are often increased under conditions of low-attention and mind-wandering54; (c) Neural beta-oscillations, which are associated with cognitive processing and tend to increase under heightened task demands55,56,57; and (d) Skin-conductance, which reflects level of arousal and is often affected by stress and cognitive or perceptual effort55,58. Based on previous studies and the hypothesized functional role of these neurophysiological metrics, we expected to find that during quiet portions of the lecture (without background noise) and portions reported as highly interesting, neural tracking of the speech would be improved, we would find lower alpha power and higher beta power, and modulation of skin conductance, a metric linked to levels of arousal, relative to portions that contained additional noise and/or that were rated as less interesting.

Results

Behavioral data

We analyzed the ratings of situational interest for the different lecture-segments. First, we compared the reported levels of interest across segments between these data and an online screening study. Overall interest ratings were slightly higher in the current study (M = 4.73, SD = 0.34) than in the online study (M = 4.28, SD = 0.41) [t(61) = 15.32, p < 0.001, Cohen’s d = 1.946; Fig. 1A], which may indicate overall higher engagement when performing the task in a lab vs. alone and online. Despite this ‘baseline shift’, we found that the variation in interest-level ratings for the different lecture-segments, where highly correlated between the two studies [Pearson’s r = 0.827, p < 0.001, Fisher’s z = 1.178; Fig. 1B]. This supports the utility of interest-ratings as objective metrics of variation in situational interest across the entire lecture, despite potential differences in individual interest and subjective experience (Song et al. 2021). Accuracy on answering comprehension questions about the lecture content was also generally good (M = 87% correct, SD = 13.22), and was comparable to the online study [M = 86.2% correct, SD = 15.23; comparison between lab and online studies: t(29) = 0.468, p = 0.643, Cohen’s d = 0.08].

Fig. 1: Behavioral results.
figure 1

A Average interest values for each lecture segment in the online screening study (n = 37; gray line) and the EEG study (n = 32; black line). Error bars represent the standard error of the mean (SEM). B Pearson correlation between average interest ratings in the EEG and online studies across lecture segments (r = 0.83, p < 0.001). C, D Box plots showing the distribution of C accuracy scores (percentage of correct answers) and D interest ratings1,2,3,4,5,6,7, across noise conditions. E Box plot showing the distribution of accuracy scores for lecture segments rated as high vs. low in interest. In all box plots, red lines indicate the median, boxes represent the interquartile range (IQR), whiskers extend to 1.5×IQR, and individual dots represent participants’ data points.

Behavioral analysis of accuracy on comprehension questions and interest level ratings showed no significant effect of noise condition [repeated measures ANOVA: Accuracy - F(2,62) = 2.098, p = 0.13; Interest levels - F(2,62) = 1.149, p = 0.324; Fig. 1C, D], suggesting that the addition of noise did not impact performance or interest-ratings.

However, when comparing accuracy on comprehension questions for segments rated as high vs. low interest (median split), we found a marginally significant effect suggesting improved comprehension for lecture segments rated as more interesting [M = 89.76%, SD = 11.07 vs. M = 83.94%, SD = 15.61, respectively; t(28) = 1.856, p = 0.074, Cohen’s d = 0.345; Fig. 1E].

Galvanic Skin Response (GSR) data

We extracted two main metrics from the GSR data: (1) the mean phasic activity, representing short-term fluctuations in skin conductance, and (2) the mean tonic activity, reflecting the slower, sustained level of arousal.

Repeated-measures ANOVA comparing these metrics between the three noise conditions revealed significant differences between them in the mean tonic activity [F(2,62) = 20.923, p < 0.001; Table 1, Fig. 2A]. Follow-up pairwise analyses of differences in tonic GSR levels revealed higher responses in the noise conditions (average between intermittent and continuous conditions) relative to the quiet condition [t(31) = 3.15; p < 0.005, Cohen’s d = 0.55], and higher response in the intermittent vs. continuous noise conditions [t(31) = 5.552; p < 0.001, Cohen’s d = 0.98] with a large effect size in both. In contrast, the mean phasic activity did not differ significantly between conditions [F(2,62) = 1.182, p = 0.314; Table 1, Fig. 2B].

Fig. 2: GSR results (n = 32) across noise conditions.
figure 2

Box plots show the distribution of A mean tonic activity and B mean phasic activity, both in microsiemens. The phasic average is shown as a representative phasic measure, as all phasic metrics showed similar patterns. In both panels, red lines indicate the median, boxes represent the IQR, whiskers extend to 1.5×IQR, and individual dots represent participants’ data points.

We next tested whether the GSR metrics were modulated by the level of interest across lecture segments. Both main metrics showed significantly higher activity during low- compared to high-interest segments [mean tonic activity: t(31) = 4.326, p < 0.001, Cohen’s d = 0.76; mean phasic activity: t(31) = 3.566, p = 0.001, Cohen’s d = 0.63; Table 1, Fig. 3].

Fig. 3: GSR results (n = 32) by level of interest.
figure 3

Box plots show the distribution of A mean tonic activity and B mean phasic activity (x10) in microsiemens, for lecture segments rated as high vs. low in interest. The mean phasic activity is shown as a representative phasic measure, as all phasic metrics showed similar effects. In both panels, red lines indicate the median, boxes represent the IQR, whiskers extend to 1.5×IQR, and individual dots represent participants’ data points.

Neural data: Speech tracking analysis

We estimated Temporal response functions (TRFs) to the speech stimulus separately for the three conditions (quiet, continuous noise, intermittent noise). All conditions showed clusters of electrodes with significant predictive power relative to a null distribution, indicating reliable speech tracking. Notably, the number of significant electrodes varied across conditions, with the fewest in the intermittent-noise condition (six electrodes; cluster corrected) and the most in the continuous-noise condition (27 electrodes; cluster corrected).

To assess the effect of noise on speech tracking, we compared the TRFs in the quiet condition to the average TRFs across both noise conditions [Fig. 4A], and also compared the TRFs in the two noise conditions to each other [continuous vs. intermittent; Fig. 4B]. All TRFs showed two prominent positive peaks, approximately around 200 and 350 ms, which were maximal in mid-central electrodes. These peaks were modulated by noise in the following way [Fig. 4A–C]: The early peak (~200 ms) was significantly larger in the quiet condition vs. noise [p < 0.002, cluster-corrected], and larger for intermittent vs. continuous noise [p < 0.03, cluster-corrected]. The later peak (~350 ms) was larger in the noise vs. quiet conditions [p < 0.004, cluster-corrected] and was larger in for continuous vs. intermittent noise [p < 0.002, cluster-corrected]. A complementary decoding analysis, which estimates the overall accuracy of reconstructing a speech stimulus from the recorded EEG, also showed significant modulation by noise condition (Fig. 4D). A repeated-measures ANOVA revealed a significant main effect of noise condition on decoding accuracy [F(2,62) = 6.846, p = 0.002; Table 1], and post hoc pairwise comparisons (Holm-corrected for multiple comparisons) indicated significantly better decoding performance in the continuous noise condition compared to both the quiet condition [t(31) = 3.020, p = 0.007, Cohen’s d = 0.37] and the intermittent noise condition [t(31) = 3.362, p = 0.004, Cohen’s d = 0.41].

Fig. 4: Neural speech tracking response (n = 32) across noise conditions.
figure 4

A, B TRFs for A quiet (red) vs. noise (purple) conditions, and B continuous (green) vs. intermittent (blue) noise conditions, averaged across electrodes showing significant speech tracking. Shaded gray areas indicate time windows with significant differences between conditions. C Topographical maps showing clusters of electrodes with significant differences in TRF amplitudes (p < 0.05, corrected; white circles) during the relevant time windows for the noise vs. quiet and continuous vs. intermittent comparisons. D Box plot showing the distribution of speech reconstruction accuracy (r values) of the decoding model across conditions. Red lines indicate the median, boxes represent the IQR, whiskers extend to 1.5×IQR, and individual dots represent participants’ data points.

To assess whether neural tracking of the speech was modulated by situational level of interest, we estimated TRFs separately for segments rated as high- and low-interest. A similar number of electrodes shows significant speech-tracking responses in both conditions, relative to a null distribution (33 and 34 electrodes, respectively). The TRF amplitudes were significantly modulated by level of interest as follows: Two prominent TRF peaks – an early negative peak (~160 ms) and a late positive peak (~350 ms) were larger in the high vs. low interest condition, while the intermediate positive peak (~200 ms) was larger in the low interest condition [all p’s<0.02, cluster-corrected; Fig. 5A, B]. The complementary decoding analysis also showed that the speech stimulus could be reconstructed more accurately in the high vs. low interest condition [t(31) = 3.17, p = 0.003, Cohen’s d = 0.56; Table 1, Fig. 5C].

Fig. 5: Neural speech tracking response (n = 32) by level of interest.
figure 5

A TRFs for high (black) vs. low (gray) interest segments, averaged across electrodes showing significant speech tracking. The shaded gray area indicates a time windows with a significant difference between conditions (p < 0.05, corrected). B Topographical map showing electrodes with significant differences in TRF amplitudes (p < 0.05, corrected; white circles) during the significant time windows. C Box plot showing the distribution of speech reconstruction accuracy (r values) of the decoding model for high vs. low interest conditions. Red lines indicate the median, boxes represent the IQR, whiskers extend to 1.5×IQR, and individual dots represent participants’ data points.

Neural Data: Power spectral density (PSD)

Spectral analysis of the EEG focused on frequency bands with observed peaks in the PSD which indicates periodic oscillations: Alpha power and Beta power (Fig. 6A). The averaged alpha-power peak between 7-13 Hz and beta-power peak between 16-22 Hz were calculated across participants and focused on predefined clusters of electrodes marked with white circles in Fig. 6B. For each participant, we determined the alpha-power\beta-power peak as the frequency with the highest average amplitude within their frequency range and across segments in different conditions (quiet, continuous, intermittent) and in different level of interest (high or low).

Fig. 6: Spectral analysis (n = 32).
figure 6

A Full spectrum (2–30 Hz) averaged across all participants. Shaded areas around waveforms represent the SEM. B Topographical distribution of the averaged alpha-power peak and the averaged beta-power peak, with the clusters of central-partial and frontal-central electrodes marked with white circles, respectively. C Box plots showing the distribution of maximum mean alpha power (top) and beta power (bottom) across high and low interest conditions. Red lines indicate the median, boxes represent the IQR, whiskers extend to 1.5×IQR, and individual dots represent participants’ data points.

Repeated-measured ANOVA revealed no main effect of noise condition in either band [Alpha-power: F(2,31) = 1.166, p = 0.318; Beta-power: F(2,31) = 1, p = 0.373; Table 1]. However, power in both frequency bands was significantly modulated by situational interest with higher alpha-power and lower beta-power in the low vs. high interest conditions [Alpha-power: t(31) = -2.29, p = 0.029, Cohen’s d = 0.40; Beta-power: t(31) = 2.39, p = 0.023, Cohen’s d = 0.42; Table 1, Fig. 6C].

Table 1. Means (M) and standard deviations (SD) of physiological and neural measures across interest and noise conditions

Discussion

Here we studied how behavioral and neurophysiological metrics associated with arousal, attention and speech processing vary while watching a frontal lecture, as a function of its varying levels of situational interest and the presence of external noise. Several neurophysiological responses were modulated by interest level, with portions of the lecture rated as less interesting associated with poorer neural speech tracking, higher skin-conductance levels, higher alpha-power, lower beta-power and a trend towards poorer performance on answering comprehension questions. Interestingly, level of interest had a more substantial effect on neurophysiological responses relative to background noise, which did not show significant effects on alpha- or beta-oscillations or on behavioral performance, and had a limited (and somewhat inconsistent) effects on skin-conductance and on neural speech-tracking. This was somewhat surprising, given the vast literature on the disruptive effect of background noise on speech processing, albeit under less ecological circumstances. These results highlight the importance of considering content-related factors when studying natural speech processing, which in this case had a more prominent effect on neural encoding of speech and listeners’ neurophysiological state than external noise.

That people have an easier time paying attention when they are interested is one of the cornerstones of modern pedagogy. Educators have long observed that students who find a topic engaging exhibit greater focus, motivation, and improved learning outcomes44. This principle has been repeatedly demonstrated in behavioral studies, showing that interest enhances memory, attention, and cognitive processing10,43,48. For instance, studies on reading comprehension show that individuals remember and understand texts better when they find them interesting45. Similarly, when students perceive a lecture or task as engaging, they persist longer and maintain focus, even when it is cognitively demanding59. Here we relied on group-level ratings of interest to capture variations in situational interest, across segments of the presented lecture, which likely stem from features of the lecture itself, such as presentation style, clarity, and content delivery. Ratings were highly consistent both within and across two independent cohorts (online- screening and lab study), indicating that they reliably capture common fluctuations in interest levels, above and beyond variance between individuals due to personal preferences or context44,51. This approach was further reinforced by content-analysis of the lecture, as lecture-segments that were rated as “highly interesting” often included concrete examples, clear visuals, or relatable content that invited reflection or emotional engagement. In contrast, lower-rated segments tended to be repetitive or abstract, lacking novelty or practical application. Our choice to study situational interest is motivated by its affordance for drawing generalizable group-level conclusions about the impact of interest on speech processing, and potential implication for teaching styles and pedagogical design. However, we acknowledge that this approach limits inferences on variability in interest levels across individuals, efforts that would require larger datasets and optimized designs for individual-level analyses.

Despite the extensive behavioral literature on interest, and its crucial role in human communication and learning, to date, little is known about its neurophysiological underpinnings. The scarcity of data is partially due to the conceptual difficulties in defining “interest”44,60, and partially due to the operational challenge of quantifying levels of interest and its predominant reliance on self-reported measures61,62,63. Nonetheless, some attempts have been made to advance our understanding of neural correlates of interest, particularly in learning contexts. In some real-life classroom studies, variations in the spectral profile of students’ EEG signal (e.g., in the delta, alpha and gamma band) were linked with different levels of interest during a lesson64,65, and curiosity-driven learning has been associated with modulation of dopaminergic reward circuits66,67. While these studies provide a proof-of-concept for the impact of interest on so-called “brain states”, they fall short of providing a comprehensive understanding of how interest impacts neural processing. In more controlled studies, fluctuations in situational interest are assessed through analysis of changes in inter-subject correlation of neural activity over time during the presentation of natural stimuli such as movies or audiobooks. In these studies, segments with higher inter-subject correlation values are associated with better ‘joint’ attention and engagement with stimulus material, which is also predictive of improved memory68,69,70,71,72,73. In a fMRI study, Song et al. 2021 linked fluctuations in engagement to activity in the default mode network and dorsal attention network activity, which also predicted memory encoding, reinforcing the connection between situational interest and attention.

The current data provides a more detailed mechanistic account for how situational interest may affect neural processing during learning. Interest modulated a multi-faceted neurophysiological “profile”—comprised of neural speech tracking, the periodic EEG spectrum and skin conductance—in a manner consistent with modulations observed more generally for attention to speech. Specifically, for segments rated as less interesting, there was reduced neural speech tracking of the lecturer’s speech, a pattern commonly found for impaired speech encoding74,75,76,77 and for speech that is not attended52,78,79,80,81. This result is seen both in the decoding analysis, which reflects overall poorer reconstruction accuracy of the speech in the low-interest vs. high-interest condition, as well as in the magnitude of the TRF response, particularly in the early time windows associated with more sensory levels of processing38,53,82,83. This effect directly links the construct of “interest” to the way the speech-content is processed in the brain, which may underlie the well-documented behavioral consequences of reduced interest on learning outcomes10,43,44,48.

Alongside the direct impact on speech processing, interest level was also associated with changes in global brain dynamics, as captured by the periodic EEG spectrum. Segments with low levels of interest were associated with higher alpha-power and with lower beta-power relative to high-interest segments. This pattern is consistent with the hypothesized role of these ongoing oscillations in attention and cognitive effort. Alpha oscillations are the most dominant feature of the EEG signal, and enhanced alpha is often associated with reduced attention to external stimuli and increased mind-wandering84,85,86,87,88,89,90. We must note, however, that alpha-oscillations are not monolithic, and in some cases alpha oscillations have been shown to support active attention by suppressing distracting input and maintaining selective focus, particularly in noisy or effortful listening conditions54,91. Nonetheless, the current finding of increased alpha-power in low-interest segments is in line with many studies, particularly in the neuroeducation domain, pointing to it as a signature for reduced engagement with presented content and learning activities92,93,94,95,96. The enhanced alpha-power in low-interest segments was accompanied by reduced beta-power, which is often noted as involved in processes such as cognitive control, predictive processing, and reward mechanisms56,97,98,99,100, following the notion that interest and motivation can shape auditory attention in a top-down manner101,102. In the context of speech comprehension, beta oscillations have been associated with the integration of auditory input into meaningful linguistic content, supporting active engagement with speech and improving comprehension56,100,103,104,105. These findings align well with the current results, where ostensibly segments of the lecture that were more interesting evoked these beta-related processes more extensively relative to low-interest segments. They are also in line with observations that beta activity declines when attention wanes or when listeners adopt a more passive, bottom-up processing mode57,106 as well as when listeners do not expect the content to be particularly meaningful or rewarding.

The effect of interest on neural activity was accompanied by physiological effects, as indicated by changes to phasic and tonic skin-conductance levels. Skin conductance is a well-established measure of autonomic arousal controlled by the sympathetic nervous system107,108. There is a strong established relationship between arousal and level of engagement, with levels of arousal generally associated with higher level of engagement and better performance109,110, although this relationship is not linear and sometimes follows an inverted U-shape whereby hyper-arousal levels actually accompany poorer performance27,111. Initially, we had expected to find higher levels of arousal in the high-interest conditions, which could reflect higher engagement with the lecture. However, the current results show the opposite effect with higher skin-conductance found during low-interest segments. One possible interpretation for this pattern is that here, skin-conductance levels reflect the investment of listening effort rather than arousal per-se55,112. In other words, when listeners find the content boring, but are still required to pay attention, they may exert additional effort to stay focused, leading to heightened physiological arousal despite reduced intrinsic motivation113,114,115. This compensatory perspective could, in part, explain the relatively small behavioral effect found here, with a marginally significant trend towards reduced performance in the low vs. high interest segments.

Taken together, the current findings demonstrate that momentary fluctuations in situational interest are accompanied by measurable changes in specific neural processes that are crucial for speech processing. The overall similarity between the profile of neurophysiological effects observed here and in more traditional studied of attention suggest that the constructs of ‘interest’ and ‘attention’ are highly intertwined and share common underlying mechanisms. This highlights the central importance of including listener-based factors such as interest in models of speech processing, particularly in realistic ecological contexts, acknowledging that the human brain exercises active selection of the information it chooses to process, based on its relevance and reward to the listener. Nonetheless, several unresolved issues from this study require additional follow-up research. These include the modest behavioral effect of interest level relative to the observed neurophysiological effect, as well as clarifying the role of arousal as reflecting increased interest or mitigating and compensative for the effects of boredom.

The second factor tested here was the effect of background construction noise on listeners during a realistic lecture, as part of our attempt to advance the ecological validity of speech-in-noise studies12,14,116,117,118,119. Our choice to compare continuous vs. intermittent sounds (continuous drilling vs. intermittent air-hammers) was motivated by competing hypotheses regarding the role of temporal structure in speech in noise processing, as discussed at length by Levy et al.9. One possibility is that continuous noise is more disruptive to listening, due to its constant level of acoustic masking, whereas for intermittent noise in is possible to ‘listen in the gaps’ and recover the masked speech information9,120,121,122,123,124,125. Alternatively, the ‘habituation hypothesis’ posits that monotonic nature of continuous noise may render it more prone to habituation, making it less disruptive126, whereas the frequent onsets and offsets of intermittent noise may trigger repeated phase-resets of cortical and arousal responses127,128, ultimately reducing cortical adaptation and leading to greater disruption of speech processing. In a previous study, where we studied the effect of these background noises during learning in a virtual reality classroom, we found that intermittent noise was more disruptive of performance, reduced the neural speech tracking of the teacher, and was accompanied by an increase in skin-conductance reflecting heightened arousal, relative to continuous noise9. Those findings were taken as supporting the ‘habituation hypothesis’, suggesting that despite the more substantial acoustic masking of continuous noise, it is less disruptive to speech processing, potentially due to habituation over time129,130. The current results are broadly in line with those findings, as here too we found that neural speech tracking was reduced and skin-conductance were elevated for intermittent vs. continuous construction noise, consistent with heightened arousal or increased listening effort in this condition112,113.

Interestingly, in both the current study and our previous work, the presence of background noise did not affect ongoing oscillatory activity in the alpha or beta ranges. This is in contrast to interest level which here showed a clear modulatory effect on these neural dynamics, as discussed above. Given the hypothesized functional roles of alpha- and beta-oscillations in attention and cognitive processing, this replicated null-result is noteworthy, as it suggests that the impact of a noisy background on speech processing on listeners may be qualitatively different than that of reduced attention or engagement with the lecture. As discussed by Levy et al.9, mixed results have been reported regarding the modulation of oscillatory activity in speech in noise studies, particularly in the alpha-band, with some reporting increased activity in noisy conditions91,131, some finding decreased activity132,133, and some reporting effect that vary as a function of noise-type or performance97,134. Overall, we can conclude that although alpha and beta oscillation can play important roles in perception and attention, they do not constitute monolithic neural markers of specific cognitive processes but rather capture global changes in neural dynamics that can arise from the interaction between stimulus features (e.g. noise), internal goals and cognitive demands87,98,135,136,137.

More broadly, when comparing the effects of noise and interest-level on the neurophysiological profile of responses, the current results show internal consistency in that conditions that both noise and low interest level were associated with reduced neural tracking of the lecture and with increased arousal. At the same time, the magnitude of effects was larger for interest levels and, as discussed here, also included modulation of alpha- and beta-oscillations. This pattern suggests that internal factors, such as top-down attention or interest in the content, may ultimately play a more central role in the way that listeners process and comprehend speech, than external acoustic interference. We put forth this hypothesis for testing in future studies.

To summarize, the current study contributes to endeavors to enhance the ecological validity of speech processing research and identify the neurophysiological underpinnings of fluctuations in attention and engagement to a continuous narrative over time. By demonstrating that fluctuations in situational interest-levels over time are intrinsic to naturalistic speech, and comparing the effects of interest-level and external noise on neurophysiological measure associated with arousal, attention and speech processing, this work advances current thinking about how environmental and internal factors influence neural processing in real-life. Results converge with previous research to emphasize the multi-faceted neurophysiological ‘profile’ of responses that are modulated by listener engagement and by the presence of background noise, indicating that no single metric is sufficient for capturing the mechanistic underpinnings of real-life speech processing. This work focuses on a specific commonplace example that requires sustained attention to speech—watching a video of an educational lecture. While we acknowledge that this context lacks many aspects of live learning environments, such as interactive instruction and social interactions138,139, it still captures a form of learning that has become a staple of modern education, particularly since the COVID-19 pandemic. This work, together with previous studies using video or virtual reality-based learning9,96,140,141,142,143,144 lay the foundation for future studies investigating neurophysiological features of attention and speech processing in live learning contexts, an exciting emerging research field95,140,145,146. Another limitation of the current results is that the neurophysiological metrics used here (e.g., speech tracking, spectral power and skin-conductance) lack sufficient sensitivity for quantifying fluctuations on a moment-by-moment basis, but require averaging data across large portions of the experiment. This is due to their relatively poor signal to noise and non-specific nature147. Our hope is that future advances in signal processing techniques will improve and validate the single-trial reliability of these metrics, which would provide much needed insight into the temporal dynamics of attention to speech and the nature of its fluctuations over time.

Methods

Participants

Data was collected from 32 adult volunteers (20 female, 12 male), ranging in age between 19 and 28 (M = 23.54, SD = ± 1.97). Sample size was determined based on results from a similar previous study in our lab9, where we found that a sample of at least N = 28 is required to detect within-group effect sizes of Cohen’s d = 0.55 with a two-sided α = 0.05, and power of 0.8. All participants were fluent Hebrew speakers with self-reported normal hearing and no history of psychiatric or neurological disorders. The study was approved by the Institutional review board of Bar Ilan University (approval # ISU202106003), and participants gave their written informed consent prior to the experiment. Participants were either paid or received course credit for participating in the experiment.

Stimuli

The stimuli consisted of a 35-minutes video recording of a public lecture, given by Prof. Liat Ayalon on the topic of Ageism. As shown in Fig. 7, the video recording included the lecturer herself as well as the slides accompanying the talk. Prof. Ayalon gave her approval to use these materials for research purposes.

Fig. 7: Illustration of the experimental procedure.
figure 7

Lecture segments were randomly assigned to the quiet, intermittent or continuous conditions. Interest-level ratings were given after each segment (scale 1–7) and three comprehension questions about the content of each segment were asked after every three segments.

For the noise stimuli, we used recordings from a real-life construction site (recorded using a mobile phone; iPhone 12). The continuous noise was a 1-minute-long recording of drilling, and the intermittent noise was a 1 min-long recording of air hammers (see Fig. 7). The continuous and intermittent noise stimuli were equated in loudness to each other.

Online stimulus screening study

The lecture materials were pre-screened in an online study conducted using the webservice Qualtrics (Provo, UT; https://www.qualtrics.com). 37 adult volunteers (29 female, eight male), ranging in age between 20 and 29 (M = 22.54, SD = ± 1.76), participated in the screening study. Participants watched the lecture on their phone or computer in a quiet environment. As in the main experiment, the lecture was presented continuously, but was split into in 63 segments ranging from 23 to 40 s each (M = 32.6, SD = ± 4.2). After each segment participants were asked to rate their level of interest in the segment, on a scale from 1 to 7 (“How interesting was this segment?”; 1- not at all, 7- extremely)63,148. To ensure that participants were indeed paying attention to the lecture, after every 20–23 segments, participants were asked to answer ten multiple-choice questions about the recent content of the lecture. Participants who achieved less than 70% correct were excluded from analysis of the screening study. In the screening study no noises were added to the lecture segments, and served as a baseline for testing whether adding noise in the main experiment affected ratings of interest. They were also used to ensure that the lecture-segments allocated to different noise conditions in the main experiment did not vary significantly in their interest-level ratings.

Main Experiment

The experiment was programmed and presented using OpenSesame (version: 3.3.14 https://osdoc.cogsci.nl149). Participants were seated on a comfortable chair in a sound attenuated booth and were instructed to keep as still as possible and blink and breathe naturally. The video of the lecture was presented on a computer monitor in front of the participants, and the lecture audio was presented through a loudspeaker placed behind the monitor.

The lecture was presented continuously, but was split into 63 segments ranging from 23 to 40 seconds each (M = 32.6, SD = ± 4.2). The varied lengths were necessary to ensure that the segments did not cut-off the lecture mid-sentence or mid-thought. This duration provided participants with enough time to engage with the content, while also being suitable for our analyses (e.g., TRF), and allowing for a sufficient number of data points to assess behavioral and self-reported measures of interest throughout the experiment.

Each segment was randomly assigned to one of three conditions: 1) quiet (21 trials); 2) continuous noise (21 trials); 3) intermittent noise (20 trials). In the two noise conditions, continuous/intermittent noise was presented alongside the lecture, at a loudness level of 0.2 (-16 dB) relative to the lecture. The allocation of segments to the different noise condition was kept constant across participants. We verified that the segments assigned to different noise-conditions did not differ on average in their level of interest, based on results from the online screening study [mean interest levels in the screening study for the segments later assigned to each condition: quiet: 4.25, continuous: 4.3, intermittent: 4.31; F(2,19) = 0.044, p = 0.957].

After each segment, participants were asked to rate their level of interest on a scale from 1 to 7 (“How interesting was this segment?”; 1- not at all, 7- extremely). In addition, after every 3 segments, participants were asked to answer three comprehension multiple-choice questions, regarding the content of the last three segments of the lecture they heard to ensure that they were paying attention and to assess their level of understanding/memory of the lecture content (one question per segment; see Fig. 7). Participants received feedback regarding the correctness of their answers. Participants indicated via button press when they were ready to continue to the next trial. A training trial was performed at the beginning of the experiment (quiet condition), to familiarize participants with the task and this trial was excluded from data analysis.

EEG and GSR data recordings

Electroencephalography (EEG) was recorded using a 64 Active-Two system (BioSemi B.V., Amsterdam, Netherlands; sampling rate: 1024 Hz) with Ag-AgCl electrodes, placed according to the 10–20 system. Two external electrodes were placed on the mastoids and served as reference channels. Electrooculographic signals were simultaneously measured by 3 additional electrodes, located above the right eye and on the external side of both eyes. Galvanic Skin Response (GSR), which captures changes in the electrical properties of the skin due to changes in sweat levels and is considered an index of the autonomic nerve responses, was measured using 2 passive Nihon Kohden electrodes placed on the fingertips of the index and middle fingers of participants’ nondominant hand. The signal was recorded through the BioSemi system amplifier and was synchronized to the sampling rate of the EEG.

Behavioral data analysis

Behavioral data consisted of accuracy on the comprehension questions asked about each segment and subjective rating of interest. These values were averaged across lecture segments, separately for each noise-condition (quiet, continuous, and intermittent), and for each participant. A one-way repeated-measures ANOVA was performed using JASP (version: 0.17.3; JASP Team, 2025; https://jasp-stats.org/), to test whether interest-ratings and/or accuracy on the comprehension questions differed significantly across noise-conditions.

In addition, we identified the median interest-level value across all lecture-segments and used it to classify each lecture-segments as high-interest vs. low-interest. We then performed a paired t-test to evaluate whether comprehension-question accuracy different for segments with high vs. low interest level ratings.

GSR data analysis

The GSR data were analyzed using the Ledalab MATLAB toolbox150 (version: Ledalab V3.4.9; http://www.ledalab.de) as well as custom written scripts. The data were downsampled to 16 Hz, as per the toolbox recommendation. The raw data were manually inspected for distinguishable artifacts, which were fixed using a built-in linear interpolation. We performed a continuous decomposition analysis (CDA) on the entire GSR signal, and then segmented it into trials. The CDA estimates and separates the continuous phasic and tonic activity using a standard deconvolution. Initially, 4 metrics were extracted for each trial: (1) the mean tonic activity across the entire trial (2) the mean phasic response, (3) the number of phasic skin conductance responses (nSCR), defined as transient changes in the phasic response that exceed a threshold of 0.01 micro-siemens (muS); and (4) the sum of SCR amplitudes, estimated as the area under the curve of the phasic response around SCR peaks (SCR-amp). Since the nSCR and SCR-amp measures showed highly similar patterns to the mean phasic activity and did not provide additional insights, only the mean phasic and mean tonic activity were used in subsequent analyses. For each participant, these metrics were averaged across trials within each condition. We used a 1-way ANOVA with repeated measures to test whether any of these measures were significantly affected by the different noise conditions or by the level of interest (high or low) as mentioned above. To further examine the effects of noise, we conducted planned a priori paired t-tests, following the same logic as in our previous study9: (1) comparing the two noise conditions (continuous and intermittent) vs. the quiet condition, to evaluate the overall effect of noise, and (2) comparing between the two noise conditions (continuous vs. intermittent).

EEG Preprocessing

EEG preprocessing and analysis were performed using FieldTrip (version: 20220729; https://www.fieldtriptoolbox.org)151), a matlab-based toolbox as well as custom written scripts. Raw data was re-referenced to the linked left and right mastoids and bandpass filtered between 0.5-40 Hz using a zero-phase, two-pass Butterworth IIR filter to reduce artifacts with extreme high-frequency activity or low-frequency activity/drifts. The filtered data was visually inspected and gross artifacts exceeding ±50 μV (that were not eye-movements) were removed manually. Entire trials containing such artifacts were excluded from further analysis, with 1–4 trials rejected per participant (M = 0.875, SD = 1.21). Independent component analysis (ICA) was performed using the ft_componentanalysis function (method = ‘runica’) to identify and remove components associated with horizontal or vertical eye-movements as well as heartbeats (based on visual inspection; 4–10 components were removed), with the inclusion of electrooculographic channels to improve the algorithm’s identification of eye-movement artifacts. Any remaining noisy electrodes, likely due to bad or loose connectivity, were replaced with the weighted average of their neighbors using an interpolation procedure (either on the entire data set or on a per-trial basis, as needed), with up to two electrodes interpolated per participant.

Neural speech tracking analysis

The clean data was segmented into trials, and the first 420 ms of each trial were removed to avoid onset effects. To estimate the neural response to the speaker in the different noise scenarios we performed speech tracking analysis, using both an encoding and a decoding approach. We estimated linear TRFs using the mTRF MATLAB toolbox152, which constitutes a linear transfer function describing the relationship between a particular feature of the stimulus (S) and the neural response (R) recorded when hearing it.

The S used here was the speech-envelope stimulus presented in each trial, which was extracted using an equally spaced filterbank between 100-10,000 Hz based on Liberman’s cochlear frequency map153. The narrowband filtered signals were summed across bands after taking the absolute value of the Hilbert transform for each one, resulting in a broadband envelope signal. The R used here was the continuous EEG data, after ICA for correcting eye-movements, and bandpass filtered between 0.8 and 20 Hz using a zero-phase, two-pass Butterworth IIR filter. S and R were aligned in time and were downsampled to 100 Hz for computational efficiency. Encoding and decoding models were run and optimized separately for each noise-condition (quiet, continuous and intermittent). Encoding TRFs were calculated over time lags ranging from −150 (pre-stimulus) to 1000 ms, and the decoding analysis used time lags of 0 to 400 ms.

A leave-one-out cross validation protocol was used to assess the TRF predictive power. In each iteration, 61 trials are selected to train the model (train set), which was then used to predict either the neural response at each electrode (encoding) or the speech envelope (decoding) in the left-out trial (test set). The goodness of fit (predictive power) of the encoding model was determined by calculating the Pearson correlation between the predicted and actual neural response at each sensor. Similarly, the goodness of fit of the decoding model was determined by calculating the Pearson correlation between the predicted and actual speech envelope. To prevent overfitting of the model, a ridge parameter was chosen as part of the cross-validation process (λ -predictive power). This parameter significantly influences the shape and amplitude of the TRF and therefore, rather than choosing a different λ for each participant (which would limit group-level analyses, especially for the encoding approach), a common λ value was selected for all participants. Specifically, we tested a range of λ values (from 10-3 to 106) and selected the λ that yielded the highest average predictive power across participants, electrodes and conditions. In this dataset, the optimal ridge parameter was λ = 1000 for both the encoding and decoding models.

To determine which subset of sensors showed a significant speech tracking response (encoding approach) we used a permutation test, where we shuffled the pairing between acoustic envelope (S) and neural data responses (R) across trials such that speech-envelopes presented in one trial were paired with the neural response recorded in a different trial. This procedure was repeated 100 times and an encoding model was estimated for each permutation. We obtained a “max-chance predictive power” null-distribution by selecting the maximum r-value from the grand average across participants for each permutation. EEG channels with predictive power values with the top 5% of the null distribution were deemed to exhibit a significant speech tracking response. All subsequent TRF and predictive power analyses were limited to this subset of electrodes, with ensured that comparisons between condition were conducted only on electrodes where speech-tracking estimates are interpretable and meaningful.

Next, we tested for differences in the speech tracking response (TRF) and its predictive power across the conditions. To assess whether speech tracking was affected by the presence of any type of noise, we compared responses in the quiet condition vs. noise condition (average across the two noise conditions - continuous and intermittent). To test whether the TRFs were affected by the specific type of noise, we further compared TRFs in the continuous vs. intermittent condition. In addition, we estimated TRFs separately for the speech-segments rated as ‘high’ and ‘low’ interest levels and compared whether the speech tracking was affected by subjective interest ratings. In all analyses we performed paired t-tests at each electrode and each time point for TRF comparisons and corrected for multiple comparisons using spatio-temporal clustering.

EEG Spectral analysis

The second type of analysis performed on the EEG data was spectral analysis, which focused on two frequency bands with observed peaks in the PSD indicates periodic oscillations: Alpha (7–13 Hz), and Beta (16–22 Hz). The range we focus on in each frequency chosen according to the window surrounding the peaks observed in the PSD. This analysis was performed on the clean EEG data, across segments (same segmentation as used for the speech-tracking analysis). We calculated the EEG Power Spectral Density (PSD) of individual segments using the multitaper fast-fourier transform method with Hanning tapers (method ‘mtmfft’ in the fieldtrip toolbox). The PSDs were averaged across segments for each participant, separately for each electrode, across noise conditions (quiet, continuous, intermittent) and across level of interest (high or low). We then used the Fitting Oscillations and One-Over-F algorithm (FOOOF154) to decompose the PSD into periodic (oscillatory) and aperiodic components. The periodic portion of the PSD was used to extract power-estimations for the specific frequency bands.

For each participant, we identified the frequency with the largest amplitude within the alpha range (7–13 Hz) and beta range (16–22 Hz), the only two bands where clear periodic activity was observed, and averaged the response across a cluster of electrodes (21 for alpha, 29 for beta) showing the strongest activity in the grand-averaged power topography, as determined by visual inspection across all participants and conditions. The average power in each band was compared across conditions (quiet, continuous, intermittent) using 1-way ANOVA with repeated measures, and across level of interest (high or low) using paired t-test.