Introduction

Research on the history of psychology has often relied on interviews with witnesses (e.g., Dutt & Grabe, 2014; Nyman, 2010). For instance, an oral history approach has been used in work on the history of feminist psychology (cf. Johnston & Johnson, 2008; Ruck, 2015) and developmental psychology (cf. Cameron & Hagen, 2005; Johnson & Johnston, 2015). Given past and developing research practice, it seems important to scrutinize an often neglected aspect of this material: the potential impact of voice properties. While interviews are often transcribed before being analyzed, the aspects of the spoken word might still have an impact on the analysis of the material, as, for instance, Gibson (2017) indicated regarding the Milgram Experiment. Often, the person providing the transcript also participates in the analyses of the written text (e.g., Nyman, 2010; Corcoran et al., 2019). The voice of a speaker can carry cues regarding their metacognitive status (being sure/unsure), (lack) of sympathy for a subject or person that is the object of the elaboration, and many other aspects that might not be apparent in the transcription (e.g., Nicolai et al., 2010). Analyzing text with vs. without hearing the voice might influence judgments and categorizations of the researcher. The invention and widespread use of emoticons in digital text messages suggests that additional information about emotions helps to address the semantic part of the message and prevent misunderstandings. Meanwhile, speech emotion recognition systems exist that process and classify speech signals to recognize implicit emotions (e.g., Akcay & Oguz, 2020).

In order to explore the extent to which experiencing the voice of a speaker alters the categorization of content relevant to the history of psychology, a subject should be used where witnesses and persons analyzing their statements are both likely to feel strongly involved. We chose to confront participants with interview material from witnesses being interviewed on the reconstruction of psychology in Germany after World War II. The audio material on which this study is based had been collected between 2000 and 2003 as part of the project Psychology in Reconstruction by Prof. Dr. Helmut E. Lück and Dr. Hermann Feuerhelm, funded by the German Research Foundation. Interviews aimed to explore the extent to which networks and content from Nazi-era psychology in Germany were re-established after World War II (continuity) and the extent to which a new orientation and cohort of academics could be established (new beginning).

In our prior work (author, 2020), the transcribed form of interview statements was presented to test persons who evaluated them with regard to whether they expressed continuity vs. a new beginning. In the 2020 study, however, only the transcribed quotes from contemporary witnesses were used. In the current study, we experimentally varied whether statements were provided as text plus audio vs. as text only. It is conceivable that linguistic features, emotions or emphasis could lead to a different evaluation of the interview statements. Work on the potential surplus of spoken over written language dates back at least to the Organon model by Bühler (1934), underlining that physical sound is not identical to the linguistic sign. Someone can say more than is relevant to the specific situation (i.e., the receiver abstracts the essential meaning from the perceived sound waves). In contrast, the opposite is possible as well, as the receiver might independently add additional information that has not been explicitly communicated (Bühler, 1934). The added benefit of spoken language has also been discussed while using automated speech as a test case (Shankweiler & Fowler, 2015). Research has long been underway on the properties of language that are not contained in text form. Besides the semantic content, additional information such as the workload and the psychological and physiological stress of a person can be inferred from spoken language on the basis of acoustic variations (e.g., Ruiz et al., 1990).

Aspects of spoken language proved to be meaningful indicators of the human emotional state, such as spectral and spectral-temporal characteristics of fast and slow speech components, as well as temporal qualities and intensity of speech. Based on these properties, it is possible to recognize emotions and their manifestations and even to distinguish between emotional and physical stress (Simonov & Frolov, 1977). It seems that a person’s predominant emotions can be heard in their voice, but sexual orientation, for example, cannot (Sulpizio et al., 2020). Suires, Tognettis and Durands (2020) studied qualities that distinguish female from male voices: the fundamental frequency, modulation, overtone to noise ratio (a proxy for vocal breathing) and jitter (a proxy for vocal roughness).

Seminal work has targeted possible differences in content extraction from written vs. spoken text. Kintsch and Kozminsky (1977) suggested that the comprehension processes of reading and listening possess a common core. The test participants listened to or read three stories recorded on tape. Then a summary was written. The comparison of the results showed only slight differences between the listening and reading conditions. The only difference was that after listening to the story, more idiosyncratic details were reproduced while the actual content of the summaries was remarkably similar. Following up on this, it has been investigated whether there are also differences in the depth of processing depending on the presentation of auditory vs. written material and whether this also leads to differences in mental representations (Kim & Petscher, 2016; Kim et al., 2019; Kürschner et al., 2006). The monistic position states that the processing and representation in reading and listening are the same. It is therefore assumed that the same mental lexicon is used and that the same syntactic processing processes are present, leading to comparable mental representation (Gilbert et al., 2018; Kürschner et al., 2006). This is justified by the fact that hearing is a valid predictor for learning to read (Kürschner & Schnotz, 2008). In contrast, the dualistic position claims that there are differences between hearing and reading at the lower and higher levels of cognitive processing (Kürschner et al., 2006). Specific memory processes during hearing and reading are assumed (Kürschner & Schnotz, 2008).

Overall, it can be said that voice and spoken language contain some information that is absent in writing. Some of which can potentially influence judgments about the text content.

Purpose of the present study

The literature suggests that multimodal presentation (text + audio) can differ from text-only presentation in many aspects. Yet, there is a lack of testing whether (and how) this leads to different research outcomes when working with eye-witness material on questions relevant to the history of psychology. In the current work, we explored the potential impact of experiencing the voice in addition to text. First, we checked whether ratings on different statements from eye-witnesses would lead to mean level differences in ratings concerning the extent to which the statements signaled continuity vs. a new beginning. After securing that statements could be consistently categorized, we explored whether mean ratings as well as variability in ratings would be affected by whether voice was made available in addition to text.

Method

Research design

The experiment was programmed using lab.js (Henninger et al. 2020), a free experiment creation tool for online experiments. The study design was a within-subjects design with four different balancing conditions (see Table 1). This ensured that each participant rated each interview excerpt only once, avoiding repeated exposure to the same material and potential carryover effects. Yet, each participant rated half of the material based on the text form and half based on the text-plus-spoken form. Across participants, each interview excerpt was evaluated in either of the two variants equally often.

Table 1 Modalities and order of the statements in the different test variants.

The participants were randomly assigned to one of the four different test variants. 26 statements were used and can be found in the digital appendix on the platform OSF.io in the original language and in English translation. All participants were presented with 13 statements as audio recordings with the transcription and 13 statements in text form for evaluation. Within the sections of the conditions, the citations were presented in random order.

The independent variable was multi-modality, i.e., whether only the transcription or the sound recording with transcription was presented, or the audio was additionally available. The rating on the eyewitness’s statement was the dependent variable. The test subjects answered the question of whether a presented quote expressed continuity in psychology or a new beginning and rated their evaluation on a five-point Likert scale ranging from 1 (continuity) to 5 (new beginning).

Sample

For the procedure of the current study, a positive vote was obtained from the institutional review board of the Faculty of Psychology, and participants provided informed consent. A total of 54 volunteers took part in the study. The participants were tested individually as part of BSc.-thesis projects (see Acknowledgment). Due to incomplete data, two participants were excluded from the analysis. The final sample consisted of 52 participants, of whom 26 were female. Post-hoc power analyses with G*Power (Faul et al. 2009) showed that with this approach, we could reach a power of 0.89 to detect a difference of d = 0.4 in a two-tailed comparison (within-subjects t test) with alpha = 0.05. The age range of the subjects was 21 years to 75 years (M = 42.52, SD = 14.76). There were 12 people in the balancing condition A. In variant B, there were 11 persons, 18 persons in condition C and 11 in condition D (see Table 1).

The test materials

The study was based on interviews with contemporary witnesses from the (anonymized). The interviews used originate from two research projects of the author and (anonymized) funded by the German Research Foundation, carried out in 2000 and 2003 (Bettenhausen, 2020). The eyewitnesses had participated after obtaining informed consent and being informed of the procedure, data storage and usage. Since the interviews were available as sound recordings, the selected excerpts were transcribed. Annotations were not used in the transcription to make reading easier for the participants.

Relevant passages from the interviews were selected, which cover the post-war period from 1945 to 1950 and contain a reference to continuity or new beginnings in German psychology. The quotes were selected based on referring to continuity or new beginnings in terms of content (rather than simply containing general statements) and based on referring to the time span from 1945 to 1950. In a previous study (author, 2020), it was possible to identify contemporary witnesses whose statements expressed more of a continuity or a new beginning in psychology. Quotes from these witnesses were selected with priority for this study. The quotes used in the current study are documented online: https://osf.io/emybp/.

Since the selected interview passages address specific historical or psychological aspects that could make them difficult to understand, short explanatory texts were composed and amended to the original statements. These set the quote into context by explaining terms such as “psychotechnology” or “denazification” and gave details of individual personalities. Care was taken not to include too many explanations, as this could have overwhelmed the participants or led to floor or ceiling effects in the categorizations. Importantly, the explanations were added identically in both of the experimental conditions (written text vs. written text plus voice). Examples for excerpts are “(…) in the first days after the collapse, people came and were clearly often Americans and were clearly instructed to look through the libraries for Nazi literature and eliminate this literature.” or “That was really a time of awakening, because you were really, if you like, with this inadequate training, this one-sided training, with this short training, you actually had a tremendous need to catch up, because the wave was gradually sweeping over from abroad.” The audio files of the quotes were 4 to 35 seconds long.

The test procedure

After welcoming the participants, explanations were given regarding data protection guidelines and anonymization. Then, the participants read the instructions of the study explaining the purpose of the investigation, the procedure and giving an example of the evaluation, as well as the approximate duration of about 30 minutes. The experimenter ensured that the sound level was set adequately so that the audio material could be heard and took care that all participants understood the instructions. As described above, we counterbalanced the experimental conditions and their order for the different statements. Within a condition, the statements were presented in an individually randomized order.

At the beginning of each trial, a new written statement was presented on the screen. Written context information was presented together with the statement. In the text-plus-voice condition, the audio was played while the written statement was on the screen. If necessary, it was possible for the subject to listen to the statement again. There was no time limit for submitting the rating. Rather, the subject could independently move to the next statement by turning in the rating.

Independent ratings of emotionality and arousal

Based on feedback to an earlier version of this manuscript we had four independent raters judge on a rating scale (1 = not at all; 7 = very strong) (a) the level of emotionality of the voice of the interviewed person and (b) the level of arousal of the interviewed person for each of the audio files in order to explore one possible basis of differences between the voice-plus-text and the text-only variant.

Results

Figure 1 shows the profile of the average ratings for the text vs. the text-plus-voice condition across the statements. For Set 1 (Statements 1 to 13), there was a Pearson correlation of r = 0.991 between the mean values of the experimental conditions (text + voice vs. text-only). For Set 2, the profile correlation was very high as well (r = 0.976). These values imply that the profiles were highly similar for the two experimental conditions. In both conditions, participants were capable of differentiating among the statements and did so in a consistent manner.

Fig. 1
figure 1

Mean rating per quote for the text-plus-voice condition and the text condition. Error bars depict the standard error of the mean.

Furthermore, Fig. 1 (and also Fig. 2a) suggests that there was hardly any mean difference between the text (M = 3.04, SD = 0.47) and the text-plus-voice condition (M = 3.01, SD = 0.51; t(51) = 0.92, p = 0.362, dav = 0.061, for the paired t test). Thus, adding the voice to the text did not systematically bias the ratings overall.

Fig. 2
figure 2

Comparing mean ratings (a), variability within person (b) and proportion of extreme ratings (c) for the text-plus-voice condition and the text condition. Error bars depict the 95% confidence interval of the mean of the paired t test.

Ratings averaged across participants might hide differences in variability. Conceivably, adding voice to the text might lead to more extreme ratings (in either direction). Indeed, the average within-subjects standard deviation (i.e., across quotes) in the text-plus-voice condition (M = 1.33, SD = 0.48) was higher than in the text condition (M = 1.22, SD = 0.48; t(51) = 2.77, p = 0.008, dav = 0.229, for the paired-measures t test; Fig. 2b). Thus, participants in the text-plus-voice condition differentiated more strongly among the sentences. Further explorations suggest that this higher variability was specifically driven by a higher proportion of extreme ratings. There were M = 44.53% (SD = 25.14%) ratings either with the lowest (1) or highest (5) category of the scale in the text-plus-voice condition, while this proportion was lower in the text condition by 8.88% (M = 35.65%, SD = 26.6%; t(51) = 2.73, p = 0.009, paired t test, dav = 0.343; Fig. 2c).

To explore potential bases for the differences between the text vs. the text-plus-voice ratings, we analyzed the judgments of the four independent raters concerning emotionality and arousal in the voices of the interviewed persons. Both aspects were on average rated as rather low (M = 3.12 and M = 2.95, respectively; mid-point of the scale = 4). The agreement amongst the four raters was substantial (Cronbach's Alpha across the items = 0.69 and 0.73, respectively), so we averaged across the four raters. Yet, checking the correlation between, on the other hand, the emotionality rating and, on the other hand, the judgments concerning new beginning vs. continuity for text and voice, text or the difference of the media conditions did not reveal a significant correlation. The same was true when using the arousal rating instead (ps > 0.05).

Discussion

Working with witness statements in research on the history of psychology might involve emotional topics for which the presence vs. absence of voice might tip the balance between different interpretations of historical developments. Using statements from interviewed witnesses on the issue of continuity vs. new beginning in rebuilding psychology in Germany after World War II as a test case, we tested whether including voice to transcribed statements might affect content judgments. Underlining the validity of our procedure, we found that participants – despite not being markedly knowledgeable of the history of psychology – were consistent in rating different citations (see Fig. 1). While the overall average judgment across raters processing the statements was not affected by whether voice was combined with the transcribed text, subjects’ ratings were more polarized in the conditions in which the voice recording was played with the text presentation. Potentially, people strive for coherence and produce more extreme judgments by selectively attending to the aspects and interpretations of the statement that fit the emotions transmitted and judgments inferred from the audio (cf., Engel et al., 2020).

These results are relevant for methodological aspects of research. While in the current study, polarized ratings averaged out given the high number of raters, the current results suggest that adding voice to transcripts can lead to a more extreme average rating when using a small number of raters. The finding that average ratings are similar while voice seems to make single ratings more extreme suggests a nuanced interpretation concerning the debate (Gilbert et al., 2018; Kürschner et al., 2006), whether reading and listening are similar (monistic position) or cognitive processes differ across modalities (dualistic position). On the one hand, it is plausible that processes are qualitatively the same, yet adding voice implies adding arousal and/or noise. On the other hand, adding voice might lead to text processing that differs qualitatively.

While the current study suggests that adding voice to transcripts can lead to polarized evaluation of the content, further work is needed to understand which aspects of voice are relevant for such an effect. Aspects such as irony might be very hard to deduce from text alone, while they are effectively communicated by paraverbal auditory cues (Aguert, 2022). Furthermore, it is currently not clear whether the effects of paraverbal auditory cues would be larger if raters were confronted with even shorter quotes (cf. Wang et al. 2021). On the one hand, presenting raters with an audio statement of, for instance, only two seconds might increase reliance on paraverbal cues, as time would not allow for substantial amounts of text to be conveyed. Yet, on the other hand, in long interview snippets, participants might derive an overall impression and attentional filters from paraverbal auditory cues. Hence, paraverbal auditory cues might be influential in very short as well as in rather long sections of material.

While rated emotionality and rated arousal of the voice did not seem to account for the effect, further studies might systematically test for larger arrays of voice properties. Additionally, it might be fruitful to test the impact of added voice on content ratings in domains with a clear ground truth. While ratings on continuity vs. new beginning in post-war German psychology were consistent across raters, the benchmark for what should be considered as a correct answer might differ across domains relevant to this issue. A further issue for future studies concerns the different directions of influence between written text and picture, when working with interviews. The current results and evidence for that spoken text can contain more information than written text (e.g., Nicolai et al., 2010), suggesting that adding voice can alter the interpretation of written text. Yet, visually presented text might also influence the reception of spoken text. In three studies, Moreno and Mayer (2002) investigated whether and under which conditions the addition of written text can improve the understanding of a spoken scientific multimedia explanation. The subjects received an explanation of the process of flash formation in two modalities: auditory only (non-redundant) or auditory and visual (redundant). The subjects understood the explanation best when the words were presented not only auditorily but also visually, provided that there was no other simultaneous visual material. The overall pattern of the results can be explained by a dual-processing model of working memory. Further studies might investigate how presenting the written text can support tasks for which spoken text has to be analyzed in working with interview material.