Introduction

Generative artificial intelligence (GenAI) within ChatGPT launched in 2022, and an explosion in research application followed. In a PubMed search of “generative AI” in Title or Abstract fields, just 7 occurred prior to 2023, 552 in July 2024, and 825 as of July 2025GenAI can process large amounts of information or data and generate new information – whether as text, images, or audio – that potentially offers rapid and substantial advances in understanding and using data. The goal of using GenAI is to “provide output that is indistinguishable from that of a human”, or if possible to exceed it1. Thus, it is no surprise that its availability has been seized upon by the research community. A systematic review and meta-analysis found that GenAI in health related research was used most frequently for diagnostic and screening processes (e.g., improving accuracy) in relation to diabetes, radiology, cardiology, and gastrointestinal medicine2, and its performance compared to physicians has also been assessed in meta-analysis, showing variable performance by physician expertise and specialty3.

Large language models (LLMs), as implemented within GPT-4o4, are advanced natural language programming (NLP) systems trained on huge amounts of text. These models use deep learning, particularly transformer architectures5, to perform tasks such as writing text, translating languages, summarizing information, and analyzing sentiment. By recognizing patterns in language, LLMs can generate meaningful and natural responses, making them useful for chatbots, content creation, and customer service. In short, LLMs make it easier for humans to communicate with technology more naturally. The application of GenAI incorporating LLMs raises potential utilities for undertaking analysis in qualitative research. The aim of qualitative research is a naturalistic enquiry to understand perceptions, feelings, thoughts or behaviors as experienced by study participants themselves, without attempting to broadly generalize to a wider context as occurs with quantitative research. Rigor and credibility of qualitative study is ascertained by its trustworthiness – with underlying concepts of confirmability, reflexivity, and transferability. This requires reflection and transparency in order to negate or acknowledge any bias, including that of the analysts’ own thoughts and actions6. Of the different methods of analysis, thematic analysis appears particularly appropriate for use of AI as it offers a systematic and robust framework for analyzing qualitative data which makes it particularly useful in applied health research7.

Currently, thematic analysis is a lengthy and labor intensive activity. The phases of thematic analysis involve familiarization with the data (transcription, reading, re-reading); systematically generating initial codes; searching for or deducing themes and subthemes by collating codes; reviewing themes; defining and naming themes; and reviewing transcripts to ensure codes and themes are appropriate to the context and represent the range of participant views. The narrative is then produced, with illustrative quotes added to support the text.7 GenAI may conduct such thematic coding, with time efficiency and without specific biases8, but may lack nuanced understanding of the data, and an ability to make sense of the complex and often irrational thoughts or behaviors of humans, as well inability for reflexivity on the processes ensuring rigor. Studies have emerged evaluating the use of GenAI in qualitative research. For example, a study compared LLM for inductive thematic analysis of public health-related content from social media posts compared to human thematic analysis, and found that LLM identified several themes consistent with those identified by humans, and that additional themes were relevant and reasonable9. Other analyses have been limited to medical records or policy documents8, or lack formal comparison to human analysis10. However, these qualitative analyses involve relatively short or simple texts, as compared to qualitative research studies, such as focus group discussions, which involve simultaneous and interactive conversation among multiple people. As summarized by Lee et al., several studies have applied GenAI to various stages of thematic analysis in qualitative research with varying results, identifying challenges such as prompt-dependency, hallucination (i.e., “made up” information), repetition of output, word and syntax errors, and requirement for human assistance to refine codes and themes11.

GPT-4o offers significant advances over prior versions, including expanded coherence, ability to generate richer explanations, greater capacity for ambiguous queries and multiple interpretations, a broader knowledge base, and improved recognition of user intent12. Given the expanded utilities and building on prior studies, we applied LLM to a qualitative study and evaluated how well it conducted thematic analysis: (1) compared to a formal human analysis, and (2) through a replicable framework to assess the quality of the GenAI output. Our test case makes use of a qualitative study that assessed the impact of the COVID-19 pandemic on the sexual and reproductive health of adolescent girls and young women (AGYW) in rural western Kenya13. The main objective of the present study was to compare the thematic analyses conducted by GenAI using LLM to that conducted by humans, with regards to major themes identified, selection of supportive quotes, and quality of quotes; and secondarily to explore quantitative and qualitative sentiment analysis conducted by the GenAI, as a novel potential adjunct to traditional qualitative thematic analysis. The process and results of this study present a replicable framework to evaluate the potential complementarity, augmentation, and errors or bias that may arise when using GenAI to support qualitative study.

Methods

This study was approved by the institutional review boards of Maseno University Ethics Review Committee (MUERC, MSU/DRPI/MUERC/01021/21), University of Illinois at Chicago (UIC, #2022 − 0220), and (Liverpool School of Tropical Medicine (LSTM, #21–087, favourable ethical opinion). Written informed consent was obtained for all participants. All research was performed in accordance with relevant guidelines/regulations. Per OpenAI’s safety protocols, they actively monitor for potential misuse and safeguard user data by declaring input is not accessible or shared to anyone else, or the OpenAI team14. Complete annotated code is available in Supplementary File 1.

Study design and participants

Data for this analysis came from the Cups and Community Health (CaCHe) study15,16, which involved a sub-set of participants within the Cups or Cash for Girls (CCG) trial (ClinicalTrials.gov NCT03051789). The CCG trial has been described in detail17. Briefly, CCG was a cluster-randomized controlled trial in which secondary schools in western Kenya were randomized into 4 arms (1:1:1:1): (1) provision of menstrual cups with training on safe cup use and care; (2) conditional cash transfer (CCT) based on  80% school attendance in the previous term; (3) menstrual cup and CCT; and (4) control (usual practice). For the CaCHe study, we enrolled approximately 20% of girls in the cup only and control arms of the CCG trial. After enrollment, CaCHe participants were followed every 6 months, with the 72-month study visit completed in July 2024. The 24-month study visit window (May through June 2020) coincided with the COVID-19 pandemic and was missed. At the 48-month study visit (April 5 – July 1, 2022), following observed increases in STIs and pregnancy18,19, we initiated a qualitative study to understand the impact of the COVID-19 pandemic on the sexual and reproductive health of AGYW in the study.

Methods of qualitative study and thematic analysis

We conducted facilitator-led FGDs to explore this topic for its ability to elicit rich data for complex and nuanced topics. In this approach, a structured interview guide was used, with open-ended questions allowing flexibility in discussion flow and ability to explore participants’ meanings and interpretations. The structured interview guides and human-led thematic analysis methods used in this study are published13. Briefly, semi-structured FGD guides with overlapping themes were applied for AGYW and males. FGDs lasted 1.5–2 h, and were carried out in English, Swahili, Dholuo or a mixture, with the Kenyan moderator (EA) being fluent in all three. FGDs were audio recorded with participant permission and transcribed verbatim, with Dholuo and Swahili being translated to English. The moderator (EA) compared all transcripts against the original audio for accuracy. For the translated transcripts, EA conducted a quality check on 10% of the content to verify consistency and correctness. General patterns that emerged were identified and noted as initial themes; line by line narratives were ascribed detailed codes, which were assigned to the themes and built into a framework. The framework was edited dynamically until all transcripts were coded and a series of subthemes and themes identified. Further analysis was undertaken to compare opinions, behaviours, language across and within AGYW and community men transcripts. EA and SY coded and analysed the data independently, and then compared results, discussing and resolving with SDM and LM where any disagreement occurred.

GenAI: thematic analysis

We used GPT-4o (November 2023 version) to conduct thematic analyses (i.e., identification of key attributes) separately for transcripts from AGYW and community males. While platforms such as Google’s Gemini and Meta AI exist, ChatGPT remains the more widely utilized tool for research20,21,22,23. Although Gemini Pro has been tested in a limited number of studies, its use has primarily been for comparison against ChatGPT24,25,26. No such research papers have reported utilizing Meta AI for similar purposes.

We interfaced with GPT-4o through Google Colaboratory27. We chose to access GPT-4o via the OpenAI API in a Google Colaboratory notebook rather than the ChatGPT web interface for data safety. In this environment, all data exchange occurs over our university’s encrypted pipeline: credentials and user inputs never pass through a third-party GUI. Using Colab allowed us to script and version control every step of the analysis (from prompt templates to post processing), which greatly enhances reproducibility. The API driven approach gave us programmatic control over batching, rate limits, and logging, enabling a consistent, automated workflow that would be difficult to achieve via the interactive ChatGPT app.

Our primary task involved prompting the models to generate themes when given the transcripts.There were 6 transcripts from 54 girls and 5 transcripts from 53 men. Each transcript consisted of moderator’s questions and participants’ answers. We only used the English translation of participants’ responses. We pre-processed transcripts in python before providing the texts as input to GPT-4o. We were able to input all the data at once, separately for AGYW and men. The final input size was 41,280 words for AGYW and 59,985 words for men. After inputting the transcripts and pre-processing, we constructed a standardized prompt for this task. The prompt used in this study was structured in plain text and consisted of three main sections (Supplemental Table 1). The first section contains all the FGD questions (verbatim) directed towards male or female participants. This was done so that GPT-4o would generate themes parallel to those being sought by humans. No pre-existing themes were provided to GPT-4o. After GPT-4o generated an exhaustive set of themes, we posed specific questions to it, where we provided two primary themes along with their sub-themes. Lastly, we included detailed instructions for the AI to generate outputs containing key words, descriptions, and relevant supporting quotes. The specific wording of the task prompts was based on best practices in AI prompt engineering28: being specific and clear in requesting the desired format (e.g., using key words, asking for direct and exact quotes from the original text), inclusion of specific constraints (e.g., provide 3 supportive quotes), iteratively refining prompts.

To standardize our LLM Application Programming Interface (API) calls, all models were set to a temperature of 0.7 and 4000 maximum output tokens. We experimented with different values of the “temperature” parameter, which can range from 0 to 2. In OpenAI’s models, temperature controls the randomness of the AI’s output29,30,31. A temperature of 0 results in highly deterministic and repetitive responses, while a temperature of 2 encourages more creative and diverse outputs. In our case, the model did not function properly at temperatures of 1.5 and above (i.e., providing overly fanciful and nonsensical results), but worked at 1.4 and below. We tested a range of values from 0.7 to 1.2 (1.0 is the default value). While the AI provided some well-structured and coherent answers at higher temperatures, the quotes it generated were often altered from the original text, making them difficult to trace back to the source. After reviewing the results, we fixed a temperature setting of 0.7 to ensure that AI-selected quotes came from the original text and maintained a precise connection with the accompanying descriptions.

We repeated the thematic analysis, recording the subthemes with each iteration (Supplemental Table 2), until no new themes appeared. We quantified the stability of the subthemes across different iterations using BERTScore F1. The BERTScore F1 ranges between 0 to 132,33,34,35, and measures the similarity between two texts based on their semantic meaning. We obtained BERTScores F1 in the range of 0.89–0.99, indicating a high degree of consistency, implying that the AI produced stable outputs over repeated trials. Once an exhaustive list of subthemes was obtained for AGYW and for community males, GenAI was asked to provide an overview of each theme and identify three supporting quotes for each of the themes.

Evaluation of thematic analysis

The exhaustive list of subthemes was taken as the final product from the GenAI and was evaluated by two investigators (EA, SY), who had previously coded the transcripts manually. They independently reviewed this product from the AI using a rubric (Supplemental Table 3). We developed the rubric for this study with attention to consolidated criteria for reporting qualitative research (COREQ)36, recommendations for standardized reporting of qualitative research37, and the Critical Appraisal Skills Program (CASP) checklist for reporting qualitative studies38. These standards emphasize that qualitative research with high quality should demonstrate critical reflection, with key criteria of: credibility (plausible and trustworthy findings that align between theory, the research question, data collection, and results); confirmability (clear link between the data and the findings, such as through use of quotes), and reflexivity (self-reflection regarding subjectivity and influence on the research). Given that GenAI had no role in the design of the qualitative study, hypothesis generation, or development and implementation of the interview guides, we evaluated its work primarily on confirmability. Therefore, the three metrics we selected for evaluation were: (1) how well the themes were described (not at all, partially, completely), (2) for each supportive quote selected, whether it was consistent with and supportive of the theme (no, yes), (3) for each supportive quote selected, the extent to which it was consistent with and supportive of the theme (not at all, partially, completely). Evaluators were asked to provide comments where themes or supportive quotes were not completely explained or consistent. Interrater reliability of the evaluators is reported as percent agreement, Cohen’s kappa, and Gwet’s AC1.

Exploratory sentiment analysis

Analysis conduct

Sentiment analysis in text analysis involves identifying and categorizing the emotional tone or sentiment expressed in a text, typically as positive, negative, or neutral. Several lexicon-based tools exist, such as VADER39, NRC40, TextBlob41. Lexicon-based approaches rely on a pre-defined dictionary of words labeled with sentiment (positive, negative, or neutral). Sentiment is derived by identifying words in the text that match the lexicon and assigning a score based on the lexicon entries. GPT-4o is a transformer-based language model trained on massive datasets and can generate human-like text, and uses deep neural network analysis to generate sentiment, based on each entire sentences structure, context, and relationship between words. Based on this capability, we tasked it to conduct sentiment analysis of our transcripts.

We provided GPT-4o API with a predefined set of sentiment/emotion categories for the analysis. For sentiment analysis, we first applied word tokenization to the transcripts, breaking each participant’s response into a vector of tokens (i.e., numerical representations of words). GPT-4o then mapped this vector to patterns in its training model and data, to predict sentiment. After having sentiments for each participant response, we requested the percentage distribution of sentiments. Additionally, we instructed the AI to extract and highlight relevant keywords and provide supporting quotes from the text.

Two different types of sentiment analysis were conducted: (1) The evaluation of transcript tokens categorized as “Very negative”, “Negative”, “Neutral”, “Positive”, and “Very Positive” from VADER lexicon. VADER (valence aware dictionary and sEntiment reasoner) uses a list of words to determine positive or negative valence. Originally designed for shorter pieces of text, it is commonly used for sentiment analysis of social media42. (2) Evaluation of more detailed feelings or emotions, including “fear”, “anger”, “trust”, “surprise”, “joy”, “disgust”, “sadness”, “positive”, and “negative” was adapted from the Circumplex Model based on subjective feelings43. The circumplex encompasses numerous emotional states, but there is evidence that certain states are more reliably recognized and assessed, such as sadness and anger, representing both the arousal and deactivation quadrants of negative valence, or surprise and joy for positive valence44. Because we did not know a priori the specific sentiments most likely to be expressed in the transcripts, we chose a range of negative sentiments covering both arousal (fear, anger, disgust) and deactivation (sadness) of negative valence, neutral valence sentiment (surprise), and positive valence sentiments (trust, joy), and included “positive” and “negative” as “catchalls” for other states.

Initially we attempted sentiment analysis within sub-themes, but due to excessive repetition of supporting quotes within and across sub-themes we abandoned this exercise. Sentiment analysis was then conducted separately for each of the two objectives, which also produced repetitious results. Therefore, as the depth of data appeared insufficient for these approaches, we proceeded with an overall sentiment analysis, stratified by female and male inputs. Within each analysis, GPT-4o was asked to provide three supportive quotes for each sentiment. BERT F1 scores comparing different sets of transcripts ranged from 0.88 to 0.92, indicating a high level of agreement and reliability between the models. After reliability analyses, we proceeded to evaluate the sentiment analysis based on three randomly selected analyses each for female and male transcripts.

Human evaluation of sentiment analysis

Each of the six sentiment analysis results (3 AGYW FGDs, 3 male FGDs) were reviewed independently by the two reviewers (EA, SY). Reviewers were asked to rate how much they agreed or disagreed with the key indicator words chosen for the sentiment analysis description with a score ranging 1–5: “Completely Disagree”, “Somewhat Disagree”, “Neither agree or disagree”, Somewhat Agree”, “Completely Agree”. We used a different scale than for thematic analysis given the more subjective nature of sentiment analysis. For each of the three supporting quotes provided for each sentiment, reviewers were asked to determine whether the quote was consistent with or supportive of the sentiment (yes/no), and for each quote, to rate how strongly supportive the quote was (0–2): “Not at all”, “Somewhat”, “Very supportive”. Lastly, stratified by sentiment, we report the percentage of quotes that were consistent with sentiments and a mean score of strength of support.

Bias analysis. GenAI was asked separately for male and female transcripts with three repetitions: “As an AI, what potential biases might you have in conducting this analysis?” We grouped biases into selection biases (representativeness of the study sample/transcripts, representativeness of quotes) and information biases (how quotes were interpreted in relation to generation of themes).

Results

Among 54 AGYW participating in the FGDs, mean age was 20.9 years, few (11%) were employed, with 44% currently in school (Table 1). Among 53 community men participating in the FGDs, mean age was 27.1 years, nearly half (47%) were married, and all were employed.

Table 1 Distribution of FGD participant characteristics.

Summary of human manually coded themes and GenAI coded themes

GPT-4o identified 13 themes from AGYW transcripts, taking 7 repetitions to reach exhaustion (i.e., no new themes emerging), and 11 themes from community male transcripts, taking 10 repetitions to reach exhaustion. The final list of GenAI themes are presented alongside human derived themes in Table 2; results from each repetition are in (Supplemental Table 2). Some themes may be considered closely related, but we decided not to apply too much amalgamation, since we could not have a “back and forth” discussion with GenAI as human investigators would have done. For example, when we would suggest that two themes overlapped and could they be merged into one? (i.e., posed as a question), GenAI always capitulated (e.g., “yes you’re correct”; “I see your point”), rather than providing explanation for why they were initially identified as separate themes.

Table 2 Summary of human manually coded themes and GPT 4o coded sub-themes.

On the surface, human-derived and GenAI-derived themes were somewhat different; however, this stemmed from organization: in our analysis, we grouped sub-themes by themes that were similar for AGYW and male stakeholders. For example, one of our major themes was the impact of the COVID-19 pandemic on sexual behaviors (increased frequency and number of partners), with drivers of increased sexual activity as a second theme, including a sub-theme of poverty and economic insecurity; in our manuscript, discussion around this encompassed the economic dependency, and transactional exploitative nature13. Conversely, GenAI noted the increased pressure for sexual relationships and exploitative relationships as a sub-theme to impacts of the pandemic on men’s attitudes and relationships with girls or schoolgirls. Increases in pregnancy (as a result of increased sexual behaviors) was also noted in our analysis as a theme, whereas GenAI indicated increased pregnancy rates as a sub-theme of the impact of the pandemic on schoolgirls. Similarly, increased domestic responsibilities are noted in our analyses as a result of school closures, but were not identified as a sub-theme by GenAI. Overall, we did not find disagreement with the sub-themes raised by GenAI, but did not consider some to rise to level of a theme, identifying them more as context or explanation to a theme or sub-theme.

Results of GenAI thematic analysis and human evaluation

Across all AGYW and male thematic analysis repetitions, there was 100% agreement between the two raters that the themes were completely described and explained, except in one set for males. However, as shown in Fig. 1 (Panel A), performance was low and variable with regards to selection of quotes that were consistent with themes. Both evaluators classified quotes from male thematic analysis as less frequently being consistent with the thematic description, ranging from 33 to 79% of quotes, as compared to 64–87% of quotes from AGYW. GenAI also performed poorly when assessed on the quality of the selected quotes, with AGYW quotes supporting themes 36–67% and male quotes 18–55% of the time. Additionally, several hallucinations were noted (Supplemental Table 4). In some instances, a single word or phrase was changed; in more labile hallucinations, there were examples of combining two quotes, and/or substantial modification (see examples, Fig. 2).

Fig. 1
figure 1

Evaluation of GPT-4o thematic and sentiment analyses by two independent evaluators. (A) Evaluation of thematic analysis. For transcripts from adolescent girls and young women (AGYW), GPT-4o generated 13 themes and selected 3 quotes per theme, resulting in 39 evaluation points for each Results Set. For transcripts from community males, GPT-4o generated 11 themes and selected 3 quotes per theme, resulting in 33 evaluation points per Results Set. The bars represent proportion (x-axis) of GPT-4o selected quotes that evaluators rated as (1) consistent with/supportive of the theme for each of the 3 results sets (light grey, medium grey, dark grey), and (2) the proportion of quotes that were rated completely consistent with/supportive. (B) Evaluation of Sentiment Analysis. For transcripts from AGYW and community males, GPT-4o selected 3 quotes for each of 5 sentiments very negative to very positive, resulting in 15 evaluation points per results set. For VADER emotional sentiments (Anger, Disgust, Fear, Negative, Sadness, Joy, Positive, Surprise, Trust) GPT-4o selected 3 quotes per sentiment, resulting in 27 evaluation points per results set.

Fig. 2
figure 2

Sample hallucinations involving modification of original transcript.

Results of sentiment analysis and human evaluation

The distribution of sentiments identified by GPT-4o for AGYW and community male transcripts is shown in (Fig. 3). Sentiment analysis of AGYW transcripts revealed 50% of sentiments were very negative (20%) or negative (30%), compared to 25% of sentiments being classified as very positive (10%) or positive (15%). Men’s sentiments were also more commonly classified as negative than positive. The distribution of the more emotive sentiments were similarly distributed for AGYW and males, though AGYW transcripts were somewhat more frequently classified as expressing “sadness” (15%) compared to men’s transcripts (5%).

Fig. 3
figure 3

GenAI sentiment analysis from qualitative study about COVID-19 impacts on AGYW sexual and reproductive health. Sentiment analysis 1 reports the frequency distribution of sentiments ranging “Very Negative” to “Very Positive” for (A) Adolescent girls and young women (AGYW), and (B) Community males. In sentiment analysis 2, panels represent the frequency distribution of specific emotions listed in the key for (C) AGYW and (D) Community males.

Human evaluation of GenAI sentiment analysis

Evaluators were largely in agreement that the keywords or descriptions used to define the sentiments (Supplemental Table 5) were consistent (Fig. 1, Panel B). For sentiments ranging from “very negative” to “very positive”, evaluators mostly agreed that quotes selected were consistent with the ascribed sentiment for AGYW transcripts (74–100%), and generally indicated that quotes were strongly supportive (87–100%, with one set of quotes excepting at 53%). Overall, evaluators determined that GPT-4o performed poorly at selecting quotes that were consistent with (53–73%) and strongly supportive (33–67%) of the sentiments ascribed to male transcripts. When examining more complex emotions, evaluators determined that GPT-4o described the sentiments well, but there was variable and lower performance in selection of supportive quotes, and selection of strongly supportive quotes. For example, in one repetition of a male sentiment analysis, the evaluators were in agreement with descriptive basis provided by GenAI regarding the “Positive” classification: sentiments that express optimism, hope, or positive outcomes. However, one of the three supportive quotes provided was “The sponsorship we talked about earlier, he will provide while expecting something in return and that makes it not a better one”, which was deemed not supportive of positive sentiment. As another example, GenAI described “surprise” as sentiments that express astonishment, shock, or unexpected outcomes, which evaluators agreed with. Yet a quote deemed non-supportive of this sentiment was “I was in school”. In contrast, a quote evaluators found highly supportive of “surprise”, included “For me when corona came it really shocked me, it took me around four months to believe that it was with us”.

Examining GPT-4o performance stratified by sentiment (Table 3) showed fairly consistent performance for AGYW, in that most quotes were consistent with and strongly supportive of the specified sentiments, except for poor performance on selecting quotes related to “Disgust”. For male transcripts, GPT-4o appeared to perform acceptably (mean score  5, and  80% of selected quotes being strongly supportive) only for “Fear” and “Negative” sentiments.

Table 3 Mean scores of evaluation of GPT-4o GenAI selection of consistent and strongly supportive quotes by specific sentiment.

1The consistency of quotes were rated as yes (1 point) or no (0 points). The range for consistency of scores is 0 (across two raters, none of the 3 provided quotes were consistent with the sentiment) to 6 (across two raters, each of the 3 provided quotes were consistent with the sentiment).

1The strength of quotes was rated as not at all (0), somewhat (1), or strongly supportive (2). The range for strongly supportive scores is 0 (across two raters, none of the 3 provided quotes were strongly supportive of the sentiment) to 12 (across two raters, each of the 3 provided quotes were strongly supportive the sentiment).

A higher mean score indicates greater agreement that quotes are consistent or strongly supportive of the specified sentiment.

Rubric performance: interrater reliability of evaluators

Agreement between raters was higher for evaluations related to AGYW than males, and higher for thematic analysis than sentiment analysis (Supplemental Table 6). Agreement was excellent for how well themes were explained; substantial (AGYW) or fair (males) for consistent quote selection; and fair for how strongly supportive quotes were. Agreement was fair-moderate for male sentiment analysis based on circumplex, and was substantial-excellent for AGYW circumplex-based sentiment analysis. There was a similar pattern of agreement for the VADER-based sentiment analyses.

Identifying biases

Several biases were termed differently (e.g., Language and Context Bias vs. Language and Interpretation), but had similar explanations and were grouped together (Table 4). Regarding selection bias with regards to representativeness, GenAI relayed that results may not be generalizable and may not include other important perspectives in how the COVID-19 pandemic affected schoolgirls, in one instance highlighting the potentially important input of parents and teachers. GenAI identified numerous information biases, primarily related to the underlying training data (leading to confirmation bias and potential exclusion of contradictory or more nuanced findings), its lack of cultural understanding (social dynamics and linguistics; in one instance highlighting its Western-biased training data).

Table 4 Potential biases identified in reflexivity analyses of transcripts from adolescent girls and young women’s transcripts and community males transcripts.

Discussion

We conducted formal comparative analysis of GenAI to human thematic coding of qualitative focus group discussion study exploring the impacts of the COVID-19 pandemic on sexual and reproductive health of AGYW, with AGYW and male participants in Kenya. We found that GenAI did an adequate job of thematic analysis. However, selected quotes were unreliable, even after initial feedback, and were subject to hallucination. We often disagreed with the GenAI’s performance of sentiment analysis, with general dissent regarding supportive quote selection. In both thematic and sentiment analyses, GenAI’s performance was rated more poorly for transcripts generated by male participants. We examine these findings with critical and contextual analysis and recommendations below.

GenAI was tasked with undertaking an inductive approach to thematic analysis in order to compare the output to our human analysis. As stated, it performed adequately in identifying similar themes and subthemes which we identified. Reassuringly, GenAI did not identify themes that did not emerge from our analysis; rather the discrepancies between human evaluation and GenAI resulted from differences in the level of importance that was assigned to them, i.e., as main theme, subtheme, or explanatory theme. The themes and subthemes we identified, and which formed the comparator for this study, were a product of repeated and time-consuming analysis followed by constant refining to form a coherent and logical framework and narrative that was concise enough for publication.

GenAI’s sub-optimal performance in quote selection

GenAI perform satisfactorily in identifying the key themes from the transcripts, but performed poorly in identifying and selecting appropriate quotes. This suggests that while GenAI can perform well when tasked with amalgamating this type of data, at a granular level of text analysis it does not have the capacity to perform good scrutinization, and errors become more obvious. While we could have provided this feedback from the evaluators to the GenAI to attempt to improve its performance, we refrained from doing so because the themes, quotes, and feedback are specific to our study and we were seeking to assess whether, and in what aspects, the GenAI could conduct qualitative data analysis on par with human researchers. Given the amount of person-time involved in evaluating the GenAI output and providing feedback, it would not be useful or helpful to iteratively do this until the GenAI met our standard because the “learning” would not transfer to another study: a qualitative study of different topic, or of varying behavioral or ethical complexity, may provide different analytic goals or results. However, if we were to have done this, one approach may have been to give the GenAI examples of quotes that human researchers judged to be consistent with and strongly supportive of each theme. While this may enhance the LLM if the topic is related, it would be unlikely to help with a different context or question.

The hallucinations were cause for concern that further analysis would produce untrustworthy results. The use of language is central to qualitative research; researchers look not just at the words used, but sentence structure, ordering, emphasis, hesitations, repetitions, tone, connotation and denotation amongst others, which give insight to the meaning, context and the overall message conveyed by the participant. Sociocultural significance is also conveyed by the language used45. Any errors in transcripts can alter the meaning and interpretation of results, similar to an incorrect number reported in quantitative research. Taking the example in Fig. 2, the GenAI hallucinations (i.e., quote modifications) change the context by suggesting that men were under pressure to have sex, rather than the AGYW. While this did not affect the overall themes generated, were we not already familiar with the transcripts and findings from prior analysis, this may have been highlighted as an unusual study finding, or could have led to an unnecessary line of inquiry.

Why male transcripts might have performed more poorly

We hypothesize that GenAI performed worse in selecting quotes from the males’ transcripts that were consistent with themes and sentiments because of linguistic differences between the male and AGYW transcripts. We noted that the AGYW were more precise, concise and direct in their responses to our questions and probes. In contrast the speech patterns from the male transcripts were lengthier, with responses to the moderator’s questioning often indirect. Men’s vocabulary frequently included euphemisms and local vernacular, and their use of grammar was less accurate. As a result, some of their quotes were more difficult to comprehend, increasing the chance of misinterpretation. These linguistic differences may also contribute to the markedly poorer performance in sentiment evaluation of the male supporting quotes.

Limitations

This was secondary analysis of data generated for understanding contextual characteristics of participants and stakeholders during a trial, and FGD were not specifically designed for the purpose of examining the quality of AI technology. We do not know if our processes generalize to another setting, topic, or method of qualitative analysis. However, we used replicable and standardized approaches to (1) conduct the qualitative study13, (2) conduct the LLM analyses, and (3) applied a literature based rubric to evaluate the AI. Interrater reliability was variable. This may reflect that our rubric criteria were not sufficiently observable or measurable. Conversely, measures of reliability – especially Cohen’s kappa - are dependent on the structure of the data (number of subjects, categories, and prevalence). While Gwet’s AC1 was also presented and addresses some of these concerns, limitations from data structure remain, as evidenced in some instances by high percent agreement and moderate reliability measures. Therefore, while the rubric we developed provided criteria and replicability to the evaluation, broader application of the rubric is necessary to further refine its utility.

In this study, LLM was not context-specific and relied on pre-existing training data to conduct analysis. As the model may not have been sufficiently trained on topics or contexts inclusive of western Kenya, COVID-19, or sexual and reproductive health, its analysis might lack relevance or accuracy, and this limitation was acknowledged by the GenAI in its assessment of its biases. The selection of lower quality quotes likely occurred because LLM relied on surface-level patterns and lacked contextual understanding. Although GenAI was provided English translations, there are nuances of idiomatic and cultural references that we believe were a challenge to the GenAI and were not resolved. This highlights the importance of building AIs that have been trained in multiple languages and cultures. Hallucinations have been documented in other studies of GenAI thematic analyses11, and present a substantial threat to its validity and trustworthiness. Statistical approaches to detect hallucinations are emerging46; while this can alert researchers to risk, Xu et al. demonstrate LLM hallucination is inevitable and unavoidable47. The results of sentiment analysis may have been limited by our choice of circumplex sentiments. This type of analysis should be repeated using the full range of valence and arousal, to determine its utility in more varied and larger datasets.

Customizing GenAI models for specific research contexts and requirements may improve contextual understanding and reduce hallucinations. For example, special training datasets that reflect diverse and contextually rich qualitative scenarios could be used to augment the GenAI model through a retrieval augmentation generated (RAG) approach48. This would make it more reliable in producing relevant, context-sensitive outputs. Developing mechanisms for human cross checking would increase reliability. Regarding new capabilities for qualitative research, development should focus on GenAI ability to compare and contrast between and within sources, take into account background characteristics of participants, and look for typical and atypical patterns. GenAI will also need to develop the inherently human trait of being ‘curious’ with the data – spotting or investigating something that it hasn’t been instructed to do, or being able to identify something that emerges from the data that hadn’t been thought about beforehand.

Conclusions

Based on our study, GenAI implemented in GPT-4o was unable to provide a thematic analysis that is indistinguishable from a human analysis of focus group study transcripts, due to a combination of errors and a low level of sophistication. We recommend that it can be used currently as an aid to the human analyst in identifying themes, keywords, and basic narrative, and potentially as a check for human error or bias. GenAI may also be useful for rapid appraisal to quickly generate themes and subthemes from which to then hone and refine the direction for a further round of data collection. However, until it can eliminate hallucinations, provide better contextual understanding of quotes, undertake a deeper scrutiny of data, and demonstrate a greater range of reflexivity, it is not reliable or sophisticated enough to undertake a rigorous thematic analysis equal in quality to experienced qualitative researchers.