Fig. 1: Workflow. Data Engineering.
From: Using a fine-tuned large language model for symptom-based depression evaluation

A MADRS clinical interviews were videotaped, and audio files were extracted. B Automatic speaker diarization was conducted using pyannote to segment audio files into hypothesized sequences of the individual speakers. C Individual segments were then transcribed using Whisper-large-v3. Transcripts were proofread, and items and scores were manually assigned. D To account for the unbalanced score distribution, further interviews were generated, including the underrepresented scores across the nine items. LLM Training: E Real patient transcript data and synthetic data were merged, tokenized, and used for training the pre-trained BERT-base-german model. A flexible evaluation metric was included to account for predictions within the ± 1 score deviation of the true label reflecting the practical application of the MADRS scoring in the clinical setting. LLM Evaluation: F The model was evaluated for each item individually using accuracy and mean absolute error (MAE), and confusion matrices were generated to illustrate strict and flexible predictions. Created in https://BioRender.com.