Fig. 5: The evaluation of output reports generated by GPT-4o- and Gemma3-driven DisasTeller.

For each task, five independent runs were evaluated by both EvaluatorGPT and human assessors. Grey points represent individual evaluation scores (0–10 scale). Light and dark yellow bars correspond to the mean automated (EvaluatorGPT) and human scores for GPT-4o-based DisasTeller, respectively; light and dark green bars represent the corresponding mean scores for Gemma3-based DisasTeller. Error bars indicate standard deviations across the five runs. Source data are provided as a Source Data file.