Extended Data Fig. 9: Token source-related biases. | Nature

Extended Data Fig. 9: Token source-related biases.

From: Learning the natural history of human disease with generative transformers

Extended Data Fig. 9

Non-random missingness may cause biases in predictions even when sources are not explicitly provided to the model. a. Disease embedding UMAP for a Delphi model with explicit token sources (e.g. “Common cold (self-reported)” and “Common cold (hospital records)” are separate tokens), tokens coloured by ICD-10 chapters. b. Same as a, coloured by token source. c. Same as a, but for the standard Delphi-2M model. Only tokens with more than 75% of all entries from one source are shown. d. Same as c, coloured by primary token source. e. SHAP value matrix (similar to Fig. 4c), with tokens grouped by chapter and primary source.

Back to article page