Fig. 2: Statistic results on MMedC. | Nature Communications

Fig. 2: Statistic results on MMedC.

From: Towards building multilingual language model for medicine

Fig. 2: Statistic results on MMedC.

a The Distribution of languages included in MMedC around the world (This map is just for demonstration and has nothing to do with politics.). The map shows our collected corpora can cover most main countries worldwide. b The Token distribution for each language. The bar plot shows the detailed token number for different languages. c The Contributions of four sources to six languages for our MMedC. The Sankey diagram shows how the four considered data sources contribute for different languages, i.e., filtering content, medical textbooks, medical websites and small-scale corpus. Source data are provided as a Source Data file.

Back to article page