Fig. 4: Unsupervised clustering of free text responses is consistent with closed-ended answers and provides fine-grained description of infection circumstances. | Nature Communications

Fig. 4: Unsupervised clustering of free text responses is consistent with closed-ended answers and provides fine-grained description of infection circumstances.

From: Extracting circumstances of Covid-19 transmission from free text with large language models

Fig. 4

This Figure shows the application of an unsupervised method (topic modeling) to determine the main circumstances of infection from free text responses only, without relying on closed-ended question answers, and visualizes its consistency with the predefined infection contexts from closed-ended question answers. ad The plots show UMAP (Uniform Manifold Approximation and Projection for Dimension Reduction) representations of the embeddings computed by CamemBERT. Each of the small dots corresponds to a distinct text response, i.e. a distinct individual. Disks of larger size represent groups of dots belonging to the same label (color) in close proximity (distance < 0.2), with disk size indicating the number of points as per the legend. Proximity between dots or disks means that the corresponding embeddings are close to each other, indicating semantic similarity of the corresponding text responses. a UMAP for the entire dataset of n = 79,444 responses (but restricted to a region occupying 93.3% of the responses; see Supplementary Fig. 7a for a larger view). Three groups automatically defined by BERTopic are shown: (i) outliers (gray dots; 44% of responses); (ii) a set of 9 clusters with names such as “mask, covid”, “hands, gel”, “test, negative” (colored dots except gray and blue dots), representing 17% of responses, which did not appear to contain specific circumstances of infection but rather aggregated all responses reporting generic aspects such as a lack of social distancing, test results or health state; and (iii) all other responses (blue dots, 39% of responses). b UMAP showing only the latter group of n = 31,036 responses, which BERTopic partitioned into 23 distinct clusters (shown in blue under the name ‘Clusters’ in panel a). Here, each color corresponds to a distinct cluster, as indicated in the legend, with the number of responses in each cluster as indicated. Each cluster is automatically named (labelled) using the two most salient words for each cluster, as determined by TF-IDF. “Nursing home” is our manual translation of “ehpad”, a term that was not translated by DeepL and which stands for “établissements d’hébergement pour personnes âgées dépendantes” (residential facilities for dependent elderly people). (*) The cluster “birthday, concert” is outside the displayed region. See Supplementary Fig. 7b. c Same as (b), except that the dots are colored according to the context of infection selected by the same individual among the seven predefined categories in the closed-ended question (Work, Family, Friends, Sports, Cultural, Religious and Other). d Same as (c), but with random shuffling of the seven context categories and without the cluster names. Source data are provided as a Source Data file.

Back to article page