Fig. 1: Etymology analysis pipeline.

The pipeline of language data collection, preprocessing, and analysis for etymology content is outlined here. Participants engaged in a recorded Zoom interview, which was then transcribed by the HIPAA-compliant transcription service TranscribeMe!. From there, participant speech was isolated, converted to lower case, and lemmatized using the Stanza NLP package for Python. To calculate etymology proportions, word lemmas were searched on Etymonline.com, and the names of origin languages were pulled from the etymology description on the returned webpage. Separately, word lemmas were searched in a database derived from Wiktionary.com, and word origins with the relation types “inherited from”, “derived from”, etc. were retrieved.