Table 4 Text properties and entropy of medical concept metadata records.

Concept	Average Entropy (bits, +/−standard deviation)	Character Count	Word Count	Segment Count
Keywords	2.17 +/− 2.04	127,932	8,326	6,636
Geographic Locations	0.35 +/− 1.01	6,085	901	358
Life Style	0.55 +/− 1.35	29,244	4,862	521
Family History	1.15 +/− 1.83	138,162	21,342	1,717
Social History	0.23 +/− 0.90	12,310	2,022	249
Medical/Surgical History	3.02 +/− 1.84	804,975	119,816	8,783
Signs and Symptoms	3.96 +/− 0.94	1,460,450	218,276	16,467
Comorbidities	0.96 +/− 1.63	33,978	3,918	1,329
Diagnostic Techniques and Procedures	3.98 + /− 0.87	1,369,668	195,000	15,936
Diagnosis	3.85 +/− 0.66	206,418	24,432	4,718
Laboratory Values	2.80 +/− 2.12	990,769	146,240	5,238
Pathology	2.32 +/− 2.11	853,084	121,009	2,865
Pharmacological Therapy	2.74 +/− 1.99	422,402	60,270	3,863
Interventional Therapy	2.60 +/− 1.94	399,831	57,967	4,909
Patient Outcome Assessment	3.07 +/− 1.77	440,602	66,786	4,526

For each medical concept used in the metadata extraction process, we determined its average character-level entropy (Shannon entropy) across all text values in the concept, along with its standard deviation. As length of text can contribute to estimates of its complexity, we also include counts of characters (not including delimiters or spaces), words, and segments (i.e., phrases between delimiters) for each concept across the MACCR set. Values of “NA” are considered to have an entropy of zero and do not contribute to character, word, or segment counts.

Quick links

Search