Table 4 Text properties and entropy of medical concept metadata records.

From: A reference set of curated biomedical data and metadata from clinical case reports

Concept

Average Entropy (bits, +/−standard deviation)

Character Count

Word Count

Segment Count

Keywords

2.17 +/− 2.04

127,932

8,326

6,636

Geographic Locations

0.35 +/− 1.01

6,085

901

358

Life Style

0.55 +/− 1.35

29,244

4,862

521

Family History

1.15 +/− 1.83

138,162

21,342

1,717

Social History

0.23 +/− 0.90

12,310

2,022

249

Medical/Surgical History

3.02 +/− 1.84

804,975

119,816

8,783

Signs and Symptoms

3.96 +/− 0.94

1,460,450

218,276

16,467

Comorbidities

0.96 +/− 1.63

33,978

3,918

1,329

Diagnostic Techniques and Procedures

3.98 + /− 0.87

1,369,668

195,000

15,936

Diagnosis

3.85 +/− 0.66

206,418

24,432

4,718

Laboratory Values

2.80 +/− 2.12

990,769

146,240

5,238

Pathology

2.32 +/− 2.11

853,084

121,009

2,865

Pharmacological Therapy

2.74 +/− 1.99

422,402

60,270

3,863

Interventional Therapy

2.60 +/− 1.94

399,831

57,967

4,909

Patient Outcome Assessment

3.07 +/− 1.77

440,602

66,786

4,526

  1. For each medical concept used in the metadata extraction process, we determined its average character-level entropy (Shannon entropy) across all text values in the concept, along with its standard deviation. As length of text can contribute to estimates of its complexity, we also include counts of characters (not including delimiters or spaces), words, and segments (i.e., phrases between delimiters) for each concept across the MACCR set. Values of “NA” are considered to have an entropy of zero and do not contribute to character, word, or segment counts.