Fig. 2: The relative abundance of annotated named entity classes in our corpus.

As is typically the case with human languages, semantic classes are represented unevenly in free texts, following a heavy-tail (Zipf’s) distribution. a In biomedical corpora, unsurprisingly, named entities associated with genes and proteins are the most prevalent (15%), followed by processes (9%), medical findings (8.8%), and chemicals (6.7%). At the low-frequency end of the named entity spectrum, we find journal names, units, citations, and languages. b Events connecting two or more entities are also approximately Zipf-law distributed. Event frequencies are closely tracking corresponding named entity classes. For example, the most frequent event, bind, is associated with the most frequently named entity, GeneOrProtein. We tried fitting the rank-ordered frequency distribution of annotated named entities with a Discrete Generalized Beta Distribution (DGBD). The result showed a significant deviation from Zipf’s law33: The observed distribution’s tail was not heavy enough to match Zipf’s distribution, most likely due to the relatively small number of classes in our ontology34. In other words, we expect that frequencies of semantic classes in a very large corpus, annotated with classes from a hypothetical perfect named entity ontology, would follow a Zipfian (discrete Pareto) distribution of named entity classes. Our action annotations have moved beyond interactions between proteins and genes (e.g., bind, inhibit, phosphorylate, encode), into interactions involving genetic variants and environmental factors (e.g., associated with, occur in presence of, trigger, lack). Ambiguity levels varied broadly across the named entities captured in our corpus. For example, in the class AnatomicalPart, almost all (99.3%) are annotated at the most specific levels, with the majority of entities belonging to BodyPart, CellularComponent, and Cell. In contrast, the general (most vague) concept, Chemical, turns out to be the most annotated within its cluster, although more specific subclasses, such as Protein, NucleicAcid, and Drug are also well represented in the corpus. In the Process concept cluster, about a third of all concept instances are annotated at a more general Process level, and the rest of them are specific concepts, such as MedicalProcedure, MolecularProcess, ResearchActivity, and BiologicalProcess. In addition to these major clusters of concepts, several individual concepts are well represented in the corpus. For example, MedicalFinding represents 7.3% of all entities. Other well-represented concepts include Duration, IntellectualProduct, Measurement, Organism, PersonGroup, PublishedSourceOfInformation, and Quantity. In total, about 70.4% of all entities are annotated at the most specific ontology level. There are five concepts in the NERO ontology that allow the semantic flexibility needed to avoid arbitrary concept assignment. Entities annotated as AminaoAcidOrPeptide, QuantityOrMeasurement, PublicationOrCitation, MedicalProcedureOrDevice, and GeneOrProtein account for 17.8% of all entities, while less than a quarter (23%) of entities representing either genes or proteins are cleanly annotated with class Gene or class Protein. The remainder are annotated with class GeneOrProtein. In addition to the action bind, actions indicating entities’ attributes are the next most frequent. Other biological relationships are also well-represented in this annotation, such as inhibit, activate, mediate, interact, contain, and regulate. The top 30 action categories account for 64.4% of all actions annotated with the top ten action categories accounting for 52.2%. Interestingly, negations of actions were also quite abundant in our annotated corpus. For example, do not bind was the sixth most frequent normalized action. Other well-represented negations of actions include do not affect and do not inhibit (see Supplementary Figs. 1–3).