Main

Phylogenetic placement of deep-time (>1 million years ago (Ma)) fossils has typically relied on morphological observations, because the recovery of sufficiently extensive genetic evidence has not been proven to be possible before the Pleistocene12. Although ancient DNA (aDNA) sequences are often a valuable source of data for inferring phylogenies and population dynamics in the Middle-Late Pleistocene13,14,15, the oldest authentic aDNA from macrofossils has been extracted from Arctic-situated specimens dated to no more than 1.2 Ma (ref. 16). By contrast, palaeoproteomic data have been recovered from Middle–Late Miocene, Pliocene and Early Pleistocene fossils, even in localities that are warm, humid and/or at low latitudes5,6,17. Although protein sequences from the Early Pleistocene have been used successfully to infer the phylogenetic placement of various fossil mammals2,3,4, the precise limit of proteomic survival has not been systematically characterized yet, because it depends on a complex interplay of time, temperature and environmental factors driving chemical breakdown mechanisms18,19. At present, the oldest confirmed palaeoproteomic data successfully used to infer subordinal taxonomic relationships derive from bone collagen of camelids from the 3.7-million-year-old Fyles Leaf Bed site of Canada’s High Arctic1,20. Beyond this time frame and latitude, only peptide sequences too short to be genetically informative5,6 and the products of advanced diagenesis21,22,23,24,25 have been reported.

Rhinocerotidae is a family that includes only five extant species, but a wide diversity of fossil members7,26. It remains debated as to where and when the radiation of this group occurred27. For most of the past two decades, the group was defined by a deep ‘basal split’ between two clades—Rhinocerotinae and Elasmotheriinae—before episodes of rhinocerotid diversification in the Late Eocene8,9,28,29. This paradigm contrasts earlier hypotheses of a close relationship between two extinct rhinocerotids that survived into the Late Pleistocene—the Siberian unicorn (Elasmotheriinae, Elasmotherium sibiricum) and the woolly rhinoceros (Rhinocerotinae, Coelodonta antiquitatis)30. Recently, the sequenced genomes of Coelodonta and Elasmotherium31 were used to confirm hypotheses on the basis of morphological data that suggest they have distinct phylogenetic affinities8, but also allowed for the recognition of a split between these two groups during the Late Eocene (36 Ma). This suggests that the deep-divergence hypothesis based on the morphological analysis of fossils is not supported by molecular evidence. However, the lack of available genetic sequence data from other early-diverging rhinocerotid lineages (for example, Aceratheriinae32), makes it difficult to assess the timing of the Rhinocerotinae–Elasmotheriinae split in relation to other radiations that occurred in the group. For these reasons, the ancient radiations of the group remain obscured.

To investigate the timing of Rhinocerotidae divergence and the potential for evolutionarily informative protein sequences to persist in deep time, we targeted dental enamel deriving from the Haughton Crater (75° N, Nunavut) in Canada’s High Arctic (Fig. 1). The Haughton Crater is an impact structure with its stratigraphy including post-impact fossiliferous lacustrine sediments dated to 21–24 Ma (ref. 33). Fossils from these sediments are found in a polar landscape, at present characterized by permafrost. Compared with similarly aged material from lower latitudes, this creates a temperature regime favourable for biomolecular preservation, sparing these fossils from the harshest effects of diagenesis, and potentially paralleling those mechanisms underlying the remarkable soft-tissue preservation of Konservat-Lagerstätten34. To maximize the likelihood of proteomic recovery, we focused on dental enamel, following recent successful extractions of pre-Pliocene peptides from highly biomineralized tissues5,6. The prismatic enamel of placental mammals, in which tightly packed enamel prisms typically extend from the dentine–enamel junction to the tooth surface, presents a suitable scaffold for protecting biomolecules35.

Fig. 1: The high-latitude Haughton Crater on Devon Island has produced a highly endemic vertebrate fauna.
Fig. 1: The high-latitude Haughton Crater on Devon Island has produced a highly endemic vertebrate fauna.
Full size image

a, Location of Devon Island in the circumpolar North (map data from International Bathymetric Chart of the Arctic Ocean, accessed via ArcGIS on 8 May 2024). b, Anterolingual view of specimen CMNFV59632 after destructive palaeoproteomic analysis. c, Location of Haughton Crater (75° 22′ N, 89° 40′ W) on Devon Island. Scale bar, 1 cm (b). Panel a adapted from ref. 48, Springer Nature Limited, under a Creative Commons Licence CC BY 4.0. Basemap from Natural Earth (https://www.naturalearthdata.com/).

The digestion-free palaeoproteomic workflow2,3 applied to an Early Miocene rhinocerotid (Epiaceratherium sp.) specimen (CMNFV59632) (Fig. 1b and Supplementary Information) of dental enamel36 from the Haughton Formation (21.8 Ma) allowed for the recovery of an enamel proteome covering 1,163 confident peptide–spectrum matches (PSMs), at least seven proteins (AHSG, ALB, AMBN, AMELX, AMTN, ENAM and MMP20) and spanning at least 251 amino acids (Fig. 2a and Extended Data Fig. 1). At present, the enamel proteome of CMNFV59632 represents both the oldest mammalian skeletal proteome reported, confirming the predicted deep-time persistence of ancient mammalian proteins from high latitudes3,5, and the first biomolecular characterization of the extinct genus Epiaceratherium. The survival of a relatively rich enamel proteome from such ancient deposits is representative of the specimen’s excellent state of preservation.

Fig. 2: Proteome preservation in the enamel specimen of the Early Miocene rhinocerotid (CMNFV59632, Haughton Crater, Canada).
Fig. 2: Proteome preservation in the enamel specimen of the Early Miocene rhinocerotid (CMNFV59632, Haughton Crater, Canada).
Full size image

Preservation is compared with enamel proteomes from an Early Pleistocene (1.77 Ma) Stephanorhinus (rhino) (DM.5/157, Dmanisi, Georgia), a Middle Pleistocene (0.4 Ma) Stephanorhinus (CGG 1_023342, Fontana Ranuccio, Italy) and a medieval ovicaprine (Control, Aarhus, Denmark)2. All plots exclude contaminants and reverse hits. a, Comparison of amino acid sequence coverage for each identified protein between samples, showing that coverage decreases over time. b, Comparison of the peptide length distributions for each sample. Dashed bars represent average peptide length for each specimen, showing that older samples have shorter average lengths. c, Comparison of the modification rate (0%–100%) of selected amino acids recovered from each sample that are often modified in ancient enamel proteomes. Colours are the same as in b. Results derive from modification-specific searches as described in the Methods. ‘Arginine’ includes arginine to ornithine conversion (−42.02 Da); ‘Glutamine’ includes glutamine deamidation (+0.98 Da); ‘Asparagine’ includes asparagine deamidation (+0.98 Da); ‘Tryptophan’ includes advanced tryptophan oxidation to kynurenine (+3.99 Da), oxylactone (+13.98 Da) and tryptophandione (+29.97 Da); ‘Histidine’ includes oxidation (+15.99 Da) and dioxidation (+31.99 Da) of histidine, as well as histidine conversion to hydroxyglutamate (+7.98 Da). The overall average modification rate of these amino acids (excluding deamidation) ranges from 6.2% in the control to 72.3% in the Haughton Crater specimen, showing an increase in oxidative damage over time, especially for histidine and arginine. d, Sequence coverage plots for the three most abundant EMPs (AMBN, AMELX and ENAM), recording relative number of PSMs (coloured areas). Colours are the same as in b. Raw data used to create these figures are given in the Supplementary Information.

Source data

Protein diagenesis in closed systems such as enamel is driven by the combined effect of time and temperature. Therefore, thermal age37 can be used to assess expected and observed molecular degradation, typically by normalizing the thermal age to a mean annual temperature of 10 °C, and to predict survival into deep time at different geographic locations3,17. Rather than using global signals of climate change38 to estimate the temperature history of the sample, we extrapolated location-specific palaeotemperature values from the HadCM3 model39,40 (Extended Data Fig. 2 and Supplementary Table 1). We calculate the equivalent thermal age at 10 °C (Ma at 10 °C) for CMNFV59632, obtaining a value of 2.5 ± 2.5 Ma at 10 °C. Despite the broad confidence interval, caused by the wide seasonal temperature fluctuations extrapolated from the model for the Haughton Crater location in the High Arctic (Extended Data Fig. 3), this result is entirely consistent with previously reported protein survival over Pleistocene–Pliocene timescales in temperate climates3. Consequently, it reinforces the idea that the climatic history of Miocene high-latitude sites is compatible with protein preservation, a crucial requirement for extending molecular-based phylogenetic reconstructions into deep time.

To better appreciate the preservation state of the Haughton Crater enamel proteome, we compared it with those of two other rhinocerotids, the Early Pleistocene Stephanorhinus from the site of Dmanisi (Georgia), dated at 1.77 Ma (ref. 2), and a Middle Pleistocene Stephanorhinus (about 0.4 Ma) from the site of Fontana Ranuccio (Italy). A medieval ovicaprid enamel control sample2 was also re-analysed to illustrate preservation differences. Although the set of proteins retrieved from the CMNFV59632 enamel specimen is similar to that of the other two Pleistocene rhinocerotids used for comparison, fewer peptides and a shorter reconstructed amino acid sequence were recovered from the Arctic specimen (Fig. 2).

As expected, diagenetic modifications are extensive in the enamel proteome of CMNFV59632 (Fig. 2b). Average peptide lengths are similar, although slightly shorter than those of the Dmanisi Early Pleistocene specimen (9.64 amino acids versus 10.42 amino acids, respectively), and further reduced in comparison with the Fontana Ranuccio Middle Pleistocene Stephanorhinus (10.99 amino acids), indicating a greater degree of peptide bond hydrolysis (Fig. 2b). We also observe high deamidation rates in CMNFV59632, although no more so than in the Pleistocene rhinocerotids (Extended Data Fig. 3), or other previously sequenced mammalian specimens from low-latitude sites2,3,4. Although high deamidation rates can be useful for confirming proteome authenticity, they can be highly variable in samples41,42, and can plateau relatively quickly in fossil proteomes, reducing their utility in characterizing degradation patterns in deep time (Fig. 2c). Instead, we identify a suite of informative spontaneous modifications indicative of advanced diagenesis that are observed at a higher rate in the Arctic Miocene rhinocerotid, providing support for their utility as markers of advanced diagenesis and authenticity in deep time2 (Fig. 2c). These include arginine to ornithine conversion (Fig. 2c) and advanced forms of tryptophan (Extended Data Fig. 1b) and histidine oxidation (Extended Data Fig. 1c). Intra-crystalline protein decomposition analysis further confirms the advanced degradation state of CMNFV59632. The concentration of free amino acids (FAA) and total hydrolysable amino acids (THAA) is around half of those in the Early Pleistocene Stephanorhinus sample from Dmanisi (Extended Data Fig. 4a), and the percentage of FAA in CMNFV59632 (about 75%) is higher than in the Pleistocene Stephanorhinus from Dmanisi (about 50%) (Extended Data Fig. 5b), supporting increased peptide bond hydrolysis. Furthermore, these analyses confirm that the enamel of CMNFV59632 behaves as a closed system, because the racemization values for CMNFV59632 fall along the expected FAA versus THAA trends for both fossil enamel and experimentally heated enamel samples (300 °C for 10 min) (Extended Data Fig. 6). On a peptide level, endogeneity is supported by the similar patterns and levels of across-sequence degradation shown by sequence coverage plots for CMNFV59632 and the experimentally heated samples (Extended Data Fig. 7).

Peptide sequences recovered from CMNFV59632 also derive from sequence regions similar to those previously identified in the Dmanisi Pleistocene Stephanorhinus proteome (Fig. 2d), particularly for the three most abundant enamel matrix proteins (EMPs). ENAM and AMBN present broadly similar sequence coverage patterns in both specimens, although with fewer PSMs covering most positions in the Miocene sample. AMELX, the most abundant EMP, is instead covered by a similar number of PSMs in both the Miocene and Pleistocene samples. The depth of coverage is also similar for the most abundantly covered AMELX sequences, including those spanning the deletion observed in the leucine-rich amelogenin peptide2.

Despite a relatively limited breadth of coverage across the sequenced proteins, a high depth of coverage allows for the confident reconstruction of sequences in CMNFV59632, including positions variable in Perissodactyla. At least ten single amino acid polymorphisms (SAPs) support the placement of CMNFV59632 in Rhinocerotidae. A smaller number (two or more) of SAPs are shared between CMNFV59632 and other perissodactyls, to the exclusion of later-diverging rhinocerotids. No new variants are uncovered in CMNFV59632, because the aforementioned SAPs represent character states retained from ancestors in Perissodactyla and Mammalia more broadly. The identification of these SAPs is supported by several unique PSMs showing almost complete ion series (Fig. 3 and Supplementary Section 4).

Fig. 3: Abridged alignment and mirror plots of a phylogenetically informative SAP at AMELX-39.
Fig. 3: Abridged alignment and mirror plots of a phylogenetically informative SAP at AMELX-39.
Full size image

Residue numbering (depicted above the alignment) follows the UniProt reference sequence F6QHS4 (F6QHS4_HORSE), corresponding to AMELX isoform 1 of Equus caballus. The upper spectrum is experimentally derived, whereas the lower one is predicted using the ‘Original mode’ with the Prosit tool, available online via the Universal Spectrum Explorer49. This spectrum is the highest scoring PSM (with Andromeda) for AMELX sequence positions spanning the most abundantly covered SAP differentiating between CMNFV59632 and all other rhinocerotids for which sequences are available. Instead, the more ancestral variant (YIDFSYEVLTPLK), shared with horses and others, is recovered.

Regardless of the mechanisms behind preferential mass spectrometric and data analysis identification of specific sequence regions, biases favouring the recovery17 and identification43 of conserved peptide sequences can ultimately lead to underestimates of divergence times in taxa represented by empirically derived protein sequences. To accurately estimate the phylogenetic position of CMNFV59632 and estimate divergence times in the group, we completed a phylogenetic analysis of a suite of extinct and extant perissodactyls. In addition to the perissodactyl taxa previously used2, we incorporated whole-genome sequence data to predict enamel protein sequences from the Siberian unicorn (Elasmotherium sibiricum) and a pair of extant tapirs (Tapirus terrestris and Tapirus indicus).

The time-calibrated phylogenetic analysis of enamel protein sequences under a fossilized birth–death (FBD) model infers CMNFV59632 as the earliest diverging rhinocerotid in the analysis, with Elasmotherium sibiricum being more closely related to Rhinocerotina (crown rhinoceroses) than to CMNFV59632 (Fig. 4). This phylogenetic hypothesis has also been supported by previous total-evidence analysis36. Also, our FBD analysis resolves the Early Pleistocene Stephanorhinus from Dmanisi as a sampled ancestor of the Middle Pleistocene Stephanorhinus from Fontana Ranuccio. Divergence time estimates place the split between CMNFV59632 and all other rhinocerotids during the Middle Eocene–Oligocene (around 41–25 Ma). The divergence between Elasmotheriinae and Rhinocerotina is reconstructed to have probably occurred in the Oligocene (around 34–22 Ma), which is younger than previous molecular clock estimates31.

Fig. 4: Time-calibrated phylogeny of Rhinocerotidae enamel proteomes.
Fig. 4: Time-calibrated phylogeny of Rhinocerotidae enamel proteomes.
Full size image

The maximum a posteriori (MAP) tree was produced using RevBayes v.1.2.1 (ref. 50) (https://revbayes.github.io/) with a FBD model. Coloured bars at nodes represent 95% height posterior density age interval estimates. Specimen CMNFV59632 represents the Early Miocene rhinocerotid from the Haughton Crater.

The Late Eocene and the Early Oligocene represent dynamic periods in the evolution of rhinocerotids, particularly in North America. After appearing in the Middle Eocene (37–34 Ma), North American rhinocerotids diversify during the Late Eocene, evolving a variety of body sizes and ecologies as several new clades arise, before rhinocerotid diversity experiences a significant drop in the Early Oligocene (34–32 Ma)44. During this time frame, other early-diverging lineages are also appearing in Asia27,45, eventually spreading as far as western Europe27. Morphologically, the Haughton Crater rhinocerotid shares closer affinities with these early-diverging lineages from Eurasia10, particularly those in the genus Epiaceratherium36. Similarly, some other vertebrates in the highly endemic fauna of the Haughton Formation have their closest relatives in Eurasia. These include the transitional pinniped Puijila darwini, sister to the Oligocene Potamotherium of Europe11, and a swan, family Anatidae, a group that is otherwise restricted to the Oligocene and Miocene of Europe10. Overall, these patterns, in conjunction with the recovered divergence times, suggest the Haughton Crater rhinocerotid represents a migrant from eastern Asia or western Europe, derived from one of the early-diverging lineages that arose in the Late Eocene or early Oligocene of East Asia.

We provide molecular evidence that this lineage falls outside Rhinocerotinae, because it diverges before the Rhinocerotinae–Elasmotheriinae split. We also reject a deep divergence (basal split) between Elasmotheriinae and Rhinocerotinae8,9,29 and find moderate support for their branching event after the divergence of Epiaceratherium. Our analysis disagrees with that in ref. 9, which noted a deep divergence for Elasmotheriinae (47.3 Ma), and an early divergence for Rhinocerotinae (almost 30.8 Ma). The later divergence times for these nodes in our analysis are despite equivalently old ages for crown Ceratomorpha (earliest Eocene). Among other timetrees, our dates are generally most consistent with those reported in ref. 31. Our recovered topologies are also broadly similar to trees derived from previous morphology-based phylogenetic analyses27,32, identifying Elasmotheriinae and Rhinocerotinae as deeply nested in Rhinocerotidae. Discrepancies between the genomic31 and proteomic trees arise probably because of different calibration points. The more ancient age of Elasmotheriinae in the analysis in ref. 31 is constrained by a high minimum bound for the Elasmotheriinae–Rhinocerotinae split (35 Ma). However, this date is based on the earliest age of Epiaceratherium naduongense and its allocation to Rhinocerotinae. Assuming monophyly of Epiaceratherium, the present proteomic evidence refutes the assignment of this genus to Rhinocerotinae, because it falls as earlier-diverging than Elasmotheriinae without such topological constraints in our phylogenetic analysis.

In sum, these findings highlight the importance of integrating palaeoproteomic sequence data into phylogenetic analyses to infer topologies and estimate divergence times. Ancient proteomic sequence data allow for robustly supported timetrees, and can serve to develop phylogenetic frameworks in deep time, particularly from specimens too old to preserve aDNA. For example, the present data allow for firm placement of the Haughton Crater rhinocerotid outside Rhinocerotina, and probably outside the Elasmotheriinae–Rhinocerotinae clade, a fact that has significant implications for both morphological and molecular studies integrating fossil calibration times from the fossil record. In the future, fully characterizing these deep divergences in Rhinocerotidae requires accessing protein sequence data from Aceratheriinae, a group that includes the late-surviving Pliocene Shansirhinus from high-altitude deposits in the Linxia Basin46, a region that has shown to be amenable to ancient protein survival into the Miocene5.

Our experimental results and thermal age calculations firmly indicate that at least some high-latitude fossiliferous deposits preserve not only the tangible remains of extinct organisms, but also ancient biomolecules. Further experiments on a broader sample of fossils from this site can reveal whether this exceptional preservation is an isolated case or extends across the Haughton Formation. The latter scenario would support the notion that the Haughton Formation could potentially represent a new type of lagerstätte34—a palaeomolecular lagerstätte—showing preservation of subfossil peptides from a time range when they are otherwise not known. These findings should encourage further vertebrate palaeontological field work in the High Arctic, and other cold-temperature sites, with a goal that includes identifying taphonomic conditions favourable to such remarkable biomolecular preservation. The survival of an extended set of mammalian enamel peptides in the Early Miocene demonstrates that the research scopes of palaeoproteomics and palaeobiogeochemical analyses focusing on proteagens from the Palaeogene and beyond19,21 can finally overlap. The complementary integration25,47 of these two approaches can ultimately lead us to better define a unifying framework for understanding the degradation pathways from intact biological proteins and polypeptide chains to very short, unsequenceable, oligopeptides and isolated amino acids. More broadly, this work illustrates the power of palaeoproteomics in elucidating phylogeny and taxonomy of extinct vertebrates in deep time.

Methods

Site and specimen

Located in the Haughton impact crater (75° N, Nunavut, Canada), the Haughton Formation comprises the remnants of a large, post-impact lacustrine deposit, dated to the Early Miocene. Previous dating estimates, using fission-track and 40Ar–39Ar furnace step-heating dating, identified an age of 24–21 Ma (refs. 33,51). An Early Miocene age has also been corroborated by (U-Th)/He thermochronology52. Although older age estimates between 30 and 40 Ma have also been suggested53,54,55, there have been no age estimates younger than the Early Miocene. Therefore, we conservatively use the younger Early Miocene age estimates in our analysis and interpretation.

The highly endemic fauna of the Haughton Formation consists of several vertebrate taxa, including a transitional pinniped11, a pair of salmoniform fishes, a swan-like anatid, a small artiodactyl, a leporid rabbit, a heterosocid shrew and a well-preserved rhinocerotid10,36. Although the megafloral assemblage is not particularly rich, the palynofloral assemblage is well-characterized, allowing for reconstruction of local climatic conditions. In the Early Miocene, the Haughton Crater lake and its surrounding environs experienced a significantly warmer annual temperature (8–12 °C) than the present day10,56.

Specimen CMNFV59632 is a nearly complete rhinocerotid skeleton, including skull and dentition, uncovered 10.8 m above the base of the formation36. Our analysis focuses on a single tooth fragment from a lower left m1 (Fig. 1b and SuppIementary Information) that was already separated from the rest of its tooth row because of the fragmenting effectings of cryoturbation36. The dental specimen’s rhinocerotid affinities are further supported by its size and morphology (Supplementary Information), most notably the presence of vertical Hunter–Schreger bands on its enamel, a defining feature of rhinocerotids and found in few other mammals57. A single tusk fragment (left i2) derived from CMNFV59632 was also selected for proteomic extraction. Owing to its thin enamel, only limited peptides were recovered from this tusk fragment, and the sample is thus excluded from further analysis and discussion.

Proteomic extraction

The laboratory workflow for the CMNFV59632 teeth and the Fontana Ranuccio Stephanorhinus tooth (for comparison) generally follows that reported in refs. 2,35. Using a sterilized drill, flakes of enamel were removed from the fragmentary teeth, with care taken to avoid sampling the dentine. The CMNFV59632 tooth enamel sample, weighing 154 mg, was then ground to a fine powder and demineralized overnight using 10% high-performance liquid chromatography (HPLC)-grade trifluoroacetic acid (TFA) (Merck, Sigma-Aldrich) in high-purity liquid chromatography–mass spectrometry (LC–MS) grade water. The CMNFV59632 tusk enamel sample, weighing 90 mg, was processed in the same way. The Fontana Ranuccio (FR sd-295) enamel sample was divided into three subsamples—FR2, FR3 and FR4—weighing 202, 243 and 205 mg, respectively, which were similarly ground to a fine powder, and demineralized using 10% TFA (FR3, FR4) or 10% HCl (FR2) in high-purity LC–MS grade water. For each sample, the demineralization step was repeated a second time to ensure complete demineralization. No enzymatic digestion was performed. Subsequently, peptides were collected and desalted on C18 StageTips58 produced in-house. An extraction blank for each sample set was processed alongside the samples for every step, to control for contamination.

Mass spectrometry

StageTips were eluted with 30 μl of 40% acetonitrile (ACN) and 0.1% formic acid in high-purity LC–MS grade water, into a 96-well plate. To remove ACN and concentrate the samples, the plate was vacuum-centrifuged until approximately 3 μl of sample remained in each well. Next, samples were resuspended in 6 µl of 5% ACN, 0.1% formic acid in high-purity LC–MS grade water, and 4 µl (CMNFV 59632) or 5 µl (FR sd-295 Stephanorhinus), of sample were injected.

Liquid chromatography coupled with tandem mass spectrometry was used to analyse the samples, on the basis of previously published protocols2,59. Samples were separated on a 15-cm column (75 μm inner diameter in-house laser pulled and packed with 1.9-μm C18 beads (Dr Maisch)) on an EASY-nLC 1200 (Proxeon) connected to an Exploris 480 (CMNFV59632) or a Q-Exactive HF-X (Fontana Ranuccio Stephanorhinus) mass spectrometer (both Thermo Fisher Scientific), with an integrated column oven. Buffer A, containing 0.1 % formic acid in MilliQ water, and the peptides were separated with increasing buffer B (80% ACN, 0.1% formic acid in MilliQ water) with a 77-min gradient, increasing buffer B concentration from 5% to 30% in 50 min, 30% to 45% in 10 min, 45% to 80% in 2 min, and maintained at 80% for 5 min before decreasing to 5% in 5 min, and finally held for 5 min at 5%. Flow rate was 250 nl min−1. An integrated column oven was used to maintain the temperature at 40 °C.

The two mass spectrometers were run using the same parameters except where specified, owing to changes in running software. Spray voltage was set to 2 kV, the S-lens RF (radio frequency) level was set to 40%, and the heated capillary was set to 275 °C. Full-scan mass spectra (MS1) were recorded at a resolution of 120,000 at m/z 200 over the m/z range 350–1,400. The AGC (automatic gain control) target value was set to 300% (Exploris) or 3 × 106 (HF-X) with a maximum injection time of 25 ms. HCD (higher-energy collisional dissociation) -generated product ions (MS2) were recorded in data-dependent top-10 mode and recorded at a resolution of 60,000. The maximum ion injection time was 118 ms (Exploris) or 108 ms (HF-X), with an AGC target value of 200% (Exploris) or 2 × 105 (HF-X). Normalized collision energy was set at 30% (Exploris) or 28% (HF-X). The isolation window was set to 1.2 m/z with a dynamic exclusion of 20 s. A wash-blank, using 5% ACN, 0.5% TFA, was run between each sample and laboratory blank to limit cross-contamination.

Database construction

The protein reference alignment given in ref. 2 was used as a starting point to construct a database for sequence reconstruction. Owing to the vast evolutionary distance between CMNFV59632 and any extant taxa (>20 Myr), a broader database was constructed to identify sequence variants that may be known in other mammals. To construct a broader database, we searched UniProt and the National Center for Biotechnology Information for each enamel protein, specifying the taxonomic grouping of ‘Theria’ to include all therian mammals. To supplement available sequences, others were manually extracted from available genomes, following the methodology reported previously60.

To investigate the relationships at the base of Rhinocerotidae, protein sequences translated from Elasmotherium sibiricum genomic data31 were generated. To obtain the corresponding amino acid sequences, we first collapsed the paired-end reads and masked the conflict bases as ‘N’ using adapterRemoval61. We then mapped the collapsed reads against the reference genome of the white rhinoceros (GCF_000283155.1_CerSimSim1) using the BWA MEM function62 with the shorter split hits being abandoned. After that, we removed duplicates using an in-house Perl script following ref. 31. Finally, we extracted the gene sequences according to their locations on the reference genome.

The remaining steps generally follow the workflow outlined in ref. 2. We used ANGSD63 to generate consensus sequences from BAM files corresponding to chromosomes that include genes of interest. To reduce the effects of post mortem aDNA damage, we trimmed the first and last five nucleotides from each DNA fragment. We formatted each consensus sequence as a blast nucleotide database. To recover translated protein sequences, we performed a tblastn alignment64, with the corresponding Ceratotherium simum sequences as queries. Finally, we used ProSplign to recover the spliced alignments, and ultimately, the translated protein sequences65.

Protein identification

Thermo Fisher Scientific .raw files generated using the mass spectrometers were searched with various software using an iterative search strategy to interpret spectra, characterize modifications and ultimately, reconstruct protein sequences. For comparison, .raw files from a medieval ovicaprine (control) and an Early Pleistocene Stephanorhinus generated previously2 were also analysed. Among samples from the Fontana Ranuccio Stephanorhinus, only FR4 was analysed. Although it is possible that some inter-sample variation between CMNFV59632 and these samples is caused by analysis using different mass spectrometer models, according to benchmarking studies66,67, it is unlikely that the overall trends and interpretations would change.

We primarily used MaxQuant68 for sequence reconstruction and other downstream aspects of data analysis. We performed two initial runs: (1) a more focused run using the database we modified from ref. 2, and (2) a broad run using the ‘Theria’-wide database we constructed from publicly available sequences.

In all runs, an Andromeda score threshold of 40 and a delta score of 0 were set for both unmodified and modified peptides. Minimum and maximum peptide lengths were specified as 7 and 25, respectively. The default peptide false discovery rate was used (0.01), whereas the protein false discovery rate was increased to 1 to show possible low-abundance proteins. Error tolerances were kept at the default settings for Orbitrap MS instruments: 20 ppm for the first search, 4.5 ppm for the final search and 20 ppm for the fragment ion. ‘Unspecific’ digestion was specified. No fixed post-translational modifications were set. Several modifications were set as variable modifications in our initial runs: glutamine and asparagine deamidation (delta mass (ΔM) = +0.984016), methionine and proline oxidation (ΔM = +15.9949), N-terminal pyroglutamic acid from glutamine (ΔM = −17.026549) and glutamic acid (ΔM = −18.010565), phosphorylation of serine, threonine and tyrosine (ΔM = +79.966331), and the conversion of arginine to ornithine (ΔM = −42.021798).

Proteins included in the database of common contaminants provided by MaxQuant (for example, proteinaceous laboratory reagents and human skin keratins), as well as reverse sequences, were removed manually and not examined further. In addition, proteins detected in the laboratory blank were also treated as contaminants, and not considered further.

To discover new SAPs and peptide variants not included in our database, we used more search tools. Peaks v.7.0 was used to attempt de novo sequencing and an homology search was performed using the SPIDER algorithm69,70,71. The open search capabilities of openPFind72 and MSFragger73 were also used. When possible, the same settings were selected as in the MaxQuant runs.

With our iterative search strategy, we integrated possible sequence variants from the results of our de novo, homology searches and open searches into hypothetical sequences from closely related taxa, to produce artificial sequences. These artificial sequences were included in a subsequent MaxQuant search, and only incorporated into reconstructed sequences if identified and validated using MaxQuant.

Sequence reconstruction and filtering

Before sequence reconstruction, all non-redundant PSMs were filtered using three criteria to reconstruct only those peptide sequences and amino acid residues that we can confidently assign. Sequences were accepted at two levels, resulting in two different datasets: (1) a minimally filtered dataset, and (2) a strictly filtered dataset. This filtering starts with using Basic Local Alignment Search Tool (BLAST)74 to determine whether peptides match any contaminants, beyond those included in MaxQuant by default, such as soil bacteria and fungi. At this stage, PSMs are discarded if they present a match to any reasonable candidate contaminants, if they are also identified in the blank, or if they present poorly covered ion series. The resulting PSMs are used to reconstruct sequences for the ‘minimally filtered dataset’.

Next, ion series coverage is examined for each PSM. At this stage, peptide sequences are accepted for the strictly filtered dataset only if each amino acid residue is covered (for example, at least y-, b- or a- ion designates the mass of that specific amino acid, plus any identified modifications) by at least two spectra, following the approach outlined previously75. Also, for both strictly and minimally filtered datasets, poorly supported spectra are removed at this stage, and proteins are only submitted for phylogenetic analysis if they are covered by at least two non-overlapping peptides. Finally, under the strict filtering criteria, BLAST is used again on any trimmed sequences, to remove any that match contaminants.

Intra-crystalline protein decomposition analysis

We analysed chiral amino acids on CMNFV59632 to evaluate the overall extent of amino acid degradation in the intra-crystalline fraction of the enamel, enabling comparison with previously analysed specimens2, and samples that had been heated experimentally to between 60 and 80 °C for up to 17,520 h and with samples heated to 200–500 °C for up to 25 min. Enamel chips were drilled using a Dremel 4000 (kit 4000-1/45) drill with a diamond wheel point (4.4 mm (7105) by Dremel) to remove any dentine, which could be identified under a microscope (ZEISS Stemi 305, Axiocam 105 R2). Samples were processed following the methods in ref. 76. To remove excess powders, enamel chips were washed in deionized water and ethanol (analytical grade) before being powdered in an agate pestle and mortar. Powdered samples were weighed into a single plastic microcentrifuge tube and bleached (NaOCl, 12%, 50 μl mg−1 of enamel) for 72 h to remove the inter-crystalline amino acids and any contamination. This bleached sample was washed five times with deionized water and then once with methanol (HPLC grade), before being left to dry overnight.

The dried bleached sample was then divided into four subsamples: two for technical replicate analysis of the FAA and two for replicate analysis of the THAA. The THAA subsamples were dissolved in HCl (7 M, 20 μl mg−1, analytical grade) in a sterile 2-ml glass vial (Wheaton), purged with N2 to reduce oxidation and heated at 110 °C for 24 h in an oven (BINDER series). The acid was then removed by centrifugal evaporation (Christ RVC2-25). THAA and FAA fractions were subjected to a biphasic separation procedure76,77 to remove inorganic phosphate from the enamel samples. HCl was added to both FAA (1 M, 25 μl mg−1) and THAA (1 M, 20 μl mg−1) fractions in separate 0.5-ml plastic microcentrifuge tubes (Eppendorf), and KOH (1 M, 28 μl mg−1) was added into the acidified solutions, which then formed monophasic cloudy suspensions. Samples were agitated and then centrifuged (13,000 rpm for 10 min, Progen Scientific GenFuge 24D) to form a clear supernatant above a gel. The supernatant was removed and dried by vacuum centrifugation. The concentration of the intra-crystalline amino acids and their extent of racemization (d/l value) were then quantified using reverse-phase HPLC (Agilent 1100 series HPLC fitted with HyperSil C18 base deactivated silica column (5 μm, 250 × 3 mm) and fluorescence detector) following a modified method from ref. 78.

For the reverse-phase HPLC analysis, samples were rehydrated with an internal standard solution (l-homo-arginine (0.01 mM), sodium azide (1.5 mM) and HCl (0.01 M)), and run alongside standards and blanks. A tertiary mobile phase system (HPLC grade ACN–methanol–sodium buffer; 21 mM sodium acetate trihydrate, sodium azide,1.3 μM EDTA, pH adjusted to 6.00 ± 0.01 with 10% acetic acid and sodium hydroxide) was used for analysis. The d and l peaks of the following amino acids were separated: aspartic acid and asparagine; glutamic acid and glutamine; serine, alanine, valine, phenylalanine, isoleucine, leucine, threonine, arginine, tyrosine and glycine. During preparation, asparagine and glutamine undergo rapid irreversible deamidation to aspartic acid and glutamic acid respectively79 and hence they are reported together as aspartic acid and asparagine, and glutamic acid and glutamine. One of the experimentally heated samples (300 °C for 10 min) was also analysed using liquid chromatography coupled with tandem mass spectrometry with minor changes to the protocol reported in ref. 2 (Extended Data Fig. 7 and Supplementary Information).

Phylogenetic analysis

A time-calibrated phylogenetic tree was inferred with the Bayesian phylogenetic software RevBayes v.1.2.1 (ref. 50) (https://revbayes.github.io/) under a constant-rate FBD model80,81. The dataset consisted of enamel proteome data for 16 perissodactyl species (10 extant and 6 extinct), totalling 7 proteins and 3,446 amino acids. Phylogenetic analyses were performed with both the strictly filtered and minimally filtered sequences for CMNFV59632, to observe any topological differences between the two datasets and assess whether filtering is warranted. Because no main differences were observed, only the results from the ‘strictly filtered’ dataset are discussed. The proteome dataset was partitioned by protein. A General Time Reversible + Invariant sites (GTR + I) amino acid substitution model—in which stationary frequencies of the 20 amino acids and exchangeability rates among amino acids are free to vary and estimated from data—was applied to each partition. Preliminary unrooted phylogenetic analyses performed on each protein showed evidence for within-protein Γ-distributed rate variation only for MMP20, hence Γ-distributed rate variation was modelled only for the MMP20 partition. A relaxed clock model with uncorrelated lognormal-distributed rates was applied to allow rate variation across branches. The prior on the average clock rate was set as a log uniform distribution (min = 10−8, max = 10−2 substitutions per lineage per million years). The prior on the clock rate standard deviation was set as an exponential distribution with mean equal to 0.587405, corresponding to one order of magnitude of clock rate variation among branches. The FBD tree model allows for placement of extinct species in a phylogenetic tree while simultaneously estimating the rates of speciation, extinction and fossilization (sampling of species in the past). The priors on speciation, extinction and fossilization parameters were set as uniform distributions bounded between 0 and 10. The sampling probability for extant species was fixed to 0.5882353 (\(\frac{10}{17}\)), corresponding to the fraction of extant perissodactyl species included in the analysis, and assuming uniform sampling of extant taxa. The three species of Equidae in the analysis (Equus caballus, Equus przewalskii and Equus asinus) were constrained as outgroup to other perissodactyls (Tapiridae and Rhinocerotidae). Tip ages of fossil taxa were given a uniform prior distribution ranging from the minimum to maximum age of the deposit in which each fossil has been found. The prior on the origin age of the tree was set as a uniform distribution with minimum = 54 Ma, corresponding to the oldest fossil that can be unequivocally assigned to crown Perissodactyla (Cambaylophus vastanensis from the Early Ypresian Cambay Shale82), and maximum = 100 Ma, corresponding to the beginning of the Late Cretaceous and a very lax upper boundary on the origin of placental mammals83. Further constraints on node ages on the basis of the fossil record of perissodactyls were set to improve the precision of divergence age estimates. Each node calibration was set up as a soft-bounded uniform distribution with normally distributed tails, with 2.5% of the distribution younger than the minimum age (allowing for potential misattribution of the oldest fossil of a clade) and 2.5% of the distribution older than the maximum age. Monophyly was not enforced when setting up these node calibrations. The following age constraints have been applied to five nodes: (1) Node = crown Perissodactyla; soft minimum = 54 Ma, with the same justification as the minimum on the origin age prior; soft maximum = 66  Ma, corresponding to the Cretaceous/Palaeogene boundary, before which no unambiguous crown placental fossils are known. (2) Node = Rhinocerotina (crown rhinoceroses); soft minimum = 22.6 Ma, corresponding to the earliest putative appearance of a crown rhinoceros in the fossil record (Gaindatherium cf. browni from the Aquitanian upper member of the Chitarwata Formation84,85); soft maximum = 44 Ma, corresponding to the minimum age of Rhinocerotidae as supported by fossil and phylogenetic evidence31. (3) Node = Diceroti (Ceratotherium + Diceros); soft minimum = 5.3 Ma, corresponding to the minimum age of the oldest deposits yielding Diceros bicornis fossils (Lothagam and Albertine86,87); soft maximum = 7.3 Ma, as in ref. 31. (4) Node = Rhinoceros unicornis + Rhinoceros sondaicus; soft minimum = 1.9 Ma, corresponding to the Early Pleistocene appearance of Rhinoceros unicornis in the fossil record88,89; soft maximum = 5.3 Ma, as in ref. 31. (5) Node = Dicerorhinus + Stephanorhinus + Coelodonta; soft minimum = 13 Ma, corresponding to Middle Miocene remains of Dicerorhinus from the Middle Siwaliks of Pakistan31,90; soft maximum = 22.6 Ma, corresponding to the oldest crown rhinoceros fossil as in the soft minimum of calibration 2.

The Markov chain Monte Carlo was set up as four independent runs, running for 50,000 iterations and sampling every 10, averaging between 262.2 and 279.2 moves per iterations. Convergence between runs was checked by visually inspecting and calculating effective sample sizes of parameter estimates on Tracer v.1.7.2 (ref. 91). A MAP tree was calculated to summarize the posterior distribution of trees, with 20% burn-in. In the analysis of the minimally filtered dataset, one of the four runs was discarded from the MAP tree calculation, as it converged only in the last 10% of the Markov chain Monte Carlo.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.