Fig. 3: Dansylation-specialized in silico spectral library (DnsBank) constructed by DeepCDM.

a Structural information of molecules was selected and extracted from a public chemical database, Distributed Structure-Searchable Toxicity (DSSTox), which were then virtually derived. The spectra of these virtually dansylated molecules were predicted by the dansylation-specialized spectrum predicting model (Dns-MS) for constructing DnsBank. b The spectral composition of DnsBank. 17.10% spectra were belonged to dansylated amine-containing molecules, 37.34% for dansylated hydroxyl-containing molecules, 39.39% for dansylated carboxyl molecules and 6.07% for didansylated molecules. c Spectrum-to-compound (MS2C) capability of DnsBank was benchmarked with public databases and compound identification tools using the hold-out test set. The top-k accuracy of annotations from 167 query spectra in the test set through searching in DnsBank, PubChem by SIRIUS 4, and 6 libraries by CFM-ID 4.0, including Human Metabolome Database (HMDB), MassBankJP/MassBankEU, the MassBank of North America (MoNA), and the predicted libraries generated by CFM-ID 4.0 from ChEBI, DSSTox and STOFF-IDENT. 28.14% of the test set were correctly annotated at top 1 output by DnsBank, 63.47% were in top 5, 76.05% were in top 10 and 91.02% were ranked in top 25. CFM-ID 4.0 failed to annotate any dansylated molecules in the test set, although 3 molecules were found in its libraries by manual checking. SIRIUS 4 correctly annotated 7 dansylated molecules (4.19%) in top 5 output and 8 (4.79%) in top 25. In total, 15 dansylated molecules in the test set were found in PubChem by manual checking, 8 of them were annotated by SIRIUS 4. d The top-k accuracy of the 167 query spectra in test set for molecules containing amine, hydroxyl and carboxyl matched in DnsBank. Amine: 37.25% at top 1, 74.51% in top 5, 86.27% in top 10 and 92.16% in top 25. Hydroxyl: 25.68% at top 1, 60.81% in top 5, 71.62% in top 10 and 91.89% in top 25. Carboxyl: 21.43% at top 1, 54.76% in top 5, 71.43% in top 10 and 88.10% in top 25. Source data are provided as a Source Data file.