Abstract
Metadata plays an essential role in the analysis and dissemination of proteomics data. It annotates sample information for output tables from library searches and displays sample information from data files in public repositories. However, integrating metadata into data analysis can be time-consuming and is not well standardized. Inconsistent metadata formats in public repositories hinder other researchers’ ability to reproduce and reuse these public datasets. Here we present the metadata integration in MaxQuant, which provides a user-friendly way to export metadata as SDRF, the standard format that maps sample properties to proteomics data files. We also implemented the annotation of output tables with the SDRF file, enabling users to perform seamless downstream data analysis with annotated output tables. These features provide a simple and standardized approach to creating and leveraging standardized metadata, thereby facilitating data analysis and improving the reusability and reproducibility of public proteomics datasets.
Similar content being viewed by others
Introduction
Mass spectrometry-based proteomics is the study of expression, interaction and modification of all proteins in living organisms using mass spectrometry1. Advances in instrumentation and data acquisition methods have made it a more powerful tool for studying active biological processes by greatly improving the resolution, sensitivity, reproducibility and throughput of the measurement2,3,4,5. After the sample measurement, library search is performed to identify and quantify the peptides and proteins6. Then output tables that store the library search results need to be annotated with the metadata before the downstream data analysis, due to the fact that the column names of output tables are created based on the data files and may not directly reflect the sample-related properties.
This annotation of output tables could be very cumbersome for studies with large cohorts and complex experimental design. It’s also more complicated to annotate studies with multiplexed and fractionated experiments, since there is no direct relation between samples and data files. For multiplexed experiments, multiple samples are included in the same data file7. For fractionated samples, multiple data files are related to the same sample8. If the metadata are in inconsistent formats across datasets within a single study, such as when reanalyzing multiple public datasets, additional effort would be required for the manual annotation of the output tables9.
In order to address inconsistent and incomplete metadata in proteomics studies, the Sample and Data Relationship Format (SDRF) was introduced as a standardized format to store the proteomics sample metadata10. The SDRF is a tab-delimited text format that describes the relationship between samples and data files in proteomics studies. It consists of sample-related metadata, data file-related metadata and the variables under study. Submitting the SDRF file together with raw files to public repositories like ProteomeXchange11 is recommended to enhance the reusability and reproducibility of public datasets. Several tools have been developed for exporting12,13 or reading SDRF14,15 to promote this standard metadata format.
However, very few proteomics datasets currently include the SDRF file in data submissions12. First, creating the SDRF file remains complicated and is not mandatory for data submission. Currently, users need to fill the SDRF file row by row, and each row corresponds to one sample and data file relationship, therefore for some experimental setups many rows require the entry of repetitive information. For example, multiplexed experiments have repetitive data file properties, and fractionated experiments have repetitive sample properties. Second, users benefit little from creating the SDRF file for their own projects because it doesn’t significantly impact the data analysis. Creating the SDRF file requires documenting lots of required data file and sample properties, but most of the data file properties will not be taken into consideration in data analysis. Furthermore, sample properties in the SDRF file currently cannot be used to directly annotate output tables.
In this work, we address the aforementioned challenges by implementing the metadata integration feature in MaxQuant, a widely used software that supports proteomics data analysis from various instruments, quantification techniques and data acquisition modes16,17,18. This feature simplifies the process of creating an SDRF file. We also implement automatic metadata annotation of output tables using Perseus or the provided scripts to further assist users with their own data analysis.
Results and discussion
Overview of metadata integration in MaxQuant
In addition to the raw files and FASTA files, metadata integration is implemented in the MaxQuant workflow (Fig. 1). The SDRF file, which is created in the process, consists of all required sample properties, which are mainly filled out by users, as well as all required data file properties, which are extracted automatically by MaxQuant from the input files and user-defined parameters.
The SDRF file could play an essential role in both data dissemination and analysis. First, it could be uploaded to public repositories alongside raw files to improve the reproducibility and reusability of the datasets. Moreover, it can be used to automatically annotate the output tables from MaxQuant for Perseus, as well as other tools using the provided scripts written in Python and R.
Export metadata as an SDRF file in MaxQuant
Since MaxQuant v2.7.0, the feature to export metadata as an SDRF file has been implemented in the “Metadata” tab (Fig. 2). In MaxQuant, all required data file properties specified by SDRF would be extracted and converted to ontology terms and accession numbers when the SDRF file is written (Table 1). After setting up the parameters for the library search, click the “Refresh” button under the “Metadata” tab, and the corresponding metadata table will appear. To reduce the confusion caused by having too many columns, the metadata table only includes basic data file information and the required sample properties that users need to fill out. An additional column named “Group” is also included, representing the study variable “factor value[group]” in SDRF. This column doesn’t have to follow ontology-based rules, users can define any project-specific treatments or phenotypes that do not belong to one of the required sample properties.
The layout of the metadata table in MaxQuant differs from the SDRF, wherein each row represents one sample and data file relationship. In the metadata table, however, each row represents one sample. Since only sample properties need to be filled in to create an SDRF file in MaxQuant, the layout of metadata table is particularly user-friendly. The exact layout is also automatically generated according to the experimental settings, thereby eliminating the need to fill in repetitive information. For multiplexed dataset, there will be expanded rows corresponding to different labels for each raw file, all labels will be assigned identical data file properties. For fractionated dataset different raw files corresponding to the same sample will be collapsed into one row, these raw files will be assigned identical sample properties. For example, in an experiment with one TMT10 set fractionated into 20 fractions, the SDRF file contains 200 rows. Users only need to fill a 10-row metadata table in MaxQuant to generate the SDRF file. When MaxQuant writes the SDRF file, information from the metadata table will be converted to the SDRF layout and combined with corresponding data file properties.
The metadata table can be filled out in MaxQuant GUI or in the template exported by MaxQuant. There’s no maximum limit to the number of rows that can be generated in the metadata table. Users can either edit multiple rows at once in MaxQuant GUI, or fill out the exported template outside of MaxQuant. In the SDRF, some sample properties are required but might not be available for users’ projects (e.g. the ancestry category for human samples). Users can leave these columns empty, and MaxQuant will have the option to automatically fill them with “not available”. Then the exported SDRF file will have no missing values and can be submitted directly to public repositories without further editing. The SDRF file will be created later as one of the output tables after the library search, or immediately by clicking the “Write SDRF” button (Supplementary Data 1, example of SDRF file generated by MaxQuant).
In summary, the auto-adjusted layout and autofill feature of the metadata table in MaxQuant simplify the process of creating the SDRF file. Automatic extraction of data file properties saves users the effort of collecting and organizing information that rarely impacts their data analysis. It is especially helpful for proteomics facilities to provide a well-annotated SDRF file to clients without the analytical expertise to annotate the data file properties. The possibility to edit multiple rows simultaneously and read the template file improves efficiency especially when handling large datasets. By default, the SDRF file becomes one of the regular tables written in every MaxQuant run. This feature will encourage the deposit of standardized SDRF metadata in public proteomics repositories, helping to improve the reproducibility and reusability of public datasets.
Annotate the MaxQuant output tables with the SDRF file for downstream data analysis
In MaxQuant, “Experiment” is the unique identifier for each sample, except for multiplexed datasets, in which each sample is represented as a combination of “Experiment” and label. In the SDRF, the source name has the same definition. Therefore, in the MaxQuant-exported SDRF file, the source name is identical to “Experiment” (combined with the label if the dataset is multiplexed). Thus, it can be used to annotate MaxQuant output tables.
Perseus is a widely used software that performs downstream omics data analysis such as pre-processing, statistical analysis and data visualization19. Since it was co-developed with MaxQuant, it is especially compatible with MaxQuant output tables. The feature that annotates MaxQuant output tables with the SDRF file has been implemented since Perseus v2.1.5. Clicking “Read SDRF” under “Annot. rows” automatically adds all information from the SDRF file to the corresponding intensity columns as annotation rows. By default, “Skip repetitive properties” is selected to only add properties that are not identical between all samples as annotation rows (Fig. 3). With added annotation rows from the SDRF file, users can continue with data filtering, normalization and statistical analysis directly with the annotated output tables. Two scrips (https://github.com/cox-labs/converters), written in Python and R, are provided to convert MaxQuant output tables and the SDRF file into an expression matrix and a sample annotation table. These two files are required by other popular tools for downstream data analysis, such as DESeq220, limma21, and edgeR22. As long as the source name is identical to “Experiment” in MaxQuant, SDRF files created by other tools can also be used to annotate output tables with Perseus or can be converted by the provided scripts.
The SDRF was initially introduced to store metadata during data dissemination, which is one of the last steps in finalizing a proteomics study. Here, we extended the application of the SDRF file to output tables annotation, which is a key step in data analysis. As a result, this output table annotation feature will motivate users to create SDRF files because it helps with data analysis. As the SDRF was adapted from the MicroArray Gene Expression Tabular (MAGE-TAB)23, which is used to encode metadata annotations for RNA-Seq data24, SDRF files could also perform as a bridge for multi-omics data integration in the future.
Methods
Software development, requirements and usage
Metadata integration is a new feature implemented in MaxQuant v2.7.0 and Perseus v2.1.5. MaxQuant is developed using.NET8 and written in C# programming language. The command-line version can be run on Windows, Linux and macOS (provided that the vendor libraries are available for accessing the raw files). The graphical user interface (GUI) is only available on Windows at the moment. It can be downloaded from https://www.maxquant.org/maxquant. Perseus is developed using.NET8 and written in C# programming language. Its GUI is on Windows only at the moment. It can be downloaded from https://www.maxquant.org/perseus/.
The video tutorials about MaxQuant and Perseus are available at https://www.youtube.com/@MaxQuantChannel. The detailed tutorial on exporting the SDRF file and using it to annotate output tables is provided as PDF in the Supplementary Information. Alternatively, the video tutorial can be viewed on YouTube at https://youtu.be/fHCPOBXXRp8.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
No new mass spectrometry proteomics data was generated in the scope of this work. The 24-fraction TMT10 dataset displayed in Figs. 2 and 3 is a subset of the dataset from the CPTAC data portal under accession number S06025 and was obtained from Proteomic Data Commons (PDC) repository using bash script from https://github.com/esacinc/PDC-Public/tree/master/tools/downloadPDCData.
Code availability
Both MaxQuant and Perseus are freeware that can be used for all purposes, including commercial use. The partially open-source code was deposited at https://github.com/JurgenCox/mqtools7, under the CC BY-NC-ND 4.0 license. The scripts for annotating the output tables were created using Python v3.12.3 and R v4.5.0, and were deposited at https://github.com/cox-labs/converters, under the CC BY-NC-ND 4.0 license.
References
Guo, T., Steen, J. A. & Mann, M. Mass-spectrometry-based proteomics: from single cells to clinical applications. Nature 638, 901–911 (2025).
Eliuk, S. & Makarov, A. Evolution of orbitrap mass spectrometry instrumentation. Annu Rev. Anal. Chem. 8, 61–80 (2015).
Meier, F., Park, M. A. & Mann, M. Trapped ion mobility spectrometry and parallel accumulation-serial fragmentation in proteomics. Mol. Cell Proteom. 20, 100138 (2021).
Gillet, L. C. et al. Targeted data extraction of the MS/MS spectra generated by data-independent acquisition: a new concept for consistent and accurate proteome analysis. Mol. Cell Proteom. 11, O111 016717 (2012).
Guzman, U. H. et al. Ultra-fast label-free quantification and comprehensive proteome coverage with narrow-window data-independent acquisition. Nat. Biotechnol. 42, 1855–1866 (2024).
Domon, B. & Aebersold, R. Challenges and opportunities in proteomics data analysis. Mol. Cell Proteom. 5, 1921–1926 (2006).
Pappireddi, N., Martin, L. & Wuhr, M. A review on quantitative multiplexed proteomics. Chembiochem 20, 1210–1224 (2019).
Manadas, B., Mendes, V. M., English, J. & Dunn, M. J. Peptide fractionation in proteomics approaches. Expert Rev. Proteom. 7, 655–663 (2010).
Griss, J., Perez-Riverol, Y., Hermjakob, H. & Vizcaíno, J. A. Identifying novel biomarkers through data mining-A realistic scenario? Proteom. Clin. Appl 9, 437–443 (2015).
Dai, C. X. et al. A proteomics sample metadata representation for multiomics integration and big data analysis. Nat. Commun. 12, 5854 (2021).
Vizcaino, J. A. et al. ProteomeXchange provides globally coordinated proteomics data submission and dissemination. Nat. Biotechnol. 32, 223–226 (2014).
Claeys, T. et al. lesSDRF is more: maximizing the value of proteomics data through streamlined metadata annotation. Nat. Commun. 14, 6743 (2023).
Kong, A. T., Leprevost, F. V., Avtonomov, D. M., Mellacheruvu, D. & Nesvizhskii, A. I. MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics. Nat. Methods 14, 513–520 (2017).
Dai, C. et al. quantms: a cloud-based pipeline for quantitative proteomics enables the reanalysis of public proteomics data. Nat. Methods 21, 1603–1607 (2024).
Zheng, P. et al. Ibaqpy: a scalable Python package for baseline quantification in proteomics leveraging SDRF metadata. J. Proteom. 317, 105440 (2025).
Ferretti, D. et al. Isobaric labeling update in MaxQuant. J. Proteome Res. 24, 1219–1229 (2025).
Sinitcyn, P. et al. MaxDIA enables library-based and library-free data-independent acquisition proteomics. Nat. Biotechnol. 39, 1563–1573 (2021).
Cox, J. & Mann, M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat. Biotechnol. 26, 1367–1372 (2008).
Tyanova, S. et al. The Perseus computational platform for comprehensive analysis of (prote)omics data. Nat. Methods 13, 731–740 (2016).
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).
Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).
Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010).
Rayner, T. F. et al. A simple spreadsheet-based, MIAME-supportive format for microarray data: MAGE-TAB. BMC Bioinforma. 7, 489 (2006).
Fullgrabe, A. et al. Guidelines for reporting single-cell RNA-seq experiments. Nat. Biotechnol. 38, 1384–1386 (2020).
Krug, K. et al. Proteogenomic landscape of breast cancer tumorigenesis and targeted therapy. Cell 183, 1436–1456 e1431 (2020).
Acknowledgements
The authors would like to thank Barbara Steigenberger for her explanation of the metadata management. This work was supported by the German Ministry for Science and Education funding action CLINSPECT-M [FKZ 161L0214E, to J.X.].
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Contributions
W.V. developed the new features in MaxQuant and Perseus. S.U. performed the tests on the new features. W.V., S.U., D.F., J.C. and J.X. conceptualized the project and designed the new features. J.C. and J.X. supervised the project and drafted the manuscript. All authors have read and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Yasset Perez-Riverol and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Viegener, W., Urazbakhtin, S., Ferretti, D. et al. Facilitating analysis and dissemination of proteomics data through metadata integration in MaxQuant. Nat Commun 16, 8421 (2025). https://doi.org/10.1038/s41467-025-64089-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467-025-64089-4