Facilitating analysis and dissemination of proteomics data through metadata integration in MaxQuant

Viegener, Walter; Urazbakhtin, Shamil; Ferretti, Daniela; Cox, Jürgen; Xiao, Jinqiu

doi:10.1038/s41467-025-64089-4

Download PDF

Article
Open access
Published: 25 September 2025

Facilitating analysis and dissemination of proteomics data through metadata integration in MaxQuant

Nature Communications volume 16, Article number: 8421 (2025) Cite this article

1455 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Metadata plays an essential role in the analysis and dissemination of proteomics data. It annotates sample information for output tables from library searches and displays sample information from data files in public repositories. However, integrating metadata into data analysis can be time-consuming and is not well standardized. Inconsistent metadata formats in public repositories hinder other researchers’ ability to reproduce and reuse these public datasets. Here we present the metadata integration in MaxQuant, which provides a user-friendly way to export metadata as SDRF, the standard format that maps sample properties to proteomics data files. We also implemented the annotation of output tables with the SDRF file, enabling users to perform seamless downstream data analysis with annotated output tables. These features provide a simple and standardized approach to creating and leveraging standardized metadata, thereby facilitating data analysis and improving the reusability and reproducibility of public proteomics datasets.

lesSDRF is more: maximizing the value of proteomics data through streamlined metadata annotation

Article Open access 24 October 2023

A proteomics sample metadata representation for multiomics integration and big data analysis

Article Open access 06 October 2021

Mass spectrometry-based proteomics data from thousands of HeLa control samples

Article Open access 23 January 2024

Introduction

Mass spectrometry-based proteomics is the study of expression, interaction and modification of all proteins in living organisms using mass spectrometry¹. Advances in instrumentation and data acquisition methods have made it a more powerful tool for studying active biological processes by greatly improving the resolution, sensitivity, reproducibility and throughput of the measurement^2,3,4,5. After the sample measurement, library search is performed to identify and quantify the peptides and proteins⁶. Then output tables that store the library search results need to be annotated with the metadata before the downstream data analysis, due to the fact that the column names of output tables are created based on the data files and may not directly reflect the sample-related properties.

This annotation of output tables could be very cumbersome for studies with large cohorts and complex experimental design. It’s also more complicated to annotate studies with multiplexed and fractionated experiments, since there is no direct relation between samples and data files. For multiplexed experiments, multiple samples are included in the same data file⁷. For fractionated samples, multiple data files are related to the same sample⁸. If the metadata are in inconsistent formats across datasets within a single study, such as when reanalyzing multiple public datasets, additional effort would be required for the manual annotation of the output tables⁹.

In order to address inconsistent and incomplete metadata in proteomics studies, the Sample and Data Relationship Format (SDRF) was introduced as a standardized format to store the proteomics sample metadata¹⁰. The SDRF is a tab-delimited text format that describes the relationship between samples and data files in proteomics studies. It consists of sample-related metadata, data file-related metadata and the variables under study. Submitting the SDRF file together with raw files to public repositories like ProteomeXchange¹¹ is recommended to enhance the reusability and reproducibility of public datasets. Several tools have been developed for exporting^12,13 or reading SDRF^14,15 to promote this standard metadata format.

However, very few proteomics datasets currently include the SDRF file in data submissions¹². First, creating the SDRF file remains complicated and is not mandatory for data submission. Currently, users need to fill the SDRF file row by row, and each row corresponds to one sample and data file relationship, therefore for some experimental setups many rows require the entry of repetitive information. For example, multiplexed experiments have repetitive data file properties, and fractionated experiments have repetitive sample properties. Second, users benefit little from creating the SDRF file for their own projects because it doesn’t significantly impact the data analysis. Creating the SDRF file requires documenting lots of required data file and sample properties, but most of the data file properties will not be taken into consideration in data analysis. Furthermore, sample properties in the SDRF file currently cannot be used to directly annotate output tables.

In this work, we address the aforementioned challenges by implementing the metadata integration feature in MaxQuant, a widely used software that supports proteomics data analysis from various instruments, quantification techniques and data acquisition modes^16,17,18. This feature simplifies the process of creating an SDRF file. We also implement automatic metadata annotation of output tables using Perseus or the provided scripts to further assist users with their own data analysis.

Results and discussion

Overview of metadata integration in MaxQuant

In addition to the raw files and FASTA files, metadata integration is implemented in the MaxQuant workflow (Fig. 1). The SDRF file, which is created in the process, consists of all required sample properties, which are mainly filled out by users, as well as all required data file properties, which are extracted automatically by MaxQuant from the input files and user-defined parameters.

The SDRF file could play an essential role in both data dissemination and analysis. First, it could be uploaded to public repositories alongside raw files to improve the reproducibility and reusability of the datasets. Moreover, it can be used to automatically annotate the output tables from MaxQuant for Perseus, as well as other tools using the provided scripts written in Python and R.

Export metadata as an SDRF file in MaxQuant

Since MaxQuant v2.7.0, the feature to export metadata as an SDRF file has been implemented in the “Metadata” tab (Fig. 2). In MaxQuant, all required data file properties specified by SDRF would be extracted and converted to ontology terms and accession numbers when the SDRF file is written (Table 1). After setting up the parameters for the library search, click the “Refresh” button under the “Metadata” tab, and the corresponding metadata table will appear. To reduce the confusion caused by having too many columns, the metadata table only includes basic data file information and the required sample properties that users need to fill out. An additional column named “Group” is also included, representing the study variable “factor value[group]” in SDRF. This column doesn’t have to follow ontology-based rules, users can define any project-specific treatments or phenotypes that do not belong to one of the required sample properties.

**Fig. 2: Metadata table in MaxQuant GUI.**

Table 1 Overview of the columns from the SDRF that MaxQuant fills

Full size table

The layout of the metadata table in MaxQuant differs from the SDRF, wherein each row represents one sample and data file relationship. In the metadata table, however, each row represents one sample. Since only sample properties need to be filled in to create an SDRF file in MaxQuant, the layout of metadata table is particularly user-friendly. The exact layout is also automatically generated according to the experimental settings, thereby eliminating the need to fill in repetitive information. For multiplexed dataset, there will be expanded rows corresponding to different labels for each raw file, all labels will be assigned identical data file properties. For fractionated dataset different raw files corresponding to the same sample will be collapsed into one row, these raw files will be assigned identical sample properties. For example, in an experiment with one TMT10 set fractionated into 20 fractions, the SDRF file contains 200 rows. Users only need to fill a 10-row metadata table in MaxQuant to generate the SDRF file. When MaxQuant writes the SDRF file, information from the metadata table will be converted to the SDRF layout and combined with corresponding data file properties.

The metadata table can be filled out in MaxQuant GUI or in the template exported by MaxQuant. There’s no maximum limit to the number of rows that can be generated in the metadata table. Users can either edit multiple rows at once in MaxQuant GUI, or fill out the exported template outside of MaxQuant. In the SDRF, some sample properties are required but might not be available for users’ projects (e.g. the ancestry category for human samples). Users can leave these columns empty, and MaxQuant will have the option to automatically fill them with “not available”. Then the exported SDRF file will have no missing values and can be submitted directly to public repositories without further editing. The SDRF file will be created later as one of the output tables after the library search, or immediately by clicking the “Write SDRF” button (Supplementary Data 1, example of SDRF file generated by MaxQuant).

In summary, the auto-adjusted layout and autofill feature of the metadata table in MaxQuant simplify the process of creating the SDRF file. Automatic extraction of data file properties saves users the effort of collecting and organizing information that rarely impacts their data analysis. It is especially helpful for proteomics facilities to provide a well-annotated SDRF file to clients without the analytical expertise to annotate the data file properties. The possibility to edit multiple rows simultaneously and read the template file improves efficiency especially when handling large datasets. By default, the SDRF file becomes one of the regular tables written in every MaxQuant run. This feature will encourage the deposit of standardized SDRF metadata in public proteomics repositories, helping to improve the reproducibility and reusability of public datasets.

Annotate the MaxQuant output tables with the SDRF file for downstream data analysis

In MaxQuant, “Experiment” is the unique identifier for each sample, except for multiplexed datasets, in which each sample is represented as a combination of “Experiment” and label. In the SDRF, the source name has the same definition. Therefore, in the MaxQuant-exported SDRF file, the source name is identical to “Experiment” (combined with the label if the dataset is multiplexed). Thus, it can be used to annotate MaxQuant output tables.

Perseus is a widely used software that performs downstream omics data analysis such as pre-processing, statistical analysis and data visualization¹⁹. Since it was co-developed with MaxQuant, it is especially compatible with MaxQuant output tables. The feature that annotates MaxQuant output tables with the SDRF file has been implemented since Perseus v2.1.5. Clicking “Read SDRF” under “Annot. rows” automatically adds all information from the SDRF file to the corresponding intensity columns as annotation rows. By default, “Skip repetitive properties” is selected to only add properties that are not identical between all samples as annotation rows (Fig. 3). With added annotation rows from the SDRF file, users can continue with data filtering, normalization and statistical analysis directly with the annotated output tables. Two scrips (https://github.com/cox-labs/converters), written in Python and R, are provided to convert MaxQuant output tables and the SDRF file into an expression matrix and a sample annotation table. These two files are required by other popular tools for downstream data analysis, such as DESeq2²⁰, limma²¹, and edgeR²². As long as the source name is identical to “Experiment” in MaxQuant, SDRF files created by other tools can also be used to annotate output tables with Perseus or can be converted by the provided scripts.

**Fig. 3: Annotation of the protein groups table with the SDRF file in Perseus.**

The SDRF was initially introduced to store metadata during data dissemination, which is one of the last steps in finalizing a proteomics study. Here, we extended the application of the SDRF file to output tables annotation, which is a key step in data analysis. As a result, this output table annotation feature will motivate users to create SDRF files because it helps with data analysis. As the SDRF was adapted from the MicroArray Gene Expression Tabular (MAGE-TAB)²³, which is used to encode metadata annotations for RNA-Seq data²⁴, SDRF files could also perform as a bridge for multi-omics data integration in the future.

Methods

Software development, requirements and usage

Metadata integration is a new feature implemented in MaxQuant v2.7.0 and Perseus v2.1.5. MaxQuant is developed using.NET8 and written in C# programming language. The command-line version can be run on Windows, Linux and macOS (provided that the vendor libraries are available for accessing the raw files). The graphical user interface (GUI) is only available on Windows at the moment. It can be downloaded from https://www.maxquant.org/maxquant. Perseus is developed using.NET8 and written in C# programming language. Its GUI is on Windows only at the moment. It can be downloaded from https://www.maxquant.org/perseus/.

The video tutorials about MaxQuant and Perseus are available at https://www.youtube.com/@MaxQuantChannel. The detailed tutorial on exporting the SDRF file and using it to annotate output tables is provided as PDF in the Supplementary Information. Alternatively, the video tutorial can be viewed on YouTube at https://youtu.be/fHCPOBXXRp8.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

No new mass spectrometry proteomics data was generated in the scope of this work. The 24-fraction TMT10 dataset displayed in Figs. 2 and 3 is a subset of the dataset from the CPTAC data portal under accession number S060²⁵ and was obtained from Proteomic Data Commons (PDC) repository using bash script from https://github.com/esacinc/PDC-Public/tree/master/tools/downloadPDCData.

Code availability

Both MaxQuant and Perseus are freeware that can be used for all purposes, including commercial use. The partially open-source code was deposited at https://github.com/JurgenCox/mqtools7, under the CC BY-NC-ND 4.0 license. The scripts for annotating the output tables were created using Python v3.12.3 and R v4.5.0, and were deposited at https://github.com/cox-labs/converters, under the CC BY-NC-ND 4.0 license.

References

Guo, T., Steen, J. A. & Mann, M. Mass-spectrometry-based proteomics: from single cells to clinical applications. Nature 638, 901–911 (2025).
Article ADS CAS PubMed Google Scholar
Eliuk, S. & Makarov, A. Evolution of orbitrap mass spectrometry instrumentation. Annu Rev. Anal. Chem. 8, 61–80 (2015).
Article Google Scholar
Meier, F., Park, M. A. & Mann, M. Trapped ion mobility spectrometry and parallel accumulation-serial fragmentation in proteomics. Mol. Cell Proteom. 20, 100138 (2021).
Article CAS Google Scholar
Gillet, L. C. et al. Targeted data extraction of the MS/MS spectra generated by data-independent acquisition: a new concept for consistent and accurate proteome analysis. Mol. Cell Proteom. 11, O111 016717 (2012).
Article Google Scholar
Guzman, U. H. et al. Ultra-fast label-free quantification and comprehensive proteome coverage with narrow-window data-independent acquisition. Nat. Biotechnol. 42, 1855–1866 (2024).
Article CAS PubMed PubMed Central Google Scholar
Domon, B. & Aebersold, R. Challenges and opportunities in proteomics data analysis. Mol. Cell Proteom. 5, 1921–1926 (2006).
Article CAS Google Scholar
Pappireddi, N., Martin, L. & Wuhr, M. A review on quantitative multiplexed proteomics. Chembiochem 20, 1210–1224 (2019).
Article CAS PubMed PubMed Central Google Scholar
Manadas, B., Mendes, V. M., English, J. & Dunn, M. J. Peptide fractionation in proteomics approaches. Expert Rev. Proteom. 7, 655–663 (2010).
Article CAS Google Scholar
Griss, J., Perez-Riverol, Y., Hermjakob, H. & Vizcaíno, J. A. Identifying novel biomarkers through data mining-A realistic scenario? Proteom. Clin. Appl 9, 437–443 (2015).
Article CAS Google Scholar
Dai, C. X. et al. A proteomics sample metadata representation for multiomics integration and big data analysis. Nat. Commun. 12, 5854 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Vizcaino, J. A. et al. ProteomeXchange provides globally coordinated proteomics data submission and dissemination. Nat. Biotechnol. 32, 223–226 (2014).
Article CAS PubMed PubMed Central Google Scholar
Claeys, T. et al. lesSDRF is more: maximizing the value of proteomics data through streamlined metadata annotation. Nat. Commun. 14, 6743 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Kong, A. T., Leprevost, F. V., Avtonomov, D. M., Mellacheruvu, D. & Nesvizhskii, A. I. MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics. Nat. Methods 14, 513–520 (2017).
Article CAS PubMed PubMed Central Google Scholar
Dai, C. et al. quantms: a cloud-based pipeline for quantitative proteomics enables the reanalysis of public proteomics data. Nat. Methods 21, 1603–1607 (2024).
Article CAS PubMed PubMed Central Google Scholar
Zheng, P. et al. Ibaqpy: a scalable Python package for baseline quantification in proteomics leveraging SDRF metadata. J. Proteom. 317, 105440 (2025).
Article CAS Google Scholar
Ferretti, D. et al. Isobaric labeling update in MaxQuant. J. Proteome Res. 24, 1219–1229 (2025).
Article CAS PubMed PubMed Central Google Scholar
Sinitcyn, P. et al. MaxDIA enables library-based and library-free data-independent acquisition proteomics. Nat. Biotechnol. 39, 1563–1573 (2021).
Article CAS PubMed PubMed Central Google Scholar
Cox, J. & Mann, M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat. Biotechnol. 26, 1367–1372 (2008).
Article CAS PubMed Google Scholar
Tyanova, S. et al. The Perseus computational platform for comprehensive analysis of (prote)omics data. Nat. Methods 13, 731–740 (2016).
Article CAS PubMed Google Scholar
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).
Article PubMed PubMed Central Google Scholar
Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).
Article PubMed PubMed Central Google Scholar
Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010).
Article CAS PubMed Google Scholar
Rayner, T. F. et al. A simple spreadsheet-based, MIAME-supportive format for microarray data: MAGE-TAB. BMC Bioinforma. 7, 489 (2006).
Article ADS Google Scholar
Fullgrabe, A. et al. Guidelines for reporting single-cell RNA-seq experiments. Nat. Biotechnol. 38, 1384–1386 (2020).
Article CAS PubMed PubMed Central Google Scholar
Krug, K. et al. Proteogenomic landscape of breast cancer tumorigenesis and targeted therapy. Cell 183, 1436–1456 e1431 (2020).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

The authors would like to thank Barbara Steigenberger for her explanation of the metadata management. This work was supported by the German Ministry for Science and Education funding action CLINSPECT-M [FKZ 161L0214E, to J.X.].

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

These authors contributed equally: Walter Viegener, Shamil Urazbakhtin.

Authors and Affiliations

Computational Systems Biochemistry Research Group, Max Planck Institute of Biochemistry, Max Planck Institute of Biochemistry, Martinsried, Germany
Walter Viegener, Shamil Urazbakhtin, Daniela Ferretti, Jürgen Cox & Jinqiu Xiao

Authors

Walter Viegener
View author publications
Search author on:PubMed Google Scholar
Shamil Urazbakhtin
View author publications
Search author on:PubMed Google Scholar
Daniela Ferretti
View author publications
Search author on:PubMed Google Scholar
Jürgen Cox
View author publications
Search author on:PubMed Google Scholar
Jinqiu Xiao
View author publications
Search author on:PubMed Google Scholar

Contributions

W.V. developed the new features in MaxQuant and Perseus. S.U. performed the tests on the new features. W.V., S.U., D.F., J.C. and J.X. conceptualized the project and designed the new features. J.C. and J.X. supervised the project and drafted the manuscript. All authors have read and approved the final manuscript.

Corresponding authors

Correspondence to Jürgen Cox or Jinqiu Xiao.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Yasset Perez-Riverol and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Description of Additional Supplementary Data

Supplementary Data 1

Reporting Summary

Transparent Peer Review file

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Viegener, W., Urazbakhtin, S., Ferretti, D. et al. Facilitating analysis and dissemination of proteomics data through metadata integration in MaxQuant. Nat Commun 16, 8421 (2025). https://doi.org/10.1038/s41467-025-64089-4

Download citation

Received: 20 June 2025
Accepted: 03 September 2025
Published: 25 September 2025
DOI: https://doi.org/10.1038/s41467-025-64089-4