Abstract
This data descriptor presents a curated dataset for pathogen detection and identification (Staphylococcus aureus, Pseudomonas aeruginosa, and Candida albicans) directly from whole-blood samples. The dataset was created using differential cell lysis combined with rapid extraction, digestion, and mass spectrometry-based proteomics. Our method offers a rapid diagnostic alternative to traditional culture, enabling timely disease management, such as sepsis. Highlighting our dataset’s uniqueness, it features a three-tier structure: Spectral Libraries of Pathogens for identifying peptide peaks for putative biomarkers; Spiked pathogen in blood MS data for biomarker panel optimization through varied concentration samples; and Parallel Reaction Monitoring (PRM) data from sepsis patients for validating our biomarker panel, achieving 83.3% sensitivity within seven hours without microbial enrichment culture. This dataset serves as a comprehensive reference for bioinformatic tool development and biomarker panel proposals, advancing microbial detection, antimicrobial resistance, and epidemiological studies.
Similar content being viewed by others
Background & Summary
Ground truth datasets are foundational to the advancement of proteomics, providing a benchmark for validating and refining computational tools and methodologies. They are essential for developing algorithms and software designed to analyze complex proteomic data, facilitating the accurate identification and quantification of proteins. These datasets allow researchers to test the reliability, sensitivity, and specificity of their analytical tools, ensuring the robustness and reproducibility of their findings. In a field where the complexity and diversity of biological samples can significantly impact the experimental outcome, having access to high-quality, well-characterized ground truth datasets is crucial for benchmarking the performance of data analysis strategies and for fostering innovation and improvement in proteomic research.
In the development and refinement of analytical methods and computational tools for proteomic analysis, we delineate three principal purposes for utilizing benchmark datasets: benchmarking new tools, facilitating user familiarity, and iterative improvement of tools and methods. These datasets provide a standardized framework for objective assessment, ensure confident result interpretation, and expose tools to diverse test conditions for continuous refinement. We believe this dataset is of great value to the scientific community, since it provides LC-MS/MS spectra from whole blood samples spiked with microorganisms and subjected to blood cell differential lysis. Both differential lysis and LC-MS/MS have been little explored. These techniques offer potential as tools for protein or peptide detection in bloodstream infections1,2.
There are several well-established datasets for quantitative proteomics; one of the most notable being Prosit; in this report, where Gessaulat and collaborators demonstrate the ability to predict peptide tandem mass spectra by deep learning. By training on a synthetic peptide library containing 550,000 tryptic peptides, the deep learning model named Prosit has not only facilitated more accurate fragment ion intensity predictions but also integrated these enhancements into the validation of proteomic analytical tools, demonstrating its capability for more precise identifications at substantially lower false discovery rates3. In another example, Shalit and collaborators created a dataset simulating protein expression differences within a complex proteome4. The process began with the cultivation of HeLa S3 cells, which were lysed and digested using trypsin. Simultaneously, a digested Escherichia coli (E. coli) sample was prepared. These two digests were then mixed in four distinct E. coli to HeLa ratios (3, 7.5, 10, and 15 ng of E. coli to 200 ng of HeLa), to mimic varying levels of protein expression changes. This mixing strategy aimed to create a complex dataset that would test the limits of MS1-based label-free quantitative proteomics, specifically using a quadrupole Orbitrap mass spectrometer for analysis. The significance of this dataset lies, not only in its complex and well-characterized nature but also in its utility as a benchmarking tool for MS1-intensity-based label-free quantification methods. By providing a realistic and challenging test scenario, the dataset served as an invaluable resource for evaluating the performance of proteomic analysis tools, particularly in terms of precision, accuracy, and sensitivity. Furthermore, the dataset facilitates the exploration of quantitative protein inference methods, such as iBAQ and Hi-N, underpinning the development and optimization of new computational tools for proteomics research.
Another demonstration of the importance of ground truth datasets was shown by Zhen-Lin Chen and collaborators; they relied on simulated, synthetic, and 15 N metabolically labeled data, each designed to challenge the precision and sensitivity of the pLink 2 search engine for proteome-scale identification. This approach enabled the authors to demonstrate pLink 2’s enhanced performance—showing it to be faster and more reliable than its predecessor, pLink 1, and significantly outperformed other available tools like Kojak in terms of accuracy and speed5. In a slightly different context, Milan Clasen and collaborators created a ground truth dataset for protein-protein interactions using cleavable cross-linkers and synthetic proteins. They demonstrated that their search engine, Scout, was not only significantly faster than all other cleavable cross-linking search tools but also capable of delivering more results with a true 1% False Discovery Rate (FDR). In contrast, most competing tools reported error rates significantly higher than the theoretical 1% FDR, even though they claimed to achieve this benchmark6; this was only possible to demonstrate by using the ground of truth dataset.
In the rapidly evolving field of proteomics, the creation and availability of curated ground truth datasets play a pivotal role in advancing scientific research and clinical diagnostics. This data descriptor presents a comprehensive protocol and the resulting dataset, obtained through a methodology that combines differential cell lysis with mass spectrometry (MS) based proteomics, to target notorious pathogens such as S. aureus, P. aeruginosa, and C. albicans directly from whole-blood samples. The strategic use of differential cell lysis of whole blood samples enables the selective breakdown of host cells while preserving the integrity of pathogen cells. This approach significantly enriches pathogen-derived proteins by mitigating the dominant presence of host proteins typically encountered in blood samples. By focusing on the pathogens, differential cell lysis markedly improves the sensitivity and specificity of mass spectrometry-based proteomic analyses, enabling the identification of low-abundance pathogenic proteins crucial for accurate diagnostic. We further demonstrate the utility of this dataset by proposing a biomarker panel derived from the proteomic profiles of the targeted pathogens and confirming the presence of the peptide ion peaks in the MS data of the whole blood samples with spiked pathogens. We then verify the panel in sepsis patients. We achieved a sensitivity of 83.3% through parallel reaction monitoring (PRM) and could provide diagnostics within a seven-hour window without necessitating microculture enrichment.
In brief, this dataset uniquely incorporates a three-tier approach: Firstly, Spectral Libraries of Pathogens (library of spectra of microorganisms acquired by MS) are crafted, which offers an essential foundation for identifying spectral peaks and peptides critical for pathogen detection. Secondly, we include spiked pathogen data on whole blood samples. This enables the development of bioinformatic algorithms to propose biomarker panels by leveraging the spectral libraries from the first tier. Lastly, we use a strategy to propose our biomarker panel and resort to Parallel Reaction Monitoring (PRM) Data, showcasing the efficacy of our biomarker panel through the analysis of the 9 biomarker peptides derived from our strategy alongside their heavy counterparts. This dataset is derived from real patient samples. In the future of biomarker panel tools will use this dataset to provide variant biomarker panels that are even more sensitive than ours. Moreover, this dataset exemplifies high reproducibility and minimal outliers, and affirming its value as a benchmark for the development and validation of bioinformatic tools aimed at peptide panel identification for microbial detection, antimicrobial resistance, and epidemiological studies as it provides spectral libraries of microorganisms of interest alongside the differential cell lysis of blood from patients with sepsis to test the biomarker panel. This dataset is particularly significant for clinical applications as it employs differential cell lysis to enrich pathogen-derived proteins directly from whole blood samples, enhancing the sensitivity and specificity of pathogen detection.
Methods
Microbiological cultivation and whole blood samples
Three clinically important sepsis-causing microorganisms in Brazil were chosen for this experimental method: Staphylococcus aureus (gram-positive bacterium), Pseudomonas aeruginosa (gram-negative bacterium), and Candida albicans (fungus). Pseudomonas aeruginosa and S. aureus were cultured in BHI agar and C. albicans in Sabouraud agar for 24 h at 37 °C (bacterium species) or at 30 °C for 48 h (C. albicans). Two clinical isolates and one reference ATCC strain were cultivated for each species. After incubation, microorganisms were stored in cryogenic tubes containing 1 mL of BHI and 30% glycerol at −80 °C. The microorganisms were obtained from the species collection of the Clinical Hospital of the Federal University of Paraná (HC-UFPR), Curitiba, Paraná, Brazil.
Patient samples
Seven whole blood samples were obtained from patients of HC-UFPR affected by sepsis with blood cultures that were positive for one of the pathogens (six samples) and one positive sample for S. epidermidis. Ethical committee approval was obtained under number CAAE: 03377612.5.0000.0096, and the project was registered in SisGen under number A40E081, including all the samples used in the present study. Protocols were developed in accordance with the ethical guidelines established by the local ethics committee. Before any procedure, informed consent was obtained from all participants, ensuring they were aware of the research objectives, the procedures involved, and the potential risks and benefits associated with their participation. Furthermore, strict measures were adopted to protect the privacy and anonymity of participants to preserve the confidentiality of their data. Records and clinical samples were identified by numeric codes only, without any personally identifiable information, and all data were stored in a secure database accessible only to authorized research staff. These measures aimed to ensure the safety and well-being of participants, maintaining the integrity and confidentiality of the data collected.
Protein extraction from microbiological cultivation
For all lysis protocols tested, the colonies of P. aeruginosa, S. aureus, and C. albicans were diluted in PBS on McF scale 5 according to the measurement on the Densimat® bioMérieux scale, which corresponds to 1.5 × 109 CFU/mL. After dosage, the samples were centrifuged for 10 min at 20 °C with a rotation at 1,400 × g, and the supernatant was discarded. Proteins were extracted following the Sample Preparation by Easy Extraction and Digestion (SPEED)7 protocol. In a preliminary test to standardize the protein extraction methodology, the urea filter-aided Sample Preparation (FASP)8,9 (urea 8 M and urea SDS) and triton-lysozyme-urea FASP protocols were also tested.
Differential lysis of human cells and microbial cell enrichment
A total volume of 5 mL of whole peripheral blood was collected by intravenous puncture and stored in a vacuum collection tube with EDTA (Fig. 1). Then, 1 mL was placed in a 15 mL Falcon tube and mixed with different numbers of the microorganism cells depending on the stage of the experiment. Following this, 1 mL of blood cell lysis buffer (sodium carbonate 500 mM, Triton X-100 1% pH 10.5) was added, followed by stirring for 3 min in a shaker. Neutralization was performed by adding 1 mL of 1 M Tris-HCl and stirring for 3 min. The samples were centrifuged for 15 min at 2,791 × g, the supernatant was discarded, and the pellet was resuspended in 1 mL of PBS. The samples were homogenized by tube inversion, transferred to 1.5 mL microtubes, and centrifuged for 5 min at 20 °C, 1,400 × g. The supernatant was discarded.
Protein extraction from blood samples
After differential cell lysis, the proteins were extracted following the SPEED protocol, which consists of lysis, neutralization, reduction, and alkylation stages. For lysis, 60 μL of 99% TFA was added to the pellet, homogenized with a micropipette to ensure complete lysis of the microbial cells, incubated for 5 min, and heated in a thermomixer at 70 °C for 3 min. The samples were neutralized with 600 μL of 2 M Tris-base. Then, reduction and alkylation were performed by adding 66 μL of the reduction/alkylation solution (100 mM Tris (2-carboxyethyl) phosphine and 400 mM 2-chloroacetamide) and incubation at 95 °C for 5 min. The lysis for the remaining protocols is detailed in Supplementary Material 1 - Section 5. For SPEED and other tested protocols, the protein extract was dosed using the tryptophan fluorescence method9.
Protein digestion
The proteins were digested following the SPEED or SPEED-filter-aided protocols. In SPEED, the samples were diluted to the proportion of 1:5 (v:v) with water, trypsin was added with a 1:50 (m:m) trypsin to protein ratio, and the samples were incubated in a humid chamber for 16 to 18 hours at 37 °C. TFA was added to the samples to achieve a final concentration of 2%. For SPEED-filter-aided digestion, 40 μg of protein extract was used. Initially, 160 μL of acetone was added to the extract and incubated for 2 min at room temperature. Then, the samples were transferred to a 0.2 μm spin filter. Next, the filters were washed with 200 μL 80% acetone, 200 μL 100% acetone, and 200 μL pentane. After each wash, the filter content was homogenized with a micropipette, followed by centrifugation for 2 min at 5,000 × g, discarding the supernatant. Subsequently, the filter was air-dried for 2 min and placed in a new microtube. Following this, 40 μL trypsin 1:10 m:m (trypsin: protein) diluted in 50 mM ammonium bicarbonate (ABC) was added. Next, the filters were incubated in a thermomixer for 30 s at 800 rpm and for 15 min at 70 °C without agitation. For peptide collection, the filters were centrifuged for 2 min at 5,000 × g and washed with 40 μL of 50 mM ABC. The samples were then acidified with TFA to a final concentration of 0.2% and desalted in homemade C18 StageTips.
LC-MS/MS analysis
The LC-MS/MS analysis was performed in two systems. Data-dependent acquisition (DDA) analysis was performed in an Eksigent nanoLC Ultra 1D Plus / AS-2 system (Eksigent) coupled to an LTQ Orbitrap XL (Thermo Fisher Scientific) equipped with a PST nanospray source (Phoenix S&T). Peptide separation was carried out in a capillary column (75 µm ID × 15 cm) packed with 3.0 µm C18 particles (Dr. Maisch). Mobile phase A was 0.1% formic acid and 5% DMSO in water, and mobile phase B was 0.1% formic acid and 5% DMSO in acetonitrile. Five microliters of sample (1 µg) were injected, and the peptides were eluted at 250 nL min−1 in a linear gradient from 5 to 40% B for 120 min. The MS acquisition parameters were: MS1 in the orbitrap analyzer at 60,000 resolution, with a window of 300 to 1,800 m/z; AGC of 1 × 106 (MS1) and 3 × 104 (MS/MS); maximum injection time of 1,000 ms for MS1 and 100 ms for MS/MS; the ten most intense ions were subjected to MS/MS in the LTQ and fragmented by CID with a normalized collision energy of 35; a dynamic exclusion list of 90 s was used; the lock mass option was enabled at 401.922718. PRM acquisition was performed in an Ultimate 3000 RSLCnano coupled to an Orbitrap Fusion Lumos (Thermo Fisher Scientific) equipped with a Pneu-Nimbus dual-column source (Phoenix S&T). Chromatographic separation was performed on a capillary column (75 µm ID × 15 cm) packed with 3.0 µm C18 (Dr. Maisch). Mobile phase A was 0.1% formic acid in water, and mobile phase B was 0.1% formic acid and 5% water in acetonitrile. The sample volume injected was 2.5 µL (200 ng), and the peptides were eluted at 250 nL min−1 in a linear gradient from 5 to 40% B for 60 min. The MS acquisition parameters were as follows: MS1 in the Orbitrap at 120,000 resolution, with a window of 350 to 1,050 m/z; AGC standard for MS1 and targeted MS/MS; maximum injection time of 50 ms for MS1 and 54 ms for targeted MS/MS; the targeted MS/MS was set for the nine selected peptides plus their heavy standards, totaling 18 targets; peptides were isolated in the quadrupole with a window of 1.6 Da; peptides were fragmented by HCD using a normalized collision energy of 30% and their fragments scanned in the orbitrap at 30,000 resolution. Each scan cycle (loop control) ran through the full target list; the internal calibration option was enabled for the MS1 scan.
Peptide spectrum matching (PSM)
The search analysis was performed with PatternLab for proteomics version IV, which is freely available at http://www.patternlabforproteomics.org10 and MaxQuant version 1.6.1411. Protein sequences from the following organisms (supplementary material 1-13) were downloaded to compose a single FASTA sequence database. The sequences were downloaded from UniProt on October 31, 2019, and named DB Sepsis. This concatenated sequence file contained a total of 1,152,229 entries. A target-decoy database was generated to include a reversed version of each sequence plus those from common MS contaminants. PatternLab identifications were carried out according to its published bioinformatic protocol10,12. The search parameters considered fully and semi-tryptic peptide candidates with masses between 550 and 5,500 Da, up to two missed cleavages, an initial 40 ppm for precursor mass with a post-filtering of 10 ppm, and bins of 1.0005 m/z for MS/MS with an offset of 0.4. The modifications were carbamidomethylation of cysteine and oxidation of methionine as fixed and variable, respectively. The validity of the PSMs were assessed using SEPro13. The identifications were grouped by charge state (2+ and ≥3+) and then by tryptic status, resulting in four distinct subgroups. The XCorr14, DeltaCN, DeltaPPM, and Peaks Matched values, provided by the search results, were used to generate a Bayesian discriminator for each group. The identifications were sorted in nondecreasing order according to the discriminator score. A cutoff score was accepting an FDR of 2% at the peptide level based on the number of decoys. This procedure was independently performed on each data subset, resulting in an FDR independent of charge state or tryptic status. Additionally, a minimum sequence length of six amino-acid residues and a protein score greater than 3 were imposed. Finally, identifications deviating by more than 10 ppm from the theoretical mass were discarded. This last filter led the FDRs, now at the protein level, to be lower than 1% for all search results13. The quantification was performed using an XIC. In MaxQuant, oxidation (M) and acetylation (protein N-terminal) were set as variable modifications and carbamidomethyl (C) as fixed. Digestion mode was specific, enzyme Trypsin, and a maximum of two missed cleavage sites. For runs containing heavy peptides, Arg10 or Lys8 were indicated. The other parameters followed the default configuration of the software.
Target MS data analysis
The top 100 peptides of P. aeruginosa, S. aureus, and C. albicans presenting the highest XIC and detected in all DDA screening runs were selected to perform the target quantification using the Skyline tool (64-bit version 20.1.0.31). The evaluation of the peptides considered dotp scores (>0.9), intensity, absence of missed cleavage and cysteine, unique charge state, and peptides detected at low protein injected mass (view Skyline parameters in the Supplementary Material 1 Section 4). This evaluation allowed the selection of three specific peptides for each microorganism, for a total of nine. Together with the peptide-heavy standards (Lys or Arg (13 C, 15 N)), the PRM was set to monitor 18 targets.
Limit of detection and quantification
The following steps were followed to construct the analytical curve of target peptides. First, a linearity study was conducted to detect the pure heavy peptides (outside of the blood matrix), plotting the peptide mass versus the MS peak area. The curves of the heavy peptides can be found in Supplementary Material 1 – Section 3. Then, a bacterial inoculum of 40 colonies was obtained from the microculture of P. aeruginosa. Dilutions were made in 1:4 of the stock inoculum, processed using SPEED, and analyzed with LC-MS/MS. The results from the Skyline tool showed a good detection of peptide fragments in all these dilutions; thus, new dilutions were made from the sample containing 0.0625 colonies. In the results of the last dilutions, it was not possible to detect the five best fragments or to obtain a dotp > 0.9 for the peptide AGENVGILLR at the 1/100 dilution. Using the McF scale/bacterial concentration ratio, we reached the value of 1,000 cells, corresponding to the minimum number of P. aeruginosa cells as the starting material for detecting the peptide AGENVGILLR. With this approximate LOD, we chose the nine points of the curve with a distribution of points above and below this LOD in addition to one point as the blank. The experimental procedure was repeated for the other target peptides. Table 1 shows the curve points for each species.
The limit of detection and quantification was then established from the analysis of these microbial suspensions by LC-MS/MS after the steps of lysis, digestion, and purification standardized as described previously.
Data Records
Availability
All the mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE15 partner repository with the dataset identifier PXD03916516.
Nomenclature
The nomenclature of the LC/MS dataset reflects the structured approach of our experimental design, emphasizing reproducibility and detailed analysis. The dataset encompasses experiments conducted across three different days to test reproducibility, designated as “C1” for the first day, “C2” for the second, and “C3” for the third. Moreover, the dataset is systematically categorized to represent different conditions based on the cell counts of the three analyzed microorganisms (P. aeruginosa, S. aureus, and C. albicans), with “P” followed by a number (P1 to P9) indicating one of nine pre-established cell count combinations. A distinct subset without a specific cell count, representing total blood samples (the blank), is labeled as “sgtotal,” highlighting the samples not directly linked to the predefined cell count combinations. The datasets called “patients” are those originating from real patient samples.
Mass spectrometry file descriptions
Each sample was processed in technical triplicate, which resulted in 125 raw files being analyzed. Supplementary 5 - Table 1 summarizes the mass spectrometry files and their contents, and such Supplementary 5 - Table 2 as the number of MS1 and MS2 scans, as well as the average cycle time and chromatography full time of analytical curve files and patient samples.
Technical Validation
Quality control was conducted using RawVegetable 2.017 a software equipped with modules designed to identify potential issues in MS runs and verify data quality for peptide identification. For the technical validation of our dataset, we utilized several metrics to assess reproducibility, including TopN density, precursor signal ratio, and Xrea score. TopN density measures the distribution of MS/MS scans per MS event throughout the chromatographic run, identifying potential over- or under-sampling that could affect consistency in peptide identification. Precursor signal ratio, the ratio of the precursor ion’s intensity to the total signal in an MS/MS scan, ensures efficient fragmentation and high-quality spectra, which are crucial for accurate peptide identification. The Xrea score measures the heterogeneity of peak intensities within a spectrum, with higher scores indicating more heterogeneous and potentially higher-quality spectra. These metrics collectively assess the quality and reproducibility of our mass spectrometry data, ensuring consistent and reliable identification results across different experimental replicates. The dataset showcased excellent reproducibility among samples based on the criteria mentioned above. The RawVegetable report is available as Supplementary Material 4.
The selection of species-specific peptides that are suitable for target analysis was performed in two stages. In the first one, the LC-MS/MS DDA obtained from both pure cultures and blood contaminated with the microorganisms was used to search for peptides that fulfill specific requirements, such as: species-exclusive high-intensity peptides; identification in all LC-MS runs by both by MaxQuant and PaternLab V; unique charge state and do not containing methionine and cysteine in the sequence. After this filtration stage, we obtained a total of 513, 376, and 111 unique peptides for P. aeruginosa, C. albicans, and S. aureus, respectively. In the second stage, the hundred most abundant peptides for each species (Supplementary Material 1- S10) were monitored by PRM independently in each pure microorganism protein digest, and the data from the best replicate were used to construct a spectral library in the Skyline tool for peptide selection. To do this, blood peptides (from proteins remaining after differential lysis) were contaminated with the microorganisms’ total peptides, independently and in different microorganism to blood peptides (m:m) proportions: 1/30, 1/100, 1/500, 1/1,000, and 1/2,000, and analyzed by PRM (one method with 100 targets for each species). Peptides were selected based on high spectral library match scores (dotp > 0.9), high intensity, and detected at lower microorganism peptide mass injection. The first-round analysis resulted in the selection of 76 peptides, 30 for C. albicans, 31 for P. aeruginosa, and 15 for S. aureus, which were then evaluated by PRM in a new round (76 targets in a single method), from which the three best peptides for each species were selected. Therefore, we selected three target peptides (potential biomarkers) for each of the three microorganisms (Table 2). Subsequently, we constructed the single-running PRM method using the resulting nine best peptides and their respective nine heavy peptides, totaling 18 peptides. In our dataset, an average of 17% ± 3.5% of the peptides were identified as semi-tryptic, reflecting the enzymatic activity in the blood samples.
Usage notes
A Benchmark for rapid sample preparation aiming at emergency diagnostics
One of the primary applications of this dataset is in the development of rapid diagnostic tools tailored for emergency medicine, particularly for the timely management of sepsis—a condition where every minute counts. This dataset serves as a reference standard for the differential lysis protocol. As such, it can be used to benchmark sample preparation and data acquisition in different labs and compare the number of peptides identified with ours. Our sample preparation and mass spectrometry data generation took seven hours. By providing a direct diagnostic alternative that circumvents traditional cell culture processes, such sample preparation can significantly enhance the responsiveness of healthcare systems to sepsis, potentially reducing morbidity and mortality associated with delayed diagnosis.
Development and fine-tuning of bioinformatic approaches for biomarker panel optimization
This dataset is composed of three tiers. Tier A includes spectral libraries of pathogens for identification and biomarker discovery. Tier B consists of data from experiments with pathogens spiked into samples at various concentrations, aimed at biomarker panel optimization. Tier C features Parallel Reaction Monitoring (PRM) data for validating the effectiveness of the biomarker panel in real patient samples. The dataset opens new avenues for the development and fine-tuning of bioinformatic approaches aimed at optimizing the specificity and sensitivity of biomarker panels. By leveraging the detailed proteomic profiles of S. aureus, P. aeruginosa, and C. albicans, bioinformaticians can employ machine learning and deep learning techniques to analyze the data, identify patterns, and predict the most effective combinations of peptides for pathogen detection. This process involves training algorithms to discern subtle differences in peptide profiles that are most indicative of each pathogen, thereby enhancing the diagnostic accuracy. This is accomplished by resorting to Tier A to search for peptide ion peaks and then using Tier B to find these peaks in the whole blood sample with spiked microorganisms. Future work leveraging this dataset could lead to the creation of highly specific biomarker panels that not only improve upon the current specificity but also contribute to the personalization of diagnostics based on pathogen type and patient-specific responses, leading to more tailored and effective treatment strategies.
Enhancing educational tools for proteomics and pathogen detection
Another potential application of this dataset lies in education and training for proteomics and pathogen detection. The dataset can serve as foundations for developing interactive educational modules and simulation tools. These resources could train students and professionals in proteomic data analysis, differential cell lysis techniques, and the application of mass spectrometry in clinical diagnostics. Through hands-on experience with actual data, learners can gain a deeper understanding of the challenges and considerations involved in pathogen detection, including the importance of selecting appropriate biomarker panels and understanding the clinical implications of their findings. We encourage reanalyzing the dataset with methods such as Random Forest and Support Vector Machines (SVM). Random Forest, an ensemble learning method, constructs multiple decision trees to improve predictive accuracy and control overfitting, making it particularly useful for handling complex datasets with numerous variables. SVM, on the other hand, is effective in high-dimensional spaces and widely used for classification tasks in proteomics, helping to identify optimal hyperplanes that separate different classes of biomarkers with maximum margin. Convolutional Neural Networks (CNNs) can be adapted for proteomic data analysis. Traditionally used in image processing, CNNs capture complex patterns and relationships between features, in our case biomarkers, thereby enhancing the predictive power and accuracy of pathogen detection models.
Code availability
No custom code was used to be made available.
References
Yates, J. R. et al. Toward objective evaluation of proteomic algorithms. Nat Methods 9, 455–456 (2012).
Barboza, R. et al. Can the false-discovery rate be misleading? Proteomics 11, 4105–4108 (2011).
Gessulat, S. et al. Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning. Nat Methods 16, 509–518 (2019).
Shalit, T., Elinger, D., Savidor, A., Gabashvili, A. & Levin, Y. MS1-Based Label-Free Proteomics Using a Quadrupole Orbitrap Mass Spectrometer. J. Proteome Res. 14, 1979–1986 (2015).
Chen, Z.-L. et al. A high-speed search engine pLink 2 with systematic evaluation for proteome-scale identification of cross-linked peptides. Nat Commun 10, 3404 (2019).
Clasen, M. A. et al. Proteome-Scale Recombinant Standards and a Robust High-Speed Search Engine to Advance Cross-Linking Ms-Based Interactomics. https://doi.org/10.1101/2023.11.30.569448 (2023).
Doellinger, J., Schneider, A., Hoeller, M. & Lasch, P. Sample Preparation by Easy Extraction and Digestion (SPEED) - A Universal, Rapid, and Detergent-free Protocol for Proteomics Based on Acid Extraction. Molecular & Cellular Proteomics 19, 209–222 (2020).
Batista, M. et al. The MAP kinase MAPKLK1 is essential to Trypanosoma brucei proliferation and regulates proteins involved in mRNA metabolism. Journal of Proteomics 154, 118–127 (2017).
Wiśniewski, J. R. & Gaugaz, F. Z. Fast and Sensitive Total Protein and Peptide Assays for Proteomic Analysis. Anal. Chem. 87, 4110–4116 (2015).
Carvalho, P. C. et al. Integrated analysis of shotgun proteomic data with PatternLab for proteomics 4.0. Nat Protoc 11, 102–117 (2016).
Cox, J. & Mann, M. Computational principles of determining and improving mass precision and accuracy for proteome measurements in an Orbitrap. J. Am. Soc. Mass Spectrom. 20, 1477–1485 (2009).
Santos, M. D. M. et al. Simple, efficient and thorough shotgun proteomic analysis with PatternLab V. Nat Protoc 17, 1553–1578 (2022).
Carvalho, P. C. et al. Search engine processor: Filtering and organizing peptide spectrum matches. Proteomics 12, 944–949 (2012).
Eng, J. K. et al. A Deeper Look into Comet—Implementation and Features. J. Am. Soc. Mass Spectrom. 26, 1865–1874 (2015).
Perez-Riverol, Y. et al. The PRIDE database resources in 2022: a hub for mass spectrometry-based proteomics evidences. Nucleic Acids Research 50, D543–D552 (2022).
Rezende et al. PRIDE. https://identifiers.org/pride.project:PXD039165 (2024).
Kurt, L. U. et al. RawVegetable 2.0: Refining XL-MS Data Acquisition through Enhanced Quality Control. J. Proteome Res. acs.jproteome.3c00791 https://doi.org/10.1021/acs.jproteome.3c00791 (2024).
Acknowledgements
This work was supported by project CNPq-PROEP/ICC n° 442338/2019-8 and Carlos Chagas Institute, Fiocruz Paraná, Paraná, Brazil. The proteomic experiments were performed at the Laboratory for Applied Science and Technology in Health and the Mass Spectrometry Facility RPT02H, Carlos Chagas Institute, Fiocruz Parana, Parana, Brazil. We would like to thank the Clinical Hospital of the Federal University of Paraná (HC-UFPR), Curitiba, Paraná, Brazil, for providing strains and whole blood samples of patients. The authors are very grateful to Wagner Nagib de Souza Birbeire for assistance with artwork design and FIOCRUZ for use of its Technological Platforms Network.
Author information
Authors and Affiliations
Contributions
Rezende, J.O.V. and Batista, M.: Term, Conceptualization, Validation, Methodology, Writing - Original Draft. Machado, K.C: Validation, Methodology, Writing - Review & Editing. Bandini, T. B.: Formal analysis. Menezes, I.A.C.: Resources. Stefani, F. C.: Resources. Santos, M.D.M. e Carvalho, P.C.: Software, Writing. Brant, R.S.C.: Data Curation. Kurt, L.U.: Analysis of proteomic data. Morello, L.G: Term, Conceptualization, Project administration. Marchini, F.K.: Term, Conceptualization, Project administration.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
de Oliveira Veloso Rezende, J., Batista, M., Machado, K.C. et al. A dataset for developing proteomic tools for pathogen detection via differential cell lysis of whole blood samples. Sci Data 11, 1105 (2024). https://doi.org/10.1038/s41597-024-03834-8
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41597-024-03834-8



