Background & Summary

Neonatal metabolic screening is a cornerstone of preventive medicine across the world1, enabling early detection of rare but severe metabolic disorders since the 1960s. These disorders, if left undiagnosed, can result in significant morbidity or mortality, underscoring the necessity of accurate and timely screening methods2,3. The Neonatal Screening program in the Lombardy Region targets four groups of conditions: endocrine disorders (congenital hypothyroidism and congenital adrenal hyperplasia), cystic fibrosis, inherited metabolic disorders (NBS), and genetic neuromuscular diseases (such as spinal muscular atrophy). The screened diseases can be rare (affecting no more than one person in 2,000), congenital (present at birth), and often hereditary. In Italy, there are clear rules and indications to deal with the full process4, in particular, reference is made to the Ministry of Health Decree of 13 October 2016 (https://www.trovanorme.salute.gov.it/norme/dettaglioAtto?id=56764#articoli) that allows each region to work on it independently but in the same way. The process begins with collecting a few drops of blood from each baby in every hospitals of the region, usually between 48-72 hours after birth, in accordance with current national and international reference standards. Then the blood drops are stored on barcoded cards called Guthrie card5 (DBS), along with demographic and medical data submitted manually to the system. Samples and data are then sent to Buzzi Hospital in Milan for analysis using Genetic Screening Processor6 and Mass Spectrometry7. Further details about the Lombardy’s Neonatal Screening analysis procedure will be added in the methods section. Poor-quality samples are flagged for recollection, labeled as BIS (if the sample is the second) or Controllo (if the sample is to check the previous one). The Mayo Clinic’s neonatal screening software called Collaborative Labour Integrated Reports (CLIR) (https://clir.mayo.edu/Home/About)8 flags samples with a high probability of being positive to one or more diseases. Doctors review these cases and may request a second sample for further analysis, repeating the process and measuring different markers taking in consideration the diseases to check. Final diagnoses are based on weight-specific and disease specific cut-off values developed by doctors.

To our knowledge, no large public datasets on newborn screening currently exist. Having access to a large volume of real newborn screening data is crucial for facilitating information sharing between screening centers. This not only enables the comparison of biochemical marker concentration distributions across different populations but also allows for the evaluation of data collection quality and diagnostic assignment processes. In particular, thanks to the group of positive cases that have managed to be collected, it will be possible to test new automatic diagnosis assignment techniques. For this reason, the proposed dataset9 can be exploited in both the medical and academic fields. While the fact that the information it contains is real makes it possible to understand certain characteristics of the Lombardy population and the distribution of the concentration of particular biochemical markers, its size, the technical measurement errors it contains and the large number of missing values make it an excellent dataset with which to test different data cleaning techniques10. Nowadays many synthetic data are used for many health projects11 because of privacy issue or data availability and surely and certainly using them is a good way to curb such problems. However, depending on the generative model used to create them, one may encounter different technical difficulties and levels of realism12.

The proposed dataset contains many observations from those that are usually fed into the model supporting the diagnostic process. Originally, the extracted observations contain samples of 10 years (from september 2012 to december 2022). The dataset originally comprised 985,792 rows and 266 features. The majority of the columns (235) correspond to biochemical marker concentration measurements, while 17 are binary features. The remaining columns contain information about the modules, such as the city of residence, the mother’s ethnicity, and the time of sample collection. To ensure data quality and address potential errors, the dataset was processed, resulting in a reduced size. As explained in the methodology section, this involved the removal of several variables and observations. Additionally, thanks to the contributions of physicians, diagnoses from the years 2016-2021 were integrated into the dataset. During the observed period, 198 positive cases were recorded. However, not every disease included in the screening had at least one positive case; rather, these 198 cases correspond to only 29 different diseases (Table 1).

Table 1 Summary of the diseases included in the neonatal screening program of the Lombardy region.

Methods

The Fig. 1 illustrates the data-collection process organized by Lombardy region. Dried blood spot (DBS) samples and associated data collected at hospitals across Lombardy are sent to the Neonatal Screening Laboratory of Buzzi Hospital (Milan). This section details the steps for data collection, laboratory analysis, and data preparation/cleaning.

Fig. 1
figure 1

Flow Chart Process.

Ethical Consideration

Informed consent was obtained from all participating mothers through a signed document at the time of delivery. Pursuant to Italian legislation (e.g., Article 8, paragraph 2, Law No. 132/2025), and in accordance with local governance, the general informed consent permits secondary use of data stripped of direct identifiers. There is no evidence of harm associated with the DBS collection procedure. All procedures comply with applicable privacy regulations and institutional policies. The study was approved by the Territorial Ethics Committee 1 of Lombardy (CET 4–2023, 12 July 2023).

Collection

Between 48 and 72 hours after birth, a few drops of capillary blood are obtained from the newborn’s heel (typically the right foot) and applied to a Guthrie card identified by a barcode. This time window balances neonatal physiological adaptation (reducing false-positive and false-negative results) with timely detection of target conditions such as congenital hypothyroidism and inborn errors of metabolism.

At the same time, a questionnaire records perinatal and maternal information. Anamnestic data are manually entered into the hospital information system by healthcare workers at each hospital, along with the associated barcode. Upon arrival at Buzzi Hospital, the barcode is scanned allowing the system to automatically retrieves the linked clinical/demographic information for analysis.

Shipping of Guthrie cards to Buzzi Hospital

In Lombardy, neonatal screening is centralized at Buzzi Hospital (Milan). Everyday hospitals send Guthrie cards in numbered parcels for processing. To ensure privacy, laboratory staff have defined two identifying variables that are excluded from the published dataset: one linked to the individual child and the other to the sample.

After associating the child’s name with the barcode on the corresponding card, the system automatically integrates additional information. Occasional clerical errors may occur during manual handling (e.g., loss or physical damage of cards).

DBS analysis at the “V. Buzzi” Children Hospital

The DBSs were processed using the Neobase newborn screening kit (PerkinElmer, Milan, Italy). Single 3.2 mm spots were punched using a DBS puncher (PerkinElmer, Italy) into 96-well plates. A total of 100 μL of the PerkinElmer extraction working solution was added to each well. The microplate was shaken for 45 min at 650 rpm at 45 C and incubated for 120 min at room temperature to ensure the complete derivatization of the extracted succinylacetone. The plate was then quantified using a Waters TQD (Waters Corporation, Sesto San Giovanni, Italy) or Quattro Micro mass spectrometer in FIA-MS/MS. The concentrations were calculated by comparing the measured analyte intensities to those of the internal standards multiplied by the internal standard concentration and relative response factor. Eleven amino acids were analyzed: alanine (Ala), citrulline (Cit), arginine (Arg), glycine (Gly), leucine/isoleucine (Leu/lle/Pro-OH), methionine (Met), ornithine (Orn), phenylalanine (Phe), tyrosine (Tyr), and valine (Val). Thirty-one acylcarnitines were analyzed: C0, C2, C3, C3DC, C4, C4DC, C5, C5:1, C5DC, C5OH, C6, C6DC, C8, C8:1, C10, C10:1, C10:2, C12, C12:1, C14, C14:1, C14:2, C14OH, C16, C16:1, C16OH, C16:1, C18, C18:1, C18:2, C18OH, and C18:1OH13. The proposed dataset do not contain all of the cited biochemical entity. Calibration procedures are also taken into consideration for the analysis. In particular, each assay performed on the PerkinElmer GSP machine is calibrated using a specific calibration curve, while the mass spectrometer relies on internal standards for calibration.

Analysis of results and identification of positives

Initial post-analytical evaluation is performed using the interactive web tool called Collaborative Laboratory Integrated Reports (CLIR)8. It is a very sophisticated software that has mainly three goals:

  • Replacement of traditional cutoff values with continuous adjustments for age and other covariates of reference ranges shown as seamless percentile charts. CLIR reference ranges are derived by retrospective analysis of “big data”, tens and even hundreds of thousands of data points from a growing worldwide community of collaborators.

  • Creation of cumulative, covariate-adjusted disease ranges for all informative markers for target conditions, usually clustered by specialty and/or type of markers.

  • Post-analytical interpretive tools that integrate all relevant results into a single score. Tools are applicable either to the diagnosis and/or prognosis of a condition or to the differential diagnosis between pairs of conditions (for example benign variant vs. classic disease, responsive/not responsive to treatment).

The system automatically identifies babies with a high likelihood of having one or more target conditions. Laboratory physicians review the alerts and, when it is necessary, a repeat DBS is obtained for re-testing.

In the subsequent analysis, diagnosis is performed manually using conventional cut-off values, adjusted by four different weight categories (weight < 1500 g; 1500 g < weight < 2000 g; 2000 g < weight < 2500 g; weight > 2500 g) and three different sampling times (Figs. 3-6).

Fig. 2
figure 2

Number of missing values.

Fig. 3
figure 3

Comparison between Female and Male < 1500g.

Fig. 4
figure 4

Comparison between Female and Male 1500-2000 g.

Fig. 5
figure 5

Comparison between Female and Male 2000-2500 g.

Fig. 6
figure 6

Comparison between Female and Male >2500 g.

Data preparation

Data extraction from the database was performed by the hospital’s IT support team. Data preparation involved different scripts implemented in Python, primarily using the polars library.

Diagnostic information from 2016 to mid–2021 was integrated by merging a supplemental file via a unique identification code. The positive cases included 198 children diagnosed with 29 different metabolic diseases across various groups 1.

Many samples contained multiple consecutive values within the same cell for several numerical variables. As the second value could not be reliably identified or interpreted, only the first value was retained, rounded to two decimal places. Composite variables derived from other fiealds were removed to avoid redundancy. Two additional biochemical markers, s-TSH and s-17OHP, were excluded because they were measured in only 25 and 3 healthy newborns, respectively. Features judged unreliable, inconsistent or privacy-sensitive (Allele 1, City, Ethnicity, Sampling, AnswerIX, SampleQuality, id, and SampleBarCode) were removed. In particular, the removal of id and SampleBarCode ensured full anonymization of patient data.

All records were sorted by SamTimeCollected. Only the first observation for each patient ID was retained, except for affected children, for whom the diagnostic record was preserved, yielding 791,752 records.

After the feature reductions, the number of variables was 100. Consequently, data type were standardized. Since all binary variables indicate the presence or absence of a specific characteristic or condition, the high number of missing values resulted from many operators omitting negative responses, leaving the field empty. For this reason, missing values in binary variables were recoded as 0.

Handling of missing values is particularly critical, as “Ignoring or inadequately handling missing values can lead to biases and loss of statistical power”14. Despite this, we opted not to modify the large number of missing values in the dataset, even though proper handling is generally preferable to ignoring them15. This choice allows users to evaluate alternative strategies (Fig. 2)16,17,18,19,20.

Two potentially informative variables SamTimeReceived and SamTimeCollected, were frequently incomplete or incorrectly recorded. These variables would have been particularly valuable if timestamps were available for all observations, but they were missing for most observations. Consequently, only the calendar date was retained and variable names were updated accordingly.

We then computed the interval between SamTimeCollected and DateOfBirth and excluded all observations with this value lower than 1 or greater than 3 days. After this filtering step, the final number of rows was reduced to 748,716.

Outliers were retained to preserve the dataset’s Although several variables exhibited extreme values (including maxima exceeding three times the 99th percentile), the definition of outliers is context-dependent. We chose to retain all values as extracted from the hospital server in order to allow users to determine their own outlier definitions and apply appropriate correction methods as needed.

Data Records

The data are available on Zenodo https://doi.org/10.5281/zenodo.164111499. The dataset is stored as a separate comma-separated values (CSV) file, where each record corresponds to the first sample collected from a newborn in Lombardy. It includes numeric, string, date, and binary columns.

The dataset consists of 748,716 rows and 100 columns. A complete list of feature names is provided in Tables 2, 3, 4, 5, 6, 7 with the respective descriptive statistics.

Table 2 Summary of the features of the dataset.
Table 3 Statistics of the observations with a weight lower than 1500 g (5966 babies).
Table 4 Statistics of the observations with a weight between 1500 g and 2000 g (9172 babies).
Table 5 Statistics of the observations with a weight between 2000 g and 2500 g (35527 babies).
Table 6 Statistics of the observations with a weight higher than 2500 g (698052 babies).
Table 7 Table showing the number of positive cases among infants in each weight range.

Technical Validation

Developing datasets is a process that requires significant responsibility and attention. Data quality is of fundamental importance, as it has a major impact on the reliability of extracted information. It is essential to ensure that this medical dataset conforms to the recommendations outlined by the IEEE in the paper IEEE Recommended Practice for the Quality Management of Datasets for Medical Artificial Intelligence21.

This section outlines several procedures for verifying and inspecting data quality. The code file available in the repository includes checks to ensure the correctness of certain information. In particular:

  • The fact that all babies with the variable Premature equal to 1 have a GestationalAge lower than 36 was inspected and eventually corrected.

  • To select the first sample for each baby, the dataset was sorted taking in consideration the variable SamTimeCollected.

  • A check was made on the time window of sample acquisition by calculating the difference between date of collection and date of birth. Some observations were found to be far outside the 48-72 hour window of the health standard. In some cases the error is probably a typing error, while in others there may have been an unforeseen event that caused a delay. These samples, in case they did not belong to positive children, were removed from the published dataset.

  • Other features that didn’t deal with their original description were deleted.

There are also some medical shrewdness applied, in particular during the extraction of the biochemical markers concentrations like the periodic change of reagents and the daily sanitization of the spindle of PelkinElmer machine.