Table 2 Summary of our data harmonization process to produce the final harmonized dataset, SNAPD.
Harmonization step | Details | Observations affected | |
---|---|---|---|
Step 0: | Pre-harmonization | Raw data | 9,217,921 |
Step 1: | Organization name | Standardized organization names in instances where there were varied spellings. | 568,644 |
Step 2: | Unique water monitoring sites | Flagged or combined coordinates and Monitoring Location Identifiers (MLIs) where possible such that each water monitoring site was defined as the unique combination of a MLI and coordinate pair. | 54,478 (multiple coordinates) |
965,724 (multiple MLIs) | |||
Step 3: | Medium | If the sample was taken in any medium besides water, dropped. | 163,356 |
Step 4: | Date | If an observation was missing a date, dropped. | 1,640 |
Step 5: | Chemical form | If the chemical form of the observation could not be determined, dropped. | 1,026,757 |
Step 6: | Concentration value | If the concentration value was negative, nonsensical (e.g., text instead of a number), or missing and the observation was not indicated to be a non-detect, dropped. | 194,579 |
Step 7: | Concentration units | If concentration units were missing or if they could not be converted to mg/L, dropped. | 20,222 |
Step 8: | Detection Text/codes | If the detection code/text indicated that concentration was not detected due to contamination or other quality control reasons, dropped. | 39,868 |
Step 9: | Sample fraction | If sample fraction was ambiguous or missing, dropped. | 340,239 |
Step 10: | Activity type | If the activity type indicated that the sample was part of a quality control check, dropped. | 384,273 |
Step 11: | Result type | If the result type indicated that the concentration value was estimated, dropped. | 130,054 |
Step 12: | Conversions | Converted nutrients to elemental form (as P or as N) and converted concentration units to mg/L, where possible. | all |
Step 13: | Nutrient renaming | Renamed nutrients to incorporate their sample fraction (e.g., nitrogen mixed forms unfiltered, ammonia filtered) to ensure comparability of observations. | all |
Step 14: | Detection limit approximation | If a detection limit was not provided for a non-detect observation in the raw data, approximated the detection limit (see section on Non-detects, detection codes, and detection limits). | 68,533 |
Step 15: | Non-detect handling | If an observation was indicated as non-detected, imputed concentration value using detection limits (see section on Imputing concentration for non-detects). | 1,241,315 |
If a nutrient-site-year had 80% or more non-detected observations, flagged observations and left concentration as N/A. | 612,918 | ||
Step 16: | Outlier flagging | If a given nutrient’s concentration value was above the 99th or below the 1st percentile, flagged as a potential outlier. | 131,021 |
Step 17: | Duplicates | If there were duplicates from multiple concentrations reported for the same site, nutrient, sample fraction, detection status, and date, averaged concentration and indicated the number of observations in the daily average. Note that this also includes time duplicates (see section on Duplicates). | 3,191,771 |
If there were duplicates due to differently named organizations reporting the same record, chose one organization and assigned to duplicate records. | 142,952 | ||
If there were duplicates due to a site measuring both detected and non-detected concentrations on the same date for the same nutrient, averaged concentration and flagged that the average includes an imputed value. | 134,848 | ||
Step 18: | Nutrients and sample fraction combination | If nutrient sample fractions could be combined to create a more common nutrient (e.g., total phosphorus vs. particulate phosphorus), combined observations where possible (see section on Combining nutrients and sample fractions). | 352 (added as new observations) |
Step 19: | Data quality | For a given sample, if the filtered nutrient concentration was greater than or equal to the unfiltered nutrient concentration, dropped. | 100,050 |