Table 3 Dataset detail and count before and after Data Pre-processing.
From: Advanced air quality prediction using multimodal data and dynamic modeling techniques
Pre-processing step | Before Pre-processing | After Pre-processing | Details |
---|---|---|---|
Total records | 1,200,000 | 1,150,000 | 50,000 records were removed due to duplicates or incomplete timestamps |
Missing values | 15% of the total data | 0% | Missing values are addressed using linear interpolation and advanced imputation techniques |
Outliers | 20,000 | 0 | Outliers detected using IQR and z-score methods were either removed or replaced with median values |
Categorical data (e.g., region) | 10 unique values | 10 one-hot encoded vectors | Regions are converted to binary vectors using one-hot encoding |
Spatial features (e.g., AOD) | 200,000 rows with missing AOD | 200,000 rows completed | Missing satellite features interpolated using geospatial mapping techniques |
Time-series records | 1,200,000 | 1,150,000 | Temporal inconsistencies were corrected by aligning timestamps across all data sources |
Noise reduction (PM2.5) | High variability | Smoothed trends | Noise is reduced using wavelet transforms, retaining meaningful patterns |
Augmented data | 0 | 50,000 additional samples | Time series augmented with jittering and synthetic transformations |
Dimensionality reduction | 500 features | 250 features | PCA was applied to reduce the redundant dimensions of satellite and meteorological data |