Table 3 Dataset detail and count before and after Data Pre-processing.

From: Advanced air quality prediction using multimodal data and dynamic modeling techniques

Pre-processing step

Before Pre-processing

After Pre-processing

Details

Total records

1,200,000

1,150,000

50,000 records were removed due to duplicates or incomplete timestamps

Missing values

15% of the total data

0%

Missing values are addressed using linear interpolation and advanced imputation techniques

Outliers

20,000

0

Outliers detected using IQR and z-score methods were either removed or replaced with median values

Categorical data (e.g., region)

10 unique values

10 one-hot encoded vectors

Regions are converted to binary vectors using one-hot encoding

Spatial features (e.g., AOD)

200,000 rows with missing AOD

200,000 rows completed

Missing satellite features interpolated using geospatial mapping techniques

Time-series records

1,200,000

1,150,000

Temporal inconsistencies were corrected by aligning timestamps across all data sources

Noise reduction (PM2.5)

High variability

Smoothed trends

Noise is reduced using wavelet transforms, retaining meaningful patterns

Augmented data

0

50,000 additional samples

Time series augmented with jittering and synthetic transformations

Dimensionality reduction

500 features

250 features

PCA was applied to reduce the redundant dimensions of satellite and meteorological data