Table 3 Dataset transformation: conceptual view of selected features before vs. after preprocessing.

From: A generalized three-tier hybrid model for classifying unseen (IoT devices) in smart home environments

Feature

Before Preprocessing (Raw Dataset)

After Preprocessing (Processed Dataset)

Transformation Applied

Packet Size

Missing values (NaN) in some records

Missing values imputed with mean (e.g., 1200.0)

Imputation using SimpleImputer (Mean)

Flow Duration

Large heterogeneous values (e.g., 34,000 \({\upmu }\)s, 150,000 \({\upmu }\)s)

Standardized values (z-score normalization, e.g., 0.15, −1.20)

Standardization with StandardScaler

Protocol

Text categories: {TCP, UDP, ICMP}

Encoded as integers: {0, 1, 2}

Encoding with LabelEncoder

Device Class (Target Variable)

Semantic labels: {Smart Speaker, Smart Camera, Smart TV,...}

Encoded as integers: {0 = Smart Speaker, 1 = Smart Camera, 2 = Smart TV,...}

Encoding with LabelEncoder

Other Numeric Features

Raw values with varying scales (e.g., Bytes Sent, Packets/sec)

Standardized (mean = 0, std = 1)

Standardization with z-score

Other Categorical Features

Non-numeric labels (e.g., “Established”)

Converted to numeric codes (e.g., 0 = No, 1 = Yes)

Encoding with LabelEncoder