Introduction

Access to clean water is a fundamental human right, essential for sustaining life and health. Waterborne diseases, primarily affecting the gastrointestinal tract, pose a major global public health challenge, arising from diverse pathogenic organisms including viruses, bacteria, and parasites (WHO, 2023). Inadequate Water, Sanitation, and Hygiene (WASH) infrastructure remains a primary contributor to the global disease burden1, with associated deficiencies causing approximately 1.5 million deaths. In low- and middle-income countries (LMICs), nearly 69% of diarrheal mortality is attributed to inadequate WASH facilities (WHO, 2023). Children under five years of age are disproportionately affected, with waterborne infections persisting as a leading cause of morbidity and mortality in this demographic. Notably, an estimated 1.4 million diarrheal deaths in 2019 could have been prevented through effective WASH interventions2,3,4,5.

In Pakistan, only 39–41% of the population has access to safely managed drinking water, while approximately 68% have access to basic sanitation services. Subsequently, waterborne diseases (WBD) account for about 30–50% of all diseases and up to 40% of deaths. UNICEF reports over 53,000 annual child deaths from diarrhea linked to poor WASH systems. Water and sanitation problems also carry out major economic burdens, estimates suggest annual losses amounting to PKR 343 billion (≈ USD 1.5 billion). In Khyber Pakhtunkhwa, endemic water contamination is intensified by old infrastructure, sewage leaks, and insufficient treatment systems. Nearly 80% of water samples in some areas have been found unsafe for consumption, contributing to outbreaks of cholera, typhoid, hepatitis, and bloody-diarrhea6.

The burden of WBD in Pakistan emphasizes the pressing need for systemic WASH interventions, improving water quality, sanitation access, hygiene education, and infrastructure monitoring, to save lives and reduce economic and health consequences. Several epidemiological studies on waterborne diseases have been carried out in Khyber Pakhtunkhwa, focused on the epidemiological characteristics and risk factor analysis of these diseases7,8,9. However, the proposed work develops algorithms for accurate detection of waterborne disease clusters to enhance epidemiological decision-making. Spatiotemporal cluster detection is central to epidemiological research, revealing disease burden distributions and underlying health determinants. By analyzing population level patterns across regions and time periods, epidemiologists gain insights into transmission dynamics and prioritize high-risk areas. Such analyses typically require examining ecological, socio demographic, and infrastructural factors associated with elevated prevalence10,11. Cluster mapping delineates hotspots and informs targeted resource allocation. Identifying spatiotemporal disease clusters is pivotal for strengthening public health surveillance and guiding effective interventions. Health agencies routinely collect spatiotemporal case data to monitor disease dynamics and mitigate outbreaks. Systematic analysis of these patterns enables detection of localized incidence surges, facilitating timely resource deployment. A cluster is formally defined as a spatial or temporal domain where observed cases significantly exceed expected counts12,13. While diverse statistical techniques detect regularly shaped clusters, scan statistics have emerged as the predominant method, particularly for circular clusters14,15,16. These methods employ a cylindrical scanning window traversing the study area: the base defines a circular/elliptical spatial zone, while the height represents the temporal dimension, capturing both persistent and emerging clusters. As the window expands from minimum to maximum radius, overlapping regions are evaluated via likelihood ratio tests comparing observed versus expected cases under spatial randomness. The window maximizing the test statistic and identifies the most likely cluster, indicating significant disease incidence elevation, an approach proven effective in surveillance contexts17,18,19. Nevertheless, traditional scan statistics struggle to detect irregularly shaped clusters, especially in geographies constrained by natural boundaries (rivers, mountains) or urban landscapes, where circular/elliptical windows inadequately capture disease dispersion. Advanced methods detecting arbitrary-shaped clusters address this limitation20,21,22,23, but often rely on restrictive distributional assumptions (Poisson or Gaussian), limiting applicability to complex modern datasets. Moreover, scan-statistic algorithms inherently require strict parametric assumptions, impairing performance when these assumptions are violated, particularly for nontraditional or structurally complex data. In addition, these algorithms are designed to identify regular-shaped clusters and are less efficient for irregular-shaped clusters. These algorithms require high-quality data, making them vulnerable to noise and outliers24.

The EigenSpot algorithm was developed by Fanaee-T and Gama25, as a nonparametric, eigenspace based algorithm capable of detecting disease clusters without presuming any particular data distribution, quality, or cluster shape. Yet it can identify only a single hotspot, rendering it inadequate for uncovering multiple high-risk cluster over space and time. To overcome this, Sami Ullah et al.26 proposed a generalized Multi-EigenSpot algorithm that, like its predecessor, relies on eigenspace techniques, but substitutes expected case counts for population data as its baseline, thereby enabling the detection of several spatiotemporal clusters. Eigenspace methods have since gained widespread application from data mining and signal processing to information retrieval powering innovations, such as Google’s search engine, famously explained in “The $25 000 000 000 Eigenvector”27, and behind the BellKor team’s 2008 Netflix Prize winning use of singular value decomposition in collaborative filtering28.

The pioneering work by Fanaee-T and Gama25 and Ullah et al.26, introduced eigenspace methods to epidemiology, marking the first application of these techniques to disease cluster identification. However, related to clustering and anomaly detection, hotspot detection (also called outbreak or event detection) is distinct. Such as clustering partitions an entire dataset into groups, anomaly detection flags unexpected individual instances, and hotspot detection pinpoints areas of statistically significant deviation from a defined baseline.

All of the above-mentioned approaches involve scanning the entire space, being computationally laborious and time-consuming. The computing time for spatial scan statistics is given as \(O\left({N}^{3}\right)\), whereas the computation time for space-time scan statistics is given as \(O\left({N}^{4}\right)\). Several recent initiatives have been undertaken to minimize this complexity. Spatial scan statistics approach that is more efficient, requiring just \(O\left(\raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{$\epsilon $}\right.*{N}^{2}*{log}_{2}\left(N\right)\right)\). Under optimal conditions, the minimal complexity for Sat Scan has not yet reached below O(N3) because it has not yet achieved that level. This enormous processing cost makes it almost impossible to employ them in applications that are used in the real world or with large-scale datasets. In terms of time complexity, the Eigenspot and Multi Eigenspot methods are both considered to be \(O\left(K{N}^{2}\right)\) or \(O\left({\left(mn\right)}^{2}\right)\) 25. This is significantly faster than scanning methods, yet still presents challenges for high-dimensional data due to the super-linear growth in computational time.

This study covers the following three-fold gap in the literature:

  1. 1.

    It is not appropriate for detecting clusters of rare diseases.

  2. 2.

    In nations like Pakistan, a zero count frequently denotes unrecorded data or a lack of data availability. However, a zero that falls between two high counts is misclassified as a disease cluster, which results in erroneous cluster identification.

  3. 3.

    Lastly, the EigenSpot methods are computationally costly on large-scale spatiotemporal matrices, especially when repeated singular value decomposition (SVD) calculations are required for multiple clusters.

The novelty of this works lies in the following ballots:

  • To develop an efficient approach for identifying spatiotemporal clusters of rare disease with linear complexity in both spatial and temporal dimensions.

  • To present a novel method designed for identifying spatiotemporal clusters in rare diseases and characterized by its efficiency in time complexity.

  • To integrate singular value decomposition spare (SVDs) instead of SVD to manage false positive detection of cluster.

  • To provide robust alternative of classical tools for finding abnormal components.

The paper is organized into the following sections:

  • Section “Methodology” presents the study area, data sources, and the proposed methodology, detailing the novel EigenSpace algorithm based on SVDs, Z-control charts, and heatmap visualization.

  • Section “Results and discussions” describes the implementation steps and computational procedures of the algorithm, including matrix formation, anomaly detection, and iterative updates.

  • Section “Performance evaluation” discusses the results and findings from the application of the proposed method to typhoid disease data in Khyber Pakhtunkhwa, highlighting detected clusters and comparing performance with existing approaches.

  • Section “Discussion and conclusion” evaluates the computational efficiency of the proposed method.

  • Section 6 concludes the study with key insights, limitations, and future directions.

Methodology

Materials and methods

Approach

The analytical procedures were implemented in MATLAB R2017, where the proposed Novel EigenSpace Method was systematically compared against the baseline algorithms. For spatial visualization, the Pakistan administrative boundary shapefile was obtained from the Humanitarian Data Exchange (HDE)29 portal (https://data.humdata.org/dataset/cod-ab-pak)30. The Khyber Pakhtunkhwa shapefile was extracted from this dataset, and the study area as well as the cluster distribution maps (Figs. 1, 9 and 10, and 11) were generated using QGIS31 Version 3.34 Firenze (QGIS.org, 2025, https://qgis.org ).

Study area

This study is conducted in Khyber Pakhtunkhwa (KP), a northern province of Pakistan distinguished by varied topography, including mountainous terrains and alluvial plains. The geographic coordinates of Khyber Pakhtunkhwa is 34.9526205° N latitude and 72.331113° E longitude. Home to approximately 40.85 million people, KP exhibits considerable disparities in access to safe drinking water and sanitation. The province experiences recurrent waterborne disease outbreaks, particularly during the monsoon season, largely due to the contamination of water sources32. These health vulnerabilities are intensified by deficient infrastructure, unregulated urban expansion, and climatic fluctuations. Given these conditions, KP serves as a critical setting for examining spatial distribution and identifying spatiotemporal clusters of waterborne diseases. Figure 1 displays the study area map of Khyber Pakhtunkhwa, generated by the authors using QGIS software with administrative boundary data.

Fig. 1
figure 1

Study area map of Khyber Pakhtunkhwa, generated using QGIS v3.34 Firenze (https://qgis.org) with administrative boundary data from the HDX30.

Data collection

The dataset used in this study was acquired from the Directorate General (DG) Health Services, Khyber Pakhtunkhwa, and is available at https://www.nih.org.pk/phb/weekly-bulletin. The data includes 35 districts and 12 months specific records of WBD taken for the year (2024). Supplementary data, Population estimates were extracted from the 2017 national census records to support demographic normalization and computation of expected disease counts in the spatiotemporal analysis.

Computationally efficiency

The algorithms proposed by Fanaee-T and Gama25 and Ullah et al.26 implement the EigenSpace framework by integrating conventional SVD techniques for dimensionality reduction and matrix factorization within the spatiotemporal data structure. However, standard SVD is computationally expensive, particularly for large and sparse datasets. To address the computational and structural limitations essential in EigenSpot and Multi-EigenSpot, the proposed methods integrate an advanced variant SVDs. Design for efficient processing of high dimensional and sparse matrices, SVDs significantly improves decomposition accuracy and scalability, making it particularly effective for analyzing rare disease datasets with limited cases.

The proposed algorithm integrates the following three methodological components:

  • SVDs: Employed to extract the principal left and right singular vectors (LSV and RSV) from the K and E matrices for dimensionality reduction.

  • Robust Z-Control Chart: Employed to identify abnormal components in differences vectors.

  • Visualization: A heatmap is used to display the final Relative Risk (RR) matrix, highlighting potential cluster regions through color intensity variations.

Algorithm
figure a

Novel Multi-EigenSpot.

Novel multi-eigenspot

The proposed method targets scenarios where disease case data are aggregated across defined spatial units and temporal intervals. The Population at risk and observed case counts are structured into an \(m\times n\) spatiotemporal matrices \(P\) and \(K\), where \(m\) and \(n\) represent the number of spatial regions (Districts) and time points (Months), respectively. Auxiliary matrices, E (expected cases) matrix is computed from spatiotemporal matrices \(P\) and \(K\) and R (relative risk) is computed from spatiotemporal matrices \(K\) and \(E\). The relative risk (RR), a standard epidemiological metric, is computed as the ratio of observed to expected counts.

To extract central spatiotemporal SVD is applied to matrices K and E, yielding LSV and RSV that capture spatial and temporal patterns, respectively. Formally, the decomposition of K is expressed as \(K = UD{V}^{t}\), where U and V contain the LSV and RSV, and D is a diagonal matrix of singular values. The principal singular vectors of K are denoted as \(SK=({sk}_{1}, {sk}_{2}, \ldots , {sk}_{m})\) and \(TK=({tk}_{1}, {tk}_{2}, \ldots , {tk}_{m})\) for the spatial and temporal dimensions, respectively. Correspondingly, the principal vectors for E are \(SE=({se}_{1}, {se}_{2}, \dots , {se}_{m})\), and \(TE=({te}_{1}, {te}_{2}, \dots , {te}_{m})\). Abnormal spatiotemporal components are identified by computing the differences vectors: \(\text{D}\text{S}=\text{S}\text{K}-\text{S}\text{E}\) and \(\text{D}\text{T}=\text{T}\text{K}-\text{T}\text{E}\).

Figure 2 presents a systematic workflow of the novel Multi-EigenSpot algorithm, detailing the integration of SVDs decomposition, robust Z-control charts for anomaly detection, and iterative matrix updating to identify spatiotemporal clusters with reduced computational complexity. The proposed algorithms employ a robust Z-control chart to detect joint spatiotemporal abnormal component by analysing both differences vectors, DS and DT. Upon identification of simultaneous joint spatiotemporal abnormal component in these vectors, the observed case matrix K is updated by replacing the corresponding entries with their expected case matrix E values. Concurrently, the relative risk matrix R is updated by replacing the associated abnormal component with the median value. This iterative process continues until no further abnormal component are detected in either the spatial or temporal dimensions. For hotspot visualization, the final updated matrix R is utilized. Elements in R differing from the median are replaced with 1 to mark potential clusters. A heatmap is then generated to provide a clear graphical representation and segmentation of the detected disease clusters. Figure 3 visually demonstrates the iterative cluster detection and removal mechanism of the proposed algorithm, emphasizing how joint spatiotemporal anomalies are identified and suppressed in the relative risk matrix through recursive update.

Fig. 2
figure 2

Flow chart of the proposed algorithms.

Fig. 3
figure 3

Scheme Illustration of the proposed Algorithm.

A comprehensive step-by-step process of how these techniques are integrated within the algorithm is given below.

  1. 1.

    The total observed cases matrix is denoted by \(K\), and the population at risk matrix is denoted by \(P\).

$$K=\left[\begin{array}{ccc}{k}_{11}& \cdots & {k}_{1n}\\ & \ddots & \\ {k}_{m1}& \cdots & {k}_{mn}\end{array}\right], P=\left[\begin{array}{ccc}{p}_{11}& \cdots & {p}_{1n}\\ & \ddots & \\ {p}_{m1}& \cdots & {p}_{mn}\end{array}\right]$$

Where \({k}_{11}\) is the total disease in the first region (district), first time point (Month), \({p}_{11}\) is the total population at risk in the first region, first time point,\(m\) is the total spatial dimensions, and\(n\)total time points.

  1. 2.

    Compute the expected disease cases \(E\) and relative risks \(R\) matrices for \(K\) and \(P\) matrices.

$$E=\left[\begin{array}{ccc}{E}_{11}& \cdots & {E}_{1n}\\ & \ddots & \\ {E}_{m1}& \cdots & {E}_{mn}\end{array}\right] \text{a}\text{n}\text{d} R=\left[\begin{array}{ccc}{R}_{11}& \cdots & {R}_{1n}\\ & \ddots & \\ {R}_{m1}& \cdots & {R}_{mn}\end{array}\right]$$

The primary objective of computing the relative risk matrix \(R\) is to enable effective visualization of disease clusters through a heatmap representation.

  1. 3.

    The one-rank SVDs are used to obtain the principal left and right singular vectors for matrices \(K\) and \(E\). Our approach only requires the principal singular vector corresponding to the highest eigenvalue, as the first principal singular vector explains the majority of variance in the data. While full-rank SVDs decompose a matrix into a combination of orthogonal vectors, one-rank SVDs capture the most significant singular value and corresponding singular vectors, effectively representing the matrix with a single dominant direction. For matrix \(K\), the principal left singular vector is denoted as \(SK=({sk}_{1}, {sk}_{2}, \dots , {sk}_{m})\) and the principal right singular vector is denoted as \(TK=({tk}_{1}, {tk}_{2}, \dots , {tk}_{m})\). Similarly, for matrix E, the principal left singular vector is denoted as \(SE=({se}_{1}, {se}_{2}, \dots , {se}_{m})\), and the principal right singular vector is denoted as \(TE=({te}_{1}, {te}_{2}, \dots , {te}_{m})\). The elements in the principal left singular vectors correspond to the components in the spatial dimension, while the elements in the principal right singular vectors correspond to the components in the temporal dimension.

  2. 4.

    Abnormal components are identified by computing the difference vectors between the corresponding singular vector pairs: the spatial differences vector as DS = SK − SE and the temporal differences vector as DT = TK − TE.

  3. 5.

    Standardized z-score vectors are computed from the differences vectors DS and DT. A robust z-score control chart is then applied to both vectors at a significance level α = 0.10. Elements yielding left-tailed p-values less than α are considered out of control, indicating abnormal components within the spatial and temporal dimensions, respectively.

  4. 6.

    If simultaneous abnormal component/components is/are detected in both DS and DT, the observed case matrix K is updated by replacing the elements corresponding to the joint abnormal spatial and temporal component with their respective expected values. Likewise, the relative risk matrix R is updated by substituting the affected entries with the median value.

  5. 7.

    Identify any additional abnormal components in the spatial and temporal dimensions. Repeat Steps (01–06) until no abnormal components are found in either dimension.

  6. 8.

    The elements in the recently updated matrix R, corresponding to the components (spatial/temporal) not classified as abnormal, are substituted with the value of 1.

  7. 9.

    Visualize the final updated relative risk matrix R on a heatmap for clear segmentation and interpretation of detected disease cluster.

Results and discussions

Typhoid, caused by Salmonella enterica serotype Typhi, is classified as a WBD. Typhoid is a major public health concern in several countries with limited resources, including Pakistan. Annually, 9 to 12 million individuals are affected by typhoid globally. Typhoid is a major health concern in Pakistan, with thousands of cases reported annually, particularly in regions characterized by poor sanitation and restricted access to clean water33. The World Health Organization (WHO) said that Pakistan is at higher risk for typhoid fever, especially in Khyber Pakhtunkhwa, where environmental and infrastructure conditions increase the likelihood of outbreaks.

This study examined typhoid data in KP, identifying many spatiotemporal hotspots with significantly higher case counts. Identifying these clusters is crucial for public health, enabling the early detection of high-risk areas, which allows for timely action, resource allocation, and awareness campaigns. Spatiotemporal cluster identification provides governments with evidence-based insights to improve infrastructure, execute immunization programs, and monitor disease transmission, therefore reducing morbidity and mortality associated with typhoid. Figure 4 displays the population distribution across KP 35 districts, revealing demographic disparities that underpin the normalization of disease incidence rates and the calculation of expected case thresholds in spatiotemporal analysis.

Fig. 4
figure 4

Total KP district wise population.

From Figs. 5 and 6, it is clear that maximum typhoid disease cases are recorded in district Bannu and Peshawar, and in the months of May, July, and October.

Fig. 5
figure 5

District-wise total recorded typhoid disease cases of KP 2024.

Fig. 6
figure 6

Monthly observed typhoid disease cases of KP 2024.

Fig. 7
figure 7

Observed and expected typhoid cases of KP 2024.

The alpha threshold was set at 0.10 because, in common diseases, observed cases typically exceed expected cases in most regions. Setting the alpha threshold at 0.05 or 0.01 could limit the identification of multiple high-risk locations or lead to undetected hotspots. The results were verified by displaying the observed and expected typhoid cases for each of the 35 districts every month (January 2024 to December 2024) in the graphs illustrated in Fig. 7.

Fig. 8
figure 8

Heatmap of typhoid disease.

The heatmap illustrates the spatiotemporal distribution of the relative risk matrix for typhoid fever over 35 districts in Khyber Pakhtunkhwa, Pakistan. Figs. 8 and 9 show that the first likely and most significant cluster was detected in the districts of Bannu and Tor Ghar throughout May and October, with an average relative risk (RR) of 1.767, as denoted by a deep red colour. The second likely cluster identified in the districts of Bannu, Dir Lower, Khyber, Tank, and Tor Ghar throughout July, with an average RR of 1.663, highlighted in red. The third cluster emerged in August inside the districts of Bannu, Khyber, Tank, and Tor Ghat, with a relative risk (RR) of 1.587. The fourth likely cluster was detected in the districts of Bannu (August and November), Charsadda (May, October, and November), Dir Lower (May, October, and November), Khyber (August), Tank (August), and Tor Ghar (August and November), with an average relative risk of 1.587, denoted by a dark yellow colour. A fifth likely cluster occurred in September over numerous regions, including Bannu, Charsadda, Dir Lower, and Kohat, with a relative risk (RR) of 1.414. The RR 1 shows that no abnormal case was found in these districts. From both Figures, it is clear that Bannue, Charsadda, Dir Lower, Khyber, Tank, and Tor Ghar districts are highly affected by typhoid disease during the year 2024 for various months, suggesting them as alarming typhoid disease hotspots.

Fig. 9
figure 9

Detected typhoid disease clusters in Khyber Pakhtunkhwa, mapped in QGIS v3.34 Firenze (https://qgis.org) using administrative boundaries from HDX30.

Performance evaluation

Figure 10 illustrates the efficiency of the proposed approach, through which we conducted a comparison study against the Eigen Spot and Multi-Eigen Space algorithms for spatiotemporal disease clusters identification. The map illustrates that the Eigen Spot algorithms identify only a single cluster, Bannu (October), while missing other broad disease clusters. Multi-Eigen space approaches inaccurately identify a disease cluster in the temporal domain, detecting numerous spatiotemporal disease clusters including Bannu (March, April, May, October, and December), Kohat (March and April), D.I. Khan (February), and Swat (June and October). No reported cases were noted in April; however, a temporal cluster was found in Bannu during that month. The suggested Novel Eigen Space algorithms effectively resolve these two limitations. The suggested techniques identify multiple disease clusters while minimizing false positives in temporal clusters with no observed cases. The map clearly illustrates that the suggested techniques identify real clusters with higher precision. This demonstrates its superiority in identifying significant hotspots in sparse data, making it a more accurate approach for cluster discovery. Table 1 quantifies the 5–10× speed advantage of the novel algorithm over Multi-EigenSpace, achieved through SVDs truncation and vectorization (0.1–0.5 s vs. 1–3 s for 35 × 12 matrices.

Fig. 10
figure 10

Comparative maps of disease clusters identified by EigenSpot, Multi-EigenSpace, and the proposed method. Maps created in QGIS v3.34 Firenze (https://qgis.org) with shapefile data from HDX30.

Table 1 Computational time of the Multi-EigenSpace algorithm vs. the NovelMulti-EigenSpace algorithm.

As shown in Table 2, EigenSpot is limited to detecting a single hotspot (Tank, July), while SaTScan identifies only a few clusters due to its circular window constraint. DBSCAN performs better, capturing multiple clusters, but its results are sensitive to parameter selection, sometimes leading to fragmented or spurious detections. In contrast, the proposed Novel Multi-EigenSpot method consistently identifies a broader set of epidemiologically reasonable clusters across districts and months, capturing both temporal recurrence and spatial irregularity. This demonstrates its superior robustness and practical applicability in real-world surveillance settings where outbreak signals are rare and irregular.

Figure 11 provides a visual comparison of spatiotemporal clusters detected by the proposed Novel Multi-EigenSpot, DBSCAN, and SaTScan. SaTScan’s results are constrained by its circular scan windows, leading to under-detection of irregular cluster shapes. DBSCAN identifies several clusters, but its outputs vary depending on parameter choices, sometimes overestimating cluster boundaries. The proposed method, however, delineates multiple realistic clusters with higher spatial precision and temporal consistency, closely matching the epidemiological distribution of typhoid in KP. This visualization reinforces the interpretability and robustness of the proposed algorithm over existing approaches.

Fig. 11
figure 11

Spatiotemporal clusters detected by the proposed method, DBSCAN, and SaTScan, generated in QGIS v3.34 Firenze (https://qgis.org) using shapefile data from HDX30.

Table 2 Comparison of disease clusters identified by EigenSpot. Novel Multi-EigenSpot (proposed), SaTScan, and DBSCAN.

Figure 12 shows the comparative performance of EigenSpot family, SaTScan, and DBSCAN across multiple evaluation metrics, including Precision, Recall, F1-score, Robustness Index, and computational efficiency. The results show that the proposed Novel Multi-EigenSpot constantly achieves the maximum accuracy (Precision, Recall, and F1-score above 80%) and robustness while maintaining the computational time, even on a logarithmic scale. In contrast, SaTScan and DBSCAN demonstrate relatively lower detection accuracy and robustness, joined with higher computational costs. These results highlight the superior stability of efficiency, sensitivity, and robustness achieved by the Novel Multi-EigenSpot method.

Fig. 12
figure 12

Benchmarking EigenSpot Family Against SaTScan and DBSCAN: Precision, Recall, F1-score, Robustness, and Runtime.

From Table 3, it is clear that the proposed method consistently outperforms existing approaches. Unlike SaTScan, which is limited to circular or elliptical clusters, and DBSCAN, which is highly sensitive to parameter tuning, the proposed framework efficiently identifies multiple irregularly shaped clusters with minimal parameter dependence. Compared to EigenSpot and Multi-EigenSpot, it demonstrates stronger performance on sparse data, faster computation through truncated SVDs, and improved interpretability via clear heatmap visualizations.

Table 3 Comparative characteristics of baseline and eigenspace-based methods for Spatiotemporal cluster detection.

From Table 4, it is evident that the proposed method maintains the highest robustness to missingness across all situations. While the performance of all methods drops as missingness increases, Novel Multi-EigenSpot consistently achieves superior F1-scores with lower variability, demonstrating resilience to both MCAR and MNAR patterns. In contrast, SaTScan shows the weakest robustness, and DBSCAN and EigenSpot exhibit moderate but less stable performance. These results highlight the ability of the proposed framework to handle data imperfections such as missing values and zeros without relying on distributional assumptions.

Table 4 Robustness to missingness (RR = 1.6, |S|=6, |T|=2, Irregular)(Mean F1 ± SD over 100 replicates).

Discussion and conclusion

The proposed Novel EigenSpace method markedly advances spatiotemporal cluster detection by addressing two critical shortcomings of the original EigenSpot and Multi-EigenSpot algorithms: the inability to identify multiple, rare-disease hotspots and prohibitive computational demands. When applied to 2024 typhoid case data from Khyber Pakhtunkhwa, our approach consistently revealed distinct clusters, both spatially across districts such as Bannu, Charsadda, Dir Lower, Khyber, Tank and Tor Ghar, and temporally during high-incidence months, where previous methods either missed secondary hotspots or generated false positives in periods with zero observed cases. By integrating truncated SVDs with a robust Z-control chart in a recursive update scheme, we realize a five-to-ten-fold acceleration in computation while preserving the sensitivity of cluster detection in inherently sparse rare disease datasets. This advancement not only optimizes the efficiency of real time epidemiological monitoring but also strengthens the robustness of public‐health conclusions derived from erratic, irregularly shaped outbreak signals.

In summary, this study introduces a computationally efficient, distribution-free algorithm that successfully detects multiple spatiotemporal clusters of rare diseases, overcoming limitations of shape assumption, data sparsity, and processing time. Although the heatmap representation of the finalized relative-risk matrix offers an intuitive means of pinpointing hotspots, the current framework does not elucidate the directional propagation or transmission dynamics that give rise to these clusters. Furthermore, by aggregating data into discrete sub‐regional and monthly slices, the method may obscure phenomena that span administrative borders or persist across overlapping temporal intervals.

Future research will focus on incorporating spatiotemporal network models to infer propagation routes, adopting rolling-window analyses for continuous-time detection, and integrating auxiliary covariates (e.g., environmental or mobility data) to contextualize outbreak drivers. By broadening the Novel EigenSpace paradigm along these lines, we aspire to furnish a comprehensive platform for adaptive epidemiological monitoring and precision targeted intervention design.