An Eigenspace Method for Detecting Space-Time Disease Clusters with Unknown Population-Data

Space-time disease cluster detection assists in conducting disease surveillance and implementing control strategies. The state-of-the-art method for this kind of problem is the Space-time Scan Statistics (SaTScan) which has limitations for non-traditional/non-clinical data sources due to its parametric model assumptions such as Poisson or Gaussian counts. Addressing this problem, an Eigenspace-based method called Multi-EigenSpot has recently been proposed as a nonparametric solution. However, it is based on the population counts data which are not always available in the least developed countries. In addition, the population counts are difficult to approximate for some surveillance data such as emergency department visits and over-the-counter drug sales, where the catchment area for each hospital/pharmacy is undefined. We extend the population-based Multi-EigenSpot method to approximate the potential disease clusters from the observed/reported disease counts only with no need for the population counts. The proposed adaptation uses an estimator of expected disease count that does not depend on the population counts. The proposed method was evaluated on the real-world dataset and the results were compared with the population-based methods: Multi-EigenSpot and SaTScan. The result shows that the proposed adaptation is effective in approximating the important outputs of the population-based methods.


Introduction
With the advent of electronic medical records, syndromic data sources, and low-cost location sensors, data on disease occurrences or other health-related events are increasingly encoded with both spatial and temporal information. Based on this data, Health authorities conduct surveillance to search for the potential clusters of disease or other health-related events. In public health, cluster detection aims to identify those spatiotemporal regions that contain unexpected counts This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. of disease cases or other health-related events. The detection of such potential clusters facilitates the health officials' efforts to identify their targets of interest for possible interventions. Such clusters show the over-density anomalies in the spatiotemporal space which assist epidemiologists in finding the environmental factors responsible for a particular disease outbreak in the area.
A number of parametric methods have been developed for detecting space-time clusters in public health data. The examples are Space-time Scan Statistic (SaTScan) [1,2], Expectationbased Scan Statistic [3,4], Flexible Space-time Scan Statistic [5,6], Space-time Permutation Scan Statistic [7,8], and EvoGridStatistic [9,10]. All these methods are based on Maximum Likelihood Estimation (MLE) which put some constraints on the distribution and quality of data that are valid only for clinical data which are collected from the hospitals and are not necessarily valid for non-traditional/nonclinical data sources. For example, data collected from social media [11], pharmacy sales, and school health surveys are non-traditional or non-clinical data sources for public health surveillance [12], where the parametric model might be very restrictive i.e., difficult to be followed. For such data sources, MLE-based methods like SaTScan are not an ideal choice for disease cluster detection. Addressing this problem, the nonparametric methods called EigenSpot [13] and Multi-EigenSpot [14] have recently been developed that make no assumption about the distribution and quality of data. However, these nonparametric methods require that the population counts be available. This is a big limitation, because, in some least developed countries census population data are not available. In addition, the population counts are difficult to approximate for some surveillance data such as emergency department visits and over-thecounter drugs sales where the catchment area for each hospital/pharmacy is undefined. Even if the population counts are available, the catchment area population would not be a good denominator since there can be natural geographical disparity in health-care utilization data, due to disparities in disease prevalence, access to health care, and consumer behavior [15].
In order to address this problem, we adapt the Multi-EigenSpot algorithm to be applicable for disease surveillance in such a realistic scenario. Multi-EigenSpot uses a population-based estimator for expected disease occurrences that has been frequently used in prior arts [9,16]. We propose an adaptation by using a different estimator of the expected disease occurrences in the algorithm which does not depend on the population counts. The proposed adaptation infers the expected disease counts from the observed disease counts only. The experimental evaluation on real-world data shows that the proposed adaptation is effective in approximating the significant outputs of the population-based methods.
Some nonparametric alternatives to the MLE-based scan statistics have also been proposed such as [17][18][19]. However, these are purely spatial techniques that can detect purely spatial clusters while this research focuses on the space-time cluster detection problem. It is evident from the literature that the Eigenspace-based methods [13,14] are the latest nonparametric technique in the spatiotemporal class of methods for areal-count data.

Materials and Methods
The stepwise process of the proposed approach is given below: Step 1: Given the observed disease counts, estimate the spatiotemporal matrices of expected disease cases, E and Risk measures, R according to Eqs. (1) and (2), respectively.
where E ij is the expected disease count for i th sub-region over the j th time-point; C .j denotes the total observed/reported cases in the whole study-area at the j th time-point; P .j the total population counts in the whole study-area at the j th time-point; p ij the population counts in the i th sub-region at the j th time-point.
where E ij is the expected disease count for the i th sub-region over the j th time-point; C ij is the observed/reported disease count in the i th sub-region at the j th time-point; C.. is the grand total of the observed/reported disease counts and is calculated as in Eq. (3).
Step 2: Calculate the principal-left and principal-right singular vectors of matrices C and E using one-rank singular value decomposition. For matrix C, the principal-left singular vector is denoted by SC and the principal-right singular vector by TC. Similarly, for matrix E, the principal-left singular vector is denoted by SE and the principal-right singular vector by TE.
Step 3: Compute the difference vector of the left-singular vectors as DS : = SC − SE, and that of the right-singular vectors as DT = TC − TE.
Step 4: Find the abnormally higher elements in each subtract vector DS and DT by applying the Z-control chart with the significance level alpha. The abnormally higher elements in the vector DS are associated with the spatial component of the cluster and in vector DT to the temporal component.
Step 5: If the abnormally higher elements are found in spatial as well as temporal dimension, upgrade matrix C by replacing the elements corresponding to the out-of-control components with the respective expected cases to remove the previous cluster. Simultaneously, matrix R is upgraded by replacing the elements corresponding to the out-of-control components by their average value.
Step 7: In the upgraded matrix R, replace the elements corresponding to the components that are not found to be abnormal by 1 to distinguish clearly between the normal and abnormal regions. Step 8: Visualize the resultant matrix R as a heatmap to show multiple clusters with different colors.
What is novel with the proposed adaptation is the strategy used for estimating the expected disease counts. Population-based Multi-EigenSpot uses the historical temporal information for population-at-risk while our proposed method infers this indirectly from the geographical neighborhood. For each region and time point, we calculate the expected number of a particular disease counts conditioning on the observed marginal.  Fig. 1 shows the detailed process that how our proposed method detects multiple clusters in a spatiotemporal space with no requirement for population counts. For instance, assume that two different hotspots exist in a 3 × 4 spatiotemporal space. The two shaded areas in matrix C (Fig. 1) are the two clusters of interest to be approximated by our proposed approach. The intersection of the third row with the first-second columns denotes the most likely hotspot and the second-third rows with the fourth column the secondary (additional) cluster. The input is only the spatiotemporal matrix of the observed disease counts denoted by C. Given the matrix C, the proposed method approximates these two clusters in two iterations. The most likely cluster is detected in the first iteration. The detected hotspot is then removed by replacing the observed counts with the corresponding expected counts, and the method is repeated for the secondary cluster. In the last upgraded matrix R, the cells containing the value M1 represent one cluster and that containing the value M2 represents the other cluster.

Experiment with the Real-World Dataset
In this section, the proposed approach is applied to the measles case data in Khyber-Pakhtunkhwa, Pakistan (Jan 2016-Dec 2016), assuming the population is unknown. This dataset has been described in detail elsewhere [14]. The proposed method is executed in MATLAB (version R2014a). Based on the spatiotemporal data on the observed measles cases, the proposed method with alpha = 0.10, results in a heatmap as shown in Fig. 2, showing the potential measles hotspots. The resulting heatmap shows three potential measles clusters in Khyber-Pakhtunkhwa in the period from January 2016 to December 2016. The most likely cluster is seen in the district of Bannu for May, October, and December with an average Relative Risk (RR) = 1.677, denoted with a dark red color on the heatmap. The secondary cluster is seen in the district Bannu for April with an average RR = 1.614, denoted by a light red color on the heatmap. The third cluster is seen in the two districts (Kohat and D. I. Khan) for March and April with an average RR = 1.58, represented with a yellow color on the heatmap. These hotspot regions have also been detected by the Multi-EigenSpot and Space-time Scan Statistics in the previous study on the same dataset [14] and hence confirm that the proposed approach is effective for surveillance data with unknown population-at-risk information. Because FATA and IDP camps suffer from a low vaccination rate due to lack of awareness [20,21].

Performance Comparison with Population-Based Methods
In this section, we compare the outputs of our proposed method with Multi-EigenSpot and SaTScan which have already been applied to the same dataset [14]. The outputs of these three methods are presented in Tab. 1. It is obvious from Tab. 1 that the regions detected by our proposed method were also detected by Multi-EigenSpot and SaTScan. Our proposed method detects (Bannu, May, Oct, Dec,) as the most likely cluster and (Bannu, Apr) as the secondary cluster. It is very interesting to know that the most likely and secondary clusters of the proposed approach are the same as detected by the population-based Muti-EigenSpot. Moreover, our approach detects (Kohat, D. I. Khan, Mar, Apr) as the third cluster while Multi-EigenSpot detects (Bannu, Kohat, D. I. Khan, Mar) as the third cluster, showing the two districts and one month in common.
The outputs of the proposed approach are also included in the significant outputs of the SaTScan. The Space-time Scan Statistics detects (Bannu, Apr-May) as the most likely cluster. This cluster is covered by the first two clusters of the proposed method. The secondary cluster of the SaTScan (Kohat, Mar-Apr) is covered in the third cluster of our proposed method.  The proposed approach detects the first three high-risk clusters while using the population counts, the detection ratio can be increased up to 8 clusters. This suggests that if the population counts are is possible to be approximated, then using this extra information, Multi-EigenSpot performs better than our proposed approach.

Conclusion
We proposed the first Eigenspace-based method which allows the nonparametric practice to detect clusters in the scenarios where the population counts are unavailable or difficult to approximate. Our proposed method replaces the temporal inference in methods like EigenSpot [13] and Multi-EigenSpot [14] with geographical inference which ultimately results in a method that can be used for hotspots detection in the least developed countries where population data is not available or very expensive to obtain. The results indicate that the proposed approach can detect the significant clusters with no need for the population counts. The proposed adaptation can delineate the boundaries of a disease outbreak and its potential to guide the control efforts in many least developed countries where the population data are not available or difficult to access. In addition, the proposed method can be used as a nonparametric solution for cluster detection in many research fields such as criminology [22,23], network [24], and environment [25] where the population data is not relevant.
The proposed method does not account for the spatial and temporal covariates which would make it impractical to examine all 'unusual' events, implicitly diminishing the significance of the surveillance. Extending the proposed method to adjust the population-at-risk-data for spatial and temporal covariate is recommended for future work in this area.