1. Field of the Invention
The present invention relates to a method for predicting outcome and evaluation of clusters. Particularly the invention relates to a method of determining deviation and predict future out comes of clusters with certain attributes. In one embodiment, the present invention relates to epidemic outbreaks of disease and, more particularly, to a method for predicting the spread thereof.
2. Description of the Related Art
The emergence of Global Information Systems (GIS) has opened a new method for analyzing spatial dynamics of clusters for example for epidemics.1 Spatial features (i.e., mountains, cities, rivers, and farms) are rarely distributed in random or regular patterns. They are usually fragmented (discontinuous). Spread of disease during an epidemic may be influenced by factors that include but go beyond topographic features (such as winds, human traffic, road density, and other spatial variables). 2,3
An epidemic process may be regarded as composed of 2 spatial points (e.g., 2 animals, 2 farms, or 2 counties) connected through a line. One of these points is the infector and the other the infected. The line may have multiple forms (e.g., a road or a delivery route). By expanding this concept to that of a network (a set of nodes or points linked by multiple lines), animals located at nodes are expected to be infected during an epidemic that spreads along the lines. Hence, the issue of interest is to identify the unknown lines of an epidemic network.
Spatial connectivity depends on Euclidean (straight line) and non-Euclidean distances (e.g., connections through roads), which are factors that influence spread of disease during an epidemic.8 Euclidean distance can be estimated by measuring the distance between centroids (e.g., farm or county centroids).9 Non-Euclidean distance can be assessed by estimating total (major and minor) road density, which tends to be linearly predicted by major road density.10
Epidemic spatial connectivity may be investigated by use of classic spatial statistical techniques. They include the Moran/test (which assesses spatial autocorrelation), Mantel test (which measures spatial-temporal autocorrelation), and their derived correlograms. The correlograms identify the distance or time lag within which spatial autocorrelations extend.11,12 The Moran test evaluates whether there is a spatial autocorrelation (e.g., whether cases are associated with sites spatially close to each other, such as in adjacent counties). 13 Positive autocorrelation exists when the magnitude of cases increases as spatial proximity increases. Similarly, the Mantel statistic is used to assess spatial and temporal autocorrelation. 14,15
Although local Moran and Mantel tests can quantify the contribution of each specific spatial point to the overall (spatial or temporal-spatial) autocorrelation, 12 most local tests are not spatially explicit because they do not identify the line that connects an infected point to other (susceptible or subsequently infected) points. They are not spatially explicit or, if spatially explicit (i.e., the scan statistic test), not appropriately suited to detect long-distance links (i.e., not appropriate to detect fragmented clusters).16-22 Those limitations could be addressed by local tests that focus on the connecting line between points. Connectivity has been investigated from a network point of view (spatial link analysis) as conceptualized in a classic study and used in various fields.4-7 Together, assessments of spatial-temporal autocorrelation, supplemented with local tests that estimate the contribution to the overall autocorrelation provided by specific connections (spatial links between pairs of infected locations), could spatially identify geographically proximal case clusters (close-distance connections) as well as non-clustered clusters (i.e., cases that are located in spatially fragmented areas and connected by long-distance links).
In accordance with the present invention, there is provided a method for identifying and evaluating the relationship between clusters in a set primarily based on the connectivity between such clusters. So in one embodiment thereof, there is provided a method of identifying clusters from a set of points selected from the group consisting of individual points and spatial points comprising:
Likewise another embodiment of the invention comprises a method of determining connectivity between a set of points selected from the group consisting of individual points and spatial points comprising:
a) selecting a geographic area; acquiring data on the spatial coordinates that characterize the selected geographic area;
b) selecting attributes to be measured for each point of the set;
c) processing the attributes of each point;
d) determining the linkage between the points based on the attributes;
e) identifying the magnitude of the attributes of any point having an attribute deviating significantly from the average point in the set as a cluster.
In yet another embodiment the invention relates to a method for prediction of the spread of an epidemic outbreak of a disease comprising
a) selecting a geographic area;
b) acquiring data on the spatial coordinates that characterize the selected geographic area;
c) selecting disease attributes to be measured for each point of the set;
d) processing the attributes of each point;
e) determining the linkage between the points based on the attributes;
f) determining the rate of change of the attributes over time.
These and other objects of the present invention will be clear when taken in view of the detailed specification and disclosure in conjunction with the appended figures.
A complete understanding of the present invention may be obtained by reference to the accompanying drawings, when considered in conjunction with the subsequent detailed description, in which:
The general description of the invention and how to use the present invention are stated in the Brief Summary above. This detailed description defines the meaning of the terms used herein and specifically describes embodiments in order for those skilled in the art to practice the invention. The above interests in evaluating clusters are explained and benefits met as can be seen readily from the disclosure which follows and thus met by the present invention.
As used herein the term “points” refers to individual points or to spatial points. Examples of individual points include people, animals, sites, groups or the like having an attribute as part of a whole set. Examples of spatial points include mountains, cities, rivers, roads and farms. As used herein “attributes” relates to attributes of the points such road accidents, work-related accidents, opinions, social networks, natural resources, weather, computer viruses, crime, epidemics, infections, banking information, internet information and the like.
As used herein the term “spatial coordinates” refers to any bi-dimensional coordinates including things such as distance, height and weight and the like. Distance has its broadest possible meaning. So no only is the measurement of point to point distance included but other abstract distances such as years of service and the like are included.
As used herein, the term “connectivity” refers to the relationship of attributes between two clusters. In other words, a relationship that tells us potential causes or consequences, for example, why or how did something happen, what could happen later, where or how much has happened and the like. One embodiment of this connectivity is the relationship between clusters of infected individuals and non infected individuals and what would happen over time. i.e. how could the disease spread over time. Connectivity can also be used to determine the relative deviation between clusters. So in one embodiment one could look at clusters of individuals and use connectivity to identify a cluster of individuals with a higher rate of disease infection, cancer or the like than other clusters of individuals.
As used herein, “geographic information system” (GIS) refers to a collection of spatial features, topographical features or a combination of the two. The GIS is collected for a specific geographic area for example for a whole country, for a city county or the like. Once a particular geographic area is selected the corresponding GIS is collected for that geographic area.
As used herein, “processing the attributes” refers to sorting, measuring, comparing, ranking the magnitude or like process to correlate the attributes of each point in the set.
As used herein “determining the linkage” refers to determining the number of links per individual or spatial point, the index of each link per individual or spatial point, time the attribute was reported, or combinations of these or the like;
The following embodiment of an epidemic spread further illustrates the invention and teaches one skilled in the art how the invention, works, is applied and calculated.
Presented in one embodiment to test the influence of spatial connectivity on disease dispersal during an epidemic, geographically referenced epidemic data are needed. The 2001 epidemic of FMD in Uruguay offers an opportunity to evaluate diffusion over time and space during an epidemic. Cattle were predominantly infected in a country previously free of FMD. 23-25 The minimal replication cycle of FMD virus is estimated to be 3 days. 26 Studies 27-29 on FMD and other diseases have indicated heterogeneous spatial spread and used the centroids of irregular polygons (i.e., counties) as units of analysis. Road networks may influence dispersal of FMD virus. 24,25,30
3 objectives are met by the present invention: a determination is made to detect whether infected sites are spatially or temporally auto-correlated; if sites are clustered, to measure the contribution of each spatial link to the overall spatial-temporal autocorrelation; and that information is used to generate and evaluate hypotheses on the various potentials for disease spread during an epidemic for specific counties.
Details of this epidemic have been reported 23-25 elsewhere. Initial cases of FMD were identified in the southwestern quadrant of Uruguay, a non-urban, cattle-raising region characterized by higher road density than the national median (
Two GIS packages a, b were used to geographically reference data and create maps. An official map of Uruguay, c including the location and area of the 276 counties, was used. On the basis of the 2000 Agricultural Census for Uruguay, 248 counties (cattle-raising regions) were selected. Of those, 163 counties contained infected animals at some time during the 11-week period that began on Apr. 23, 2001. Geographically coded data on weekly (county level) and daily (for the first 6 days only; farm level) number of cases were retrieved from public sources and processed as described elsewhere. 24, 34-37
Four steps were used to determine the intercounty centroid distance. First, the x- and y-coordinates for each county's surface were identified by accessing the x- and y-values in the shape field. Second, the center value for each polygon (centroid) was provided by use of the GIs packages. Third, a point layer was generated from the x- and y-values of the centroid for each county. Fourth, distances between all centroids were calculated by use of the GIS tools, which selected a distance larger than the largest distance between any pair of points in the territory under study.
Three steps were used to generate data on road density. First, the total area of each county was determined by accessing the county value for area. Second, the national highway layer (excluding urban areas)c was intersected with the county layer to characterize and identify road segments by county. Length of road segments was then summarized for each county (i.e., the total length of roads was divided by total area of the county).
The GIs-generated matrix of all pairs of intercounty (centroid-to-centroid) distances (13,203 county pairs), the table containing density of county roads, and the matrix including the number of infected cattle per week and county identifier were transferred into and processed by use of technical computing software.
Spatial connectivity involved Euclidean distances (i.e., number of kilometers) between counties with infected cattle (distance between centroids) and road density (road distance divided by county area, a non-Euclidean distance measure). The Moran I coefficient was used to analyze spatial autocorrelation.13 Positive values for spatial autocorrelation indicate that sites spatially closer to each other than the mean distance have similar numbers of cases, whereas negative values for spatial autocorrelation indicate the opposite. The Moran I coefficient of autocorrelation was calculated as follows:
where n is the number of counties, i and j are counties (i and j cannot be the same county), wij is the spatial connectivity matrix, zi is the difference between the prevalence in county i and the overall mean prevalence, zj is the difference between the prevalence in county j and the overall mean prevalence, S0 is an adjustment constant, k is a county index, and zk is the difference between the county index and overall index. In addition, zi=xi−x, where xi is the weekly number of cases/100 farms in county i and x is the mean prevalence. The value for wij is calculated by use of the following equation:
w
ij
=f(dij, ri, rj)=(dij)−a (ri rj)b Eq. 2
where dij is the matrix of the Euclidean distance between counties i and j (i and j cannot be the same county), ri is the road density for county i, rj is the road density for county j, the value for variable a is a measure of the degree of epidemic diffusion in relation to distance (i.e., there is greater diffusion at shorter distances),37-41 and the value for variable b is a measure of the extent of connectivity between counties (i.e., greater road density results in greater connectivity), regardless of distance. For fixed positive values of variable a, large values of variable b support local spread as well as long-distance spread because higher local road density is associated with higher interstate highway density. Values for variables a and b were estimated by maximizing the spatial autocorrelation coefficient as reported elsewhere6 as follows:
where a>0, b>0, and t is time (week of the epidemic). The value for S0 was calculated as follows:
where i and j cannot be the same county.
Interactions of space and time were analyzed by use of the Mantel coefficient Is-t.14,15. The Is-t coefficient was calculated by use of the following equation:
where yij indicates the closeness in time between infections and i and j cannot be the same county. The first moments of the Moran I and Mantel Is-t statistics are reported elsewhere.6 Observations were assumed to be random independent samples from an unknown distribution function relative to the set of all possible values of I or Is-t when the xi were randomly permuted around the county system.6 The matrix yij was defined as yij=1 when county i had values greater than the mean number of cases/100 farms (total number of susceptible farms/county) at week t and county j also had values greater than the mean number of cases/100 farms at week t−m; otherwise, yij was equal to 0. This cross-correlation at lag m measured the temporal correlation of events at time t and those at a specified preceding point (i.e., m weeks earlier).
Interaction between county pairs was measured as a function of their distance from each other as described elsewhere.6 The graphic display of the global spatial autocorrelation coefficient (Moran I) plotted against the distance lag (correlogram) was determined by use of the following equation:
where g is the distance between the 2 counties, the matrix wij contains values of 1 for all the links among county pairs (i, j) located within the distance g and values of 0 for all other links not included within the Euclidean distance g, and i and j are not the same county. The temporal correlogram is the plot of Is-t as a function of the time lag m. Hence, the temporal correlogram was used to determine the extent of spatial-temporal autocorrelation for various time lags.
On the basis of network analysis, relationships between nodes (i.e., counties) can be described by their links.5,7 County pairs were considered connected by a spatial link when their contribution to the global spatial autocorrelation coefficient did not equal 0. The contribution of specific spatial links was defined as the link strength (index) between counties with infected cattle (i, j) located within a distance g, as indicated by use of the following equation:
where Iij (g) is the contribution of the specific spatial link.
Spatial-temporal autocorrelation and link indices were calculated by use of mathematical software.d Normality (No. of farms/county and link index, which were tested by use of the Anderson-Darling test) and comparisons among medians (assessed by use of the Mann-Whitney test) were conducted by use of a statistical program.e For all tests, values of P<0.05 were considered significant.
The 2001 epidemic began in the southwest portion of Uruguay and reached a peak (county-level) farm prevalence at week 5 (Table 1). The median road density of all counties reporting infected animals during the first week was 0.24 km/km2, which differed significantly (P=0.01) from that for the remainder of the country (0.12 km/km2;
*Number of farms reporting infected animals.
Maximization of the spatial autocorrelation index was evident when variable a=0.46 and variable b=0.06 (data not shown). The Moran I null hypothesis (lack of spatial autocorrelation) was rejected. Until at least the sixth week of the epidemic, sites closer to each other (clusters) had significantly more infected cattle than sites located at the mean (or greater) distance from each other (
Analysis of spatial correlograms (conducted before and after vaccination was implemented) indicated a significant positive autocorrelation among county pairs with infected animals located within approximately 120 km from each other for weeks 1 and 2 of the outbreak and within 80 km of each other for weeks 3 through 11. A significant negative spatial autocorrelation was observed for county pairs with infected cattle located 120 to 400 km from each other only at weeks 1 and 2 of the outbreak. A second cluster, which was not significant, was evident for county pairs with infected cattle located >400 km from each other (
Analysis of infective link indices (percentage of the overall spatial autocorrelation explained by specific infective links) revealed a clear departure from normality (
Analysis of the data suggested 3 classes of counties in terms of potential disease dispersal during the epidemic. The first class included 5 counties in which infected cattle were observed within the first 3 days of the epidemic (minimal time compatible with a replication cycle of the infective agent; hence, possible primary cases;
All counties reporting primary cases did not appear to facilitate spread of the disease during the epidemic. Four of 5 counties that had the highest link indices and connected with at least 2 other counties had 2.5 times as many cases by week 11 as 4 of 5 counties that contained cattle infected during days 1 to 3 of the epidemic. The second group of counties (counties with a high index link) reported their first infected animal on days 4 to 6 of the epidemic (time frame compatible with a secondary infection); which combined with another high index link county that reported an infected animal at day 1 to 3, this provided a county median of 0.073 cases/km2 by week 11, whereas the remaining counties reporting cases at days 1 to 3 (none of which were high index link counties) had significantly (P=0.02; Mann-Whitney test) fewer infected cattle (county median, 0.027 cases/km2) by week 11 (Table 3). Counties with a high index link (n=5) also had a significantly (P=0.01) higher median road density (0.26 km/km2), compared with the 271 other counties with infected cattle (0.126 km/km2).
Because observational epidemiologic analyses do not allow experimental designs, theories can only use historical data to attempt validation. However, such data may possess unknown sources of bias or lack critical variables. For example, the number of farms considered in the study reported here was based on the 2000 Agricultural Census, a data set not necessarily applicable for the study of this epidemic. Accordingly, the model described should not be perceived as an analysis of the FMD epidemic that took place in Uruguay in 2001 but, instead, as an evaluation of a spatial method that uses a hypothetical (although realistic) scenario for the epidemic. Despite that caveat, the analysis of assumptions on which spatial autocorrelation was based revealed adequate sample size (>20 county pairs/observation) and no departure from normality.29 Two measures of spatial-temporal autocorrelation (with and without consideration of denominator data) yielded similar results. Similar week-specific correlograms suggested that delayed reporting did not bias these findings. The use of Euclidean and non-Euclidean distances was justified by the fact that there was a maximized spatial autocorrelation index when variable a=0.46 and variable b=0.06.6
Significant positive (<120 km between counties with infected animals) and negative (>120 but <400 km between counties with infected animals) spatial autocorrelations were observed every week for at least the first 5 weeks (
Spatial analysis facilitated data-driven generation of hypotheses. Counties with infected cattle could be categorized as possessing greater potential for disease dispersal during the epidemic on the basis of 3 criteria (having a high index link [i.e., to be an outlier or county with a high index link], connecting with ≧2 other counties, and reporting infections before the other member of the pair). Counties reporting infections on days 1 to 3 of the outbreak (primary cases) were regarded as necessary sites, whereas those displaying higher index links (and connecting with at least 2 additional counties) were hypothesized to possess greater risk for other counties (sufficient cause of disease spread during the epidemic). Counties paired with those that had sufficient cause of disease spread were suspected to be target sites. This working hypothesis distinguished counties infected first (necessary causes, although not necessarily the cause of disease spread) from those that had a high index link (i.e., those hypothesized to seed new cases into target sites), regardless of when and where they got the infection. This conceptualization is similar to that of a model in which it was proposed that spatial features result in differing diffusion models during an epidemic.40 Although daily data on time of detection of infected animals facilitate the richest generation of hypotheses, even when such data are not available or are available but not used because of possible errors (e.g., delayed reporting and underreporting), information on link indices alone identifies county pairs that have indices much higher than the mean (outliers suspected to influence disease dispersal).
Although other factors associated with disease spread during an epidemic (i.e., markets) cannot be ruled out, spatial analysis may generate evidence of case clustering, whether there are short- or long-distance connections (or both), and whether there are changes in location of cases over time in relation to interventions. Identification of infected sites with greater epidemic risk (counties with a high index link) did not support the hypothesis that all infected cattle had equal influence on disease spread nor the theory of homogeneous mixing, which assumes that all susceptible and infected cattle are located at similar distances from each other and possess similar risk for becoming infected or for infecting others.40 This theory results in undifferentiated control policies, such as implementation of buffer rings (i.e., regional circles of fixed diameter within which the same control policy is conducted). 43 The fact that the first county with infected cattle and 3 other counties in which there were primary infections apparently failed to promote disease spread also argued against the homogeneous mixing theory.
Spatially explicit assessment of infective connectivity may be applied to evaluate control policy. For example, when only 2 time periods were considered, spatial autocorrelation analysis revealed a reduction of approximately 40 km in the mean distance between counties for the cluster (from 120 km at weeks 1 and 2 to 80 km at weeks 3 through 11), which supports the hypothesis that vaccination reduced disease spread during the epidemic. However, evaluation of week-specific correlograms did not reveal evidence of regional differences up to week 6 of the epidemic, which suggests that the 40-km reduction may reflect the end of the epidemic (when many counties did not report cases). These results may support the hypothesis that the conclusion of the epidemic was attributable to several factors, including lack of susceptible herds and a ban on animal movement that was imposed in week 1.
The approach described here was also informative, facilitating the explanation of apparent contradictions.
Although a second cluster was suggested by correlograms for sites located at >400 km between counties with infected cattle before and after vaccination was conducted, which is in agreement with the expected limited disease dispersal for infected animals located at the edge of the territory being infected, 40 the cluster at >400 km was not significant (
Cost-benefit analysis may also be generated by the approach used in the study reported here. Had a policy focusing on all counties reporting primary cases been adopted (on the basis of the theory that all cases equally contribute to disease spread during an epidemic), it may have been inefficient and insufficient. In contrast, a policy focused on high-index link counties could have been 2.5 to 3 times more beneficial than undifferentiated control policies (Table 3). Observations of significant case clustering and significant negative autocorrelation (for counties located >120 to <400 km between counties with infected cattle), noticed as early as week 2 (when vaccination had not been implemented), could have led to differentiated control measures (i.e., regionalization). 44
Infective link analysis can be interpreted by considering epidemics as processes that connect at least 2 points through a line. The local Moran test has been used 12, 45, 46 to focus on the contribution of each point to the overall (global) spatial autocorrelation. In contrast, the method described here focused on the line connecting the 2 points. Although local Moran tests assess inputs and outputs, infective connectivity emphasizes the intermediate process that takes place at some time point before the outcome is noticed. Such emphasis informs on earlier phenomena, which can be used to generate hypotheses on factors facilitating (or preventing) disease dispersal during an epidemic and possibly to identify case clustering in adjacent sites and in sites located far apart from each other. When based on data of a smaller scale (i.e., farm-level data), spatial autocorrelation and link analysis may facilitate real-time control of rapidly disseminated diseases.
Based on the above example the inventors have expanded the invention and the following information will aid in further calculations.
A procedure aimed at monitoring attribute patterns over space and/or time such that it generates non-overlapping diagnostic hypotheses. Monitoring is based on, at least:
1) the geocoded data from each spatial point (e.g., farm),
2) the inter-point (e.g., interfarm) (Euclidean) distances,
3) the date each observation was recorded,
4) the identifier corresponding to each individual (e.g., a cow), and
5) the identifier corresponding to each attribute (e.g., a bacterial strain) corresponding to each individual and date.
Based on data described above, the following indicators are then created:
1) the intrapoint or interpoint (e.g., interfarm or intrafarm) attribute ratio or INTER-P AR/INTRA-P AR (the number of individual attributes [e.g., one bacterial strain] expressed as percentage of all attributes at a given spatial point/date,
2) the attribute spatial spread or A-DISTNC (the distance assumed to be traveled by a given attribute, as calculated from the interfarm distance matrix, expressed in km or miles),
3) the attribute spread velocity or A-SPEED (distance traveled by an individual attribute/time, e.g., km/year), and
4) the product of the interfarm attribute ratio times the attribute spread velocity (INTRA-P AR times A-SPEED), or attribute geo-temporal spread index (A-GTSI), which may be expressed with and without adjustment for the average number of spatial points where a given attribute has been recorded per individual attribute/per unit of time.
These indicators are then used to:
A procedure aimed at detecting aggregations of individuals displaying greater/lower than average values of some attribute than those of the population at large (clusters) which may or may not possess high/low influence in the dissemination of that attribute within the population at large (with a high/low degree of connectivity).
Cluster detection is meant to refer to:
1) the spatial location of the cluster (composed of, at least, 2 “points” [e.g., cities]), and
2) the magnitude of clustering.
Cluster detection is based on, at least, these 6 factors:
A procedure aimed at estimating the connectivity of a point pertaining to a network. Connectivity analysis is based on 2 (or 3) factors:
1) the number of links per “node” (“point”),
2) the link index (the “weight” or “width” of each link), and
3 (if available) the time the attribute has been reported. Alone or combined, these factors can be used to identify and/or rank individual clusters. The number of links and the link index are defined. Alone or combined, these factors can be used to estimate the connectivity (expressed as a rank or degree) in relation to the network that point is associated to.
A procedure aimed at informing decisions based on cost-benefit like analyses that uses cluster detection and/or cluster connectivity data.
The population at large, upon which more beneficial/less costly decisions are to be made, is identified by a variety of procedures, including:
Since other modifications and changes varied to fit particular operating requirements and environments will be apparent to those skilled in the art, this invention is not considered limited to the example chosen for purposes of this disclosure, and covers all changes and modifications which does not constitute departures from the true spirit and scope of this invention.
Having thus described the invention, what is desired to be protected by Letters Patent is presented in the subsequently appended claims.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US06/62457 | 12/21/2006 | WO | 00 | 6/20/2008 |
Number | Date | Country | |
---|---|---|---|
60752325 | Dec 2005 | US |