This disclosure relates generally to technology for forecasting case counts during a disease outbreak.
In managing disease outbreaks, predicting future case counts is an important tool. While nationwide case counts can be forecast reasonably well using relatively simple models applied to historical nationwide case count data, effective logistical planning at the local level requires being able to make better predictions of future case counts at a local geographic area. One recent effort to address this problem is the β-AR model described in the following paper, which is incorporated herein by reference in its entirety: Matthew Le, et al., Neural Relational Autoregression for High-Resolution COVID-19 Forecasting published by FB Data for Good, Oct. 1, 2020 (available at: https://ai.meta.com/research/publications/neural-relational-autoregression-for-high-resolution-covid-19-forecasting) (“13-AR paper”).
Polymerase Chain Reaction (PCR) tests are widely used for determining infection by a pathogen such as a specific virus or bacteria, or other pathogens such as fungi, protozoa, worms or prions. A PCR test performs thermal cycling on a biological sample. The cycling amplifies DNA corresponding to a target sequence if that sequence is present in the sample. If the target sequence can be detected by the PCR instrument prior to a given cycle (e.g., before cycle 38 of a 40 cycle assay), then the test can be considered “positive” for the corresponding person being infected by a virus corresponding to that sequence. However, the PCR test provides more information than simply whether a person is positive or negative. It also provides the cycle threshold (Ct) which is the PCR cycle at which the relevant sequence is first sufficiently amplified to be detected. Because the PCR process amplifies DNA, the cycle at which a sequence is first detectable is, on average, inversely proportional to the amount of a given DNA sequence initially present in a given sample volume. In other words, a small Ct value suggests a much higher amount of a given DNA sequence than does a high Ct value. This has been shown to correlate to viral load, i.e., the amount of virus in the infected person.
Although PCR tests provide Ct data, typically only the binary “positive” or “negative” result data (and not the Ct data) is used for predicting incidence and epidemic trajectory. Hay et al. have shown that because the Ct data, on average, correlates with viral load, it can improve incident rate estimates and epidemic growth reproductive rate estimates. See Hay et al., “Estimating epidemiologic dynamics from cross-sectional viral load distributions”, in Science 373, eabh0635 (2021) 16 Jul. 2021, incorporated herein by reference in its entirety (“Hay paper”).
However, neither the β-AR model nor other existing models have leveraged Ct data to improve the forecasting of future case counts.
Embodiments of the present disclosure provide methods, systems, and computer program products to improve high resolution case count forecasting by generating and using features derived from Ct data from PCR tests. Specifically, Ct data is used to generate Ct features to improve machine learning model performance on case count predictions.
Further details of these embodiments are more fully-disclosed herein and in Sharmin et al., “Cross-sectional Ct distributions from qPCR tests can provide an early warning signal for the spread of COVID-19 in communities,” medRxiv preprint doi: https://doi.org/10.1101/2023.01.12.23284489, posted Jan. 14, 2023, which is incorporated herein by reference in its entirety.
While the disclosure is described with reference to the above drawings, the drawings are intended to be illustrative, and other embodiments are consistent with the spirit, and within the scope, of the invention.
The various embodiments now will be described more fully hereinafter with reference to the accompanying drawings, which form a part hereof, and which show, by way of illustration, specific examples of practicing the embodiments. This specification may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this specification will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. Among other things, this specification may be embodied as methods or devices. Accordingly, any of the various embodiments herein may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. The following specification is, therefore, not to be taken in a limiting sense.
Instructions for implementing case count forecasting system 102 reside in computer program product 104 which is stored in storage 105 and those instructions are executable by processor 106. When processor 106 is executing the instructions of computer program product 104, the instructions, or a portion thereof, are typically loaded into working memory 109 from which the instructions are readily accessed by processor 106. In the illustrated embodiment, computer program product 104 is stored in storage 105 or another non-transitory computer readable medium (which may include being distributed across media on different devices and different locations). In alternative embodiments, the storage medium is transitory.
In one embodiment, processor 106 in fact comprises multiple processors which may comprise additional working memories (additional processors and memories not individually illustrated) including a graphics processing unit (GPU) comprising at least thousands of arithmetic logic units supporting parallel computations on a large scale. GPUs are often utilized in deep learning applications because they can perform the relevant processing tasks more efficiently than can typical general-purpose processors (CPUs). Other embodiments comprise one or more specialized processing units comprising systolic arrays and/or other hardware arrangements that support efficient parallel processing. In some embodiments, such specialized hardware works in conjunction with a CPU and/or GPU to carry out the various processing described herein. In some embodiments, such specialized hardware comprises application specific integrated circuits and the like (which may refer to a portion of an integrated circuit that is application-specific), field programmable gate arrays and the like, or combinations thereof. In some embodiments, however, a processor such as processor 106 may be implemented as one or more general purpose processors (preferably having multiple cores) without necessarily departing from the spirit and scope of the present invention.
User device 107 includes a display 108 for displaying results of processing carried out by case count forecasting system 102. Such a user device may include a mobile device such as a mobile phone, smart phone, smart watch, or tablet computer, and/or a laptop or desktop computer including a display 108. In some embodiments, alerts for impending epidemic waves in one or more community or communities of interest as detected by case count forecasting system 102 will be routed in real-time or near real-time to the one or more users via the user's respective user device 107. Such alerts may be displayed on user device 107 via app notifications to one or more mobile applications configured to receive results of processing carried out by case count forecasting system 102. Such app notifications may be displayed automatically on display 108 of user device 107, and may alert the user via audible sounds or vibrations.
In some embodiments, alerts for impending epidemic waves as detected by case count forecasting system 102 may be sent to the user via email alerts displayed on user device 107. In other embodiments, these alerts may be displayed on a user dashboard of case count forecasting system 102 that is shown display 108 of user device 107.
In a typical embodiment, data source computers 110 communicate with one or more of computers 103 over a computer network such as the Internet or another public or private network (not separately shown in
In one example, operation of case count forecasting system 102 proceeds as follows. Pre-processing block 201 pre-processes cycle threshold (Ct) data and other data received from data source computers 110 shown in
Feature data 202-2 includes Ct features, which are described in detail below in the context of
Feature data 202-1 includes other features. In this example, the other features include features referenced in the β-AR paper referenced in the SUMMARY section above. Specifically, the β-AR features include features obtained from the following datasets: Confirmed Cases (New York Times collected data), Facebook Data for Good (FBDG) symptom survey, FBDG Movement Range Maps, Google Community Mobility data, doctor visits (CMU COVIDcast), Testing (COVID Tracking Project), and Weather (including average, minimum, maximum temperature and rainfall per county) (from NOAA GHCN). See β-AR paper at 6.
In this example, Ct feature data 202-2 and most of the β-AR feature data 202-1 are input into RNN 204 except that the β-AR Confirmed Cases feature is input into autoregression model 205.
In an alternative embodiments, additional features beyond those included in β-AR feature data 202-1 and Ct feature data 202-2 are used. For example, features related to disease variants are used in addition to β-AR features and Ct features. In one example, the time varying prevalence value of each of one or more of the top current variants are used as additional features. For example, in one embodiment, five variant features can be obtained for use by selecting the top five variants from GISAID, available, for example, at: https://www.gisaid.org/epiflu-applications/influenza-genomic-epidemiology/and the time varying prevalence values computed from the GISAID site for each of the five selected variants can be used as features.
In the illustrated example, machine learning model 203 comprises the neural relational autoregression model (β-AR model) described in the β-AR paper. However, in alternative embodiments, other machine learning models capable of generating case count forecasts from data that includes Ct data can be used. For example, the machine learning model 203 may also be updated over time, where machine learning model enhancements may be considered and incorporated into machine learning model 203.
Step 301 uses Ct data 320 to generate features 341, 342, 343, and 344 by determining, respectively, the mean, smoothed mean, skewness, and smoothed skewness of the vectors of Ct values. Specifically, respective sets of features 341-344 are computed for each respective date (e.g., each calendar day) that samples corresponding to respective Ct values were collected. And this is done for each of one or more geographic areas for which Ct data is provided (e.g., each county) (geographic area data dimension not separately shown in
Features 341 and 343 (mean and skewness) are calculated based on all the Ct values collected for a given date within a given geographic area. Features 342 and 344 (smoothed mean and smoother skewness) are calculated based on a Ct values collected in a moving window of dates around the given date. In one example, the moving window is 14 days, meaning that, for example, the smoothed mean is based on the Ct values collected seven days prior and seven days after the given date. Furthermore, in one example, for each date in the rolling window, daily average Ct values are used for the smoothed mean and smoothed skewness determinations.
Step 302 uses weekly Ct data to estimate incidence rates and generate estimate incident rate data 340. In this example, an estimated incident rate is generated for each day in each county. In one example, this is done using the Gaussian process model from the Hay virosolver R-package using the recommended parameters. That package is available at https://jameshay218.github.io/virosolver/index.html and is incorporated herein by reference in its entirety.
Step 302 uses estimated incident rate data to generate estimated effective reproduction rate (Rt) curves. In one example, this is done by first computing a smoothed moving average of the estimated incident rates using a 14-day window. Then, the resulting smoothed incident rates are used to estimate Rt curves using EpiEstim available at https://cran.r-project.org/web/packages/EpiEstim/index.html and incorporated herein by reference in its entirety. Each estimated Rt curve is a time-series of estimated Rt values, for example, a series of daily estimated Rt values. In one example, additional data other than Ct-derived incidence estimates can also be submitted to EpiEstim to enrich the estimated Rt curve determinations. For example, case count data can also be submitted. In one example, this data can also be smoothed using, for example, a moving average calculation with a 14-day moving window. In one example, the EpiEstim recommended parameters of a mean serial interval of 6.14 and standard deviation of 3.96 can be used.
Step 304 then uses the estimated Rt curves to determine features 345, 346, and 347. Specifically, it determines a median estimated Rt value and upper and lower confidence limits for each day.
In some embodiments, the machine learning model performance will be automatically assessed over time, and features that show diminished utility will be excluded, and reconsidered if they appear to be of value again. In other embodiments, new features may be considered through test runs of machine learning model 203. If a new feature is determined to be of utility in forecasting case counts, such a new feature may be manually added to the machine learning model. The feature may be manually added using a user interface of case count forecasting system 102 in some embodiments.
In the example, computer system 400 may provide one or more of the components of an automated case count forecasting system configured to implement one or more logic modules and artificial neural networks and associated components for a computer-implemented case count forecasting system and associated interactive graphical user interface. Computer system 400 executes instruction code contained in a computer program product 460. Computer program product 460 comprises executable code in an electronically readable medium that may instruct one or more computers such as computer system 400 to perform processing that accomplishes the exemplary method steps performed by the embodiments referenced herein. The electronically readable medium may be any non-transitory medium that stores information electronically and may be accessed locally or remotely, for example, via a network connection. In alternative embodiments, the medium may be transitory. The medium may include a plurality of geographically dispersed media, each configured to store different parts of the executable code at different locations or at different times. The executable instruction code in an electronically readable medium directs the illustrated computer system 400 to carry out various exemplary tasks described herein. The executable code for directing the carrying out of tasks described herein would be typically realized in software. However, it will be appreciated by those skilled in the art that computers or other electronic devices might utilize code realized in hardware to perform many or all the identified tasks without departing from the present invention. Those skilled in the art will understand that many variations on executable code may be found that implement exemplary methods within the spirit and the scope of the present invention.
The code or a copy of the code contained in computer program product 460 may reside in one or more storage persistent media (not separately shown) communicatively coupled to computer system 400 for loading and storage in persistent storage device 470 and/or memory 410 for execution by processor 420. Computer system 400 also includes I/O subsystem 430 and peripheral devices 440. I/O subsystem 430, peripheral devices 440, processor 420, memory 410, and persistent storage device 470 are coupled via bus 450. Like persistent storage device 470 and any other persistent storage that might contain computer program product 460, memory 410 is a non-transitory media (even if implemented as a typical volatile computer memory device). Moreover, those skilled in the art will appreciate that in addition to storing computer program product 460 for carrying out the processing described herein, memory 410 and/or persistent storage device 470 may be configured to store the various data elements referenced and illustrated herein.
Those skilled in the art will appreciate computer system 400 illustrates just one example of a system in which a computer program product in accordance with an embodiment of the present invention may be implemented. To cite but one example of an alternative embodiment, storage and execution of instructions contained in a computer program product such as, for example, computer program product 460, in accordance with an embodiment of the present disclosure may be distributed over multiple computers, such as, for example, over the computers of a distributed computing network.
This application claims the benefit of U.S. Provisional Application Ser. No. 63/391,740 filed on Jul. 23, 2022. To the extent permitted in applicable jurisdictions, the entire contents of this application are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
63391740 | Jul 2022 | US |