Extract, transform, load (ETL) typically represents a first step that is utilized in machine learning systems. When applying supervised machine learning classification algorithms to longitudinal healthcare data (e.g. claims data), an important aspect of the ETL process is the creation of labelled patient cohorts—groups of patients who are experiencing similar symptoms which may be monitored over a period of time. Conventional ETL systems typically take a single snapshot of longitudinal healthcare data anchored on a clinical event of interest, such as a diagnosis, and group patients into positive (i.e., diagnosed) and negative (i.e., undiagnosed) cohorts. However, such conventional systems typically use only a single instance of patient medical history which may result in models that may generalize poorly in real-world deployments on new and recent patient data.
An unbiased ETL (extract, transform, load) system utilizes a rolling series of time-bound cross-sections, termed “rolling cross-sections” (RCS), of patient healthcare data to provide a dataset to a machine learning model for timed medical event prediction. Patients may be labelled as belonging to one of multiple classes (e.g. positive or negative if applied to binary classification) for each cross-section in the series depending on current healthcare status. Rather than using a single snapshot, the unbiased ETL system employs multiple snapshots of patients' medical histories to provide a capability to classify a patient as belonging to one of multiple classes at different points in time, as appropriate. Supervised learning for the machine learning model is thereby enabled over multiple different periods of a patient's medical journey which advantageously supports a more statistically robust medical event prediction model and eliminates several classes of bias. Additionally, the unbiased ETL system enables customization of a prediction window to account for lags in data collection, data processing, and length of use of the medical event predictions, to thereby assure timely and properly utilizable predictions in real-world deployments.
Advantageously, the unbiased ETL system can produce more performant medical event predictions even in health data scenarios with a small sample size for an event of interest, such as diagnoses associated with an ultra-rare disease, as the availability of multiple snapshots of patient data boosts the sample size in the positive cohort, resulting in more information being available for training, testing, and validating the prediction model. Accordingly, computing resources utilized in implementations of machine learning such as processor cycles, memory, power, data transmission bandwidth and storage can be employed with greater efficiency as compared to conventional ETL systems.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. Furthermore, the claimed subject matter is not limited to implementations that solve any or all disadvantages noted in any part of this disclosure. It will be appreciated that the above-described subject matter may be implemented as a computer-controlled apparatus, a computer process, a computing system, or as an article of manufacture such as one or more computer-readable storage media. These and various other features will be apparent from a reading of the following Detailed Description and a review of the associated drawings.
Like reference numerals indicate like elements in the drawings. Elements are not drawn to scale unless otherwise indicated.
Conventional ETL (extract, transform, load) systems typically group patients as positive or negative according to their medical histories from a single snapshot. Using a single snapshot approach, patients are indexed on the date of the positive event. Patients may be assigned a fixed lookback period, which can limit the amount of medical history that can be included in the model. Selection of negative cohort is required, and, depending on the selection and matching technique, may lead to introduction of biases into the model.
Instead of using the single snapshot approach, the present unbiased ETL system enables longitudinal transition of patients across one or more classes to provide several advantages over conventional systems (i.e., the system may be configured to support both binary and non-binary classification schemes). By taking samples of the same patients in different snapshots of time, the unbiased ETL system enables training to be performed on different patterns of cross-sectional data without sacrificing the length of longitudinal data that is made available to the system. Training on the different time-bound cross-sections allows for better monitoring and detection of indicators of potential model drift and may substantially reduce or eliminate bias in sampling cohorts that may arise, for example, due to (1) seasonality; (2) changes in data coverage; or (3) changes in market dynamics, such as new market launches or changes in medical protocol.
The unbiased ETL system utilizes multiple snapshots of patients' medical history to enable customized design of a timed window of prediction. The prediction model can be optimized to predict an event of interest for a predetermined number of days in advance. In addition, the prediction is valid for a predetermined number of days. Such customization can provide significant improvements over conventional systems in real-world deployments as deployment strategies can vary by end-user and specific business needs. In addition, the customizable timed window of prediction enables a machine learning system to validate the prediction model based on most recent cross-sections of data which may provide a better indication of how well the model can be generalized to future data.
Turning now to the drawings,
The ETL system 100 supports a machine learning model 110 that is configured to operate on medical history data 102 that is extracted from the data sources 105 and transformed into a dataset using the rolling snapshot approach discussed below in the text accompanying
The prediction window 220 defines the time period over which an event of interest is predicted, for example, if the window is three months, the model will seek to predict patients that will transition over a three-month period. Thus, the prediction window further provides the time period over which the machine learning model is looking for the events of interest. Successive cross-sections are shifted by a given interval (e.g. monthly increments), as indicated by reference numeral 225 to form a final dataset containing multiple cross-sections of data defined by iterative timeframes of medical history from the initial dataset. An offset window 230 comprises a time period prior to the prediction window that can be incorporated to accommodate lags in data collection. Thus, the machine learning model can predict an event X amount of time in advance, where X is defined by the offset.
A key advantage of this approach is that it captures multiple snapshots of the patient journey in which patients are labelled according to their current therapeutic status within the specific snapshot of time, such as drug initiation versus no initiation. According to this definition, a patient may be considered by the machine learning model as a negative patient during an earlier snapshot of data, and subsequently be considered a positive patient in a later snapshot once the patient has exhibited the event of interest. This enables the machine learning model to learn from more varied and comprehensive representations of patient history regarding events of interest. It also helps overcome challenges arising from small sample sizes, which is often the case for rare events (e.g., a rare disease or adverse event, or the like) and/or niche products, products that have recently launched, or products with narrowly defined market segments, since the number of patient instances used for model training scales in relative proportion to the number of cross-sections. Thus, patients are used more than once for model training, amplifying the signal obtained from each individual patient.
There are several benefits of this multi-snapshot approach to patient data extraction and transformation related to customization that more optimally suits a given clinical scenario or commercial application. For example, the prediction window can be designed such that patient predictions are valid for a given period, for example, a 3-month window. In addition, the offset period prior to the prediction window can be incorporated to accommodate lags in data collection, data processing, or the mobilization of clinical or commercial resources. For example, a 1-month time period prior to operationalizing machine learning model predictions may be used for such offset. Moreover, the opportunity to train on multiple cross-sections allows for better monitoring of indicators of model drift—reduced performance due to market or other changes in the data—allowing for mitigation of temporal biases in patient sampling due to seasonality, fluctuations in data coverage over time, or shifts in market dynamics such as, for example, changing treatment guidelines or regulations.
The multi-snapshot approach to patient data extraction and transformation also enables model validation strategies that evaluate model performance exclusively on “future” data. Specifically, a model can be trained on the bulk of historical medical history snapshots and validated only on the most recent snapshots to produce representative indicators of model performance after real-world commercial deployment. Within this framework, the model is evaluated on data it has not seen from a future time period, as it is trained exclusively on earlier snapshots of data.
Each RCS includes a prediction window 320, an offset window 325, and a lookback window 330. As noted above, the prediction window includes a time period in which medical events of interest are examined for inclusion in the prediction model supported by the unbiased ETL system.
The offset window 325 provides a portion of a patient's medical history that is not introduced in the prediction model for that RCS to account for data that is not captured in an actual deployment due to data lags and processing times. For example, data lag can occur due to constraints from a given healthcare dataset, and processing lag may result from limitations on technological resources such as availability of processing, memory, storage, or bandwidth resources.
The lookback window comprises the time period for which the medical histories of the patients are observed in the ETL pipeline 305. Illustratively, and without limitations of the scope of the invention, parameters for the ETL pipeline include 12 months for the lookback window, three months for the prediction window, and one month for the offset window. The offset between RCS #1 and RCS #2 is illustratively three months (i.e., the lookback window for RCS #1 begins in October 2016 and the lookback window for RCS #2 begins three months later in January 2017). Different parameters may be utilized to meet the needs of other implementations of the present unbiased ETL system. For example, RCS #2 is rolled ahead by three months in the ETL pipeline in this illustrative embodiment, but it may be desirable to shorten or lengthen the rolling period to suit a particular deployment scenario.
Star symbols (representatively indicated by reference numeral 335) provide positive event of interest indicators for patients A, B, and C. Depending on the application, the positive event of interest can either be defined as an event that can only occur once in a patient's medical history (e.g., the first occurrence of the diagnosis for an autoimmune disease, or transition to a new medication or therapy), or an event that can reoccur (e.g., heart attack). The diagram in
As shown in
An illustrative implementation of RCS may utilize a software package developed in Python and PySpark. This package queries a Hadoop distributed file system (HDFS) of patient-level timestamp records for diagnosis, procedure, and prescription medical claims. PySpark may be utilized to extract relevant de-identified patient IDs and to then build a patient-level RCS table. The following illustrative, non-limiting inputs may be utilized in an exemplary iteration:
The software first queries the patient claims data to generate an HDFS table of de-identified patient IDs using inputs (a) and (b), herein referred to as the initial data pull (IDP). The IDP includes all patients that satisfy our inclusion criteria (a) during the study window (b). The software then uses patient IDs from the IDP and information from (b)-(h) to generate an HDFS RCS table. Each row in the RCS table contains a single patient ID and timestamp columns that define a single RCS (i.e., columns: lookback window start and end dates, offset window start and end dates, and outcome window start and end dates).
The RCS table contains all patient cross-sections that satisfy inputs (b)-(h) and is rolling such that for a single patient the most recent cross-section is anchored in relation to (h) and the timestamps for each subsequent cross-section are shifted by (g). To label each patient cross-section with an outcome label of interest, for example, positive or negative in the case of binary classification, logic may then be applied using PySpark to query the patient claims data to evaluate whether the positive indicator (e.g., a diagnosis or treatment of interest) occurs during each cross-section prediction window. This methodology is extensible such that additional filtering criteria may be applied to refine an RCS cohort. For example, the filtering may be applied to only patient cross-sections that satisfy inclusion criteria during the lookback window and/or drop patient cross-sections where the positive indicator occurs during the lookback window.
As discussed above, the prediction window defines a period of time over which the machine learning model is looking for an event of interest such as disease progression or change of therapy. Machine learning models may be trained with a relatively narrow window such that predictions of an event of interest are imminent. Such approach can be reasonable as patient history close to the event is often highly predictable. For example, a medical procedure may commonly be performed before initiation of a new medication to assess suitability. A narrow window, however, may not always represent an optimal time period over which the machine learning model can reasonably predict patient transition, as the narrow window constrains the model to a highly select point in the patient's medical experience.
Assessing performance in this way can underestimate the usefulness of the machine learning model because a substantial percentage of false positive patients within the constrained time window do, in fact, transition to a therapy of interest when the outcome window is relaxed to six months beyond the time of prediction. As shown, precision increases to 29.3% for the same set of 2,500 patients. The precision value is more meaningful when compared to a baseline of performance in the absence of the machine learning model. In the case of the machine learning model presented here, model precision was four to five times better than selecting patients at random for disease progression which is a substantial increase over baseline.
Machine learning model precision may be measured at the patient level as discussed above, where the performance is quantified by how many patients the model identifies correctly through its predictions. However, in some commercial targeting scenarios, it is not only an effective prediction of patient events that may be important, but also of prediction events that are related to health care providers (HCP). Accordingly, predicting whether an HCP will transition a patient to a new therapy or medication may also be an important factor in validating a model's performance and applicability in a real-world and/or commercial setting.
In a similar way as with the machine learning model developed for disease progression predictions in the autoimmune example discussed above in the text accompanying
In some real-world deployments of the present unbiased ETL system, the machine learning model may be subject to ongoing re-optimization after being trained on specific historical data. Through this process, a predictive model can be provided with more recent data, including additional positive patients, or those patients that have the outcome of interest, as well as updated timing and market influences or changes.
Two main options can be implemented for ongoing re-optimization of a model, including: 1) refreshing with newer data, and 2) refreshing with data collected prospectively. For the first option, on a routine basis, the model can be updated with additional data that is collected between the initial predictions and the new round of predictions. For the second option, an additional model can be developed to track previously predicted positives to understand how many of the predicted patients ultimately experience disease progression.
In step 605, medical histories are collected for each of a plurality of patients into a dataset. In step 610, a timeline is created from the collected medical histories in the dataset in which data for events of interest for the patients are included on the timeline. In step 615, a rolling timebound window is implemented into which data is selectively captured as a snapshot of the medical histories of the plurality of patients. In step 620, the dataset is transformed by rolling the window along the timeline to selectively capture data at different points along the timeline to thereby generate multiple snapshots of the patient medical histories. In step 625, the transformed dataset with the multiple snapshots of patient medical histories is employed in the prediction model.
A number of program modules may be stored on the hard disk, magnetic disk 933, optical disk 943, ROM 917, or RAM 921, including an operating system 955, one or more application programs 957, other program modules 960, and program data 963. A user may enter commands and information into the computer system 900 through input devices such as a keyboard 966 and pointing device 968 such as a mouse. Other input devices (not shown) may include a microphone, joystick, game pad, satellite dish, scanner, trackball, touchpad, touchscreen, touch-sensitive device, voice-command module or device, user motion or user gesture capture device, or the like. These and other input devices are often connected to the processor 905 through a serial port interface 971 that is coupled to the system bus 914, but may be connected by other interfaces, such as a parallel port, game port, or universal serial bus (USB). A monitor 973 or other type of display device is also connected to the system bus 914 via an interface, such as a video adapter 975. In addition to the monitor 973, personal computers typically include other peripheral output devices (not shown), such as speakers and printers. The illustrative example shown in
The computer system 900 is operable in a networked environment using logical connections to one or more remote computers, such as a remote computer 988. The remote computer 988 may be selected as another personal computer, a server, a router, a network PC, a peer device, or other common network node, and typically includes many or all of the elements described above relative to the computer system 900, although only a single representative remote memory/storage device 990 is shown in
When used in a LAN networking environment, the computer system 900 is connected to the local area network 993 through a network interface or adapter 996. When used in a WAN networking environment, the computer system 900 typically includes a broadband modem 998, network gateway, or other means for establishing communications over the wide area network 995, such as the Internet. The broadband modem 998, which may be internal or external, is connected to the system bus 914 via a serial port interface 971. In a networked environment, program modules related to the computer system 900, or portions thereof, may be stored in the remote memory/storage device 990. It is noted that the network connections shown in
By way of example, and not limitation, computer-readable storage media may include volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules, or other data. For example, computer-readable media includes, but is not limited to, RAM, ROM, EPROM (erasable programmable read only memory), EEPROM (electrically erasable programmable read only memory), Flash memory or other solid state memory technology, CD-ROM, DVD, HD-DVD (High Definition DVD), Blu-ray or other optical storage, magnetic cassette, magnetic tape, magnetic disk storage or other magnetic storage device, or any other medium which can be used to store the desired information and which can be accessed by the architecture 1000.
According to various embodiments, the architecture 1000 may operate in a networked environment using logical connections to remote computers through a network. The architecture 1000 may connect to the network through a network interface unit 1016 connected to the bus 1010. It may be appreciated that the network interface unit 1016 also may be utilized to connect to other types of networks and remote computer systems. The architecture 1000 also may include an input/output controller 1018 for receiving and processing input from a number of other devices, including a keyboard, mouse, touchpad, touchscreen, control devices such as buttons and switches or electronic stylus (not shown in
It may be appreciated that the software components described herein may, when loaded into the processor 1002 and executed, transform the processor 1002 and the overall architecture 1000 from a general-purpose computing system into a special-purpose computing system customized to facilitate the functionality presented herein. The processor 1002 may be constructed from any number of transistors or other discrete circuit elements, which may individually or collectively assume any number of states. More specifically, the processor 1002 may operate as a finite-state machine, in response to executable instructions contained within the software modules disclosed herein. These computer-executable instructions may transform the processor 1002 by specifying how the processor 1002 transitions between states, thereby transforming the transistors or other discrete hardware elements constituting the processor 1002.
Encoding the software modules presented herein also may transform the physical structure of the computer-readable storage media presented herein. The specific transformation of physical structure may depend on various factors in different implementations of this description. Examples of such factors may include, but are not limited to, the technology used to implement the computer-readable storage media, whether the computer-readable storage media is characterized as primary or secondary storage, and the like. For example, if the computer-readable storage media is implemented as semiconductor-based memory, the software disclosed herein may be encoded on the computer-readable storage media by transforming the physical state of the semiconductor memory. For example, the software may transform the state of transistors, capacitors, or other discrete circuit elements constituting the semiconductor memory. The software also may transform the physical state of such components in order to store data thereupon.
As another example, the computer-readable storage media disclosed herein may be implemented using magnetic or optical technology. In such implementations, the software presented herein may transform the physical state of magnetic or optical media, when the software is encoded therein. These transformations may include altering the magnetic characteristics of particular locations within given magnetic media. These transformations also may include altering the physical features or characteristics of particular locations within given optical media to change the optical characteristics of those locations. Other transformations of physical media are possible without departing from the scope and spirit of the present description, with the foregoing examples provided only to facilitate this discussion.
In light of the above, it may be appreciated that many types of physical transformations take place in the architecture 1000 in order to store and execute the software components presented herein. It also may be appreciated that the architecture 1000 may include other types of computing devices, including wearable devices, handheld computers, embedded computer systems, smartphones, PDAs, and other types of computing devices known to those skilled in the art. It is also contemplated that the architecture 1000 may not include all of the components shown in
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.
This application claims benefit and priority to U.S. Provisional Application Ser. No. 62/903,428 filed Sep. 20, 2019, entitled “Unbiased ETL System for Timed Medical Event Prediction” which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
62903428 | Sep 2019 | US |