The present disclosure relates to a model-assisted system and method for predicting a date relating to an event.
It is important to understand the effectiveness of treatments (e.g., drugs administered orally) in real-world settings, particularly for diseases whose treatment landscapes are evolving rapidly. One such disease is renal cell carcinoma (RCC). Oral drugs are becoming increasingly common in oncology care. Since 2006, ten new targeted drugs have been approved for RCC, leading to uncertainties in guidelines that could benefit from studies using real-world evidence. In contrast to intravenous chemotherapy, which is administered in the clinic and carefully tracked via structured electronic health records (EHRs), oral drug treatments are typically self-administered and, therefore, less well-tracked. A challenge in conducting such studies on electronic health records (EHRs) is that treatment information often appears only in free text in unstructured clinic notes, a phenomenon particularly prevalent for oral cancer treatments, which are generally self-administered at home. Identifying and structuring this information is an important task in understanding a patient's treatment history. Additionally, most existing work on extracting drugs from EHRs has focused on discharge summaries. However, for chronic diseases such as cancer, drug treatment information is scattered longitudinally across clinic notes, requiring synthesis across the patient record.
Thus, there is a need for automated approaches for extracting drug treatment information from clinic notes.
Embodiments consistent with the present disclosure include systems and methods for predicting dates of an event associated with a patient. Embodiments of the present disclosure may overcome one or more aspects of existing techniques for predicting dates of an event by providing model-based, automated techniques for date prediction based unstructured data. For example, a trained model may receive a plurality of unstructured documents and label the unstructured documents. The model may also predict and output a start data of an event associated with a patient (e.g., taking a drug by the patient). The use of models in accordance with embodiments of the present disclosure thus allows for faster and more efficient prediction of dates of an event. In addition, the use of rules in accordance with embodiments of the present disclosure may be more accurate than extant techniques.
In one embodiment, a model-assisted selection system for predicting a date of an event relating to a patient may include at least one processor configured to obtain, from a storage device, a medical record of the patient. The medical record may include a plurality of unstructured documents. The at least one processor may also be configured to obtain a model for predicting the date of the event. The at least one processor may further be configured to input the medical record into the model and assign, for each of the plurality of unstructured documents, a label from the model. The label may be determined from among four labels, including a pre-event label, a mid-event label, a post-event label, and a non-event label. The pre-event label may indicate that a document relates to a date before the event. The mid-event label may indicate that a document relates to a date during the event. The post-event label may indicate that a document relates to a date after the event. The non-event label may indicate that a document is non-determinative or unrelated to the event. The at least one processor may also be configured to predict a start date of the event based on the labels of the plurality of unstructured documents and output the predicted start date.
In one embodiment, a model-assisted system for predicting a date of an event relating to a patient may include at least one processor configured to obtain a medical record of the patient. The medical record includes a plurality of unstructured documents. The at least one processor may further be configured to obtain a model for predicting the event. The at least one processor may also be configured to input the medical record into the model. According to the model and the medical record, for each of the plurality of unstructured documents, the at least one processor may further be configured to identify one or more time expressions in the each of the plurality of unstructured documents. The at least one processor may also be configured to determine one or more dates relating to the identified one or more time expressions. The at least one processor may further be configured to determine a probability score for the determined one or more dates for being associated with the beginning of the event, the ending of the event, or non-event date. The at least one processor may also be configured to predict a start date of the event based on the probability scores. The at least one processor may further be configured to output the predicted start date.
In one embodiment, a model-assisted system for predicting a date of an event relating to a patient may include at least one processor configured to obtain a first model for predicting the event. The at least one processor may also be configured to input a medical record of the patent into the first model. The medical record may include a plurality of unstructured documents. The at least one processor may further be configured to obtain, for each of the plurality of unstructured documents, a label from the first model. The label may be determined by the first model among four labels, including a pre-event label, a mid-event label, a post-event label, and a non-event label. The pre-event label may indicate that a document relates to a date before the event. The mid-event label may indicate that a document relates to a date during the event. The post-event label may indicate that a document relates to a date after the event. The non-event label may indicate that a document is non-determinative or unrelated to the event. The at least one processor may also be configured to predict a first preliminary start date of the event based on the labels of the plurality of unstructured documents. The at least one processor may further be configured to obtain, from the first model, a probability score for the first preliminary start date. The at least one processor may also be configured to obtain a second model for predicting the event. The at least one processor may further be configured to input the medical record into the second model. According to the second model and the medical record, for each of the plurality of unstructured documents, the at least one processor may also be configured to identify one or more time expressions in the each of the plurality of unstructured documents. The at least one processor may further be configured to determine one or more dates relating to the identified one or more time expressions and determine a probability score for the determined one or more dates for being associated with a beginning of the event, an ending of the event, or non-event date. The at least one processor may also be configured to predict a second preliminary start date of the event based on the determined probability scores. The at least one processor may further be configured to determine a probability score of the second preliminary start date. The at least one processor may also be configured to determine a start date of the event based on the first preliminary start date, the probability score of the first preliminary start date, the second preliminary start date, the probability score of the second preliminary start date.
Consistent with other disclosed embodiments, non-transitory computer-readable storage media may store program instructions, which are executed by at least one processing device and perform any of the methods described herein.
The accompanying drawings, which are incorporated in and constitute part of this specification, and together with the description, illustrate and serve to explain the principles of various exemplary embodiments. In the drawings:
The following detailed description refers to the accompanying drawings. Wherever possible, the same reference numbers are used in the drawings and the following description to refer to the same or similar parts. While several illustrative embodiments are described herein, modifications, adaptations and other implementations are possible. For example, substitutions, additions or modifications may be made to the components illustrated in the drawings, and the illustrative methods described herein may be modified by substituting, reordering, removing, or adding steps to the disclosed methods. Accordingly, the following detailed description is not limited to the disclosed embodiments and examples. Instead, the proper scope is defined by the appended claims.
Embodiments herein include computer-implemented methods, tangible non-transitory computer-readable mediums, and systems. The computer-implemented methods may be executed, for example, by at least one processor (e.g., a processing device) that receives instructions from a non-transitory computer-readable storage medium. Similarly, systems consistent with the present disclosure may include at least one processor (e.g., a processing device) and memory, and the memory may be a non-transitory computer-readable storage medium. As used herein, a non-transitory computer-readable storage medium refers to any type of physical memory on which information or data readable by at least one processor may be stored. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, non-volatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage medium. Singular terms, such as “memory” and “computer-readable storage medium,” may additionally refer to multiple structures, such a plurality of memories and/or computer-readable storage mediums. As referred to herein, a “memory” may comprise any type of computer-readable storage medium unless otherwise specified. A computer-readable storage medium may store instructions for execution by at least one processor, including instructions for causing the processor to perform steps or stages consistent with an embodiment herein. Additionally, one or more computer-readable storage mediums may be utilized in implementing a computer-implemented method. The term “computer-readable storage medium” should be understood to include tangible items and exclude carrier waves and transient signals.
In this disclosure, a Temporally Integrated Framework for Treatment Intervals (TIFTI), a robust, generalizable framework for extracting oral drug treatment intervals from a patient's unstructured notes, is presented. TIFTI may leverage distinct sources of temporal information by breaking the problem down into a document-level sequence labeling task and a date extraction task.
According to one embodiment, a system may be configured to predict a start date of taking a drug by a patient. The system may input the name of the drug and a plurality of unstructured data, such as clinic visit notes into a model, which may predict whether the patient took the drug and if so, predict the time interval over which the patient took the drug. A user of the disclosed systems and methods may encompass any individual who may wish to access a patient's clinical experience and/or analyze patient data. Thus, throughout this disclosure, references to a “user” of the disclosed systems and methods may encompass any individual, such as a physician, a quality assurance department at a health care institution, and/or the patient.
One or more data sources 101 may obtain or generate a medical record (or medical data thereof) of a patient. For example, a data source may be a computer (e.g., computer 101-1 illustrated in
Data sources 101 may include a computer (e.g., computer 101-1), a mobile device (e.g., smartphone 101-2), a scanner (e.g., scanner 101-3), a copier, a fax machine, a multi-function machine, a tablet computer, a personal digital assistant (PDA), or the like, or a combination thereof.
Computing device 102 may receive the medical record (or medical data) of the patient from one or more data sources 101 via network 104. In some embodiments, computing device 102 may receive medical data of the patient from one or more data sources 101 and compile the medical data into a medical record of the patient. Computing device 102 may also be configured to process the medical record (or medical data) to predict a date relating to an event associated with the patient. For example, computing device 102 may obtain a medical record of a patient and a model for predicting a start date of taking a particular drug by a patient (e.g., a trained neural network). Computing device 102 may further input the medical record into the model and obtain the prediction of the data from the model (e.g., via an output layer of the model). Computing device 102 may further output the prediction of the data to, for example, an output device. In some embodiments, computing device 102 may transmit the prediction to a physician or medical personnel associated with the patient. For example, computing device 102 may transmit the prediction to computer 101-1 located in a clinic office.
In some embodiments, computing device 102 may train a model for predicting a date relating to an event based on a training algorithm and training data. Alternatively or additionally, computing device 102 may obtain a model from a database (e.g., database 103 and/or database 160).
Database 103 may be configured to store information and data for one or more components of system 100. For example, database 103 may receive one or more medical records (or medical data thereof) from one or more data sources 101 and/or computing device 102 via, for example, network 104, and store the received data. Alternatively or additionally, database 103 may store one or more (untrained and/or trained) models and transmit one or more models to computing device 102 (e.g., if a request for a model is received) via network 104. In some embodiments, database 103 may store training data and transmit the training data to computing device 102 via, for example, network 104.
Network 104 may be configured to facilitate communications among the components of system 100. Network 104 may include a local area network (LAN), a wide area network (WAN), portions of the Internet, an Intranet, a cellular network, a short-ranged network (e.g., a Bluetooth™ based network), or the like, or a combination thereof.
The processor may be configured to perform one or more functions described in this disclosure. The processor may comprise at least one processing device, such as one or more generic processors, e.g., a central processing unit (CPU), a graphics processing unit (GPU), or the like and/or one or more specialized processors, e.g., an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or the like.
The Computing device 102 may also include a memory 152 that may store instructions for various components of computing device 102. For example, memory 152 may store instructions that, when executed by processor 151, may be configured to cause processor 151 to perform one or more functions described herein.
Input device 153 may be configured to receive input from the user of computing device 102, and one or more components of computing device 102 may perform one or more functions in response to the input received. Output device 154 may be configured to output information and/or data to the user. For example, output device 154 may include a display configured to display a predicted date of an event to the user.
Database 160 may be configured to store various data and information for one or more components of computing device 102. For example, database 160 may include a medical record database 161 configured to store medical records of patients, from which processor 151 may receive one or more medical records. Database 160 may also include model database 162 configured to store one or more models for predicting a date of an event. A model may be a trained model or an untrained model. For example, processor 151 may receive a trained model for predicting a date of an event from model database 162. As another example, processor 151 may receive an untrained model and train the model based on training data (which may be stored in training data database 163). Database 160 may further include a training data database 163 configured to store training data, from which processor 151 may receive training data to train or modify a model.
Medical record 200 may include both structured data 210 and unstructured data 220. Structured data 210 may include quantifiable or classifiable data about the patient, such as gender, age, race, weight, vital signs, lab results, date of diagnosis, diagnosis type, disease staging (e.g., billing codes), therapy timing, procedures performed, visit date, practice type, insurance carrier and start date, medication orders, medication administrations, or any other measurable data about the patient. Unstructured data 220 may include information about the patient that is not quantifiable or easily classified, such as physician's notes or the patient's lab reports. Unstructured data 220 may include information such as a physician's description of a treatment plan, notes describing what happened at a visit, descriptions of how a patient is doing, radiology reports, pathology reports, etc. In some embodiments, the unstructured data may be captured by an abstraction process, while the structured data may be entered by the health care professional or calculated using one or more algorithms. Unstructured data 220 may include a plurality of unstructured documents (e.g., exemplary unstructured documents 221 and 222 illustrated in
In the data received from data sources 101, each patient may be represented by one or more records generated by one or more health care professionals or by the patient. For example, a doctor associated with the patient, a nurse associated with the patient, a physical therapist associated with the patient, or the like, may each generate a medical record (or a portion thereof) for the patient. In some embodiments, one or more records may be collated and/or stored in the same database. Alternatively or additionally, one or more records may be distributed across a plurality of databases. In some embodiments, the records may be stored and/or provided a plurality of electronic data representations. For example, the patient records may be represented as one or more electronic files, such as text files, portable document format (PDF) files, extensible markup language (XML) files, or the like. If the documents are stored as PDF files, images, or other files without text, the electronic data representations may also include text associated with the documents derived from an optical character recognition process.
Labeled records 310 may be input to feature extraction 321. For example, labeled records 310 may be stored in one or more databases. Labeled records 310 may include data associated with a plurality of patients such that each patient is associated with one or more medical records. In some embodiments, a labeled record may include a plurality of unstructured documents (original or preprocessed) and a label associated with each of the unstructured documents. Alternatively or additionally, the labeled record may include a date and/or a period of an event (e.g., a start date, an end date, an time period, or the like, or a combination thereof). Alternatively or additionally, the labeled record may include one or more time expressions associated with an unstructured document and/or a revised unstructured document associated with the time expression(s) (as described elsewhere in this disclosure).
Feature extraction 321 may extract features (such as keywords, key phrases, or the like) from labeled records 310 and may score those features for a level of relevance to a date of the event. Accordingly, in some embodiments, the features may be represented as vectors.
A portion of the features extracted by feature extraction 321 may be collated with corresponding labels of records 310 and stored as training data 323. Training data 323 may then be used by one or more training algorithms 325. For example, training algorithm 325 may include logistic regression that may generate one or more functions (or rules) that relate extracted features to particular labels (e.g., a label assigned to a document, labeled date of the event, labeled period of the event, labeled time expression, labeled revised unstructured document), which may serve as ground truths. For example, training algorithm 325 may include simple l2-regularized logistic regression, which may be featurized by ngrams. Additionally or alternatively, training algorithm 325 may include one or more neural networks that adjust weights of one or more nodes such that an input layer of features is run through one or more hidden layers and then through an output layer of labels (with associated probabilities). For example, the neural network may include an explicitly cascaded model, a long short-term memory (LSTM), or the like, or a combination thereof. Training algorithm 325 outputs one or more models 330.
In some embodiments, every node in one layer is connected to every other node in the next layer. A node may take the weighted sum of its inputs and pass the weighted sum through a non-linear activation function, the results of which may be output as the input of another node in the next layer. The training data may flow from left to right, and the final output may be calculated at the output layer based on the calculation of all the nodes.
Referring to
At step 501, computing device 102 may be configured to obtain a medical record of a patient from a storage device (e.g., database 103 and/or database 103). A medical record may include a plurality of unstructured documents. In some embodiments, the medical record may also include structured data, such as quantifiable or classifiable data about the patient. An unstructured document may include information about the patient that is not quantifiable or easily classified. Exemplary unstructured documents may include a patient's notes, clinic visit notes, physician's description of a treatment plan, lab reports, descriptions of how patient is doing, radiology reports, pathology reports, or the like, or a combination thereof. An unstructured document may be prepared by the patient, a nurse, a physician, a laboratory technician, or the like, or a combination thereof.
In some embodiments, computing device 102 may reprocess the received medical record. For example, for the unstructured document, computing device 102 may remove the document(s) and sentence(s) without a mention of the drug (either by the generic or brand name). Alternatively or additionally, computing device 102 may remove the redundancy of information included in the medical record. For example, computing device 102 may remove one or more sentences that appear in a document (e.g., a clinic note that occurred prior to the present note). Alternatively or additionally, computing device 102 may replace each mention of the drug with the placeholder “DRUG” and each mention of other commonly taken drugs with the placeholder “OTHER-DRUG.” This preprocess may ensure that the features learned by a model are generalizable across drugs.
Computing device 102 may also generate a preprocessed medical record. The preprocessed medical record may include a plurality of preprocessed unstructured documents based on the original unstructured documents. In some embodiments, two or more preprocessed unstructured documents may form a document timeline. A document timeline may include preprocessed unstructured documents sorted according to the time when a document was prepared, or a timestamp associated with the document.
In some embodiments, computing device 102 may input original unstructured documents into a model for preprocessing and obtain the preprocessed unstructured document from the model. In some embodiments, computing device 102 may input original unstructured documents into a model for preprocessing and predicting a date of an event (i.e., the model may be configured to preprocess the medical record and predict a date), and computing device 102 may receive the prediction from the model.
In some embodiments, the reprocessing may be part of step 701 of process 700, step 1001 of process 1000, and/or step 1101 of process 1100.
At step 503, computing device 102 may be configured to obtain a model for predicting the date of the event. In some embodiments, the model may include a trained model generated based on a training process (e.g., training process 300 as described elsewhere in this disclosure). In some embodiments, the model may be a simple l2-regularized logistic regression, which may be featurized by ngrams. Alternatively or additionally, the model may include one or more neural networks. The neural network may include an explicitly cascaded model, a long short-term memory (LSTM), or the like, or a combination thereof.
In some embodiments, computing device 102 may obtain a model based on a particular event of interest. For example, computing device 102 may obtain a first model for a first drug, but may obtain a second model for a second drug. Alternatively or additionally, computing device 102 may obtain a model based on the demographic information relating to the patient of interest (e.g., age, gender).
In some embodiments, the model may include an input layer, one or more hidden layers, and an output layer. Each layer may include one or more nodes. The input layer may receive input (e.g., a drug name, a medical record, a preprocessed medical record, unstructured documents, preprocessed unstructured documents, or the like, or a combination thereof). In some embodiments, the output layer may include one node configured to output a data (e.g., a predicted start date of the event) or a set of data (a plurality of candidate dates and probabilities scores associated with the candidate dates). Alternatively, the output layer may include a plurality of nodes, and each of the nodes may output a different data. In some embodiments, every node in one layer is connected to every other node in the next layer. A node may take the weighted sum of its inputs and pass the weighted sum through a non-linear activation function, the results of which may be output as the input of another node in the next layer. The input data may flow through the layers, and the final output may be calculated at the output layer based on the calculation of all the nodes.
At step 505, computing device 102 may be configured to input the medical record into the model. For example, the user may select the medical record to be input to the model via input device 153. In some embodiments, the model may include an input layer, and computing device 102 may input the medical record into the input layer of the model. In some embodiments, the medical record may include at least one preprocessed unstructured document.
At step 507, computing device 102 may be configured to assign, for each of the plurality of unstructured documents, a label from the model. In some embodiments, the model may assign a label to an unstructured document based on the timestamp and/or time expression (explicitly or implicitly) indicated in the document. Alternatively or additionally, the model may take the timestamp and/or time expression indicated in another document (or multiple documents) into consideration in determining a label for an unstructured data. For example, the model may include a classification algorithm configured to assign a label to an unstructured document as the output from the output layer. By way of example, the model may assign to the unstructured document a label from, among four labels including a pre-event label (or referred herein as a “PRE” label), a mid-event label (or referred herein as a “MID” label), a post-event label (or referred herein as a “POST” label), and a non-event label (or referred herein as a “OTHER” label). The PRE label may indicate that a document relates to a date before the event. The MID label may indicate that a document relates to a date during the event. The POST label may indicate that a document relates to a date after the event. The OTHER label may indicate that a document is non-determinative or unrelated to the event.
In some embodiments, the model may implement rules or restraints to assign a label to an unstructured document. For example, the rules or restraints may be configured such that no document labeled MID may precede a PRE and no document labeled POST may precede a document labeled MID. In some embodiments, one or more hidden layers of the model may include at least one restraining module to implement the rules or restraints described in this disclosure.
In some embodiments, the model may include an output layer, and computing device 102 may be configured to assign, for each of the plurality of unstructured documents, a label from the output layer of the model
By way of example, referring to
In some embodiments, the model may also determine a probability score for the assignment of the label to the unstructured document. Alternatively or additionally, the model may determine for each document a probability distribution across two or more labels. The model may also assign the label having the highest probability score as the label of the document.
At step 509, the model (or computing device 102) may be configured to predict a start date (or an end date, a period, or the like, or a combination thereof) of the event based on the labels of the plurality of unstructured documents.
In some embodiments, the model may implement rules or restraints to predict a date of the event. For example, one or more hidden layers of the model may include at least one restraining module to implement rules or restraints such that if there is no document labeled MID or POST, the model may output an indication that the drug was not taken. As another example, the rules may be implemented such that the start date may be set to the timestamp (or time expression) of the first document with a MID label (if exists) and the timestamp (or time expression) of the first document with a POST label (if exists). By way of example, referring to
In some embodiments, the model may also determine a probability score for the predicted date(s). For example, the model may determine a probability score for the predicted start date of Dec. 15, 2018, and a probability score for the predicated end date of Jan. 28, 2019. The model may also output the dates and their corresponding probability scores. In some embodiments, the model may include an output layer, and the model may output the dates and their corresponding probability scores via the output layer.
In some embodiments, computing device 102 may receive the results of processing of the input by the model. For example, computing device 102 may receive the predicted date(s) and corresponding probability score(s) from the model. Alternatively or additionally, computing device 102 may receive from the model one or more labeled documents (e.g., one or more documents of documents 601, 603, 605, 607, and 609 with the assigned label(s)) and the probability scores associated with the labels.
At step 511, computing device 102 may be configured to output the predicted date(s). For example, computing device 102 may be configured to output the predicted start and end dates via output device 154 (e.g., a display). In some embodiments, computing device 102 may also be configured to output one or more results of the processing of the medical record by the model. For example, computing device 102 may be configured to output the probability scores associated with the dates.
At step 701, computing device 102 may obtain a medical record of the patient. In some embodiments, computing device 102 may obtain a medical record based on one or more operations similar to those described in connection with step of 501 of process 500 as described elsewhere in this disclosure, and the detailed description is not repeated here for purposes of brevity.
At step 703, computing device 102 may obtain a model for predicting a date of an event associated with a patient. In some embodiments, computing device 102 may obtain a model based on one or more operations similar to those described in connection with step 503 of process 500 as described elsewhere in this disclosure, and the detailed description is not repeated here for purposes of brevity.
At step 705, computing device 102 may further be configured to input the medical record into the model. For example, the user may select the medical record to be input to the model via input device 153. In some embodiments, the medical record may include at least one preprocessed unstructured document. In some embodiments, the model may include an input layer, and computing device 102 may input the medical record into the input layer of the model.
At step 707, according to the model and medical data, for each of the plurality of unstructured documents, computing device 102 may be configured to identify one or more time expressions in the each of the plurality of unstructured documents. A time expression may be a defined term (e.g., “Jan. 28, 2019”), a relative term (e.g., “next Monday”), a term referring to another date or event (e.g., “since the last visit”), or the like, or a combination thereof. By way of example, referring to
At step 709, the model may determine one or more dates relating to the identified one or more time expressions. By way of example, referring to
In some embodiments, the model may determine a mapped date for a time expression based on the date of the document from which the time expression is identified. For example, as illustrated in
In some embodiments, the model may determine a mapped date for a time expression based on the date of the document from which the time expression is identified and the date of another document. For example, a document may include a time expression referring to a previous clinic visit (e.g., “since the last visit till last Monday”). The model may identify the time expression “since the last visit” in this document and determine a mapped date (or a period) for the time expression based on the dates of this document and a document associated with the previous visit (i.e., the “last visit” referred in the document including the time expression).
In some embodiments, the model may be configured to revise the content of a document based on the identified time expression and its mapped date. By way of example, referring to
In some embodiments, the model may be configured to update the medical record received and generate the updated medical record including at least one document having revised or new content. By way of example, referring to
In some embodiments, the model may also be configured to determine a probability score for the dates associated with the documents (e.g., a timestamp of a document, a date of a document, a mapped date associated with a document, or the like, or a combination thereof) for being associated with the beginning of the event (e.g., the start date), an ending of the event (e.g., the end date), or non-event date. By way of example, the model may determine a probability score for the mapped date of Dec. 8, 2018 (which is associated with document 904) for being associated with the beginning of taking the drug by the patient. Alternatively or additionally, the model may be configured to label document 904 (and/or the mapped date) as “Start,” as illustrated in
In some embodiments, the model may determine whether to update a document based on the probability score for a date of the document for being associated with the beginning of the event (e.g., the start date), an ending of the event (e.g., the end date), or non-event date. For example, referring to
At step 711, the model (or computing device 102) may be configured to predict one or more dates (and/or a period) associated with the event based on the probability scores associated with the dates of the documents. For example, the model may be configured to determine a date associated with a document (e.g., the timestamp of the document, the date of the document, a mapped date of the document) that has the highest probability score for being associated with the beginning (or the end) of taking the drug by the patient. By way of example, the model may determine that Dec. 8, 2019, which is associated with document 904, has the highest probability score for being associated with the beginning of the event, as the start date.
In some embodiments, the model (or computing device 102) may be configured to predict one or more dates (and/or a period) associated with the event based on dates associated with the documents and the probability scores for the dates for being associated with the beginning or ending of the event. For example, the model may be configured to determine an earliest document in the document timeline (e.g., a document having an earliest timestamp) that has a probability score for being associated with the beginning of the event higher than a threshold. As another example, the medical data may identify, among the plurality of the unstructured documents, one or more documents having a mid-event label, select, among the one or more documents having a mid-event label, a document having an earliest timestamp, and assign a date of the timestamp of the selected document as the starting date of the event.
At step 713, computing device 102 may be configured to output the predicted date(s). For example, computing device 102 may be configured to output the predicted start and end dates via output device 154 (e.g., a display). In some embodiments, computing device 102 may also be configured to output one or more results of the processing of the medical record by the model. For example, computing device 102 may also be configured to output the probability scores associated with the dates. As another example, computing device 102 may be configured to output the updated document timeline (e.g., updated document timeline 900). In some embodiments, the model may include an output layer configured to output one or more results of processing of the medical record by the model (e.g., one or more predicted dates, probability scores, updated documents, or the like, or a combination thereof).
At step 1003, computing device 102 may obtain a first model. In some embodiments, computing device 102 may obtain a first model similar to a model obtained in step 703 of process 700, and the detailed description is not repeated here from purposes of brevity.
At step 1005, computing device 102 may input the medical record into the first model. In some embodiments, computing device 102 may input the medical record into the first model based on one or more operations similar to those described in connection with step 705 of process 700 (or step 505 of process 500), and the detailed description is not repeated here from purposes of brevity.
At step 1007, the first model may generate and output an updated medical record, which may be received by computing device 102. The updated medical record may include at least one updated unstructured document having a mapped date. In some embodiments, the first model may generate one or more updated unstructured documents based on one or more operations similar to those described in connection with steps 707-711 of process 700. For example, the first model may be configured to identify one or more time expressions in an unstructured document of the medical record (similar to one or more operations described in connection with step 707 of process 700). The first model may also be configured to determine one or more dates (i.e., a mapped date) relating to the identified time expression(s) (similar to one or more operations described in connection with step 709 of process 700). The first model may further be configured to update the unstructured document by revising the content associated with the determined date relating to a time expression (similar to one or more operations described in connection with step 709 of process 700). In some embodiments, the first model may also be configured to create a “pseudo” document based on the determined date and the content of an original document. By way of example, the first model may generate document 904 illustrated in
In some embodiments, the first model may be configured to predict one or more preliminary dates (and/or a period) associated with the event based on the probability scores associated with the dates of the documents. For example, the first model may be configured to determine a date associated with a document (e.g., the timestamp of the document, the date of the document, a mapped date of the document) that has the highest probability score for being associated with the beginning (or the end) of taking the drug by the patient. The first model may also be configured to predict one or more preliminary dates (and/or a period) associated with the event based on dates associated with the documents and the probability scores for the dates for being associated with the beginning or ending of the event. The first model may further be configured to determine a probability score for the predicted preliminary date(s). If the probability score for the predicted preliminary date(s) is higher than a threshold (e.g., a number between 70%-99%), the preliminary date(s) may be set as the dates associated with the event (e.g., a start date, an end date), and process 1000 may proceed to step 1005, where computing device 102 may output the predicted date(s).
At step 1009, computing device 102 may obtain a second model. In some embodiments, computing device 102 may obtain a second model that is similar to the model obtained at step 503 of process 500, and the detailed description is not repeated here from purposes of brevity.
At step 1011, computing device 102 may input the updated medical record into the second model. By way of example, computing device 102 may input an updated medical record including updated document timeline 900 into the second model.
At step 1013, computing device 102 may obtain one or more predicated date associated with an event from the second model. In some embodiments, the second model may predict one or more dates associated with the event based on one or more operations similar to those described in connection with steps 507 and 509 of process 500, and the detailed description is not repeated here for purposes of brevity. By way of example, the second model may assign, for each of the updated documents (and/or original documents if they are not updated), a label based on the date associated with the updated document (e.g., a mapped date, a timestamp, a time expression, or the like, or a combination thereof). For example, the second model may assign a label, among PRE, MID, POST, and/or OTHER labels, to an updated (or original) document. The second model may further be configured to predict a start date (or an end date, a period, or the like, or a combination thereof) of the event based on the labels.
At step 1015, computing device 102 may output the predicted date(s) via, for example, output device 154. For example, computing device 102 may present the predicted start and end dates of taking the drug of the patient on a display. In some embodiments, computing device 102 may also present one or more results of the processing of the medical record and/or the updated medical record by the first and/or second models. By way of example, computing device 102 may present document timeline 500 and/or updated document timeline 900. As another example, computing device 102 may output the probability score of the predicted date(s).
At 1101, computing device 102 may be configured to obtain a medical record. In some embodiments, computing device 102 may be configured to obtain a medical record based on one or more operations similar to those described in connection with step 501 of process 500 as described elsewhere in this disclosure, and the detailed description is not reheated here for purposes of brevity. By way of example, computing device 102 may obtain a medical record including a plurality of unstructured documents from a database. The unstructured documents may include preprocessed documents. Alternatively or additionally, the unstructured documents may include updated documents.
At 1103, computing device 102 may be configured to obtain a first model and a second model for predicting a date associated with an event. In some embodiments, the first model may include a model similar to the model obtained in process 700, and the second model may include a model similar to the model obtained in process 500. Detailed descriptions are not repeated here for purposes of brevity.
At 1105, computing device 102 may be configured to input the medical record into the first model. In some embodiments, computing device 102 may be configured to input the medical record into the first model based on one or more operations similar to those described in connection with step 705 of process 700 as described elsewhere in this disclosure, and detailed description is not repeated here for purposes of brevity.
At 1107, computing device 102 may be configured to obtain a first preliminary date associated with the event from the first model. The first preliminary date may include a start date and/or an end date of the event. In some embodiments, computing device 102 may be configured to predict the first preliminary date based on one or more operations similar to those described in connection with steps 707-711 of process 700 as described elsewhere in this disclosure, and the detailed description is not repeated here for purposes of brevity.
By way of example, the first model may be configured to identify one or more time expressions in an unstructured document of the medical record (similar to one or more operations described in connection with step 707 of process 700). The first model may also be configured to determine one or more dates (i.e., a mapped date) relating to the identified time expression(s) (similar to one or more operations described in connection with step 709 of process 700). The first model may further be configured to update the unstructured document by revising the content associated with the determined date relating to a time expression (similar to one or more operations described in connection with step 709 of process 700). The first model may also be configured to determine a probability score for a date associated with a document for being associated with the beginning of the event (e.g., the start date), an ending of the event (e.g., the end date), or non-event date. The first model (or computing device 102) may be configured to predict the first preliminary date (and/or a period) associated with the event based on dates associated with the documents and the probability scores for the dates for being associated with the beginning or ending of the event.
At 1109, computing device 102 may be configured to input the medical record into the second model. In some embodiments, computing device 102 may be configured to input the medical record into the second model based on one or more operations similar to those described in connection with step 505 of process 500 as described elsewhere in this disclosure, and the detailed description is not repeated here for purposes of brevity.
At 1111, computing device 102 may be configured to obtain a second preliminary date from the second model. The first preliminary date may include a start date and/or an end date of the event. In some embodiments, computing device 102 may be configured to obtain a second preliminary date from the second model based on one or more operations similar to those described in connection with steps 507 and 509 of process 500 as described elsewhere in this disclosure, and the detailed description is not repeated here for purposes of brevity. By way of example, the second model may be configured to assign a label to unstructured documents based on the timestamps and/or time expressions (explicitly or implicitly) indicated in the documents. Computing device 102 or the second model may also be configured to predict a second preliminary date (e.g., a start date or an end date) of the event based on the labels of the unstructured documents. In some embodiments, the model may also determine a probability score for the second preliminary date.
At 1113, computing device 102 may be configured to predict a date of the event based on the first and second preliminary dates. For example, the first preliminary date may include a first preliminary start date of taking the drug by the patient, and the second preliminary date may include a second preliminary start date. Computing device 102 may receive the first and second preliminary start dates and their corresponding probability scores from the first and second models. Computing device 102 may determine a start date based on the first and second preliminary dates. For example, computing device 102 may select one of the first and second preliminary dates that has a higher probability score as the date of the event. As another example, computing device 102 may determine a date between the first and second preliminary dates by, for example, selecting a date around the midpoint of the first and second preliminary dates, and assign this determined date as the date of the event.
At 1115, computing device 102 may be configured to output the date to the user. For example, computing device 102 may present the date to the user via output device 154 (e.g., a display).
Training data were obtained based on the clinic visit notes of a set of patients with metastatic RCC were obtained from a database, which is a longitudinal, demographically, and geographically diverse database derived from electronic health record (EHR) data. Oral drug regimens, along with their start and end dates, were extracted by clinical experts via chart review. These dates were used for labeling and held as ground truths. The units of observation were patient-drug pairs. Only pairs in which the clinic notes contained at least one mention of the drug (either by the generic or brand name) were considered. There were 8259 such patient-drug examples from 172 different practices. Of these, the drug was actually taken in 4410 (53%) examples; in the rest, the drug was mentioned in the notes but not taken.
80% of the labeled (or training) data were used for training models, and 20% were used for testing. The dataset was split such that no patients who appeared in the training set were in the test set.
The performance of the binary task of predicting whether the patient took the drug using the F1 score. On true positive examples (those for which the model correctly predicted that the patient took the drug), the agreement of start and stop dates was measured as follows. Let Starti(t) and Stopi(t) be an indicator variable denoting whether for the ith example, the predicted start or stop date matches the ground truth date within a window of t days. For example, Stopi(7)=1 if either the patient is still taking the drug and the model correctly identifies this, or the last-taken date identified by the model is within a week of the ground truth. To measure overall date agreement, we used Starti(t) and Stopi(t), defined to be the mean over the true positives of the Starti(t) and Stopi(t) values.
To remain flexible and sensitive to dataset size, the TIFTI framework does not specify the classification algorithm for either sub-task. We tried multiple algorithms for each. On the document timeline sequence labeling task, we saw the best performance with a bidirectional LSTM over documents featurized by ngrams. On the time expression classification task, we saw the best performance with a simple l2-regularized logistic regression, also featurized by ngrams. These optimizations, along with other hyperparameter tuning, were performed using 5-fold cross validation over the development set, optimizing on a combination of the F1 score, Start (0), and Stop (0).
In order to perform well for rare drugs and generalize across diseases, TIFTI abstracts away the drug name during feature generation and models each drug independently. To test whether this design had the intended effect, we created a dataset of advanced non-small cell lung cancer (NSCLC) examples (a portion in the development set and a portion in the test set), using the same data preprocessing and feature generation process as for RCC. We then measured the performance on the NSCLC test set of the final TIFTI model trained on RCC and of a TIFTI model trained on NSCLC examples.
On the RCC test set, the model had an F1 score of 0.944, a Start (0) score of 45.8%, a Stop (0) score of 52.4%, a Start (30) score of 85.9%, and a Stop (30) score of 77.6%. In an ablation study (Table 1), the two best performing models were the explicitly cascaded models. The model with the simulated document timeline slightly outperformed its counterpart with the original document timeline, both at 0 and 30 days, confirming that the pseudo-documents in the simulated timeline added useful context. This effect is only visible for the start date statistics, which is consistent with the fact that starts dates were more likely than stop dates to be explicitly mentioned in text.
On the NSCLC test set, the model trained on the RCC data had an F1 score of 0.936, a Start (0) score of 49.1%, and a Stop (0) score of 57.1%. This performance was comparable to the performance on the RCC test set and was almost as high as the model trained on the NSCLC examples (F1: 0.947, Start (0): 50.3%, Stop (0): 57.8%), indicating that the framework generalized as intended.
TIFTI is a framework for extracting the spans of drug regimens from longitudinal clinic visit notes. TIFTI predicts the treatment interval over a simulated patient timeline formed by combining the temporal information from both free text and document timestamps. It predicted approximately 80% of dates within 30 days and generalized well to a new type of cancer.
The foregoing description has been presented for purposes of illustration. It is not exhaustive and is not limited to the precise forms or embodiments disclosed. Modifications and adaptations will be apparent to those skilled in the art from consideration of the specification and practice of the disclosed embodiments. Additionally, although aspects of the disclosed embodiments are described as being stored in memory, one skilled in the art will appreciate that these aspects can also be stored on other types of computer-readable media, such as secondary storage devices, for example, hard disks or CD ROM, or other forms of RAM or ROM, USB media, DVD, Blu-ray, 4K Ultra HD Blu-ray, or other optical drive media.
Computer programs based on the written description and disclosed methods are within the skill of an experienced developer. The various programs or program modules can be created using any of the techniques known to one skilled in the art or can be designed in connection with existing software. For example, program sections or program modules can be designed in or by means of .Net Framework, .Net Compact Framework (and related languages, such as Visual Basic, C, etc.), Java, Python, R, C++, Objective-C, HTML, HTML/AJAX combinations, XML, or HTML with included Java applets.
Moreover, while illustrative embodiments have been described herein, the scope of any and all embodiments having equivalent elements, modifications, omissions, combinations (e.g., of aspects across various embodiments), adaptations and/or alterations as would be appreciated by those skilled in the art based on the present disclosure. The limitations in the claims are to be interpreted broadly based on the language employed in the claims and not limited to examples described in the present specification or during the prosecution of the application. The examples are to be construed as non-exclusive. Furthermore, the steps of the disclosed methods may be modified in any manner, including by reordering steps and/or inserting or deleting steps. It is intended, therefore, that the specification and examples be considered as illustrative only, with a true scope and spirit being indicated by the following claims and their full scope of equivalents.
This application claims the benefit of priority of U.S. Provisional Application No. 62/747,428, filed on Oct. 18, 2018. The entire contents of the foregoing application are incorporated herein by reference in their entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2019/056207 | 10/15/2019 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62747428 | Oct 2018 | US |