The following relates generally to the medical research and development arts, the healthcare database curation arts, healthcare data mining arts, and related arts.
Numerous areas of healthcare research and development leverage healthcare databases containing data on medical patients. Medical histories or other clinical data, patient billing data, administrative records pertaining to matters such as hospital bed occupancy, and so forth are maintained by hospitals or other medical facilities and/or by individual units such as the cardiac care unit (CCU), intensive care unit (ICU), or emergency admittance department. These databases store sensitive patient data that generally must be maintained confidentially under financial and/or medical privacy laws such as (in the United States) the Health Insurance Portability and Accountability Act (HIPAA).
To enable a patient database to be used for data analytics for clinical, hospital administrative, or other purposes while maintaining patient privacy, it is known to anonymize the database by removing patient-identifying information (PII). Information that needs to be anonymized includes patient name and/or medical identification number (suitably replaced by a randomly assigned number or the like), address, or so forth. Other anonymization measures may include removing “rare” patients who might be identifiable by a combination of unusual characteristic for example, a patient who is 102 years old with a particular illness might be identified on the basis of that information alone.
In addition to rare patients, a patient might be identifiable based on timestamp information for events recorded in the patient record. For example, if a patient is admitted to the hospital on a certain date with a certain condition, that information may be sufficient to narrow the number of possible patient identifications to a small number. However, longitudinal information, that is, the time sequence of events and the time intervals between various events, is sometimes useful in healthcare data analytics. For example, the time interval between admission and discharge may be useful or even critical for analyzing hospital efficiency and/or effectiveness of a certain treatment. To reduce the potential for using a timestamp to identify an anonymized patient while retaining the longitudinal information potentially of value for the healthcare data analysis, in some anonymized databases the timestamps are shifted by some random amount (generally different for each patient), using a rigid shift for all timestamped events of a given patient. The random rigid time shift in timestamps makes patient identification via timestamp more difficult, while the use particularly of a rigid time shift retains the longitudinal information, i.e. the time interval information between events.
In one disclosed aspect, an anonymized healthcare data source device comprises at least one electronic processor programmed to integrate N anonymized healthcare databases (10) where N is a positive integer having a value of at least three by performing a database integration process including the operations of: for a pair of databases (i,j) of the N anonymized healthcare databases, identifying a set of features each contained in both databases i and j of the pair of databases (i,j) and generating a conversion table matching patients of the pair of databases based on patient similarity measured by the set of features; repeating the identifying and generating operations for each unique pair of databases of the N anonymized healthcare databases to generate N(N−1)/2 conversion tables. The at least one electronic processor is further programmed to perform a patient data retrieval process including the operation of retrieving patient data for one or more anonymized patients contained in the N anonymized healthcare databases using the N(N−1)/2 conversion tables.
In another disclosed aspect, an anonymized healthcare data source device comprises at least one electronic processor programmed to integrate a healthcare database i and a healthcare database j by performing a database integration process including the operations of: for the pair of databases (i,j), identifying a set of features each contained in both databases i and j of the pair of databases (i,j) including at least one longitudinal feature defined by a pair of timestamped events separated by a time interval Δt between the timestamps of the events and generating a conversion table matching patients of the pair of databases (i,j) based on patient similarity measured by the set of features including comparison of the time interval Δt for patients in the two databases (i,j). The at least one electronic processor is further programmed to perform a patient data retrieval process including the operation of retrieving patient data for one or more anonymized patients contained in both anonymized healthcare databases (i,j) using the conversion table matching patients of the pair of databases (i,j). In another disclosed aspect, a non-transitory storage medium stores instructions readable and executable by a computer to perform an anonymized population image reconstruction method to reconstruct an anonymized population image from N anonymized healthcare databases where N is a positive integer having a value of at least two. The anonymized population image reconstruction method comprises: for a pair of databases (i,j) of the N anonymized healthcare databases, identifying a set of features each contained in both databases i and j of the pair of databases (i,j) and generating a conversion table matching patients of the pair of databases based on patient similarity measured by the set of features. The identifying and generating operations are repeated for each unique pair of databases of the N anonymized healthcare databases to generate the anonymized population image comprising contents of the N anonymized healthcare databases integrated by the N(N−1)/2 conversion tables.
One advantage resides in providing for integration of two, three, four, or more anonymized healthcare databases to leverage the combined data contained in the databases for healthcare data analytic tasks.
Another advantage resides in providing for the foregoing in which one or more anonymized healthcare databases is an unstructured healthcare database.
Another advantage resides in providing the foregoing in which longitudinal information, that is, time intervals between events, is leveraged in matching anonymized patients in different anonymized healthcare databases.
A given embodiment may provide none, one, two, more, or all of the foregoing advantages, and/or may provide other advantages as will become apparent to one of ordinary skill in the art upon reading and understanding the present disclosure.
The invention may take form in various components and arrangements of components, and in various steps and arrangements of steps. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention.
Numerous challenges are posed in integration of anonymized healthcare databases. The various anonymized healthcare databases may vary significantly in scope, with only a portion of the data overlapping between any two databases. Indeed, this partial overlap is a significant motivating factor in the desire to integrate multiple anonymized healthcare databases to “fill in” information missing in one database with content from another database. For example, as used herein an “anonymized healthcare database” may be (by way of illustration): a medical records database, such as an anonymized database extracted from a comprehensive Electronic Medical Record (EMR) or a domain-specific medical database such as a cardiovascular information system (CVIS) or an intensive care unit (ICU) information system; an anonymized database extracted from a hospital billing department database; an anonymized database extracted from a medical insurance company database; an anonymized database extracted from a hospital admissions departmental database; or so forth. An anonymized database extracted from a CVIS can be expected to contain medical records pertaining to diagnosis and treatment of cardiovascular disease, but may not include information on insurance coverage for those diagnoses/treatments. By contrast, an anonymized database extracted from the hospital billing department can be expected to contain insurance reimbursement information but not medical diagnosis/treatment data. Combining these databases could provide a more holistic image of the patient population; but the limited content overlap between the two databases which provides motivation for the integration also makes such integration challenging.
In various embodiments disclosed herein, these problems are overcome by leveraging the integration of multiple (three or more) healthcare databases. This can provide a greater degree of overlap overall which motivates toward performing integration of the N databases in a single process; however, paradoxically it is disclosed herein that a more efficient and reliable approach for performing the integration is to first integrate each pair of anonymized healthcare databases, so as to generate a conversion table for each pair, and then refine the resulting N(N−1)/2 conversion tables based on consistency of patient matching between the N(N−1)/2 conversion tables. This approach recognizes that the overlap of features between the N databases is likely to be small, and moreover even where overlap is present certain features may be unreliable in some databases. By employing the disclosed approach of first integrating pairs of databases, a set of features can be chosen for each such pairwise integration that is well-chosen for that pair of anonymized healthcare databases. The additional information provided by the multiple (N>2) databases is then leveraged in the subsequent refinement step, which in some embodiments does not rely upon the features.
Additionally or alternatively, in embodiments disclosed herein these problems are overcome by leveraging longitudinal information, that is, the time sequence of events and the time intervals between various events. In general, a longitudinal feature is defined by a pair of timestamped events for a single anonymized patient in an anonymized healthcare database which are separated by a time interval Δt between the timestamps of the events. Such longitudinal features are well-defined even in an anonymized healthcare database in which the anonymization process introduces a random, but rigid, shift of all timestamps for each patient, since the rigid time shift does not affect the time intervals Δt between events.
With reference to
In general, the anonymization of a particular datum can be done by removing the data (redaction) or by replacing the data with a placeholder, the latter being preferable in situations where correlations with that particular type of information are desirably retained, albeit with anonymization. For example, medical care unit (e.g. hospital or care unit) entries may be replaced by placeholders that are internally consistent for the database. These placeholders are internally consistent within a given database, but vary essentially randomly between databases. For example, in Database 1 the hospital “Blackacre General Hospital” may be always replaced by the placeholder, e.g. “8243”, while “Whiteacre Community Medical Center” may be always replaced by the placeholder “1238”. In this example, every instance of medical care unit “Blackacre General Hospital” in Database 1 is replaced by (same) placeholder medical care unit “8243” and every instance of medical care unit “Whiteacre Community Medical Center” in Database 1 is replaced by the (same) placeholder medical care unit “1238”. On the other hand, to continue the example for Database 2, each instance of medical care unit “Blackacre General Hospital” in Database 2 may be replaced by the same placeholder medical care unit “EADF” (which is different from the placeholder “8243” used for Blackacre in anonymized Database 1), and each instance of “Whiteacre Community Medical Center” may be replaced by the same placeholder medical care unit “JSDF” (which again is different from the placeholder “1238” used for Whiteacre in anonymized Database 1). Such anonymization of medical care units by medical care unit placeholders that are internally consistent within the anonymized database enables a healthcare data analytic process operating on a database to identify correlations with a particular medical care unit while maintaining patient anonymity. For example, if Blackacre has a statistically significantly higher success rate for heart transplants than the average hospital, this will show up in Database 1 (assuming it stores heart transplant outcome data) as a statistically significantly higher success rate for heart transplants performed at anonymized hospital “8243”.
On the other hand, some information may be anonymized by redaction, that is, removal. For example, residential address information may be redacted entirely, as this is highly identifying and useful correlations with residential address may not be expected for a typical healthcare data analytic process. In a variant embodiment, if residential address correlations are expected to be a useful input for the healthcare data analytic process, address anonymization may be performed by replacing each residential address by a broader geographical area, e.g. the residential city if this city has a sufficiently large population to assure an acceptable level of anonymity. A residential city or county with sufficiently small population may be redacted entirely to avoid retaining “rare” data that could be personally identifying, or may be replaced by a suitably larger geographical unit such as the residential state.
The anonymized healthcare databases 10 are generally expected to each be formatted in some structured format, for example in a relational database format or other structured database format, as spreadsheets, searchable column-delimited rich text files, or so forth. However, in some embodiments one or more of the databases 10 may be an unstructured database, for example storing written text reports on patients, or may have limited structure, e.g. a structured heading providing information such as patient name and demographic information followed by unstructured text reports. In such a case, natural language processing (NLP) may be employed to extract structured representations of the database contents, such as bag-of-words representations of text documents.
As illustrated in
In the present case, k=2 since a pair is being drawn, and the set is the N anonymized healthcare databases 10 so that n=N, so the combination reduces to N(N−1)/2. In general, where N>2 the number of matched patients m may differ for different pairs of databases (i,j), although some overlap of patients between database pairs is expected for useful integration of three or more anonymized healthcare databases.
It is contemplated for the N(N−1)/2 conversion tables 20 to be embodied as a single table, e.g. a concatenation of the N(N−1)/2 tables each of dimension m×2 to form a single m×[N(N−1)] table. In this case it is assumed that all N(N−1)/2 constituent m×2 conversion tables have the same number of matched patients m if this is not the case then padding can be used to account for “missing” anonymized patients, e.g. if patient 49 of Database 1 has no match in Database 3 then the constituent m×2 conversion table for the pair (i,j)=(1,3) is suitably filled in by <null> or zeros or other placeholders.
The computer 14 is also programmed to perform the patient data retrieval process 18 to retrieve anonymized patient data from the N anonymized healthcare databases 10 using the N(N−1)/2 conversion tables 20. For example, a query may be submitted to the patient data retrieval process 18 to acquire the value of a query feature for a given patient identified by an anonymized patient ID used in Database 1. This patient ID can be used directly to retrieve the value of the query feature from Database 1, while for each of Databases j=2, . . . , N the appropriate conversion table for the database pair (1,j) is used to match the patient ID in Database j in order to retrieve the query feature value from Database j.
However, in general the query feature may not be contained in all N databases. If the query feature is contained in only one of the N anonymized healthcare databases then the query feature is retrieved from the (single) anonymized healthcare database containing the query feature. On the other hand, if the query feature is contained in two or more of the N anonymized healthcare databases, then a retrieved value is generated for the query feature from the values of the query feature in the two or more of the N anonymized healthcare databases containing the query feature. This may be done, for example, using a feature accuracy metric for the query feature in the respective anonymized healthcare databases containing the query feature. For example, if the query requests the primary diagnosis for patient 49 and Databases 1, 2, and 3 each contain a primary diagnosis field, then this provides three values for primary diagnosis of patient 49 (after conversion of the anonymized patient ID 49 for the Databases 2 and 3 using appropriate m×2 conversion tables). If Databases 1 and 3 are known to have accuracy rates of 97% for primary diagnosis while Database 2 has a much lower accuracy rate (e.g. 71%) for this feature, then the retrieved value is generated as the primary diagnosis obtained from Databases 1 and 3 which are most likely to be accurate. Where different databases store different values for a given query feature, various approaches can be used to generate the retrieved value, such as taking the value for the database of the N databases 10 having the highest accuracy metric for that feature, or taking the most common value (e.g. if six databases list a value for the feature and five of these agree then the value appearing in five of the six databases may be chosen), or in the case of numerical values taking an average of the values (or of the values in some subset of the databases for which the accuracy metric of that feature is highest, or after removing any identifiable outlier values), or so forth.
The queries received and processed by the patient data retrieval process 18 may vary depending upon the purpose of the query. For example, it may be desired to obtain the primary diagnosis for all male patients in the age range 30-50 years old in this case the query might be formulated as a request for the set of primary diagnoses (with an enumeration for each different diagnosis) after appropriate filtering by age and gender. The query result in this case may be the set of data pairs {(diagnosis,count)} where each element (diagnosis,count) stores a text string indicating the diagnosis and a count of the number of patients (after age/gender filtering) having that diagnosis. If the N databases 10 are relational databases then the patient data retrieval process 18 may be implemented as a Structured Query Language (SQL) query engine that receives SQL queries.
With continuing reference to
The illustrative anonymized healthcare data source device 12 is shown in
With reference to
In the following, an illustrative example is described for matching patients in the chosen databases (i,j). In an operation 42, inclusion/exclusion criteria are applied to select the database portions to match. In order to match the patient-records from Database i and Database j, the subsets of the two databases that are possibly related are extracted. For example, if Database i covers only the data of Medical-surgical and Burn-Trauma ICU patients, from Database j, the subset of patients who were admitted to Medical-surgical and Burn-Trauma ICU wards during their hospitalizations are extracted (i.e. included) while data from other areas that do not overlap Database i are excluded. It should be noted that the excluded/included data is determined by the overlap for the particular database pair (i,j) and may differ for different pairs.
In an operation 44, a set of features is identified for use in integrating the database pair (i,j). Here a set of non-uniquely identifying features is selected with which Database i and Database j can be reliably integrated. The selected features are each contained in both databases i and j of the pair of databases (i,j). Moreover, the selected features are optionally chosen based on available information on reliability. For example, if it is known that one of the databases relatively inaccurate in terms of patients' gender records, but both Database i and Database j are accurate in terms of body weight records, then body weight is suitably chosen as a feature, and gender is suitably not chosen as a feature.
With brief reference to
With returning reference to
where pi and pj are feature vectors representing a patient being compared in Database i and a patient being compared in Database j, respectively, and pi (f) represents the value of the fth feature for patient pi and likewise pj(f) represents the value of the fth feature for patient pj. The parameters wf are feature weights and/or unit conversion factors chosen to indicate the relative importance of the various features f=1, . . . , F and (if necessary) to convert different feature types to a common unit to permit computing the sum. In this formulation, a smaller value for D(pi,pj) indicates more similar patients, so that two patients may be matched if D(pi,pj) is less than some threshold value. Any missing features can be dealt with in various ways, such as simply omitting them from the sum forming D(pi,pj) (and scaling 1/F accordingly), or assigning some default value for pi(f)−pj(f) in the case of a missing feature f. It is to be appreciated that the foregoing is merely an illustrative example and that substantially any other comparison formalism may be used to identify matching patients in the respective Databases i and j.
In an operation 48, the cross-database patient matches identified in the operation 46 are tabulated in a patient ID conversion table for the database pair (i,j). For example, this table may be an m×2 table such as:
where it will be noted that patient ID=3 in Database i has no match in Database j in this example, and similarly patient ID=6, ID=9, and ID=23 in Database j have no match in Database i. The illustrative example of Table 1 is sorted by patient ID of Database i, but it is trivial to perform a sort by patient ID of Database j if doing so will enable more efficient readout of the table (for example, if the query received by the patient data retrieval process 18 of
It should be noted that in some embodiments the patient matching is not exclusive. This is illustrated in Table 1 where patient ID=5 of Database i is matched with both patient 2 of Database j and with patient 3 of Database j. This optional non-exclusivity enables capture of uncertainties in the patient matching. For medical data analytic applications such non-exclusive matching is not necessarily problematic if the number of such uncertain matches is relatively low, and in such cases allowing for multiple matches in this way can improve the overall accuracy on a statistical basis. In the illustrative conversion table for databases (i,j) shown in Table 1, the storage is by way of duplicate entries for Database i Patient ID 5, which has the advantage of facilitating sorting the table on either the patient IDs of Database i or the patient IDs of Database j.
In a decision operation 50, the processing repeats for each unique pair of databases (i,j) in the set of N databases 10 being integrated, in order to generate a patient ID conversion table for each unique pair of databases (i,j). Thus, this loop will be performed N(N−1)/2 times to generate N(N−1)/2 conversion tables for the N(N−1)/2 unique database pairs obtainable from N databases. For example, if N=3 then there are three iterations, one for the pair (1,2), one for the pair (1,3), and one for the pair (2,3). As another illustrative example, if N=5 then there are ten iterations: (1,2), (1,3), (1,4), (1,5), (2,3), (2,4), (2,5), (3,4), (3,5), (4,5). The loop implemented by decision operation 50 can, for example, be implemented by the nested loop i=1 to N−1; j=i+1 to N (where j is the inner loop).
The output of the N(N−1)/2 loop iterations is the N(N−1)/2 conversion tables for the N(N−1)/2 unique database pairs of the N databases 10. In some embodiments, this is the final output providing the N(N−1)/2 conversion tables 20 (each of dimensions m×2) used by the patient data retrieval process 18. However, if the database integration process 12 terminates at this point then information from the multiple (three or more) healthcare databases (i.e. N>3) is not effectively leveraged to improve the individual m×2 pairwise conversion tables.
With continuing reference to
In another embodiment, such consistency analysis could be performed during the iterative loop 40, 42, 44, 46, 48, 50. This approach and reduce processing time for performing later loop iterations by leveraging the already-created pairwise conversion tables. For example, consider the case of N=3 with the databases indexed X, Y, and Z, and with the iterative loop 40, 42, 44, 46, 48, 50 being performed to create the X-Y, X-Z, and Y-Z conversion tables in that order. After creation of the X-Y and X-Z conversion tables it may thereby be known that Patient 10 of Database X is linked to Patient 11 of Database Y, and that Patient 10 of Database X is also linked to Patient 15 of Database Z. Then, during the last iteration to create the Y-Z conversion table, it is already known that Patient 11 of Database Y should be linked to Patient 15 of Database Z in order to assure consistency of the Y-Z conversion table with the already-created X-Y and X-Z conversion tables.
Additionally or alternatively, in some embodiments disclosed herein longitudinal information is leveraged to improve the patient matching. In general, a longitudinal feature is defined by a pair of timestamped events for a single anonymized patient in an anonymized healthcare database which are separated by a time interval Δt between the timestamps of the events. Such longitudinal features are well-defined even in an anonymized healthcare database in which the anonymization process introduces a random, but rigid, shift of all timestamps for each patient, since the rigid time shift does not affect the time intervals Δt between events.
With reference to
It is contemplated to have more complex longitudinal features, e.g. events of types g→e→f with events g e separated by a first time interval Δt1 and events e→f separated by a second time interval Δtt. In other contemplated longitudinal features, the allowable variation in Δt may be large enough that practically the longitudinal feature is matched if the events of types e→f occur in sequence regardless of the time interval between them (within some limit defined by the allowable variation in Δt).
The illustrative longitudinal features employ the time interval Δt between events, rather than comparing timestamps of events for patients in the two databases (i,j). As discussed previously, this approach relying upon time intervals between events, rather than relying on absolute timestamps of events, is robust against the possibility that the patient timeline was rigidly shifted by a random amount as part of the anonymization process.
In some embodiments, the longitudinal features are treated like other features of the set of features identified in operation 44 and used in operation 46 (see
In some embodiments, the non-longitudinal feature matching is performed (or is performed in part) using a universal patient ID(or UID) for each patient. The UID is constructed as a concatenation of a set of common features such as the patient's gender, race, age, and body weight. For example, the UID 1518170 for a patient could be generated using their following features: Male or Gender 1 (the first digit of 1518170); Native American or Race 5 (the second digit of 1518170), Age of 18 years (the third and fourth digits of 1518170) and body weight of 170 pounds (the fifth, sixth, and seventh digits of 1518170). Hence, every time a new record (medical report or claims record) is generated for a patient, a UID is assigned to the patient-record. Since the UID is feature-based, it should be the same across different anonymized databases. Optionally, some tolerance is accepted, e.g. Age of 80 in Database II is considered to be the same as Age of 79-81 in Database I, when using the tolerance threshold of ±1 year for Age. Such a UID approach for feature matching may be employed for all features of the set of features used to match the patient, or alternatively a smaller sub-set of features may be concatenated to form the UID, where the set of features forming the UID are common to all N databases 10. This latter approach advantageously enables the UID to be computed once and re-used for each iteration of the (i,j) loop of
It will be appreciated that various combinations of disclosed aspects may be employed in a given embodiment. For example, longitudinal feature matching can be used both for dual-database integration (N=2) and for multi-database integration (N≥3). Natural language processing (NLP) can be used to generate a set of features from an unstructured or semi-structured database for both N=2 and N>3 integration tasks.
In an alternative approach for viewing the disclosed healthcare data analytics device of
The invention has been described with reference to the preferred embodiments. Modifications and alterations may occur to others upon reading and understanding the preceding detailed description. It is intended that the invention be construed as including all such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2017/059266 | 4/19/2017 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62324363 | Apr 2016 | US |