The present invention relates to the analysis of the handling of personally identifiable information (PII), such as patient data. More specifically, the present invention relates to the analysis and de-identification of patient data with respect to sequences of events related to a disease or treatment, such sequences containing time stamps or time related data and being called longitudinal data.
Nowadays medical and health records of patients are collected and used for clinical bioinformatics research. Next to clinical data, imaging data or biobanking data of patients also their patient data are collected, and analyzing patient data plays a significant role in medical research and in diagnostics and anamnesis. For example, the patient data are analyzed for finding or improving treatments for different diseases.
However, analysis of patient data might pose threats for the patients that are sharing their patient data in that, for example, their privacy might be violated. The violation is due to the fact that the patient data of a person may contain personally identifiable information (PII) such as direct identifiers (e.g. name, email address, social security number, medical record number) and indirect identifiers such as locations, gender, age, weight, height eye color, skin color. The longitudinal patient data, e.g. containing time stamps and events, possibly together with other data embedded in the patient data, may lead to identification of a person by analyzing the patient data. In order to protect the privacy of individuals, certain parts of the patient data need to be anonymized when the patient data are provided for medical bioinformatics research and analysis.
Recent regulations, e.g. GDPR (see [1]), HIPAA (see [5]), put very strict requirements on the handling of personally identifiable information (PII), while also putting huge fines on noncompliance. For instance, the GDPR requires a data controller to ask for explicit consent from all data subjects. This consent must be minimal, meaning that a data controller cannot ask for more permissions than the bare minimum necessary. This is especially inconvenient in the context of medical research, where huge amounts of medical data get combined and analyzed in many different ways in the hope of getting new insights. Getting consent from every single data subject for every single analysis is practically impossible.
Luckily, these regulations provide a way out: when the dataset does not contain PII, then the regulations do not apply. Thus, making sure all PII identifiers are removed from the data makes it a lot easier to work with the resulting dataset. This is a commonly used process called anonymization.
The easiest way to remove personal identifiable information (PII) from a dataset seems to be to just remove direct identifiers like names and birthdates, which may be done initially. However, PII can be defined as “any data that could potentially identify a specific individual”. As it turns out, this is much more than just direct identifiers. As an example: an ethnicity of ‘Asian’ may reveal no information when talking about people in an city in China, but can really stand out when talking about a small village in the Netherlands with only one inhabitant of Asian descent. Such potentially sensitive information whose release must be controlled are called quasi-identifiers or indirect identifiers.
Samarati and Sweeney [4] first studied this issue and came up with the concept of k-anonymity which commonly used metric is an example of an anonymity property. Other anonymity measures may also be considered, as elucidated further on. For some predefined value k the k-anonymity property requires that each release of data must be such that every combination of values of quasi-identifiers can be indistinctly matched to at least k individuals. So, the anonymity property defines that a concatenation of all indirect identifiers of a patient enables identifying an outlying patient in the data set if there are less than the predefined value k patients having a same concatenation of indirect identifiers.
Longitudinal data is complex health data that contains information about patients over periods of time, e.g. as depicted in
Some of the existing methods for anonymization in bioinformatics research attempt to achieve de-identification of longitudinal timestamps by adding noise. However, this removes the temporal relation between consecutive timestamps and therefore may lead to wrong results during the analysis of this data. In order de-identify longitudinal data the next methods may be used: randomizing dates independently of one another, shifting the sequence while ignoring the intervals, generalizing intervals while maintaining order, see [2].
Shifting dates with keeping intervals intact is considered not safe due to preserving the intervals between consecutive events. This is true when the number of the events is limited but usually this is not the case in longitudinal data, where multiple timestamps and events are attached to the patients. Furthermore randomizing these timestamps at de-identification is not done in a structured manner and may affect the research results.
Attributes of longitudinal may be part of the data, as discussed in [3]. Examples are: length of stay in hospital, number of days since first claim computed from the first claim for that patient for each year, etc. These attributes may be indirect identifiers but are not completely describing the longitudinal record of a patient.
Furthermore, for the events attached to the timestamps the state of the art considers the number of events as an indirect identifier and truncates these events so that each bin of events has the required k-anonymity property.
According to the foregoing, the prior art has following issues:
It is an object of the invention to provide a method and system for longitudinal data de-identification that takes into account at least one of the preceding issues.
For this purpose, devices and methods for anonymization of a data set of patient data are provided as defined in the appended claims. According to an aspect of the invention a method for anonymization of a data set of patient data from multiple patients for providing a predefined anonymity property is provided as defined in claim 1. A system is provided as defined in claim 14. According to a further aspect of the invention there is provided a computer program product downloadable from a network and/or stored on a computer-readable medium and/or microprocessor-executable medium, the product comprising program code instructions for implementing the above method when executed on a computer.
Advantageously, the method and system achieve that a data set of patient data, in particular longitudinal patient data, is anonymized to a predetermined level as defined by the anonymity property. The relevance of the data set is kept high by only removing outlying patients, while avoiding noise and generalizing of time relate data.
Various embodiments may involve extracting indirect identifiers from the timestamps and events. The indirect identifiers may be properties of the data distribution, for example length of the time window, number of breaks in the data distribution, etc. Other elements of the data distribution can be categorized as indirect identifiers.
Further embodiments may involve treating events attached to the timestamps (e.g. ICD codes), when these are indirect identifiers in the following structure manner:
If the number of these events is lower than a threshold N (e.g. 5), then the ordered set of the explicit events represents the indirect identifier;
If the number of events is higher than said threshold N, the number of events becomes an indirect identifier. Events are not truncated from the dataset, nor are dummy ones added to the dataset.
Events in a specific category that are present in the dataset less than a threshold E will be generalized until they end-up in a category with the size higher than the threshold E.
The above thresholds N an E may be selected in view of the power of an attacker and the nature of the data.
The methods according to the invention may be implemented on a computer as a computer implemented method, or in dedicated hardware, or in a combination of both. Executable code for a method according to the invention may be stored on a computer program product. Examples of computer program products include memory devices such as a memory stick, optical storage devices such as an optical disc, integrated circuits, servers, online software, etc.
The computer program product in a non-transient form may comprise non-transitory program code means stored on a computer readable medium for performing a method according to the invention when said program product is executed on a computer. In an embodiment, the computer program comprises computer program code means adapted to perform all the steps or stages of a method according to the invention when the computer program is run on a computer. Preferably, the computer program is embodied on a computer readable medium. There is also provided a computer program product in a transient form downloadable from a network and/or stored in a volatile computer-readable memory and/or microprocessor-executable medium, the product comprising program code instructions for implementing a method as described above when executed on a computer.
Another aspect of the invention provides a method of making the computer program in a transient form available for downloading. This aspect is used when the computer program is uploaded into, e.g., Apple's App Store, Google's Play Store, or Microsoft's Windows Store, and when the computer program is available for downloading from such a store.
Further preferred embodiments of the devices and methods according to the invention are given in the appended claims, disclosure of which is incorporated herein by reference.
These and other aspects of the invention will be apparent from and elucidated further with reference to the embodiments described by way of example in the following description and with reference to the accompanying drawings, in which
The figures are purely diagrammatic and not drawn to scale. In the Figures, elements which correspond to elements already described may have the same reference numerals.
The present invention will be described with respect to particular embodiments and with reference to the figures, but the invention is not limited thereto, but only to the claims.
The term “individual” refers to a human subject. Said human subject may or may not be affected by or suffering from a disease to be studied. Hence, the terms “individual”, “person” and “patient” are synonymously used in the instant disclosure.
The expression “providing patient data” is understood that the patient data of at least one individual need to be obtained. However, the patient data of the at least one individual do not have to be obtained in direct association with the method or for performing the method. Typically the patient data of the at least one individual are obtained at a previous point or period of time, and are stored electronically in a suitable electronic storage device and/or database. For performing the method, the patient data can be retrieved from the storage device or database and utilized.
The method starts at node START 301, and step LOP 302 represents collecting and storing a set of longitudinal patient data of multiple individuals. Optionally, in the step LOP includes replacing time stamps representing dates by the time stamps representing intervals between the dates. Thereby, all time related data is made relative and cannot be matched to actual, individual dates and events.
Also, the method may include determining, across the data set, respective numbers of events in respective event categories regarding a respective disease or treatment. Rare events may potentially help an attacker to identify an individual. Then, any outlying event category is determined where the respective number of events is less than an event threshold (E). All events of the outlying respective event category are generalized until these events end-up in an event category where the respective number of events is higher than the threshold. For example, the threshold E may be 10.
In next step DII 303 indirect identifiers are determined, including at least one first indirect identifier representing a property of the data distribution of the time stamps and at least one second indirect identifier representing a number of events regarding a respective patient. In an embodiment, the first indirect identifier may be a length of a time window covering all time stamps from an individual, e.g. a total period in years.
In an embodiment, a first identifier may be a number of breaks in such a time window. A break represents a local minimum in the distribution of the events during the time window, indicative of a substantial period in the total time window without, or with relatively few, events. For example, if events for a patient are succeeding every day for one week, then nothing happens for one week and then again they start repeating every day, then the break is the week in the middle and therefore is called a local minimum. In a further embodiment, the method periods of a predetermined length in a sequence of events from an individual are determined. Then, a number of breaks in the periods is determined as the first indirect identifier, a break being a local minimum in the distribution of the events during the periods. Optionally, the method comprises determining, as a first indirect identifier, intervals of a predetermined length that have no events in respective sequences of events of respective patients. For example, as the second indirect identifier, a logarithmic function of the number of events regarding a respective individual may be used, while the value may be rounded to an integer.
Optionally, in a next step EBNn 304, repeatedly for all n patients in the data set, it is determined whether the number of events regarding a respective patient is below a number threshold (N). For example, N=5 and for any patient having 5 or more events the number of events is considered to be an indirect identifier. However, when the number of events is below N, the set of events regarding the respective patient is taken as a further indirect identifier. In an embodiment, the set of events is an ordered list of events. Optionally, when taking as the second indirect identifier the rounded logarithmic function base 10 of the number of events the value N, this function will be zero for 3 or less events, so N=4 coincides with the function round(log10(x)).
In next step DCOn 305, repeatedly for all n patients in the data set, concatenations of all indirect identifiers are determined, which concatenations represent equivalence classes of potentially identifiable individuals. The respective concatenations comprise the above determined first indirect identifier and the second indirect identifier, an any further indirect identifiers. Optionally, various first, second and further indirect identifiers may be included in said concatenation, where such combination of indirect identifiers is considered to constitute a risk of identifying the individual.
Subsequently, in next step ROPn 306, repeatedly for all n patients in the data set, the patient data of each outlying patient is removed from the data set. An outlying patient is any patient for which there are less than a predefined value (e.g. k) patients in an equivalence class, e.g. having a same concatenation of indirect identifiers.
Finally, the now anonymized data set may be provided as output to be used for further data analysis, research or statistics. The method terminates at node END 307.
Various embodiments may be implemented as a software framework that de-identifies longitudinal data by shifting dates, generalizing outlying events and suppressing outlying patients as depicted in
Main indirect identifiers can be the length of the time window covered by all timestamps, furthermore other elements of the distribution of these timestamps, etc. The choice regarding these indirect identifiers depends on the assumed power of the attacker and the nature of the data. This choice may be done in a preparatory process based on statistical data. The evaluation may further assisted by a de-identification expert. For examples for persons with a rich medical history diseases, the shortest interval between consecutive timestamps may not be a difference maker, but intervals without events may be an indirect identifier. For example in
The number of categories is set in view of the power of an assumed attacker and the nature of the data. Once the number of categories is set, the belonging category of each value x may be set by means of normalization.
Normalizing the respective category between a minimum value (value_min) and a maximum value (value_max) can be done for example:
Optionally, the rare events, which occur less than a threshold E in the total data set, are generalized for patients 1 and 10, where the respective disease code I48.91 is changed to Ix.x, and the respective disease code I25.10 is also changed to Ix.x. In the examples, the codes are ICD9 or ICD10 codes, e.g. diseases, procedures, as defined in [ICD]. In a further example, two codes needing generalization may have been I48.91 and I47.9. In that case the generalization may have been I4x.x.
Also, the longitudinal records with less than a threshold N, e.g. 4, distinct events have as an indirect identifier the ordered distinct events (e.g. patients 5, 6 and 7). Establishing the respective thresholds N and E may be done depending on the data set, e.g. by a de-identification expert. For example, the ceiling for the threshold N is around 20. The ceiling is used when the timestamps and events are the source of the only indirect identifiers. If more indirect identifiers are used, these are thresholds me be lowered.
After determining the indirect identifiers extracted from the longitudinal data, the next actions are perform for de-identifying the data. First, the values of all indirect identifiers are determined from the data set, while the set of values for each patient, also called a concatenation, is calculated. Then, all outlying patients are suppressed which are outliers because of their respective concatenation of indirect identifiers occurs less the k times in the data set. Removing such patients is not detrimental to the value of the data set, while traditional methods like generalizing dates is not advisable and may add noise in the data and risk affecting any research results. In the example, additionally, outlying events are generalized, as depicted in the column marked Generalization in
The above methods may be applied in heath data analysis platform or similar platforms. It may also be used as a client application that interacts with a data-lake for making available (k-anonymous) longitudinal to its clients. Furthermore, the methods may be applied on any form of privacy preserving computation that results in a dataset that still contains personal information and any data export, e.g. for research.
In an embodiment, the method for anonymization of patient data may be used for performing medical research and can include bioinformatic means, e.g. by using software tools for an in silico analysis of biological queries using mathematical and statistical techniques to analyze and interpret biological data with respect to their relevance for the goal of the medical research. This embodiment typically requires use of genetic information of a plurality of individuals.
In another embodiment of the method for anonymization of patient data, the method may be used in diagnostics, wherein the genetic information of an individual is analyzed for the genetic disposition and/or occurrence of a specific disease or disorder of said individual.
The method may be applied to any disease, disorder or medical condition. A disease to be studied may be a specific disease that is chosen on purpose. In an embodiment, the disease to be studied is known to be a disease that is associated with a particular genotype. Examples of such diseases are cancers, immune system diseases, nervous system diseases, cardiovascular diseases, respiratory diseases, endocrine and metabolic diseases, digestive diseases, urinary system diseases, reproductive system diseases, musculoskeletal diseases, skin diseases, congenital disorders of metabolism, and other congenital disorders such as prostate cancer, diabetes, metabolic disorders, or psychiatric disorders.
Patient data not directly related to a disease to be studied may be anonymized by using techniques that are selected from the group consisting of statistical anonymization, encryption, and secure multiparty anonymization and computation.
These anonymization techniques allow analysis on the data, but this analysis is limited due to their properties. The statistical anonymization implies loss of information, but keeps the rest of the information in a human-readable shape. This allows analyses to be performed on the data, but the results are limited by the loss of information from the beginning. Encryption techniques do not lose information, but this information is not available. However, if there is ever any indication that the encryption information is necessary for research, a privacy officer is able to extend the core disease information by decrypting this set. Modern techniques like homomorphic encryption, multi-party computations and/or other operations on encrypted data may be used on the longitudinal data. In these situations the privacy-sensitive information will stay secret, while the result of these operations can be disclosed by the privacy officer. These techniques insert latency in the analysis and therefore are limiting the possible analyses that can be performed on the data.
In an embodiment, the anonymity property is selected from the group consisting of k-anonymity, l-diversity, t-closeness and δ-presence.
K-anonymity is a formal model of privacy created by Sweeney [4]. The goal is to make each record indistinguishable from a defined number (k) of other records if attempts are made to identify the data. A set of data is k-anonymized if, for any data record with a given set of attributes, there are at least k−1 other records that match those.
L-diversity improves anonymization beyond what k-anonymity provides. The difference between the two is that while k-anonymity requires each combination of quasi identifiers to have k entries, l-diversity requires that there are l different sensitive values for each combination of quasi identifiers, see [6].
T-closeness requires that the distribution of a sensitive attribute in any equivalence class is close to the distribution of the attribute in the overall table (i.e., the distance between the two distributions should be no more than a threshold T), see [7]. L-diversity requirement ensures “diversity” of sensitive values in each group, but it does not take into account the semantically closeness of these values. This is done by t-closeness.
δ-presence is a metric to evaluate the risk of identifying an individual in a table based on generalization of publicly known data. δ-presence is a good metric for datasets where “knowing an individual is in the database poses” a privacy risk, see [8].
The anonymization techniques may comprise “searchable encryption”, “homomorphic encryption”, and “secure multiparty computation”, which have the advantage that decryption of the encrypted data is not actually necessary, but it is feasible to perform data processing in the encrypted domain. The main difference between these techniques is the choice of trade-offs they make. Searchable encryption limits the processing to a simple keyword match. Fully homomorphic encryption can do any kind of processing, but has extremely big ciphertext sizes and is computationally very intensive. Multiparty computation scales better, but requires non-colluding computers to work together to do the processing.
In an embodiment, the method as described in
It will be appreciated that the invention applies to computer programs, particularly computer programs on or in a carrier, adapted to put the invention into practice. The program may be in the form of a source code, an object code, a code intermediate source and an object code such as in a partially compiled form, or in any other form suitable for use in the implementation of the method according to the invention. It will also be appreciated that such a program may have many different architectural designs. For example, a program code implementing the functionality of the method or system according to the invention may be sub-divided into one or more sub-routines. Many different ways of distributing the functionality among these sub-routines will be apparent to the skilled person. The sub-routines may be stored together in one executable file to form a self-contained program. Such an executable file may comprise computer-executable instructions, for example, processor instructions and/or interpreter instructions (e.g. Java interpreter instructions). Alternatively, one or more or all of the sub-routines may be stored in at least one external library file and linked with a main program either statically or dynamically, e.g. at run-time. The main program contains at least one call to at least one of the sub-routines. The sub-routines may also comprise function calls to each other. An embodiment relating to a computer program product comprises computer-executable instructions corresponding to each processing stage of at least one of the methods set forth herein. These instructions may be sub-divided into sub-routines and/or stored in one or more files that may be linked statically or dynamically. Another embodiment relating to a computer program product comprises computer-executable instructions corresponding to each means of at least one of the systems and/or products set forth herein. These instructions may be sub-divided into sub-routines and/or stored in one or more files that may be linked statically or dynamically.
The carrier of a computer program may be any entity or device capable of carrying the program. For example, the carrier may include a data storage, such as a ROM, for example, a CD ROM or a semiconductor ROM, or a magnetic recording medium, for example, a hard disk. Furthermore, the carrier may be a transmissible carrier such as an electric or optical signal, which may be conveyed via electric or optical cable or by radio or other means. When the program is embodied in such a signal, the carrier may be constituted by such a cable or other device or means. Alternatively, the carrier may be an integrated circuit in which the program is embedded, the integrated circuit being adapted to perform, or used in the performance of, the relevant method.
The system 1100 is configured to anonymizing patient data as described with the above methods, e.g. elucidated with reference to
Furthermore, the system 1100 may have a user input interface configured to receive user input commands from a user input device to enable the user to provide user input, such as choose or define a particular disease, disorder or medical condition for subsequently determining a subset of patient data being related to said disease, disorder or medical condition. The user input device may take various forms, including but not limited to a computer mouse, touch screen, keyboard, etc.
It will be appreciated that, for clarity, the above description describes embodiments of the invention with reference to different functional units and processors. However, it will be apparent that any suitable distribution of functionality between different functional units or processors may be used without deviating from the invention. For example, functionality illustrated to be performed by separate units, processors or controllers may be performed by the same processor or controllers. Hence, references to specific functional units are only to be seen as references to suitable means for providing the described functionality rather than indicative of a strict logical or physical structure or organization. The invention can be implemented in any suitable form including hardware, software, firmware or any combination of these.
According to a further aspect, the invention concerns the use of the method and/or the computer program product in research and/or in diagnosis. In an embodiment, the method and/or computer program product is used in bioinformatics research. The use of the method, system and/or computer program product in bioinformatics research comprises acquisition the patient data of a plurality of individuals. Examples of research fields are genomics, genetics, transcriptomics, proteomics and systems biology.
In an alternative embodiment, the method, system and/or computer program product may be used in diagnosis, wherein the patient data of an individual are utilized to analyze whether the individual is affected by a specific disease or at risk of getting said disease or being affected by said disease. The individuals are sure that their patient data are properly anonymized.
Where an indefinite or definite article is used when referring to a singular noun, e.g. “a”, “an”, “the”, this includes a plural of that noun unless something else is specifically stated. Furthermore, the terms first, second, third and the like in the description and in the claims are used for distinguishing between similar elements and not necessarily for describing a sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and that the embodiments of the invention described herein are capable of operation in other sequences than described or illustrated herein. Moreover, the terms top, bottom, over, under, beyond and the like in the description and in the claims are used for descriptive purposes and not necessarily for describing relative positions. It is to be understood that the terms so used are interchangeable under appropriate circumstances and that the embodiments of the invention described herein are capable of operation in other orientations than described or illustrated herein. It is to be noticed that the term “comprising”, used in the present description and claims, should not be interpreted as being restricted to the means listed thereafter; it does not exclude other elements or steps. Thus, the scope of the expression “a device comprising means A and B” should not be limited to devices consisting only of components A and B. It means that with respect to the present invention, the only relevant components of the device are A and B.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the device claim enumerating several means, several of these means may be embodied by one and the same item of hardware. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.
Number | Date | Country | |
---|---|---|---|
62743601 | Oct 2018 | US | |
62876082 | Jul 2019 | US |