The systems and methods described herein relate to identifying and masking or removing sensitive data contained in communications.
The present invention is directed to systems, methods and computer-readable media for applying policy enforcement rules to sensitive data. An unstructured data repository for storing unstructured data is maintained. A structured data repository for storing structured data is maintained. A request for information is received. The request is analyzed to determine its context. Based on the context, a policy enforcement action associated with generating a response to the request is identified. The policy enforcement action may be to remove sensitive data in generating the response to the request and/or to mask sensitive data in generating a response to the request. An initial response to the request is generated by retrieving unstructured data from the unstructured data repository. Using the structured data maintained in the structured data repository, sensitive data included within the initial response is identified. The policy enforcement action is applied to the sensitive data included within the initial response to generate the response to the request.
Clinical data masking and removal is a method for desensitizing raw, unstructured (e.g., free from) data. The desensitization process masks or removes specific data values whose presence will lead to violation of sensitive data protection regulations. These regulations could be defined internally as part of an organization's data management policies or these regulations can be defined by governmental departments and agencies. Desensitized, unstructured data is essential for many different applications, including training of machine learning components.
Embodiments of the systems and methods described herein are designed to be independent of the source systems and are able to apply clinical processing rules and pattern matching and extraction across various kinds of raw clinical data. Certain embodiments may also allow for keeping track of previous pattern search results and human actions on it, to further learn to better apply the patterns and extract data that is more meaningful to the user into the future. Other embodiments may allow for introduction of new patterns as further needs arise with little to no changes in existing information processing rules. Still other embodiments may further allow for human intervention and oversight around the matching and masking decisions and continue to learn from it.
With regard to data pattern matching tools and algorithms, some existing pattern matching tools are able to detect specific patterns within raw unstructured (free form) data. Such pattern matching tools can be effective in finding commonly identified data. However, existing pattern matching tools are not customized to detect uncommon data patterns (e.g., uncommon human names). Thus, the use of data pattern matching tools for desensitization of clinical data has been proven to be imperfect. Subsequently, additional desensitization of specific data attributes and data values is necessary. For example, data pattern matching tool cannot differentiate between Nov. 10, 1964 (date of birth) and Dec. 25, 2011 (Christmas 2011). This creates a situation where a sensitive data policy that regulates the use of date of birth information is difficult to implement with a data pattern matching tool, as both the data of birth and Christmas 2011 dates are likely to be incorrectly detected as sensitive data by the pattern matching tool.
The solution described herein targets to implement efficient algorithms around data pattern matching and eventual masking and/or removal of sensitive information.
The approach to sensitive data management as detailed herein brings together the ability to include specific context in the form of structured data (e.g., Member Personal Health Information) and uses the structured data as a source for detecting sensitive data (e.g., PHI data) within unstructured data (e.g., Clinical RN notes).
Certain intelligent computer systems need large amounts of training data to achieve designed accuracy. Such systems are not designed and deployed to secure PHI. Certain embodiments of methodologies described herein scramble the PHI from unstructured data sources to generate the training data. For example, PHI information may be stored in two kinds of formats: structured formats (such as database table fields dedicated to particular type of information such as DOB, member id, names, SSN etc.) and unstructured formats (such phone conversation logs, fax and nurse notes etc.). By utilizing the structured PHI information to identify the PHI information in unstructured data, a greater accuracy can be achieved.
A specific example is now illustrated with reference to
Referring particularly to
With reference to
Referring now to
Thus, structured PHI information is used to pattern match the PHI in unstructured data. This can be accomplished by doing searches (exact, like, or pattern matching) in the unstructured data to ensure the fields in the structured contextual data that need to be removed or redacted are not included in the output unstructured data.
Configured rules may be used to fine tune pattern matching. Each field has different redaction or removal requirements. For example, there may be an age in the output data that needs to be removed, but the structured contextual data has only a data of birth. Subject matter experts may configure rules using the structured data that will accomplish the desired goal in the unstructured data. For example, in the age example, the method may look for the date of birth, month/year, and age to remove not just an exact match on the source structured date of birth. The method would not just pattern match and remove all dates; otherwise, valuable information in the unstructured data would be removed.
Action rules may be used to generate designed scrambling data. One example involves encrypting an identifier used to match the request and response on return. The customer profile key is encrypted so the service provider cannot see it, but the caller can unencrypt it on response to properly match or update source systems.
The clinical data masking and removal system and method may include the ability to detect specific contexts in which to apply specific sensitive data protection policy rules. This capability enables the method to detect semantic differences across syntactic similarities (for example, the case number and member ID being similar in data type and data lengths in the above example of
The system and method may also include ability to mask (i.e., encrypt) parts of unstructured (i.e., free form) data. Data encryption tools generally encrypt the entire unstructured data. The methods and systems defined herein can selectively encrypting data within unstructured (i.e., free form) text. The selective and granular application of the encryption logic is enabled by the systems and methods described herein.
The systems and methods may also provide the ability to generate desensitized, context sensitive unstructured data that conforms to multiple sensitive data protection policies (e.g., masking or removal).
The clinical data pattern matching masking and removal of sensitive data system and method may include the following characteristics, in some embodiments.
The systems and methods may standardize various data formats into a consistent meta model. Data from each source system may be processed as per business rules and context applicable to that system and is converted into a common model. The common model is agnostic of the source system.
Also, the systems and methods gather the rules that need to be applied. Rules may be categorized as source system rules or data driven rules. Source system rules are rules that need to be executed to understand the data model available within the source system so that meaningful data extraction can occur. Data driven rules are rules that are independent of the source from which the data was extracted, but pertain to understanding the context of the extracted data to generated interpreted sections from free form text.
Pattern matching algorithms may be run to obtain interpreted data. The pattern matching algorithm is primarily associated with the clinical data driven rules. Patterns such as keywords used to describe, e.g., the procedure or diagnosis codes, may be used to detect portions of text that are relevant for clinical purposes. Other examples include use of common vocabulary to determine an outcome. For example, “Approved”, “Pended”, “Referred to Physician” may be used to detect portions of text that refer to the clinical outcome. The common vocabulary used may be an expandable library of keywords and phrases that help to break down free form text into meaningful clinical data. Additional pattern matching algorithms may employed (i.e., general patterns used to extract clinical data from free form text, such as faxes sent by physicians, nurse phone conversations, scripted text data used for data entry, etc.). These patterns are generalized such that relevant clinical data can be extracted. For example, the possible formats of data that may be found in a fax are configured within the system. When the algorithm is executed against the data, each pattern is evaluated and computed for a level of “match-factor”. The higher the match-factor, the higher is the probability for a pattern match.
The systems and methods may also allow for display of identified patterns and suggestions. Data as extracted from the source system by applying source system rules is made available for manual reference or validation. This data may then be represented in the common model. Data obtained by applying clinical data rules/pattern matching algorithms on the common model is available as interpreted data.
The systems and methods may also allow for the removal of clinically sensitive data. Extraction of data from source system focuses on extracting meaningful clinical data and leaves out member-specific information. This is one of the initial steps for excluding sensitive data. Once the common model and interpreted data are generated, another set of cleansing rules can be applied on the entire data set. For example, data may be scanned for member ID numbers, dates of birth, member names, addresses, SSN, phone number, etc. These exclusion rules can be configured within the system so that new patterns can be entered within the system, as applicable, making it more efficient over iterations.
The systems and methods may also capture human feedback around final data abstraction/aggregation to create meaningful information with sensitive clinical data excluded. Data extraction in the common model and interpreted form may be made available to allow for processing of any manual edits to the extract. This serves several purposes. First, manual validation and correction of the extraction may be achieved. Further, additional patterns and rules that are observed during the manual process may be fed back to the extraction process to make it more efficient over iterations.
The systems described herein comprise a number of different hardware and software components. Exemplary hardware and software that can be employed in connection with the system are now generally described with reference to
To the extent data and information is communicated over the Internet, one or more Internet servers 708 may be employed. The Internet server 708 also comprises one or more processors 709, computer readable storage media 711 that store programs (computer readable instructions) for execution by the processor(s) 709, and an interface 710 between the processor(s) 709 and computer readable storage media 711. The Internet server 708 is employed to deliver content that can be accessed through the communications network. When data is requested through an application, such as an Internet browser employed by end user computer 712, the Internet server 708 receives and processes the request. The Internet server 708 sends the data or application requested along with user interface instructions for displaying a user interface.
The computers referenced herein are specially programmed, in accordance with the described algorithms, to perform the functionality described herein.
The non-transitory computer readable storage media that store the programs (i.e., software modules comprising computer readable instructions) may include volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules, or other data. Computer readable storage media may include, but is not limited to, RAM, ROM, Erasable Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), flash memory or other solid state memory technology, CD-ROM, digital versatile disks (DVD), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by the computer system and processed using a processor.
This application claims priority to U.S. Provisional Patent Application No. 61/580,480, filed Dec. 27, 2011, the entirety of which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
61580480 | Dec 2011 | US |