The present disclosure relates generally to systems and methods for detecting sensitive data in data source. More specifically, the present disclosure relates to utilizing known sensitive data of a sample to identify sensitive data in database tables, electronic documents, and/or other electronic data sources.
This section is intended to introduce the reader to various aspects of art that may be related to various aspects of the present disclosure, which are described below. This discussion is believed to be helpful in providing the reader with background information to facilitate a better understanding of the various aspects of the present disclosure. Accordingly, it should be understood that these statements are to be read in this light, and not as admissions of prior art.
In an ever-increasing data-driven world, data storage structures are becoming increasingly complex, enabling an increase of areas where sensitive data may be stored. Unfortunately, as these data storage structures become more complex, the ability to track information stored in these data storage structures has also become more complex. In many instances, fields specified in data storage structures are mis-used, causing sensitive data to be stored in unexpected locations that are not regulated in a manner that fields known to storage sensitive data are.
In some systems, sensitive data detection may rely on pattern matching and/or meta data searching to identify sensitive data. Unfortunately, these approaches oftentimes fail to detect sensitive data and/or identify data as sensitive data that is not actual sensitive data (e.g., false positive). For example, pattern matching may look for known data formats to identify potential sensitive data. However, this approach may result in false positives, flagging data that is in a sensitive data format as sensitive data, even when the data does not actually pertain to sensitive data. For example, if a user's social security number (e.g., 477-57-8177) is sensitive data, a pattern matching technique may identify data (e.g., 123-45-6789) with a similar format (e.g., 3 numbers, 2 numbers, and 4 numbers separated by dashes) as sensitive data, even though this data does not match the user's actual social security number. Therefore, a false positive may be generated.
Furthermore, metadata scanning may also be used to identify potential sensitive data. Under this approach, metadata associated with fields of data is searched to identify potential sensitive data. While this technique may capture entire swaths of potential sensitive data, much sensitive data may be missed, as this approach relies heavily on predictable/expected metadata in the data storage structures. For example, a database column name, such as “SSN”, may indicate sensitive data. Metadata searching may look for column names including “SSN” and flag the entire column as sensitive data. However, when the metadata does not descriptively indicate the sensitive data, such as a column name of “MEMO”, the column may not be flagged, even when the “MEMO” field includes sensitive data, such as stored social security numbers.
Accordingly, new sensitive data identification approaches are desirable. Such solutions are described in detail herein.
A summary of certain embodiments disclosed herein is set forth below. It should be understood that these aspects are presented merely to provide the reader with a brief summary of these certain embodiments and that these aspects are not intended to limit the scope of this disclosure. Indeed, this disclosure may encompass a variety of aspects that may not be set forth below.
The systems and methods provided herein relate to scanning for and/or identifying sensitive data in unexpected locations of data storage structures. Once sensitive data is found in unexpected locations of data storage structures, actions can be taken to remediate the unexpected storage of the sensitive data.
In particular, the current systems and methods scan entire data storage structure records associable with a particular user for the presence of defined variants of known sensitive data of the particular user. In this manner, a precise indication of a sensitive data particular to the user may be identified as present in particular records of a data storage structure. This results in an understanding of where the user's sensitive data is being stored, without creating false positives from data that may look like sensitive data but is not particular to the user. This approach may be repeated for an entire sample population, resulting in an understanding of how and where sensitive data is stored in the data storage structures, even when in unpredictable and/or unexpected locations.
It is appreciated that implementations in accordance with the present disclosure can include any combination of the aspects and features described herein. That is, implementations in accordance with the present disclosure are not limited to the combinations of aspects and features specifically described herein, but also include any other appropriate combinations of the aspects and features provided.
These and other features, aspects, and advantages of the present disclosure will become better understood when the following detailed description is read with reference to the accompanying drawings in which like characters represent like parts throughout the drawings, wherein:
One or more specific embodiments will be described below. In an effort to provide a concise description of these embodiments, not all features of an actual implementation are described in the specification. It should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another. Moreover, it should be appreciated that such a development effort might be complex and time consuming, but would nevertheless be a routine undertaking of design, fabrication, and manufacture for those of ordinary skill having the benefit of this disclosure.
When introducing elements of various embodiments of the present disclosure, the articles “a,” “an,” and “the” are intended to mean that there are one or more of the elements. The terms “comprising,” “including,” and “having” are intended to be inclusive and mean that there may be additional elements other than the listed elements. Additionally, it should be understood that references to “one embodiment” or “an embodiment” of the present disclosure are not intended to be interpreted as excluding the existence of additional embodiments that also incorporate the recited features.
In the following description, for the purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of embodiments of the present disclosure. It will be apparent, however, to one skilled in the art that embodiments of the present disclosure may be practiced without some of these specific details.
Turning now to an overall system flow for scanning for sensitive data,
To scan for sensitive data, the sensitive data scanner 102 may identify a subset of the known sensitive data 106 associated with particular entities of a sample population. The sensitive data scanner 102 may identify all data store structures (e.g., database tables, electronic document fields, etc.) that can be joined/associated with particular entities of the sample population. For each entity of the sample population, the subset of the known sensitive data 106 that is associated with the entity is searched in all records associated with the entity in all of the data store structures. In this manner, precise sensitive data identification particular to the entity may be performed.
In some instances, as described in more detail below, numerous variants of the sensitive data may be generated prior to the searching of the datastores 104. In this manner, the searching need not be dependent upon a rigid sensitive data format that the known sensitive data 106 is stored in. This may result in capture of additional sensitive data in the datastores 104, which may be stored in a format different than the known sensitive data 106. Example variants of known sensitive data are described in more detail with respect to
Results of the sensitive data scanning by the sensitive data scanner 102 may be provided for use by downstream systems 108. For example, the downstream systems 108 may include a sensitive data identification reporting system 110, which may provide a graphical user interface (GUI) reporting identified sensitive data found in the datastores 104. A process for sensitive data identification and use in such a GUI is provided in
The downstream systems 108 may also include an electronic document system 112. A process for identifying sensitive data for use in an electronic document system 112 is described in detail in
The downstream systems 108 may also include a customer service feedback system 114 and/or an Information Technology (IT) system 116, which may be interested in data store structure usage patterns. A process for identifying sensitive data for identify datastore structure usage patterns is described in detail in
The downstream systems 108 may also include a Privacy Regulation Service 118. For example, the Privacy Regulation Service 118 may be charged with removal of sensitive data associated with a user from datastores 104. A process for identifying sensitive data for use in an electronic document system 112 is described in detail in
Turning now to a more detailed discussion of the sensitive data scanning,
In some embodiments, the entities used for the sample population may be selected based upon particular characteristics. For example, in some embodiments, it may be desirable to maintain a diverse population. For example, when a company offers many different products, a diverse population may be a population that has entities (e.g., customers) that cover an association with a vast number of the different products. This may help to insure that product-specific data sources are each scanned for at least a subset of sensitive data of the population. Further, in some embodiments, after the sensitive data scan process completes, it can be re-run (e.g., periodically at a user-provided time interval) with a different sample population. In some embodiments, the sample population size may dynamically change as certain variables change. For example, during periods when relatively less processing power is available (e.g., during business hours), the sample size may be relatively fewer than during periods when relatively more processing power is available.
The process includes detecting datastore structures (e.g., tables) that are relatable (e.g., joinable) to an identifier of the entity (block 204). For example, in the context of a relational database, the relationships defined in the relational database may indicate particular tables that can be associated with an entity field (e.g., a customer identifier). In other datastores, the structures may be identified based upon other features of the structures. For example, in an electronic document, an author or a signor may indicate an entity with which the fields of the electronic document may be related to.
The data (e.g., table data) associated with the detected datastore structures of block 204 may be obtained (block 205). This may be done, for example, by running a query that returns all data associated with unique identifiers of the sample entity population.
Known sensitive data for the population is obtained (block 206). For example, a database may include known sensitive data associated with unique identifiers that identify the entities. The known sensitive data associated with the sample population may be obtained by querying the datastore using the unique identifiers of the entities.
Once the known data is obtained, distinctive known data formats of the obtained sensitive data are created (block 208). For example, known sensitive data may be stored in a number of different formats. By creating the distinctive known data formats for the obtained sensitive data, each distinctive format of the known distinctive data may be searched.
Returning to
The scan results are collected for reporting of detected stored sensitive data (block 212). Further, as mentioned herein, scans are run for all data structures (e.g., tables) that are relatable to the entity (e.g., joinable to an Entity Identifier column). Once all scan results for all tables is complete, the scan results may be reported via a GUI (block 216). In some embodiments, SQL commands may also be presented that enable a user to quickly return datastore results where the sensitive data was found. As may be appreciated, some of this identified sensitive data may be expected. For example, it may be wholly expected that a customer name field includes “NAME_FIRST_LAST” sensitive data. However, there may be revelation of unexpected locations where sensitive data is stored as well. For example, the entry 414 relating to the “DEVICE_NM” column of a table may be intended to store an electronic device name. However, several instances of “NAME_FIRST_LAST” sensitive data is indicated as having been identified in this column. This may mean that an entity has set the name of their electronic device as their first and last name, which is sensitive data. As may be appreciated, this is very useful information, as it may help reveal unexpected datastore locations where sensitive data may be stored, even when the storage was not intended.
The GUI 400 may include display filters 416, which may enable a user to home in on particular sensitive data criteria. For example, metric criteria 418 may enable only metrics that meet certain criteria to be displayed. In the current embodiment, the metric criteria includes a range of percentage of detected sensitive data within a given column of data. In addition, particular datastore structures of interest may be defined. For example, a particular database of interest may be defined at affordance 420, a particular table subject area may be defined by affordance 422, a particular table name of interest may be defined by affordance 424. When selected, only the items of interest will be displayed in the GUI 400. In some embodiments, an affordance 426 may be used to select one or more particular known data format names of interest. Thus, if a user of the GUI 400 was only interested in finding data pertaining to a phone number, only the phone number format names would be selected by the affordance 426.
Having discussed identification and reporting of stored sensitive data, the discussion now turns to other uses of identified sensitive data.
To illustrate the effects of process 500,
As indicated, by the “Match” arrow 608 and the dashed lines 610, a match has been found by the Sensitive Data Scanner 102 in the current example. This match results in a redaction 612 of the data 614 found in the matching field 616. In some embodiments, the redaction may be performed by calling a function of the electronic document 602 software that performs electronic redaction of the electronic document 602.
As with previous embodiments, the Sensitive Data Scanner 102 need not rely on a particular format of the social security number or a particular name of the fields 604 to identify the data to be redacted. Thus, data to be redacted may be redacted, even when the data is stored in an unintended and/or unexpected field, resulting in increased security and enhanced redaction. Further, to perform this redaction for multiple parties and/or multiple electronic documents 602, the sample may be set to all of the parties that the redaction should occur for and/or the datastore may be set to a multitude of electronic documents 602 that the redaction should occur on.
Turning now to detection of usage patterns of datastore structures,
Turning now to a discussion of regulation implementation, under some jurisdictional regulations, there may be requirements to delete stored sensitive data for a particular user upon request of the user. However, as may be appreciated, it may be difficult to identify sensitive data stored in unintended locations.
As may be appreciated, the techniques provided herein provide a significant improvement in computer functionality, enabling the computer to seek out sensitive data in numerous formats, even in unexpected locations. This is something neither a computer nor a human could previously perform. Through reliance on defined data structure relationships, particular data records associated with particular entities may be searched, resulting in more precise identification of actual sensitive data, even in unintended/unexpected locations with fewer false positive identifications.
While only certain features of the disclosure have been illustrated and described herein, many modifications and changes will occur to those skilled in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the true spirit of the present disclosure.
The present application claims priority to and the benefit of U.S. Provisional Application No. 63/115,961, entitled “SYSTEMS AND METHODS FOR PRECISE SENSITIVE DATA DETECTION” and filed on Nov. 19, 2020, the disclosure of which is incorporated by reference herein for all purposes.
Number | Name | Date | Kind |
---|---|---|---|
9306917 | Brugger | Apr 2016 | B2 |
10445324 | Arasan | Oct 2019 | B2 |
10489462 | Rogynskyy | Nov 2019 | B1 |
11670406 | Kakhovsky | Jun 2023 | B2 |
11797705 | Voinea | Oct 2023 | B1 |
20150161397 | Cook | Jun 2015 | A1 |
20190228171 | Mathur | Jul 2019 | A1 |
20200293680 | Navarro-Dimm | Sep 2020 | A1 |
Number | Date | Country |
---|---|---|
1560394 | Aug 2005 | EP |
Number | Date | Country | |
---|---|---|---|
63115961 | Nov 2020 | US |