1. Technical Field
The present disclosure relates to a system and method for semi-automatically estimating the sensitivity of data in an organization. More specifically, the present disclosure describes methods for estimating the risk to a business posed by the potential loss, exposure, or alteration of the data items.
2. Discussion of Related Art
Data typically is a key asset within modern enterprises. Loss, exposure, or alteration of that data may cause damage to the enterprise; this damage can include tangible effects (such as loss of business or loss of competitive advantage) or intangible effects (such as loss of reputation), but in any case, the effects can be severe or even catastrophic. The aggregate of the tangible effects and intangible effects of data loss, exposure or alteration we term the sensitivity of the data. Hence, businesses typically make a variety of efforts to safeguard their data, such as systems for content management, data loss protection (DLP), and data backup systems. Unfortunately, these systems are often too expensive to apply to all data in an enterprise, or fail because enterprises do not have a clear idea of where their sensitive data resides.
As an example, consider systems for data loss protection (DLP) and data-centric security which are designed for safe-guarding important and sensitive data. The state-of-the-art technologies for DLP don't have any automated mechanisms to measure the sensitivity of individual data items, and thus treat all data equally. Most known DLP solutions rely on the following two approaches to identify sensitive data.
The first method is applying automatic or semi-automatic classification techniques to detect the content type of the given data (e.g., documents containing credit card information or social security numbers). Solutions using this method can determine if a document or a data item belongs to a certain category or contain a certain kind of information, but they do not measure the estimated sensitivity of individual data items. Therefore, these technologies provide the same kind and level of security measures to all data items belonging to a same category.
The second method is having human experts inspect the data in an organization and judge the importance levels of the data manually. This does allow more sensitive data to be better protected. However, the limitations of this method are: (1) it is very labor-intensive and costly, so typically only a few data types or data sources are inspected; (2) human judgments are very subjective and often qualitative and (3) human inspection may not be complete since it is very hard for people to have a complete view on all the data in a large organization. Furthermore, the manual method also does not typically account for data which moves after the data inspection process occurs.
In a similar fashion, other data-centric processes, such as case management, records management, data backup, and content management systems typically do not differentiate between data that is highly sensitive and data that is of low sensitivity to the enterprise.
Therefore, a need exists for a system for locating an enterprise's data, estimating a sensitivity of the data and protecting that data in proportion to its value or sensitivity for the enterprise.
According to an embodiment of the present disclosure, a method for evaluating data includes defining a list of data categories, determining a relative sensitivity to each data category, determining one or more classifiers for each of the data categories, receiving a plurality of data items to be valued, determining one of the data categories for each of said plurality of data items according to the one or more classifiers, and determining a respective sensitivity for each of said plurality of data items.
A system for evaluating data includes a memory device storing a plurality of instructions embodying the system for evaluating data, and a processor for receiving a plurality of data items to be valued and a list of data categories and executing the plurality of instructions to perform a method including determining a relative sensitivity to each data category, determining one or more classifiers for each of the data categories, determining one of the data categories for each of said plurality of data items according to the one or more classifiers, and determining a respective sensitivity for each of said plurality of data items.
Preferred embodiments of the present disclosure will be described below in more detail, with reference to the accompanying drawings:
The present disclosure describes embodiments of a system for locating an enterprise's data, estimating their sensitivity and protecting that data in proportion to its value or sensitivity for the enterprise.
According to an embodiment of the present disclosure, data sensitivity may be automatically or semi-automatically determined. Embodiments of the present disclosure may be incorporated into various other applications. Two exemplary applications include information technology (IT) security and smart storage. Data sensitivity can significantly enhance IT security of an organization, as organizations may objectively identify the sensitivity of data assets and adjust protection accordingly, e.g., providing increased protection for higher-sensitivity data assets. In another exemplary embodiment, one challenging task of a smart storage system is to create and manage storage backup intelligently. Since backing up all data in an enterprise requires a large amount of storage space and consumes computing time, a “smart” storage backup process may be implemented based on data sensitivity in which the sensitive data may be backed up more frequently, or in which the least sensitive data may never be backed up, thus saving significant amounts of storage space and computing time.
According to an embodiment of the present disclosure, an exemplary system includes an SME (Subject Matter Expert) interview (101), data selection (102) and data sensitivity estimation (103) components (see
According to an embodiment of the present disclosure, an exemplary method (see
At block 201 the definition of a list of data categories may be accomplished in several ways. For example, the definition may be built by interviewing subject matter experts (SMEs) to determine which data categories are of interest, or it can be adapted from a pre-prepared industry-standard list, or adapted from an available industry taxonomy of data or processes. The list of data categories would typically include, as much as possible, a complete inventory of different data categories (high and low sensitivity) that may be encountered in the enterprise.
SMEs can also provide some initial (may not be complete) knowledge on a company's business. Referring to
The SME interview process (101) of measuring data sensitivity includes interactions with an SME. The role of the SME is to provide an initial view on data sensitivity (e.g., what is high-sensitivity data). For example, the SME can provide a list of business processes (e.g., insurance claim process), possible locations where data resides (e.g., customer DB, patents DB) and categories of sensitive data (e.g., customer personal information, new product development plan). This information may be utilized for data selection, e.g., for selecting higher-sensitivity data. The SME is expected to provide estimated sensitivities for sample data that can help build tools for data sensitivity estimation. The labels can include any of several ways of measuring the sensitivity, such as a categorical range of data sensitivity (e.g., low, medium, high with attendant thresholds), numeric ranks (e.g., decile or percentile), dollar amounts (e.g., $1M), or unitless arbitrary values (e.g., a scale from 1.0 to 10.0).
Another role of the SME is to provide feedback to automated components in the system of
In a preferred embodiment, an SME interview tool (see 101 in
Referring again to
A metadata collector (311) collects metadata information of the enterprise data (see block 104 in
Business process provenance and mining (312) identifies important business processes in the target organization, and the data items used for the identified processes. The business process provenance and mining (312) produces an end-to-end view of a business process including (all) subprocesses, data artifacts and human resources involved in the process. The business process provenance and mining (312) further discovers new processes or new insights on an existing process by mining activity logs generated by software systems.
The business process provenance and mining (312) (or business, process management tool) is used to analyze the business processes identified by SMEs in block 101 of
For example, the system measures the priority level (e.g., importance) of the processes. This can be carried out manually or semi-automatically. In a manual approach, human experts inspect the process graphs generated by the business process provenance and assign business sensitivities to the processes. In a semi-automatic (more preferable) method, a tool automatically determines the importance levels of the processes (Process-Based Sensitivity Estimator (322,
Data content classifiers (313) can determine data categories to which the data items belong using text and data mining technologies. For each data item (e.g., a email message or a document or a database table), the component conducts content analysis and classification to determine the document category. Documents belonging to less important categories such as personal emails with no sensitive entities may be discarded.
The enterprise data includes both structured data (e.g., databases) and unstructured data (e.g., document files, email messages, application logs). The input of the system can be at any granularity; an entire database, a record of a database, all data in a server or a laptop, or a single document. The method may be embodied as a set of software (SW) programs that identify data items valuable for a business, for regulation compliances including patent documents, new product plans, acquisition plans, personal information (e.g., social security number, credit card number, personal health information), etc. In this context, the software programs are classifiers. Classifiers are particular kinds of software programs, which assign an input data item into a pre-defined class. For example, a classifier may be used to answer a question such as what is the topic of this document? According to an embodiment of the present disclosure, the classifiers are software programs that are able to distinguish data from one data category (e.g., “acquisition plans”) from any data belonging to any other data category (e.g., “HR records”). These classifiers can be built using any of several methods well-known in the art, such as machine-learning classifiers trained with examples of each data category, or pattern-based classifiers that look for particular patterns of words or phrases in a document (e.g., all documents containing the string of “for your eyes only”).
At block 205, each of the data objects selected for classification is subjected to the set of classifiers, and the data category selected based on the scores from the set of classifiers. This selection is also a technique well-known within the art of data classification.
Data sensitivity estimation (302) measures the sensitivity of individual data items in an organization or in the selected data subset determined in block 301. Data sensitivity estimation (302) may be decomposed into components. These components may be built as machine learning-based classification or regression systems, or rule-based systems.
A metadata-based sensitivity estimator component (321) produces an importance value of data based on metadata information. For example, a data item created by an executive may be more important than a data item created by a regular employee. A data item used by many different organizations may have higher sensitivity than data items used by a few users in a same organization Similarly, a document last modified a few days ago is likely to be more important than a document modified several years back. These metadata are combined to estimate the sensitivity of a given data item.
A process-based sensitivity estimator component (322) automatically determines the importance levels of the processes by applying various graph analysis techniques. The predictors of process importance level include the business area the process is primarily used for, the number of in-degree links to the process node, the number of out-degree links, the frequency of the process being executed, the job roles of the humans involved in the process, the data types used in the process or generated by the process, physical location of the data, current security levels of the data, etc.
A content-based sensitivity estimator component (323) determines the sensitivity level of the content of the data using text and data mining technologies. For each data item (e.g., a email message or a document or a database table), the content-based sensitivity estimator component (123) conducts content analysis and classification to identity the content type and subsequently the sensitivity level. For example, the content-based sensitivity estimator component (323) identifies if a document contains intellectual property (IP), personally identifiable information (PII) or personal health information (PHI), and produces the sensitivity level based on various content and context information. For instance, a document containing 100 credit card numbers is likely to receive a higher sensitivity level than a document with 1 credit card number.
A data sensitivity calculator component (324) combines the estimated data sensitivities generated by the metadata-based sensitivity estimator (321), the process-based sensitivity estimator (322) and the content-based sensitivity estimator (323) and generates a sensitivity (105) for the given data. Various combination techniques can be used to combine the sensitivities such as selecting the highest value, the average value, etc. In a preferred embodiment, a classifier weighted mean is used in which the estimated sensitivities are fed into a regression analysis together with the confidence values of the estimators (i.e., how accurately an estimator has predicted data sensitivity in the past) to produce a final data sensitivity.
Referring again to
At block 202, relative sensitivities are assigned to each category or rank-order the categories by sensitivity.
Let T={T1, . . . , Tn} be the list of data categories and types, and at block 203 for each category, Ti, there are one or more classifiers that collectively decide if a given data item belongs to Ti.
At block 205, let Ci be the set of classifiers for Ti and C={C1, . . . , Cm} be the list of all software programs or classifiers. Thus, the exemplary method outputs a set of classifiers Ci for Ti.
An exemplary method for determining a sensitivity level of a data item is shown in
At block 401, apply all CiεC on d.
At block 402, count the number of occurrences of all data types Ti found in d. The count is a real number greater or equal to 0. For some data types or topics, the count can be either 1 or 0, e.g., if d is a patent disclosure (1) or not (0).
At block 403, the sensitivity of the document d is determined based on the counts of data categories and their ranks or relative values provided by SMEs. An exemplary method for determining the value of the data d to combine all category values found in d. Let c1 be the count of category T1 found in d, and let v1 be the relative sensitivity of T1 provided at Step 202, then d can be considered to have c1×v1 value for T1. Suppose d has k categories, Ti, . . . , Ti, then the total sensitivity of d is
Another exemplary method for determining the value includes normalizing all sensitivities into a range of values, e.g., [0,1], and to combine the normalized values. In other words, for all data types TiεT, the sensitivity of Ti in d can be normalized into a value between 0 and 1.
In the present exemplary embodiment, a logistic (or sigmoid) function is used for the normalization, but other functions are also applicable. The logistic function may be defined as:
where x is determined based on the count and its relative sensitivity. A logistic function is defined for each Ti.
Optionally, the output of all data sensitivities from block 206 may be used to generate representations of the enterprise's data sensitivities. For example, this output may show the distribution of data sensitivity by machine, by user, by geography, or by organization within the business. This would allow a business manager, for example, to view areas in the enterprise that are at high risk because of excessive concentrations of sensitive data, or to find individuals who may have inappropriate levels of sensitive data on insufficiently protected machines. This output may be numeric or, in a preferred embodiment, displayed interactively as graphs, maps, or other visual representations that can be manipulated to see the data at varying levels of granularity as desired. This can be done using a variety of commercially available systems.
At block 404, all the sensitivities are determined at block 403 are combined and the aggregated value is set as the sensitivity level of d (105). This is also shown in the loop of
As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or computer program product for estimating the sensitivities of data in an organization. Accordingly, aspects of the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present disclosure may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
Referring to
The computer platform 501 also includes an operating system and micro-instruction code. The various processes and functions described herein may either be part of the micro-instruction code or part of the application program (or a combination thereof) that is executed via the operating system. In addition, various other peripheral devices may be connected to the computer platform such as an additional data storage device and a printing device.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Having described embodiments for estimating the sensitivity of data in an organization, it is noted that modifications and variations can be made by persons skilled in the art in light of the above teachings. It is therefore to be understood that changes may be made in exemplary embodiments of disclosure, which are within the scope and spirit of the disclosure as defined by the appended claims. Having thus described an invention with the details and particularity required by the patent laws, what is claimed and desired protected by Letters Patent is set forth in the appended claims.