The present invention relates to the prevention of unauthorized access to sensitive data, and more particularly to a method for analyzing digital files to recognize any sensitive data in the textual content.
The prevention of sensitive data leakage is of utmost priority to today's consumers and organizations. This is a preeminent concern in the evolving field of cybersecurity. It is a top priority for cyber practitioners to aid individuals and organizations in the prevention of unauthorized access to sensitive data.
Current digital files analysis methods do not appear to use artificial intelligence (AI) and do not appear to consider environmental context in which the document was discovered. Current technologies include those likely employing discreet algorithms but not making use of true artificial intelligence. A further limitation of these technologies is that they analyze documents without considering the environmental context in which they were created. Additionally, none of them seem to suggest utilizing graph theory as a pre-processing means for extracting features or reducing the data set in preparation for analysis.
These prior art methods rely heavily on performing analysis about how the data is being accessed rather than contextual features learned from the data itself. These prior art methods are extremely limited in that one would need to have control and/or develop insight into the underlying system on which the data resides, and perform extensive training on each system. They must run on the provider's specific platform in order to make an accurate prediction. The prior art methods all appear to not use AI and further appear to be platform specific and therefore not usable on all textual data. So these prior art methods are not something someone can run on their computer, cell phone, or web site. Accordingly, there is a need for better techniques for analyzing digital files to recognize any sensitive data in the textual content.
It is an object of the invention to provide an improved method for analyzing existing digital files and those to come in the future. The method in essence extracts features describing the environmental context in which a file was created and the file content itself by modeling and analyzing:
These and other objects and advantages of the present invention will be understood by reading the following description along with reference to the drawings.
By extracting features beyond that of just the text itself, the method captures extended metadata about a given document that previously would not have been realized. The method extracts features representing elements such as: grammatical habits of authors, common document structures, and various linguistic characteristics. The method takes these extracted features (representing the data itself and its context) and analyzes this data with artificial intelligence (AI) algorithms such as decision trees and neural networks in an effort to predict whether a document includes sensitive data. Leveraging AI algorithms rather than discrete algorithms carries with it the advantage of being able to handle massive volumes of data, as well as the ever-increasing varieties of data. The method proposed here can be easily included in software written by cybersecurity firms, and used by organizations or individuals to run on their systems to discover the existence of sensitive data in places previously unknown to them. The method of the current invention is built with “Big Data” in mind, so that it will scale to meet the privacy needs of consumers and organizations.
The current invention, which introduces a novel method for finding the existence of such sensitive data in textual content, is unique in the following ways:
Sensitive information such as passwords, credit card numbers, social security numbers, etc., is often embedded in digital text documents (computer files, web pages, spreadsheets, etc.). The problem comes when these documents are made broadly accessible to individuals that are not authorized to access this sensitive information usually through unintended means. This problem is exacerbated with the growth of cloud service providers and the increasing comfort with posting documents in the cloud. There are existing tools that leverage discreet algorithms for finding such documents with sensitive data in them, but these algorithms are difficult to maintain and rely on human intelligence to hard code the methodology by which the documents are analyzed, thereby drastically limiting the software's ability to find certain indicators of documents with sensitive information. The current invention solves that problem. It will rely on artificial intelligence algorithms that will learn previously unobserved semantics of documents containing sensitive information, then make accurate predictions about new unseen documents as to whether or not they contain sensitive data. This invention, while valuable for all textual content, is particularly well suited for structured textual content, such as text structured in markup languages, programming languages, etc.
The method of the current invention would be beneficial to software developers who embed keys and passwords in code, businesses with sensitive data, home users with computers or cell phones, and any individual that utilizes cloud services.
Reference is made herein to the accompanying drawings, which are not necessarily drawn to scale, and wherein:
The system of the present invention is capable of classifying a programming (segment of) code as to whether it contains some sensitive information. When any code is written, the programmers have a certain mindset; if they tend to incorporate sensitive information in the code, they may have certain writing traits or some coding style habits. Any experienced or well-groomed programmer will avoid putting sensitive information in the code, hence it is more likely that a relatively new programmer will tend to put sensitive information inside the code. The system will look at the actual text in the code along with the relationship of individual words with other words as well as with the whole text.
Instead of feeding the graph directly to an AI system, the invention proposes use of adjacency representation of the graph since we may have more than one edge between two nodes representing different features. These customized graphs can be easily represented with 3-dimensional adjacency matrices.
The description of the present invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.
This application claims the benefit of Provisional U.S. Patent Application Ser. No. 63/008,696 filed Apr. 11, 2020, the contents of which are incorporated herein by reference in their entirety.
The United States Government may have certain rights to this invention under Management and Operating Contract No. DE-AC05-06OR23177 from the Department of Energy.
Number | Date | Country | |
---|---|---|---|
63008696 | Apr 2020 | US |