The systems and methods of the present invention may be used to detect and thwart hackers or other unauthorized users of computer systems.
The system may process e-mails in son format plus attachments and standalone files within directory structures matching certain specifications. E-mails and files may be processed one user at a time, in sorted order according to timestamps. Within each e-mail or file, dates and times may be detected and shifted according to user-specified deltas, and people's names are detected and shifted according to user-provided templates Formatting may be preserved exactly for .docx files and approximately for .pdf files. Text files and html-formatted e-mails may also be handled similarly. The accuracy achieved for detecting recognized concepts may be high, based on a Bayesian machine learning algorithm for named entity recognition followed by a second phase to exclude false positives. During processing, fake e-mails including enticing content may be occasionally inserted to lure an unauthorized user to reveal themselves by visiting a fake website and entering generated credentials. The system may also be converted to a daemon that runs in the background and automatically detects and processes new users, e-mails, or files as they appear.
A further understanding of the invention can be obtained by reference to exemplary embodiments set forth in the illustrations of the accompanying drawings. Although the illustrated embodiments are merely exemplary of systems, methods, and apparatuses for carrying out the invention, both the organization and method of operation of the invention, in general, together with further objectives and advantages thereof, may be more easily understood by reference to the drawings and the following description. Like reference numbers generally refer to like features (e.g., functionally similar and/or structurally similar elements).
The drawings are not necessarily depicted to scale; in some instances, various aspects of the subject matter disclosed herein may be shown exaggerated or enlarged in the drawings to facilitate an understanding of different features. Also, the drawings are not intended to limit the scope of this invention, which is set forth with particularity in the claims as appended hereto or as subsequently amended, but merely to clarify and exemplify the invention.
The invention may be understood more readily by reference to the following detailed descriptions of embodiments of the invention. However, techniques, systems, and operating structures in accordance with the invention may be embodied in a wide variety of forms and modes, some of which may be quite different from those in the disclosed embodiments. Also, the features and elements disclosed herein may be combined to form various combinations without exclusivity, unless expressly stated otherwise. Consequently, the specific structural and functional details disclosed herein are merely representative. Yet, in that regard, they are deemed to afford the best embodiments for purposes of disclosure and to provide a basis for the claims herein, which define the scope of the invention. It must be noted that, as used in the specification and the appended claims, the singular forms “a”, “an”, and “the” include plural referents unless the context clearly indicates otherwise.
Use of the term “exemplary” means illustrative or by way of example, and any reference herein to “the invention” is not intended to restrict or limit the invention to the exact features or steps of any one or more of the exemplary embodiments disclosed in the present specification. Also, repeated use of the phrase “in one embodiment,” “in an exemplary embodiment,” or similar phrases do not necessarily refer to the same embodiment, although they may. It is also noted that terms like “preferably,” “commonly,” and “typically,” are not used herein to limit the scope of the claimed invention or to imply that certain features are critical, essential, or even important to the structure or function of the claimed invention. Rather, those terms are merely intended to high-light alternative or additional features that may or may not be used in a particular embodiment of the present invention.
For exemplary methods or processes of the invention, the sequence and/or arrangement of steps described herein are illustrative and not restrictive. Accordingly, it should be understood that, although steps of various processes or methods may be shown and described as being in a sequence or temporal arrangement, the steps of any such processes or methods are not limited to being carried out in any particular sequence or arrangement, absent an indication otherwise. Indeed, the steps in such processes or methods generally may be carried out in various different sequences and arrangements while still falling within the scope of the present invention.
The Decoy Generating System (“DGS” or “RAGS”) of the present invention processes e-mails along with their attachments and other user files in one user directory at a time. The user directories may exist within a specified base directory that is provided to the system. Each user directory may contain a subdirectory called “Email” containing e-mails and (as separate files) attachments; another subdirectory called “Files” may contain other user files. Processed (i.e., shifted) e-mails may be placed in a subdirectory called “Changed Emails,” and processed files may be placed in a subdirectory called “Changed_Files.” This is summarized in
All e-mails may be stored in .json format. When the DGS system processes e-mails and files, it may detect dates and times, and shift them according to deltas specified by the user. The current state of the system may also detect people and shift them according to templates specified by the user. Other areas of investigation include the detection of other nouns including locations and organizations.
The system may process one user directory at a time, and within each user directory, the system may process e-mails and files in sorted order according to their timestamps. The system may randomly insert a fake e-mail including enticing content. The content may include the URL of a fake website and login credentials for the website. The user may be alerted, for example via e-mail, if anyone tries to log in to the fake website using credentials associated with the user's account. The system may also run as a daemon that can detect when new user directories, e-mails, or files are added or created, and process them automatically at such times.
A. Training the DGS System
The system may automatically detect concepts including dates, times, people, and locations in e-mails and files, using an approach known as named entity recognition. A “chunker” predicts a category for every token (i.e., word) in a document using a Bayesian machine learning algorithm. Each token may begin a concept (e.g., label B-PERSON), continue a concept (e.g., label I-PERSON), or not be part of any recognized concept (label O). General features used for learning include the token itself, the token's part-of-speech, the next and previous token and part-of-speech (POS), and the previous token's label. Several concept-specific features have been added to improve accuracy (e.g., Boolean features representing the inclusion, or not, in lists of months, lists of names according to the U.S. Census Bureau, etc.). A second phase using hand-crafted rules is applied to eliminate some false positives. For example, predicted dates are excluded if they are not verified by Python's dateutil module, and names of people are excluded if they contain ‘@’, since these are probably e-mail addresses.
The chunker is trained on files that have had instances of each relevant concept manually labeled. A sample training corpus may consists of 94 news documents from the publicly available Information Extraction: Entity Recognition (LEER) corpus, and 100 randomly selected e-mails from the Enron e-mail dataset. Cross-validation experiments may be performed within the training set to evaluate the chunker's accuracy detecting dates, times, and people using standard metrics from the field of natural language processing (NLP). The metrics used may include recall, which indicates the percentage of actual tokens from the category that are correctly predicted to belong to the category; precision, which indicates the percentage of predicted tokens assigned to the category that actually do belong to the category; and F1, which combines recall and precision into a single metric that is closer to the lower of the two. Based on cross-validation experiments, it is possible for the system to achieve F1 scores for dates averaging about 94%, F1 scores for times averaging about 91%, and F1 scores for people averaging about 70%.
A typical user should never need to retrain the chunker. However, the system allows the user to train their own chunker, and to specify that chunker to be used by the system in place of a default chunker (which, for example, may have been trained using the training set and methodology indicated above). A graphical user interface may be implemented and shall be referred to herein as the Named Entity Labeler. In the NLP literature, the term “named entity” is used to represent the concepts that are detected by this sort of approach, including concepts such as dates, times, etc.
A screenshot of our Named Entity Labeler being used to label one of the e-mails in a training set is shown in
Once enough documents have been labeled to constitute a training set, a user can train a chunker using a Python script. This can easily be performed from the interactive Python shell.
B. Detecting and Shifting Concepts
When the DGS system processes e-mails, attachments, or other files, it may first extract the textual content from the document, then segment the text into sentences, then tokenizes each sentence (i.e., split the sentence into words plus important punctuation), then compute the part-of-speech (i.e., syntactic category) for each token, then compute other features used for learning, then apply the chunker to detect recognized concepts (e.g., dates, times, names of people, locations). For each predicted concept, a second phase may be applied to eliminate false positives. Then each date and time may be shifted according to deltas specified by the user (this makes use of Python's datetime module). Matching .pshift files provided by the user may also be modified according to user-provided templates as explained below. After all shifts are applied, the document may be reconstructed and saved in the proper destination folder. A simplified outline explaining the system workflow for processing a single e-mail or file is shown in
Retrieving the text from a file, represented by the first box in the outline, may be more complicated for some file types than others. For e-mails represented as .json files, the Python json module can be used to obtain and potentially modify the various fields. Text files are also simple to deal with. The system may handle HTML-formatted e-mails (and other .html files, if any), .docx attachments and files, and .pdf attachments and files. Handling HTML and .docx files are similar, because .docx files are stored as compressed XML documents, and specific tags indicate textual fields; Python's lxml module is useful for handling both formats. Complications can still arise as sentences may be split between HTML or XML nodes. The system may restore all modified tokens to their original nodes to preserve formatting. It is difficult, however, to manipulate .pdf files directly. The system may therefore rely on publically available utilities to convert .pdf files to .html, process the .html, and convert the file back to pdf. The conversion is not perfect, so formatting of .pdf files is only approximately preserved. Any other file type, either as an attachment or standalone file, is copied to the destination directory unmodified.
Shifting dates and times, once predicted and verified, may be achieved using Python's datetime module (examples are described below). To specify names of people to shift, and how to shift them, the user can specify one or more templates in the form of .pshift files. Each template specifies a person to shift, if detected, and how to shift the person. Each template must include: (1) all allowable variations of the person's first name, middle name, and last name; (2) how each allowable variation of any part of a name should be modified; and (3) which parts of the person's name is required to count as a match.
An example of a .pshift file specifying rules for shifting variations of the name Ken Lay is shown in
C. Generating Fake E-Mails
At random points with configurable frequencies, believable fake e-mails are generated and inserted into a user's destination e-mail directory. The system may be limited to at most one fake e-mail generated per user. The content of the fake e-mails is based on configurable templates, and each template is applied at most once during a single run of the DGS system. Each generated fake e-mail may contain fake credentials. The fake e-mails are designed to entice a hacker who steals data into using the fake credentials at a fake website. Victims are automatically notified via e-mail when fake credentials have been used, indicating that their data has been stolen.
D. Running the DGS System
To run the system, the user may be required to specify the base directory within which all user directories reside. Additionally, the user may specify various optional parameters. If the user specifies a command with an incorrect format, a message may be displayed, such as the example screenshot depicted in
The system has been tested on a corpus consisting of: (1) a subset of the Enron E-mail Dataset including 8,419 e-mails from 20 users, all converted to the proper .json format; (2) 215 .docx and .pdf files from the MITRE corpus; these MITRE files have been randomly scattered across user file folders and randomly added as attachments to e-mails; (3) 118 .txt files, representing the MITRE .pdf files converted to text (these are Unicode text files), plus one additional manually created ASCII .txt file; these .txt files were randomly scattered across user file folders (but these are not used as attachments); and (4) one additional complex .json file, including complex, formatted attachments and a .json field with an HTML-formatted body. Also included were five .pshift files. Assuming that the test corpus is placed in the directory “corpus/enron_plus_mitre” relative to the main system, a test run using all of the provided .pshift files, with specified deltas of −500 days and +630 minutes, can be run as follows: python batch_process son.py corpus/enron_plus_mitre-d-500-m 630-1 log l.txt-p KenLay.pshift-p DougGilbert-smith.pshift-p NatalieMcCarthy.pshift-p WandaCuny.pshift-p CarlReiber.pshift
In addition to the required and optional command line arguments, the user may also configure many different aspects of the system through a configuration file. These configurable properties may include: (1) the default name of the log file; (2) the names of the subdirectories for original and modified files and e-mails in the corpus; (3) the expected fields in the .json files; (4) the probabilities determining how often fake e-mails are randomly generated; (5) the content of the templates for generating fake e-mails; (6) the range of random offsets from the base e-mails for timestamps of fake e-mails; (7) whether or not to delete original e-mails and files after the modified versions have been created; and (8) the user information for the user running the system, so they may be notified when an unauthorized user has been lured to a fake website. In general, these properties tend to be more technical properties that are not likely to change frequently between runs of the system.
E. Examining System Output
The system may include a json_diff utility, written in Python and runnable from the command line, which displays the differences between two specified json files in a diff-like format.
To compare modified .docx files or .pdf files with the corresponding originals, the user may need to open both files and compare them by eye. Of course, for these file types, we are interested not only in the content that has changed, but also in ensuring that the formatting has stayed the same, or has changed in an acceptable manner.
Various other modifications will be obvious to a person of skill in the art without deviating from the inventions claimed herein.
This application is a continuation of U.S. patent application Ser. No. 15/233,563, filed on Aug. 10, 2016, which claims the benefit of U.S. Provisional Patent Application No. 62/202,997, filed on Aug. 10, 2015. The entire contents of these applications are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
62202997 | Aug 2015 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 15233563 | Aug 2016 | US |
Child | 16680873 | US |