There are a wide variety of searching and filtering technologies available; however, they have various shortcomings. For example, the ability of users to precisely express complex filter concepts is limited. And, the performance of various tools can degrade as the search criteria become more complex.
A variety of techniques can be used for filtering. Hardware acceleration can be used to provide superior performance.
A rich set of features can be supported to enable powerful filtering via compact filter rule sets. For example, filter rules can implement locality operators that are helpful for establishing context. Concept rules can be used to specify concepts that can be re-used when specifying rules. Rules can support weighting.
Considerable ease-of-use and performance improvements in the filtering process can be realized.
Such technologies can be used in a variety of domains, such as search, email filtering (e.g., outgoing or incoming), and the like. As described herein, a variety of other features and advantages can be incorporated into the technologies as desired.
The foregoing and other features and advantages will become more apparent from the following detailed description of disclosed embodiments, which proceeds with reference to the accompanying drawings.
The technologies described herein can be used for a variety of hardware-accelerated context-sensitive filtering scenarios. Adoption of the technologies can provide performance superior to software-only implementations.
The technologies can be helpful to those desiring to reduce the amount of processing time to filter documents. Beneficiaries include those in the domain of search, security, or the like, who wish to perform large-scale filtering tasks. End users can also greatly benefit from the technologies because they enjoy higher performance computing and processing.
The engine 160 includes coordinating software 162, which coordinates the participation of the specialized hardware 164, which stores a pattern list 145 (e.g., derived from the rules 130 as described herein).
The filter engine 160 can classify incoming documents 120 as permitted documents 170 or not permitted documents 180. For example, the filter engine 160 can input an indication of whether a document is permitted (e.g., by an explicit output or by placing the document in a different location depending on whether it is permitted or the like). Although the example shows classifying documents as permitted or not permitted, other scenarios are possible, such as matching or not matching, or the like.
In practice, the systems shown herein, such as system 100 can be more complicated, with additional functionality, more complex inputs, more instances of specialized hardware, and the like. Load balancing can be used across multiple filter engines 160 if the resources needed to process incoming documents 120 exceeds processing capacity.
In any of the examples herein, the inputs, outputs, and engine can be stored in one or more computer-readable storage media or computer-readable storage devices, except that the specialized hardware 164 is implemented as hardware. As described herein, a software emulation feature can emulate the hardware 164 if it is not desired to be implemented as hardware (e.g., for testing purposes).
At 210, filter rules are received. Such rules can comprise conditions indicating when the rule is met and associated word patterns. As described herein, locality conditions, supplemental definitions, and weightings can be supported.
At 220, configuration information derived from the rules is sent to the specialized hardware. As described herein, such information can comprise the word patterns appearing in the rules.
At 230, a document to be filtered is sent to specialized hardware for evaluation.
At 240, the document is evaluated in specialized hardware according to the configuration information. Such an evaluation can be a partial evaluation of the document. Further evaluation can be done elsewhere (e.g., by software that coordinates the filtering process).
At 250, the evaluation results are received from the specialized hardware. As described herein, such results can include an indication of which patterns appeared where within the document.
At 260, the document is classified based on the evaluation by the specialized hardware.
The acts 210 and 220 can be done as a part of a configuration process, and acts 230, 240, 250, and 260 can be done as part of an execution process. The two processes can be performed by the same or different entities. The acts of 230, 250, and 260 can be performed outside of specialized hardware (e.g., by software that coordinates the filtering process).
The method 200 and any of the methods described herein can be performed by computer-executable instructions stored in one or more computer-readable media (e.g., storage or other tangible media) or stored in one or more computer-readable storage devices.
The compilation tool 360 can comprise a pattern extractor 365, which can extract patterns from the rules 320, supplemental definitions 325, or both. As part of the compilation process, various pre-processing and other techniques can be used to convert the rules 320, 325 into a hardware pattern list 380 and rule processing data structure 390. The compilation tool 360 can expand the compact representation of the rules 320, 325 to a more exhaustive and explicit representation of the rules that is acceptable to the specialized hardware.
The hardware pattern list 380 can include the patterns appearing in the rules 320, 325. In practice, the pattern list 380 can be implemented as a binary image that is loadable into specialized hardware, causing the hardware to provide the document evaluation results described herein.
The rule processing data structure 390 can be used as input by software that coordinates the hardware-accelerated context-sensitive filtering. For example, it can take the form of a Java class that implements various functionality associated with processing the evaluation results provided by the specialized hardware (e.g., to process the rules 320, 325).
At 400, filter rules are received. Any of the rules and supplemental definitions described herein can be supported.
The rules are compiled at 420. Compilation places the rules into a form acceptable to the specialized hardware (e.g., a driver or API associated with the hardware). Some stages (e.g., the later hardware-related stages) of compilation can be implemented via software bundled with the specialized hardware.
At 430, the rules are pre-processed. Such pre-processing can include expanding a compact representation of the rules to a more exhaustive and explicit representation of the rules that is acceptable to the specialized hardware (e.g., by expanding references to supplemental definitions).
At 440, patterns are extracted from the rules. As described herein, such patterns can take the form of regular expressions.
At 450, a rule processing data structure is generated. Such a data structure can be used in concert with evaluation results provided by the specialized hardware to process the rules.
At 470, configuration information for the specialized hardware is output. Such configuration information can take the form of a binary image or the list of patterns that can be used to generate a binary image. To achieve configuration of the software, the patterns are converted to or included in a specialized hardware format.
The specialized hardware 530 typically includes a hardware (e.g., binary) image 540 that is understandable by the processor(s) of the specialized hardware 530. Incorporated into the hardware image 540 is a pattern list 545 (e.g., derived from the filter rules as described herein). In practice, the pattern list may not be recognizable within the hardware image 540 because it may be arranged in a way that provides superior performance.
The specialized hardware 530 outputs evaluation results that include location information 550. For example, the location information 550 can indicate where within the document 520 the patterns in the pattern list 545 appear (e.g., for patterns appearing in the document, respective locations are indicated).
The document scorer can accept the location information 550 as input and use rules logic 562 and a rule processing data structure 564 to score the document 520, providing scoring results 580. As described herein, the rules logic 562 and data structure 564 can support locality conditions that can be determined via examination of the location information 550 (e.g., in combination with locality analysis of the document 520).
The scoring results 580 can be used to classify the document 520 (e.g., via a threshold or other mechanism).
In practice, coordinating software can coordinate document submission to the specialized hardware 530. Such hardware can be incorporated with the document scorer 560, or be run as a separate module.
At 610, a document is received for filtering.
At 620, the location of pattern hit locations (e.g., patterns extracted from the rules) are determined via specialized hardware. For example, the document can be sent to the hardware and location information can be received in response.
At 630, the document is scored via the pattern hit locations. As described herein, rules can be weighted, resulting in a score that reflects such weightings.
At 640, a document score is output.
The filter engine comprises at least two stages 761 and 762, which apply respective rule sets 731 and 732 (e.g., using the hardware accelerated filtering technologies described herein). The stages 761 and 762 can differ fundamentally in their function. For example, one may provide context-sensitive filtering, while the other need not. One may implement locality conditions, while the other need not.
The engine 760 can provide output in the form of a classification of the document 720 (e.g., permitted 770 or not permitted 780). An intermediary not permitted 768 can be supported (e.g., documents that are knocked out before being submitted to the last stage).
At 810, a document is received.
At 820, the document is filtered according to a first stage.
At 830, the document is filtered according to a second stage. Additional stages can be supported. A knock-out indication can prevent the second stage filtering from occurring.
At 840, the document is classified based on results of one or more of the stages.
In any of the examples herein, hardware acceleration can take the form of submitting a document to specialized hardware to achieve some or all of the context-sensitive filtering functionality. For example, a software/hardware cooperation arrangement can divide labor so that certain tasks (e.g., pattern matching) are performed in hardware that is specially designed to accommodate such tasks.
In any of the examples herein, specialized hardware can take the form of hardware dedicated to specialized tasks, such as pattern matching, signature recognition, high-speed document processing, or the like.
Although any of a variety of hardware can be used, some examples herein make use of a NetLogic Microsystems® NLS220HAP platform card with NLS205 processors available from NetLogic Microsystems of Santa Clara, Calif. Other hardware products by NetLogic Microsystems or other manufactures can be used as appropriate.
In any of the examples herein, a context-sensitive filter can be implemented by a collection of filter rules. Context sensitivity can be achieved via locality conditions as described herein.
In any of the examples herein, filter rules can take a variety of forms. In one arrangement, concept rules and weighted rules are supported. Concept rules can be defined to identify concepts and used (e.g., reused) in other rule definitions that piece together the concepts to implement a filter.
Concept rules can be used to define words, phrases, or both. Such a rule can specify a plurality of conceptually-related words for later reuse. Other rules can incorporate such concept rules by reference. A concept rule can specify one or more other concept rules as words for the concept rule, thereby implementing nesting.
Weighted rules can indicate a weighting to be applied when the rule is satisfied. For example, a highly weighted rule can result in a higher score when the rule is met. Negative weightings can be supported to tend to knock out documents that have certain specified conditions.
A rule can be satisfied more than once, resulting in multiple applications of the weight. Other weighting techniques include constant weight, slope weight, step weight, and the like.
Nested rule definitions can be used with advantage to achieve robust, reusable rule sets without having to manually explicitly define complex rules for different domains.
In any of the examples herein, supplemental definitions can take the form of rules that are reused from a library of rules that may be of use to rule developers. For example, concept rules can be used as supplemental definitions. The supplemental definitions can be supplied in addition to the rules and processed together with the rules. For example, a simple rule in compact form may result (e.g., via preprocessing) in a large number of rules and associated patterns that are ultimately sent to the hardware.
In any of the examples herein, locality operations can support specifying a locality condition. Such a condition can take the form of “in the same x,” where x can be a document, paragraph, sentence, clause, or the like. Words specified must satisfy the specified condition (e.g., be present and be in the same x) to meet the locality condition.
In any of the examples herein, the syntax for specifying a rule having a locality condition can be specified via indicating the locality type, and words and/or concept rule references (e.g., enclosed in delimiters). The syntax can support having the locality type followed by the words and/or concept rule references.
For example, a possible syntax is,
For example, the locality type can be reduced to a single character (e.g., d, p, s, c, or the like), and the words and/or concept rule references can be specified in between parenthesis after the colon.
Concept rules can be indicated by a special character (e.g., “=”) next to (e.g., preceding) the concept rule name.
So, for example,
specifies that the word “horrible” and any word defined by the concept rule “weather” must be in the same clause in order for the rule to be met.
In any of the examples herein, patterns can take the form of a pattern that can be matched against text. Wildcard, any letter (e.g., 4), and other operators can be supported. Regular expressions can be supported to allow a wide variety of useful patterns.
Such patterns are sometimes called “word” or “word patterns” because they typically attempt to match against words in text and can be processed accordingly.
In any of the examples herein, the hardware pattern list can take the form of a list of patterns that are sent to the specialized hardware for identification in documents. In practice, the pattern list can be incorporated into a binary image that achieves processing that results in evaluation results that can be processed by software to determine whether conditions in the rules have been met.
In any of the examples herein, evaluation results can take the form of results provided by the specialized hardware to indicate evaluation performed by the hardware against a document. For example, the locations of patterns (e.g., extracted from filter rules) within a document can be indicated in the results.
In any of the examples herein, a document can take any of a variety of forms, such as email messages, word processing documents, web pages, database entries, or the like. Because the technologies herein are directed primarily to text, such documents typically include textual components as part of their content.
In any of the examples herein, documents can be classified according to filtering. So, for example, outgoing email messages can be classified as permitted or not permitted. Any number of other classification schemes can be supported as described herein.
In some cases, it may be desirable to have a human reviewer process certain documents identified via filtering.
In any of the examples herein, filter rules can be complied to a form acceptable to the specialized hardware and usable by the coordinating software. Pre-processing can include expanding the rules (e.g., according to concept rules referenced by the rules).
In any of the examples herein, multiple stages can be used. For example, a first stage may determine whether a document has sufficient content in a particular human language (e.g., English). A subsequent stage can take documents that qualify (e.g., have sufficient English content) and perform context-sensitive filtering on them. A given stage may or may not use hardware acceleration and pattern matching.
Such an arrangement can use an earlier hardware-accelerated pattern matching stage to knock out documents that are inappropriate or unexpected by a subsequent stage.
In any of the examples herein, the technologies can be applied to implement an email filter. For example, incoming or outgoing email can be processed by the technologies to determine whether an email message is permitted or not permitted (e.g., in an outgoing scenario, whether the message contains sensitive or proprietary information that is not permitted to be exposed outside the organization).
After filtering is performed, a document can be displayed such that the display of the document depicts words in the document that satisfy the filter rules with distinguishing formatting (e.g., highlighting, bold, flashing, different font, different color, or the like).
Navigation between hits can be achieved via clicking on a satisfying word. For example, clicking the beginning of a word can navigate to a previous hit, and clicking on the end of a word can navigate to a next hit
In any of the examples herein, a software emulator can be used in place of the specialized hardware. Such an arrangement can be made switchable via a configuration setting and can be useful for testing purposes.
The technologies described herein can use a variety of rule compilation techniques. User Target Rules can be implemented as filter rules representing concepts of interest to the user. They can use a syntax that allows reuse of synonym definition that supports hierarchical relationships of nested definitions. The rules contain references to locality (e.g., Entire Document, Paragraph, Sentence, and Clause). A line expresses a locality followed by a set of Unix regular expressions. The rules can be compiled to optimize data representation and also generate Hardware Patterns. The hardware patterns can be derived from the rules by identifying the unique regular expressions found in a rule set.
The rules can be compiled after any changes are made. Filtering of content uses a compiled and optimized Java class at run-time. The hardware pattern matching also compiles the regular expressions internally by compiling the patterns into an optimized representation supported by the hardware (e.g., NetLogic or the like) that is also used at run-time.
Input text is analyzed by pattern matching hardware to identify index locations of respective matching patterns in the original text. The locality index identifies the start and ending index of paragraphs, sentences and clauses in the original text. These two outputs are then combined to determine which target rules matched. Based on the number of matches for respective rules, the total input score is computed as illustrated in
The matched rules can be used to distinctively depict (e.g., highlight) the matching concepts in the original text, using color coding to represent the weighted contribution of each matched patterns. Patterns that contribute to a rule with high weights are colored in a first color (such as red), and the ones with the smallest contribution are colored in a second color (such as green). Shades in between the first color and the second color are used to indicate amount of contribution to total score by a specific pattern. The distinctively depicted text can also be hyperlinked to allow hit to hit navigation.
Matching patterns are highlighted and color coded based on the weight of the target rule that contains the pattern. Hit-to-hit navigation allows the user to go to the next instance of that pattern by clicking on the last part of the word. The user can go to the previous instance by clicking on the first part of the word. Also as depicted in
Various features of the technologies can be implemented in a tool entitled “Indago,” various features of which are described herein.
Indago's approach to deep-context data searches can address demands for the rapid and accurate analysis of content. Indago can provide in-line analysis of large volumes of electronically transmitted or stored textual content; it can rapidly and accurately search, identify, and categorize the content in documents or the specific content contained in large data streams or data repositories (e.g., any kind of digital content that contains text) and return meaningful responses with low false-positive and false-negative rates. Indago can perform automated annotation, hyperlinking, categorization, and scoring of documents, and its ability to apply advanced rule-syntax concept-matching language enables it, among other things, to identify and protect sensitive information in context.
Indago's capabilities meet the needs of many applications including, but not limited to:
Such applications can benefit from rapid, accurate, context-sensitive search capabilities, as well as the potential to block the loss of sensitive information or intellectual property.
Indago's benefits include, but are not limited to, the following:
Indago can search, identify, and categorize the content in documents or specific content in large data streams or data repositories and return meaningful and accurate responses.
Indago's can perform context-sensitive analysis of large repositories of electronically stored or transmitted textual content.
As data generation and storage technologies have advanced, society itself has become increasingly reliant upon electronically generated and stored data. Digital content is proliferating faster than humans can consume it. Current search/filter technology is well suited for simple searches/matches (that is, using a single keyword), but a more powerful paradigm is called for complex searches.
Current products do not meet the need for the rapid and accurate context-sensitive analysis of content. Current approaches merely match patterns, but do not attempt to understand the content. Existing products become bogged down to the point of being ineffective; when there are more than a small number of search rules, they produce unacceptably high numbers of false positives, and filtering, or using only specific characteristics and discarding anything that does not match, may result in the loss of desired targets. The demand on user time is enormous.
Indago can address such issues via software algorithms, open-source search tools, and a unique, first-time use of off-the-shelf hardware to provide in-line analysis (that is, analysis of data within the flow of information) of large volumes of electronically transmitted or stored textual content. Indago can rapidly and accurately search, identify, and categorize the content in documents or the specific content contained in large data streams or data repositories and return meaningful responses with low false positive and false negative rates.
The algorithms contained in the Indago software can compute the relevance (or goodness-of-fit) of the text contained in an electronic document based on a predefined set of rules created by a subject matter expert. Indago can identify concepts in context, such that searching for an Apple computer does not return information on the fruit apple, and can effectively search and analyze any kind of digital content that contains text (e.g., email messages, HTML documents, corporate documents, and even database entries with free-form text). These software abilities can be implemented via a combination of Indago's software algorithms and the acceleration provided by this use of off-the-shelf hardware that greatly speeds the action of the software.
Indago can benefit from a unique synergy of two significant advancements: developed algorithms that compute concept matches in context combined with a unique, first-time use of off-the shelf hardware to achieve acceleration of the process. The innovative algorithms can implement the intelligent searching capability of the software and the integrated hardware, a NetLogic Hardware Acceleration Platform, reduces document-content processing time.
The rules used by the algorithms are easy to implement because they are modular and can be reused or combined in different ways, and are expressed as hierarchical concepts in context. The rules make it easy to encode the subject matter expert's interpretation of complex target concepts (e.g., the sought-after ideas). An example is sensitive corporate knowledge, such as proprietary information that may be inadvertently released to the public if not identified. The rules incorporate user-specified scoring criteria that are used to compute a goodness-of-fit score for each document. This score is used for filtering and identifying “relevant” documents with high accuracy. Rules can be weighted via four different types of weighting functions that are built into the algorithms and can therefore be used to filter and thus optimize precision and/or recall and identify duplicate documents. Indago's contextual analysis rules allow the creation of complex target models of concepts. These target models are used to build increasingly sophisticated rules that operate well beyond simple word occurrence, enabling Indago to make connections and discover patterns in large data collections.
An end user of Indago will typically work with Indago's graphical user interface and determine the data repositories, emails, or other information to be searched. The user then receives the annotated results, which will have color-coded highlighted words and text blocks signifying the relative importance of the identified text. An example of Indago's annotated analysis result available to the end user, using a publicly available article (Dan Levin, “China's Own Oil Disaster,” The Daily Beast, Jul. 27, 2010) is shown in
A difference in the disclosed Indago from other approaches is the weighted identification of concepts-in-context. Most filters and search engines use either a simple word list or Boolean logic, a logical calculus of truth values using “and” and “or” operators, to express desired patterns. Simple word-occurrence searching techniques, in which a set of terms is used to search for relevant content, result in a high rate of false positives—often seriously impacting accuracy and usefulness. Current search/filter technology is suited for simple searches/matches (e.g., a Google search for “apple” will return information for both apple the fruit and Apple computers), but a more powerful paradigm—Indago—is required for complex searches, such as finding sensitive corporate knowledge that may be flowing in the intranet and could be accidentally or maliciously sent out via the Internet.
Although Boolean logic can express complex context relationships, it can be problematic—so that users of Boolean searches are forced to trade precision for recall and vice versa. A single over-hitting pattern can cause false positives; however, filtering out all documents containing that pattern in a bad context may eliminate documents that do have other targeted concepts in the desired context. If the query contains hundreds of terms, finding the one causing the rejection may require significant effort. For example, if searching for “word A” and “word B” anything that does not contain both these words will be rejected.
In contrast to the currently available approaches, Indago includes sophisticated weighting that can be applied to both simple target patterns and to multiple patterns co-occurring in the same clauses, sentences, paragraphs, or an entire document. The concepts-in-context approach of the software allows more precise definition, refinement, and augmentation of target patterns and therefore reduces false positives and false negatives. Search targets have their context within a document taken into account, so that meaning is associated with the responses returned by the software. The goodness-of-fit algorithm contained within the software allows a tight matching of the intended target patterns and also provides a mechanism for tuning rules, thereby reducing both false positives and false negatives. The goodness-of-fit algorithm uses a dataflow process that starts with the extraction of text from electronic documents. The extracted text is then either sent to hardware or software for basic pattern matching. Finally, the results of the matches are used to determine which target pattern rules were satisfied and what score to assign to a particular match. Scores for each satisfied rule are added to compute the overall document score.
Another difference from current approaches is that Indago uses off-the-shelf hardware in a unique way for implementation. This first-time hardware-assisted approach is lossless, highly scalable, and highly adaptable. Tests of Indago's hardware-accelerated prototype have shown a base performance increase of 10 to 100 times on pre-processing tasks.
One possible Indago deployment is an email filter hardware capable of using a hardware-accelerated interface to a Netlogic NLS220HAP platform card. The NetLogic NLS220HAP Layer 7 Hardware Acceleration Platform can be leveraged to provide hardware acceleration for document-content processing. This card is designed to provide content processing and intensive signature recognition functionality for next-generation enterprise and carrier-class networks. This performance far exceeds the capability of current-generation multicore processors while consuming substantially less energy. Further, multiple documents may be processed in parallel against rule databases consisting of hundreds of thousands of complex patterns. Although one embodiment of the technology is designed and for the data communications industry, the deep-packet processing technology can be applied to the field of digital knowledge discovery as described herein.
Two open-source software tools which can be used within the Indago software are the following:
The current email exfiltration implementation also has a command-line interface for batch processing. The current design does not preclude the hardware integration of additional functionality, including rules processing, which will in significant speed-up and unlock additional capability in a fully integrated system.
Indago includes a number of features which provide improvements related to, but not limited to, accuracy, cost, speed, energy consumption, rule-set flexibility, and performance.
While improvement can also be estimated in terms of cost, speed, or energy consumption, for this problem space, a primary improvement Indago offers is best understood in terms of accuracy of matches.
Number of Returns.
A Google search with thousands of results is not useful if the most-relevant document doesn't appear until the 10th page of results; few if any have the patience to scan through many pages of false positives.
False Positives.
Simple search implementations often return many hits with the significant drawback of a high false positive rate, that is, many marginally related or unrelated results that are not useful or are out of context. For instance, the term “Apple” might return results related to both fruit and computers. Indago's low false positive and on-target matching returns only the highly relevant content.
False Negatives.
Similarly, a very fast Boolean search is not useful if relevant documents are missed by the use of a “NOT” clause. Indago's concepts-in-context capability allows searching for very general terms that would otherwise be eliminated because of the high rate of false positives. For example, in the legal field missed documents, which are “false negatives,” may contain key evidence. Indago is focused on finding relevant information with a minimum number of false positives and false negatives.
Speed.
Software solutions become bogged down to the point of being ineffective when there are more than a small number of search rules. Indago's use of hardware acceleration cuts the processing time by a third, and future releases may push additional operations to the hardware for even greater speed-up and added capability.
Recent accuracy tests of the software-only rules implementation used a large body of roughly 450,000 publicly available email messages from the mailboxes of 150 users. The messages and attachments totaled some 1.7 million files. Subject matter experts reviewed a subset of over 12,000 messages under fictitious litigation criteria to identify a set of responsive and non-responsive documents. Tested against their results, Indago demonstrated a successful retrieval rate of responsive documents of 80%.
Indago is an efficient hardware-accelerated approach designed to enable the inspection of every bit of data contained within a document at rates of up to 20 Gbps, far exceeding the capability of current generation multi-core processors, while consuming substantially less energy. Indago is an in-line content analysis service as compared to implementations that require batch processing and large computer clusters with high-end machines. Indago's current hardware acceleration is based upon Netlogic technology. Netlogic technical specifications quote power consumption at approximately 1 Watt per 1 Gbps of processing speed; at the full rate of 20 Gbps of contextual processing, estimated power consumption would be 20 Watts, which is at least ten times better than the power consumption of a single computer node. Comparison to a cluster of computer nodes, as some competing approaches require, is far more impressive. Further, the anticipated cost is less than competing options, while performance is greater.
The degree of flexibility afforded by Indago is not possible with Boolean-only query methods. Indago provides a variety of weighting functions in its grammar definition. Additionally, Indago provides the option of negative weighting to accommodate undesirable contexts. Finally, the query rules are modular and thus easier to maintain than Boolean queries.
The amount of digital content that must be analyzed to solve real-world problems continues to grow exponentially. Humans are excellent at quickly grasping the general concept of a document, but person-to-person variability can be significant and variation of performance for a single individual can vary greatly depending on competing demands for attention. Teams of people simply cannot consistently analyze very large collections of documents. Indago, on the other hand, excels at performing these repetitive tasks quickly and consistently.
Indago can consistently analyze large collections of unstructured text such that the post-processed output contains scoring and annotation information that focuses analyst attention on the precise documents, and words within those documents, that are most likely to be of interest. By offloading the repetitive tasks associated with the systematic baseline analysis of a large body of documents, humans can do what they do best. Indago's contextual analysis includes color-coded triage to pinpoint attention on high-interest text matches with single-click navigation to each specific textual instance. The efficiency and effectiveness of the subsequent analyst review is enhanced, allowing more time to interpret meaning and trends. In addition, Indago's scoring mechanism allows the added flexibility of tweaking the balance between precision and recall, if desired.
Analysis, sorting, manipulation, and protection of data are required across a diverse set of industries and applications. Digital data is pervasive and the need to analyze textual information exists everywhere. Indago is particularly powerful because its ability to support a hardware-assisted, concept-in-context approach allows domain-optimized algorithm adaptation.
Today's data deluge crosses scientific and business domains and political boundaries. While of great use to many fields, Indago may be particularly useful in foreign policy as it can search out information on specific topics of interest worldwide, thus bringing attention to potential threats or geographical areas that should be monitored for significant events.
“Exfiltration” is a military term meaning the removal of assets from within enemy territory by covert means. It has found a modern usage in computing, meaning the illicit extraction of data from a system. Indago protects sensitive information as follows:
Email Exfiltration.
Exfiltration, General.
Indago allows more complex searches that retrieve relevant content by focusing on the context. For example, the word “Apple” for a non-computer person usually refers to a fruit, but for most computer people it can be either the fruit or the computer company. The use of the rule set and hierarchical concept-in-context matching enables more precise matching for the target interpretation.
Anyone who needs the accurate search and analysis of digital content, and particularly the search of large data streams or large data repositories, is a potential beneficiary of Indago's context-based capability. These needs include, but are not limited to, the following:
Legal Discovery.
Data Mining.
Product Marketing.
Patent Research.
Public Sector
As technology has continued to advance, modern society has become increasingly reliant upon electronically generated and stored data and information. Digital content is proliferating faster than humans can consume it, and digital archives are growing everywhere, both in number and in size. Correspondingly, the need to process, analyze, sort, and manipulate data has grown tremendously. Applications that alleviate the processing burden and allow users to access and manipulate data faster and to more effectively cross the data-to-knowledge threshold are in demand, particularly if they enable informed, actionable decision-making.
Indago can address this need. Indago can implement a context-based data-to-knowledge tool that provides a powerful paradigm for rapidly, accurately, and meaningfully searching, identifying, and categorizing textual content. Context-based search and analysis creates new and transformative possibilities for the exploration of data and the phenomena the data represent. Indago can reduce the data deluge and give users the power to access and fully exploit the data-to-knowledge potential that is inherent—but latent—in nearly every data collection. End-users and human analysts are thus more efficient and effective in their efforts to find, understand, and synthesize the critical information they need.
Indago can be implemented as a cost-effective solution that provides unparalleled performance and capability, and that leverages proven, commercial off-the-shelf technology. It can enable a broad and diverse set of users (e.g., IT personnel, law firms, scientists, etc.) to engage their data faster, more accurately, and more effectively, thus allowing them to solve problems faster, more creatively, and more productively. Furthermore, Indago can be domain-adaptable, power-efficient, and fully scalable in terms of rule-set size and complexity.
Indago can monitor in-line communications and data exchanges and search large repositories for more complex patterns than possible with today's technologies. Furthermore, current technology is prohibitive to use because of the high false-positive rates.
Indago can present a unique, outside-the-box innovation in terms of how it can exploit deep-packet-processing technology, its adaptability and breadth of applicability, and its unparalleled performance potential.
As data generation and storage technologies have advanced, society itself has become increasingly reliant upon electronically generated and stored data. Digital content is proliferating faster than humans can consume it. Proliferation of electronic content benefits from automated tools to analyze large volumes of unstructured text. Network traffic as well as large corporate repositories can be scanned for content of interest, either to stop the flow of unwanted information such as corporate secrets or to identify relevant documents for an area of interest.
Network filters rely mostly on simple word list matches to identify “interesting” content, and searches rely on Boolean logic queries. Both approaches have their advantages and limitations. A keyword list is simple and can be implemented easily. Boolean logic with word proximity operators allows finer definition of target pattern of interest. However, both may retrieve too many false positives. A document may contain the right words, but not in the right context. For example, a web search for apple brings both references to the computer company and the fruit. Disclosed herein is an approach that searches for concepts-in-context with reduced number of false positives. Furthermore, by the use of commercial off the shelf hardware
The process can be accelerated significantly so that the analysis can be done in near real time. Context-based search and analysis can greatly enhance the exploration of data and phenomena by essentially reducing the data deluge and increasing the efficiency and effectiveness of the human analyst and/or end-user to access and fully exploit the data-to-knowledge potential that is inherent but latent in nearly every collection.
The rules allow synonym definition, reuse, nesting, and negative weight to balance precision and recall.
The rules can be an encapsulation of the target knowledge of interest.
The rules can be shown as a graph that depicts the concepts and their relationships, which can serve as a map of the target knowledge.
A scoring algorithm can use rules and weights to determine which parts of the map are matched by a particular document and computer a goodness of fit score for an entire document.
Comparable rules for Huge, Weather, HorribleWeather, BadWeather, etc., translate to 31 simple queries.
A more complex example related to “oil disaster preparation” translates to −2.5M simple queries used to score and annotate incoming documents.
Indago can be implemented as a context-based data-to-knowledge tool that rapidly searches, identifies, and categorizes documents, or specific content contained in large data streams or data repositories, and returns meaningful and accurate responses.
Indago can be implemented as a high-performance, low-power, green solution, and an inline service as compared to implementations that require large computer clusters with high-end machines. Some characteristics can include:
Advanced rule syntax concept matching language
The ability to identify concepts-in-context has significant advantages over current approaches.
Indago computes the relevance (or goodness-of-fit) of the text contained in an electronic document based on a predefined set of rules created by a subject matter expert. This approach permits user-defined, domain-optimized content to be translated into a set of rules for fast contextual analysis of complex documents.
Indago can include hundreds of rules that contain “context” references and may be weighted to give a more accurate goodness-of-fit to the target pattern of interest. The goodness-of-fit algorithm allows a tight matching of the intended target patterns and also provides a mechanism for tuning rules. In addition, the concepts-in-context approach allows more precise definition, refinement, and augmentation of target patterns and therefore reduces false positives and false negatives.
Due to the concurrency offered by the specialized hardware (e.g., Netlogic NLS220HAP platform card), Indago can process multiple documents in parallel against rule databases consisting of hundreds of thousands of complex patterns for fast throughput. The matched pattern indices are used to determine context and identify the rules matched. The scoring function then computes the contribution of each match to generate a complete document score for the goodness-of-fit. The results of these matches are then evaluated to construct matches in context in software.
The initial Indago deployment is an email filter capable of using a hardware-accelerated interface to a Netlogic NLS220HAP platform card. Indago leverages the NetLogic NLS220HAP Layer 7 Hardware Acceleration Platform to provide hardware acceleration for document-content processing. This card is designed to provide content-processing and intensive signature-recognition functionality for next-generation enterprise and carrier-class data communications networks. The NLS220HAP is a small-form-factor, PCIe-attached accelerator card, which can be integrated into commercial off-the-shelf (COTS) workstation-class machines. The Netlogic card contains five Netlogic NLS205 single-chip knowledge-based processors.
Each NLS205 processor has the ability to concurrently support rule databases consisting of hundreds of thousands of complex signatures. The unique design and capability of these processors enable the inspection of every bit of data traffic being transferred—at rates up to 20 Gbps—by accelerating the compute-intensive content-inspection and signature-recognition tasks. This performance far exceeds the capability of current generation multicore processors while consuming substantially less energy. While this technology is designed for the data communications industry, deep-packet processing technology can be applied to the field of digital knowledge discovery.
Indago has demonstrated a hardware accelerated base performance increase of one to two orders of magnitude on pre-processing tasks over a software-only implementation.
Exemplary technical supporting information for Indago is described below.
Indago eMail Filter-Hardware Acceleration Interface (eMF-HAI), provides a software interface to the NetLogic NLS220HAP Hardware Acceleration Platform card. This interface allows for the seamless integration of the NLS220HAP card into the larger eMF-HAI application. The software leverages the NetLogic NETL7 knowledge-based processor Software Development Kit (SDK).
The SDK has been used to develop C/C++ based codes that enable the following on the NLS220HAP card: Binary databases generated from application-specific rule sets specified using Perl Compatible Regular Expressions; configuration, initialization, termination, search control, search data, and the setting of device parameters; document processing at extreme rates; and transparent interface between Java-based code and C/C++.
A NetLogic NLS220HAP Layer 7 Hardware Acceleration Platform (HAP) card (see
The NLS220HAP card is supported by a full-featured Software Development Kit (SDK). It is supplied as source code and presents Application Programming Interfaces (API) that provide runtime interfaces to the NLS220HAP card. The SDK includes a database/compiler API and a data plane API. The database/compiler API enables the compilation of databases expressed in either PCRE or raw format. The compiler takes all the pattern groups expressed in PCRE and produces a single binary image for the specific target platform (in our case the NLS220HAP card). The data plane API provides a runtime interface to the NLS220HAP card. It provides interfaces and data structures for configuration, initialization, termination, search control, search data, and the setting of device parameters.
The TWF group is based upon the one thousand most frequent word families developed by Paul Nation at the School of Linguistics and Applied Language Studies at Victoria University of Wellington, New Zealand. The list contains 1,000 word families that were selected according to their frequency of use in academic texts. This list is more sophisticated than the lists created with the Brown corpus, as it contains not only the actual high frequency words themselves but also derivative words which may, in fact, not be used so frequently. With the inclusion of the derived words the total number of words in the list is over 4,100 words.
The DWL group is a domain-specific set of rules defined using PCRE. It can vary in size depending upon the application. The DWL can be defined as individual words of interest, or more complex rules can be defined using PCRE. For the eMF-HAI, code was created (CreateRules.cpp) that combines the two rule-group files, TWF and DWL, into a single, properly formatted, input file suitable for compilation using the NetLogic compiler for the NLS220HAP card. The output file generated by the compiler is a binary image file containing the rule database for the application. This functionality has been subsumed by the higher level DWL rule generation software and should no longer be needed. It is included here for historical and reference purposes. For eMF-HAI, the dataplane API was used to construct the main code for use with the NLS220HAP card. This code accomplishes two functions that are leveraged in the larger document-processing application: (1) the determination whether the document contains enough readable English text to warrant further processing, and (2) the identification and reporting of matching rules, defined in the DWL, within the document. Both functions are accomplished concurrently in a single pass of the document through the hardware.
For the first function, the length of matches (in bytes) reported by the hardware in the TWF group are counted. The code understands and corrects for multiple overlapping matches, selecting and counting only the longest match and ignoring any other co-located matches. Once the document has been passed through the interface and the matched bytes have been counted, the code calculates the ratio of matched bytes to the total number of bytes in the document.
For the second function the hardware reports back all matches found in the DWL group. For each rule, a count of matches is maintained along with a linked list of elements consisting of the index into the file where the match occurred and the length of the match. Two other counts are maintained per document for the DWL: (1) the number of unique rules that are triggered, and (2) the total rules matched.
Since the majority of the document processing in the eMF is written in Java, the C/C++ code produced for the eMF-HAI includes a reasonably simple way of interfacing. A purely file-based interface and leverage inotify was utilized. Inotify is a Linux kernel subsystem that extends file systems to notice changes to their internal structure and report those changes to applications. The inotify C++ (inotify-cxx.cpp) implementation was used, which provides a C++ interface. POSIX Threads (Pthreads) are used to map five instances of the eMF-HAI code to the five NLS205 processors available on the NLS220HAP card.
In
The overall eMF was written to provide a software emulation of the eMF-HAI to help with development. The selection of using the hardware or software is accomplished by simply modifying a configuration variable before running the eMF application.
The use of the hardware over an all-software solution can reduce processing time for the whole process by a third because it improves the pattern-matching step by six orders of magnitude over the existing software implementation. Software-only approaches are typically limited to on-the-order of thousands of implementable rules before severe performance limitations begin to arise. As rule-set size increases, performance decreases due to effects such as memory caching and thrashing. Indago's hardware-accelerated implementation is effectively unlimited. It can be co-designed for fully optimal performance based upon domain and complexity requirements (up to hundreds of thousands of rules, if required) without reaching hardware-imposed limitations.
The approach taken has various possible advantages at least some of which are the following. First is the use of the hardware to accelerate basic pattern matching, and second is the nested word matching in context performed. The hardware was designed to match simple patterns, but nothing precludes its usage with more complex patterns. This is the first step in the filtering process. The matches are then taken into a hierarchical set of rules that estimate goodness of fit towards a target pattern of interest.
The hardware acceleration reduces clock time by five orders of magnitude for that part of the process, making the technology usable in near-real-time applications. The matching in context portion significantly reduces the number of false positives and false negatives. A disadvantage of simple word list matches is the fact that it may generate to many results that may not be relevant. The context rules are used defined and therefore can be used in any domain.
General email filtering can benefit from this approach as commercial organizations can not monitor intellectual property that may be divulged accidentally in out-flowing emails. Data mining for law firms can also benefit as the relevant document set for a litigation may be large. The rules can be customized to represent the responsive set of target patterns that would be used to search document that maybe relevant. This task is typically done by interns today.
The filter software computes the goodness of fit of a given text to a user defined set of target patterns. The target patterns are composed of rules which specify concepts in context. The rules can be nested and have no limit. The rules are user specified and therefore are an input to the filter. The rules are transformed internally for pattern matching and a version of these are sent to the hardware. The hardware returns initial pattern matches which are then combined to provide context. The original text is then scored using criteria specified for each rule. The filter uses a standard “spam/virus” filter interface as well as a command line interface for testing or batch uses. The filter can intercept and reroute suspected messages for review, such as for human review.
The current end user interface is invoked using the Zimbra Collaboration Suite (ZCS). Suspected messages are rerouted to a special email account for adjudication. ZCS uses a web interface which includes an email client. All sent messages filtered for suspected content. The rules and configuration parameters drive the process, therefore it should be applicable to any domain and easily changed for a different setting. Early testing was done using the open-source Apache JAMES project.
The technologies described herein can be implemented in a variety of ways.
Proliferation of electronic content requires automated tools to analyze large volumes of unstructured text. Network traffic as well as large corporate repositories can be scanned for content of interest; either to stop the flow of unwanted information such as corporate secrets or to identify documents relevant to an area of interest. Network filters rely mostly on simple word list matches to identify “interesting” content and manual searches typically rely on Boolean logic queries. Both approaches have their advantages and limitations. Word-list-based filters are simple to implement. Boolean logic with word proximity operators allows a finer definition of the patterns of interest. However, both often retrieve too many false positives. A document may contain the right words, but not in the right context. For example, a web search for apple brings both references to the computer company and the fruit. In this paper we will document an approach that we have developed that searches for concepts-in-context with reduced number of false positives. Furthermore, by the use of commercial-off-the-shelf hardware, we have accelerated the process significantly so that data feeds can be processed in-line.
As technology has continued to advance, modern society has become increasingly reliant upon electronically generated and stored data and information. Digital archives are growing everywhere both in number and in size. Correspondingly, the need to process, analyze, sort, and manipulate data has also grown tremendously. Researchers have estimated that by the year 2000, digital media accounted for just 25 percent of all information in the world. After that, the prevalence of digital media began to skyrocket, and in 2002, digital data storage surpassed non-digital for the first time. By 2007, 94 percent of all information on the planet was in digital form. The task of processing data can be complex, expensive, and time-consuming. Applications that alleviate the processing burden and allow users to access and manipulate data faster and more effectively to cross the data-to-knowledge threshold, particularly for large data streams or digital repositories, to enable informed actionable decision-making are in demand.
A real life test set is publicly available in the form of ENRON email messages. The set has some 0.5 million email messages that contain data from about 150 users. The messages and attachments total some 1.7 million files. Consistent analysis of this set by humans is impossible to achieve. Many of the analysis aspects are subjective, human experiences and biases become a significant factor that becomes an inconsistency issue. This problem nullifies the ability to analyze a large set of documents with a divide and conquer approach. In contrast, computer based tools may consistently analyze large collections of unstructured text contained in documents. These tools can generate consistent results. The result of the analysis can then be used by humans to interpret the meaning of the changes such as trends. Various questions such as the following can be addressed: why a sender no longer discusses information on a certain topic; why a different topic used; why a sender uses a new topic area; is this a evolution of previous discussions; is it a new problem or a new perspective on a problem; why was the topic area found to be a dead end. These questions are likely best answered by a human and the technology can support such decisions by doing the repetitive preparation task and systematically analyzing a large corpus of documents to expose the patterns for human consumption.
A filter can be used in line to monitor near real time information flow. A hardware accelerated concepts-in-context filter called “Indago” described herein is one that can be used.
Indago can perform contextual analysis of large repositories of electronically stored or transmitted textual content. The most common and simplest form of analyzing a large repository uses simple word-occurrence searching techniques in which a set of terms is used to search for relevant content. Simple word list search results have a high rate of false positives, thus impacting accuracy and usefulness. Indago's contextual analysis allows creation of complex models of language contained in unstructured text to build increasingly sophisticated tools that move beyond word occurrence to make connections and discover patterns found in large collections.
Indago computes the relevance (or goodness-of-fit) of the text contained in an electronic document based on a predefined set of rules created by a subject matter expert. These rules are modular and are expressed as hierarchical concepts-in-context. The rules can be nested and are complex in nature to encode the subject matter expert's interpretation of a target concept, such as sensitive corporate knowledge. The rules have a user specified scoring criteria that is used to compute a goodness-of-fit score for each individual document. This score can be used for filtering or for matching relevant documents with high accuracy. Rules can be weighted via four different types of weighting functions and thus can be used to optimize precision and recall. The process can also be used to identify duplicate documents as they would generate identical matches.
Software-only approaches are typically limited to on the order of hundreds of implementable rules before severe performance limitations begin to arise. Researchers have documented that simple phrase searching dramatically increased search performance time. As rule set size increases, performance decreases due to effects such as memory caching and thrashing. Indago's hardware-accelerated implementation is effectively unlimited. It can be co-designed for fully optimal performance based upon domain and complexity requirements (up to hundreds of thousands of rules, if required), without reaching hardware-imposed limitations.
Indago rule sets can be adapted to the application space. User-defined, domain-optimized content is translated into a set of rules for fast contextual analysis of complex documents. Simple search and filtering applications use a “one rule only” approach (typically a single Boolean rule or list of keywords). While the rule is user specified, the processing is done one at a time. In contrast, Indago can include hundreds of rules that contain “context” references and may be weighted to give a more accurate goodness-of-fit to the target pattern of interest.
Indago can employ rule-set flexibility that is not possible with Boolean-only query methods. Indago provides a variety of weighting functions in its grammar definition. In addition, Indago provides the option of negative weighting to accommodate undesired contexts. Finally, the query rules are modular and thus easier to maintain than long Boolean queries.
Indago enhances analyst-in-the-loop and/or end user effectiveness. Indago can analyze large collections of unstructured text with the end result that focus analyst's attention on precisely the documents, and words within those documents, that are most likely of interest. By offloading the repetitive tasks associated with the systematic base-line analysis of a large corpus of documents, the efficiency and effectiveness of the subsequent analyst review is enhanced, allowing more time to interpret meaning and trends. In addition, Indago's scoring mechanism allows the added flexibility of tweaking the balance between precision and recall, if desired by the use of the weighting functions.
Indago uses a combination of software and hardware to achieve near-real-time analysis of large volumes of text. The currently deployed implementation is used as a context filter for an email server. However, the technology has broad applicability, as the need for fast and accurate search, analysis, and/or monitoring of digital information transcends industry boundaries.
A difference from current approaches is that Indago can provide a unique, hardware-assisted, lossless, highly scalable and highly adaptable solution that exploits commercial off-the-shelf (COTS) hardware end-to-end. Tests of Indago's hardware-accelerated implementation have shown a base performance increase of 1 to 2 orders of magnitude on pre-processing tasks compared to the existing, unoptimized software and cut the overall processing time by a third. Acceleration of additional functionality, including rules processing, will result in significant speed-up and unlock additional capability in a fully integrated system. The initial implementation is an email filter hardware acceleration interface (eMF-HAI) to a Netlogic NLS220HAP platform card. This card is designed to provide content processing and intensive signature recognition functionality for next-generation enterprise and carrier-class data communications networks. The unique design and capability of the five Netlogic NLS205 single-chip knowledge-based processors enable the inspection of data traffic being transferred at rates up to 20 Gbps by accelerating the compute intensive content inspection and signature recognition tasks. While this technology is designed for and the data communications industry, deep-packet processing technology can be applied to the field of digital knowledge discovery. This card is designed to provide content processing and intensive signature recognition functionality for next-generation enterprise and carrier-class networks.
The NLS220HAP is a small form factor, PCI-e attached, accelerator card that can be integrated into commercial off-the-shelf workstation class machines. It utilizes five NetLogic NLS205 single-chip knowledge-based processors. Each NLS205 processor has the ability to concurrently support rule databases consisting of hundreds of thousands of complex signatures. The unique design and capability of these processors enable the inspection of every bit of document data being processed at rates up to 20 Gbps by accelerating the compute intensive content inspection and signature recognition tasks. This performance far exceeds the capability of current generation multicore processors while consuming substantially less energy.
Due to the concurrency offered by the NLS220HAP, multiple documents may be processed in parallel against rule databases consisting of hundreds of thousands complex patterns. The matched pattern indices are then used to determine context and identify the rules matched. The scoring function then computes the contribution of each match to generate a complete document score for the goodness-of-fit.
Indago can be scaled in multiple ways depending on the operating environment and the application requirements, see
Referring to
The highest level of scalability is at the node-level where multiple servers, each potentially containing multiple NetLogic PCIe boards, are interconnected with a high-performance network (e.g., 10GE) to form a traditional compute cluster. In this scenario, the Indago application runs on each of the servers independently using some type of network load balancing to distribute the data to be processed. If the servers each contain multiple NetLogic boards, then the amount of processing that could be achieved with even a modest sized cluster would be significant.
The Indago approach has advantages over current applications in the two closest technology areas of text searching and content filtering.
A key player in the internet field and to some extent, intranet searching is Google. This is easy to use search technology; however most searches are simple word lists. On the Content Management System arena Autonomy is a key player that touts Corporate Knowledge Management and includes algorithms for searching, clustering and other type of text management operations. TextOre is a new commercial product based on Oak Ridge National Laboratories Piranha project, and R&D 2007 winner. This product can help mine, cluster and identify content in very large scale. The original Piranha algorithm can run from a desktop machine, but uses a supercomputer to achieve best performance. The implementation is based on word frequency for finding and clustering documents. In general, text searching most often uses a simple word list, other operations such as clustering may use word co-occurrence indices and frequency counts to cluster like documents. Furthermore the documents need to be preprocessed in preparation for these operations, and therefore these techniques may not be suitable for inline content filtering.
Simple word list based tools plug into a web browser to block unwanted content. These usually target parental control customers. Antivirus software can also be considered in this category, but virus definitions are simple bit-sequence matching. Most filtering engines are designed for one-at-a-time web page, while Indago is designed to filter large volume of content in near real time.
Neither text searching nor filtering attempts to understand the content and merely match patterns. Both use simple rules and are limited to a small rule set. If the rule set grows by an order of magnitude these systems would begin to degrade. Indago's hardware accelerated performance is independent of rule set complexity
Another difference from current approaches is the weighted identification of concepts-in-context. Most filters and search engines use either a simple word list or Boolean logic to express target patterns. For example, Google uses a simple list, augmented by additional data (e.g., prestige of pages linking to the document, page ranking, etc.), to produce good retrieval rates; however, the number of matches can be impracticably high with many false positives.
Simple search implementations often return many hits but with the significant drawback of a high false positive rate.
Boolean logic can express complex context relationships but can be problematic. A single over-hitting pattern can cause false positives; however, filtering out all documents containing that pattern may eliminate documents that have it, or other targeted concepts, in a desired context. Users of this style of searching are forced to trade precision for recall and visa versa as opposed to being able to enhance both. And when the query contains hundreds or thousands of terms, just finding the few culprit patterns may require significant effort.
Indago includes sophisticated weighting that can be applied to both simple patterns and to multiple patterns co-occurring in the same clauses, sentences, paragraphs, or an entire document. The goodness-of-fit algorithm allows a tight matching of the target patterns and also provides a mechanism for tuning rules, thereby reducing both false positives and false negatives. The concepts-in-context approach allows more precise definition, refinement, and augmentation of target patterns and therefore reduces false positives and false negatives. Researchers have documented the negotiation process for creating and acceptable Boolean query in the Request to Produce documents for a Complaint. The basis for their complaint is:
The goodness-of-fit algorithm uses a dataflow process that starts with extraction of text from electronic documents; the text is then either sent to hardware for basic pattern matching or a software emulator. The matches are used to determine the satisfied rules from which the score is computed.
For the first step of the process, the integrated Tika module does the extraction of text from most electronic documents including archives such as zip and tar files. It is an open source framework with many developers creating data parsers. It does not rely on the “file” extension to determine content, it reads the data stream to make that determination and calls the appropriate parser to extract text. Most common formats are supported such as MS Office file formats, PDFs, HTML pages, even compressed formats like tar and zip, plus more. The framework is extensible; therefore, new formats can be incorporated and can be used in the framework.
The Porter word stemmer is also integrated and allows the identification of word roots, which can then be used to find word families so that all of the variations of a word can be matched as needed instead of having to specify each variation.
Rules can be used for defining the targeted concepts and determining the goodness-of-fit of a document's text to those concepts. The two main types of rules are concept rules and weighted rules. Concept rules are used to define words and/or phrases. The main purpose of this type of rule is to group conceptually-related concepts for later reuse. Concept rule blocks begin with the label “SYN” and end with “ENDSYN”. Weighted rules are used to assign a specific weight to one or more concept rules. Weighted rule blocks begin with the label “WF<weight function parameters>” and end with “ENDWF”. They are usually comprised of one or more references to concept rules. Only the weighted rules contribute to the total score of the document when they are matched.
A concept rule definition can start with the following:
As well as: “The water is so blue that it's hard to find the line where the sky meets the ocean.” (2)
Rule lines may contain regular expressions.
This rule line matches all of the following sentences:
It is also possible to specify that all elements of a set of two or more words appear within a particular syntactic locality. The supported locality constrainers are listed below, in descending order of restrictiveness:
Document locality, which is specified with the following rule syntax:
Paragraph locality, which is activated with the following rule syntax:
Sentence locality, which is specified with the following rule syntax:
Clause locality, which is specified with the following rule syntax:
SYN rules can group concepts that are related to one another in some meaningful way so that the SYN can be incorporated into other SYN rules or weighted rules. Each line in a SYN rule definition is considered to be interchangeable with each other line within the same rule definition. Once a rule is defined, it can be reused in other rule definitions by referring to its unique name. It can be referenced by preceding the name with an equal sign (‘=’) anywhere in rule definition lines. Comment lines, which are ignored, start with the “#” symbol.
In the above reuse example, it may be beneficial to remember subset and superset relationships. Anything that is “huge” is at least “big.” It is likely beneficial to reference the superlative form from the less extreme SYN, rather than the other way around. Similarly, it is useful to create very specific concepts and then reference them within more general ones. That way, the specific form can be used in a heavily-weighted rule while the more general concept can be used to establish context and/or be used in a lesser-weighted rule.
Weighted rules can be implemented as collections of one or more concepts, with a weighting function assigned to each collection. The syntax is generally the same as for a concept rule, except that these rules are meant mainly for usage of concept rules, as they do not have any unique name identifier. As mentioned before, their primary use is to define the weight function by which the included concept rules will contribute to the total score of the document. A variety of weight functions were made available for defining how the rule in weighted. They are:
Weighting function blocks begin with “WF” and end with “ENDWF.” The weighting functions are described in more detail below.
Consider the following text as it is scored by various weight functions. “numHits” is the number of matches found in the text for a particular rule.
CONST
Finally, the concept rules are modular and thus are easier to maintain than Boolean queries. Incorporating synonymy into a Boolean often leaves it looking like a run-on sentence. The resulting complexity often causes internal inconsistencies and logical gaps with respect to synonyms. Indago's modules allow the user to build modular concepts and then refer to them as many times as necessary. When new synonyms are discovered, they can be easily added to the relevant module. A complete weather related rule set may look like:
On the right-middle section of
This graph shows the encapsulation of the target concepts of interest. It shows the concepts and relationships. It is a mental model of the kind of information we are after. A very simple example is shown here; the one used in the results section contains hundreds of rules. A text of interest matches certain parts of this model, and these matches are then shown through the use of text highlight on a web page for user consumption. The text analyzed is highlighted with the concept matched contributing to the score of a document.
Indago delivers higher accuracy and scalability than Boolean queries and more consistency than humans. While improvement can be estimated in terms of cost or speed or energy consumption, for this problem space, it is perhaps best understood in terms of relevancy of matches. A Google search with thousands of results is not useful if the most-relevant document doesn't appear until the 10th page of results; none of us has the patience to scan through many pages of false positives. Similarly a very-fast Boolean search is not useful if relevant documents are missed because important concepts were omitted from some portion of the query or were filtered out by the use of a NOT operator. In the legal field, missed documents (e.g., false negatives), may contain the “smoking-gun” evidence. Failure to produce such a document may lead to a contempt charge and failure to read it might mean losing the case. The focus of this system has been to find relevant information while minimizing both false positives and false negatives. The use of hardware acceleration cuts the processing time by a third; future releases may push additional operations to the hardware for even more speed up and added processing capability.
Indago was used to score a large and publicly-available large data set of email messages, referenced earlier, and their respective attachments for relevance to the general concept of environmental damage as a result of pipeline leakage or explosions. From the more than 750,000 discrete documents, just over 12,000 were selected by a variety of search algorithms for human judicial review. Indago also identified around 1000 additional documents that were not included in the judicial relevance review, suggesting that Indago may have a higher recall than other algorithms that defined the 12,000 set.
The email collection was used to quantify the discrimination capability of the technologies. The concept model included a vast array of concepts-in-context in an effort to find relevant documents among the huge collection of irrelevant ones. “Relevance” is usually subjective, and this was no exception. Judicial reviewers were relied on to establish relevance for litigation purposes and then evaluated a sample of the documents for conceptual relevance. For example, if a document contained relevant background information, and discussed actual pipeline blowouts or oil leaks, or insurance against, or response to, potential blowouts, it was deemed conceptually relevant. This process was, however, incomplete, and in many cases, the litigation assessment was used as a proxy for conceptual relevance. This will, by definition, increase the number of false positives for Indago scoring, but there was inadequate time to fully evaluate each of the 12,000 documents without using Indago's scoring and markup.
Indago read the model and scored each document. An adjusted minimum of −500 and maximum of 1000 was used to facilitate creating the scattergram with score as the X axis and Conceptual relevance at the Y axis as shown in
The goodness-of-fit against the target model results are documented in Table 1. The Indago Score Group column categorizes documents by their total adjusted score from “Huge Negative” (lowest score set at −500) to “Huge Positive” (largest score set at +1000). The documents groups are further broken down as “yes” versus “no” in terms of meeting the legal relevance and/or conceptual relevance criteria. Of the documents judged for legal relevance, there was no clear verdict on 189.
As shown in Table 1 below, there were a grand total of 12,087 documents analyzed (not counting six image-only files). Of this total, those with “Huge Negative” to “Tiny Positive” scores (i.e. less than 25) were deemed not to meet enough of the target model. These 11,492 were considered to be irrelevant from the perspective of the computer-based model, independent of the conceptual and litigation relevance assigned through human review. The remaining 595 (101+107+387 from the “Total” column) had a high-enough score to be relevant to the target model.
Only 302 documents in the entire collection were deemed to be legally responsive, according to the judicial reviewers. Of these, Indago found 185 (21+22+142) giving a 61% retrieval rate and a 39% false negative rate. Similarly, for the adjudicated 565=(185+380) found to have a high-enough score, 380 (76+81+223) were not responsive giving a false positive rate of 67%. The other 30 (4+4+22) documents had no verdict.
There were a total of 377 documents deemed conceptually-relevant through human review. Of those, only 3 received negative scores by Indago and only one passed our “not simply generic oil pollution” concept relevance test. The responsiveness cut-off was set at 25 points, which eliminated nearly 300 documents of which only 7% were relevant. Depending on the optimal precision and recall required by the problem space, in conjunction with the resource levels available manual review, this number can be easily adjusted without re-processing the documents. For conceptual relevance, Indago's relevant retrieval rate is 93% and 7% false negative. False positive rate for all those found to have a high enough score (59+63+92=214)/(377+59+63+92 591) of 36%.
The model can be improved. Testing indicated ability to search for generic concepts in the targeted context, though sometimes the contextual items themselves were out of context and a handful of concepts appearing in close proximity triggered false positives. For example, generic definitions of pollution and leaks were included in the model, but upon review, documents were deemed conceptually irrelevant if they addressed only resultant air pollution, the leaks were quite small, or the oil was merely incidental. Although it may be argued that Indago found precisely what it was asked to find, such documents were still considered as false positives for this study.
Manually reviewing annotated documents with near-zero (<25) scores facilitated identification and suppression of false positive contexts. In other words, the concept tagging of these low-scoring documents revealed a useful set of context filters that were subsequently incorporated into the targeting rules to further improve precision without sacrificing recall. The initial processing of the entire collection revealed 1,389 documents with scores exceeding 50. The median score was 999. After incorporating the new filters, the median dropped to 47, and 272 files scores dropped to zero or less. Of the newly-negative-scoring documents that were part of the adjudicated set, not a single one had been deemed responsive. Of 54 documents, the highest scoring 10 were all deemed responsive, while only four of the lowest scoring 28 documents were responsive. Depending upon the user's goal in conducting the search, it may only be necessary to read a few top-scoring documents in order to get the gist of an issue. A Boolean “yes” doesn't help differentiate among “really yes,” and “maybe yes.”
Indago has several advantages over the current alternatives. For example, Indago allows one to make use of negative weightings for undesired contexts. Indago's raw scoring of the adjudicated documents ranged from a low of −2,370 to a high of 35,120. A Boolean query doesn't provide a mechanism for sorting the documents by matched content, whereas our does. The unfortunate result of this exercise is that it highlighted the inaccuracy of the human adjudication process. There are a number of high-scoring documents deemed to be non-responsive that are exact copies of other documents that were judged to be responsive.
Document size has an impact on scoring. As yet, scores are not normalized by size, so exceptionally positive and, to a lesser extent, negative, scores are more likely with large documents. As shown in Table 2, below, there were no false negatives among the larger file groups, and in fact the vast majority of the false negatives were among the smallest files. In many cases, these were the “body” documents with large attachments and thus “inherited” their relevance from wording that was not actually included in the document itself. The rates of false positives were significantly worse with large documents, primarily because negation for bad contexts was limited to the first few instances whereas the points for good context were attributed for each encounter of the concept. Techniques for using negative weightings can be improved in the example.
Indago performed better against some types of documents than others. The lack of “natural language” contained in typical spreadsheets rendered the differential weightings for conceptual co-occurrence within single clauses, sentences, and/or paragraphs somewhat ineffective in the example.
In spite of concerns about score variability based on document size, for each size group, the average score Indago gave to “true positives” was quite a bit higher than that given to the false positives. Similarly, Indago's average scores for true negative (e.g., irrelevant based on human evaluation) documents was less than the average for similarly-sized it misidentified.
As stated, some of the false positives were identical copies of other “responsive” documents, and more importantly they had been labeled as non-responsive. The duplication comes from the fact that messages to multiple recipients are treated as being unique even though the content is the same. Apparently some of these different files were assigned to different judges and one judge decided that the document was responsive and the other decided it was non-responsive. Since many of the decision aspects are subjective, one's experiences and biases become significant factors and introduce inconsistencies. It is difficult for different people to provide consistent responses when manually evaluating thousands of pages of documents. Some researchers state that the difference in responsiveness adjudication is most often a result of human error as opposed “gray area” documents. Some researchers recommend the simplification of the target patterns to minimize variation. By contrast, Indago results are consistent across large or small collections and the rule set can be composed of hundreds of rules.
Analysis, sorting, management, and protection of data can be applied across a diverse set of industries and applications. Indago is particularly powerful because its hardware-assisted, concept-in-context approach allows domain-optimized algorithm adaptation.
Protection of Corporate Intellectual Property or Sensitive Information—Exfiltration is a military term for the removal of assets from within enemy territory by covert means. It has found a modern usage in computing, meaning the illicit extraction of data from a system.
Email Exfiltration: Indago can be used to search transmitted data streams to identify sensitive information in context, and based upon that identification, take action to prohibit or allow the transmittal of digital content. One implementation of Indago is an electronic Mail Filter that has advantages over current approaches, as the state of the art is limited by word list matches and does not contain the ability to search concepts-in-context. The advantage of current technology is that it is very fast, but can be easily defeated by a knowledgeable individual. Similarly it can miss target material. For example many email filters depend on the extension of the file name to filter potential harmful content. The filter looks for “.exe” files. However, a real “.exe” could be renamed and still be harmful.
Exfiltration (general): Similarly, Indago can be used to search data repositories to identify sensitive information, and based upon that identification, the information can be flagged for additional protection or action can be taken to prohibit or allow access, as appropriate.
As viruses pose a threat so does disclosure of Corporate Knowledge. Indago can be used to monitor different forms of internet traffic and flag suspected sources/individuals.
An insider threat is a significant concern. Who is looking for sensitive corporate knowledge that should not have access to it? Intra web sites contain vast amounts of corporate knowledge that may not be properly protected. Monitoring of such flow could be facilitated by the use of this technology.
Fast and Accurate Large Repository Search—Indago allowing more complex searches that retrieve more relevant content by focusing on the context. As stated before, the word “Apple” for non-computer folks usually refers to a fruit, for most technical computer folks it can be either the fruit or the computer company. The use of the rule set and hierarchical concept-in-context matching allows more precise matching for the target interpretation. Researchers have documented a fictitious negotiation for a Boolean query to be used to retrieve relevant documents for a legal case.
As data generation and storage technologies have advanced, society itself has become increasingly reliant upon electronically generated and stored data. Digital content is proliferating faster than humans can consume it. Tools are needed to perform repetitive tasks so that humans focus on what they do best and that is to recognize patterns at a higher level. Current search/filter technology is well suited for simple searches/matches, but a more powerful paradigm is required for complex searches such as finding sensitive corporate knowledge that may be flowing in the intranet and could be accidentally or maliciously sent out the internet. Context-based search and analysis can greatly enhance the exploration of data and phenomena by reducing the data deluge and increasing the efficiency and effectiveness of the human analyst and/or end-user to access and fully exploit the data-to-knowledge potential that is inherent but latent in nearly every collection.
Indago can be implemented as a cost-effective solution that provides unparalleled performance and capability using proven, commercial off-the-shelf technology. It allows users (e.g., IT personnel, law firms, scientists, etc.) to engage their data faster, more accurately, and more effectively, thus allowing them to solve problems faster, more creatively, and more productively. Furthermore, Indago is domain-adaptable, power efficient, and fully scalable in terms of rule set size and complexity.
Indago can provide a unique, outside-the-box innovation in terms of how it exploits deep packet processing technology, its adaptability and breadth of applicability, and its unparalleled performance potential.
Analyst-in-the-Loop applications leverage the speed and consistency of the algorithm to enhance the productivity, efficiency, and accuracy of an expert by accurately focusing attention on content of potential interest for final and actionable context-based inspection and decision-making. Indago's contextual analysis includes color-code triage to focus attention on high interest text matches with single click navigation to each specific textual instance.
Testing utilized a Java program that runs the entire Indago processing flow. This program can be configured to use the NetLogic NLS220HAP hardware-assisted flow, or a software-only emulation of the hardware-assisted flow. The program can also be configured to output millisecond accurate timing for each of the major processing steps of the Indago processing flow.
For testing, the program was configured to print out timing information and was executed using input test files of increasing size. All test files were generated from a single base file “bad_text.txt” (base size=91,958 Bytes) that contains content that will generate a high score when analyzed by the Indago processing flow against a target rule set. Larger file sizes are generated by concatenating “bad_text.txt” multiple times. For reference one typewritten page is ˜2,048 Bytes and a short novel ˜1 MB.
For both hardware-assisted and software-only testing we present timing results for the main processing steps: SW emul (software emulation), Input File (Input File Processing), Score (Scoring), and TOT user (overall run time of Java program). All timings with the exception of TOT user are measured in milliseconds.
Table 4 presents the results obtained for the hardware-assisted testing. The hardware-assisted processing implements a high resolution timer that measures the actual processing time of both the hardware-assisted C/C++ code and the actual NLS220HAP hardware. These results are listed under the HARDWARE/HW timer column of the results.
Table 5 displays the results for the software-only testing. Since this testing does not utilize the hardware the column HW timer is replaced by SW emul.
The steps for both software-only and hardware-assisted testing are show below. In the hardware-assisted steps, the hardware functionality is shown in italics. The Java code communicates with the hardware-assisted C/C++ code via the file system; writing input text files into a process directory, and results into a related output directory. In each case the software waits; the hardware utilizes the lightweight, kernel subsystem, inotify, while the software polls directly on the file system.
1) Text extraction to text file
2) Create Results Directory for input text file
3) Place text file for inspection into process directory
4) Further processing
1) Text extraction to text file
2) Create Results Directory for input text file
3) Place text file for inspection into process directory
4) Wait for Results binary file to appear (polling)
5) Further processing
It is noted that the software-only processing process is done in a single thread of execution. There are no waits for the input text file to be placed into the process directory or waits for the binary results file to be placed into the results directory. For the hardware-assisted processing this wait can incur a 500 millisecond delay and one sees this for File Sizes up to 1,471,328 bytes.
The hardware-assisted code implements five processing threads which watch five corresponding input processing directories. The upper level Java program does simple load balancing on input files; placing incoming text files for processing into one of the available input processing directories. Once a file is deposited into the input processing directory the kernel, via the inotify subsystem, notifies the thread that there is a file to be processed.
Table 6 presents the overall results computed from timings displayed in Table 4 and Table 5. The column, TOT user % decrease, displays the percentage decrease in overall run-time of the Java program when utilizing the hardware-assist. HW speedup measures the speedup for the processing steps 3a through 3d shown above when hardware-assist is used. The column % inc HW timer and % inc SW emul measure the percentage increase moving from one File Size value to the next. The last two columns compute the percentage of time increase from the previous set for hardware or software. Notice that both processing times double with every increase of the file size.
The hardware-assisted flow provides five hardware-based, fully concurrent, processing pipelines. It is designed to handle situations where there are many documents in progress and so it is informative to look at the scaling performance of the hardware-assisted flow and once again compare with software only.
More functionality could be moved to the hardware acceleration to speed up the process further. Similarly, a simple parallelization process across several servers could speed up the process for in-line filtering of documents in a large set. The analysis of each document is independent of other documents therefore the process is trivial to parallelize.
The technologies described herein can be used to implement an email filter.
The disclosed eMail Filter (eMF) is designed to monitor email messages and score them for “goodness of fit” to a predefined target domain model. Additionally, content is analyzed for other factors such as the mime type of the attachments. Messages scoring highly against the models (e.g., closely resembling the targeted concepts) and those containing images or other non-English text are routed for human review.
eMF was designed and developed to loosely couple with the Zimbra Collaboration Suite (ZCS). However, the use of standard libraries and communications protocols make eMF capable of being used with other eMail servers. eMF uses the virus checker/spam filter communication protocol. As far as ZCS is concerned, eMF operates and communicates like any other email filter.
For the purpose of the Gateway project, eMF has been tested extensively using ZCS Version 6.0.10 on Red Hat Linux 5.5. ZCS email functionality can be coupled with different email clients. However, the system was designed for, and has been tested using, Zimbra's web mail client. The scoring algorithm runs on the server as a daemon that is invoked for each email message. Depending upon the score, the message will either flow through to the intended recipient or be re-routed for human review. The reviewer's interface is also web-based. Therefore, the ZCS, eMF, and reviewer interface can be installed and run on a single machine.
As depicted in
The system is intended to be coupled with other software packages such as ZCS and Apache Tomcat server in a secure environment. Apache Tomcat server is an open source webserver and servlet container developed by the Apache Software Foundation (ASF).
The eMF can be run on a single machine or multiple machines. The machine(s) may be placed in a demilitarized zone (DMZ) as a gateway between two networks. Users sending and receiving messages are authenticated to the machine, and content to be sent is uploaded to the server, but does not flow beyond the server until it has either been automatically scored as being allowable, or has been declared allowable by a human reviewer. Therefore, disallowed content is stopped at the server and not disseminated beyond it. This allows the ability to control flow of information as needed.
User authentication is accomplished at the Zimbra-user level, thus only valid users are able to send, receive, and review messages. Only authorized reviewers may review messages.
The eMF software is designed to integrate with state-of-the-art pattern-matching hardware from NetLogic. These very-fast-throughput pattern matchers can achieve throughput of 10 GB per second. If needed for testing, a software module can emulate the hardware (e.g., while waiting for operating system upgrades, etc.).
The system has four major components, each made up of many modules. The two developed components are the eMF and the Reviewer Interface. The other two required components are ZCS and the authentication/validation software. As stated earlier, these can be collectively run on a single server, but may also be distributed for load balancing, security, and other operational reasons.
This description does not cover installation of the non-developed modules; information is publically available so that they can be installed correctly and are functioning correctly.
The eMF is made up of several modules: daemon (Gateway-Milter), content extraction, target pattern matching, target model scoring, message flow logic, and content highlight. The eMF is loosely coupled with the ZCS, as eMF receives a call for each message. The eMF is not unlike a virus scanner plug-in, and it uses the same communication protocol. One of the eMF processing byproducts is that the content of the message is highlighted based on the goodness of fit to the targeted models.
The Reviewer Interface allows a human reviewer to access the highlighted content via a web-based interface. The interface facilitates the review process and allows the operator to adjudicate the content and take action.
The components are loosely coupled, as they only share the location of the highlighted content. The eMF only suspends the flow of messages that require review; messages scoring below the user-defined threshold flow directly to the recipient's inbox.
The workflow is depicted in
Next, a message decision flow is consulted for the appropriate action for each message. A message and its attachments are analyzed together. The first step of the process is to unpack the original files, and then the “text” is extracted using the open source Apache Tika package. eMF does not analyze images and words with non-English text. Messages containing image files and those having a low ratio of English to non-English words must be routed for human review. The message is then either forwarded to the recipient's inbox or rerouted to a reviewer's inbox for adjudication. In the case of rerouted messages, there is a configuration option to inform the sender that the message has been delayed.
i. Environment
The system uses a combination of open source packages, developed code, Apache Tomcat, and the Zimbra Collaboration Suite (ZCS). Because of ZCS dependencies and operational requirements, the software has been extensively tested on Red Hat Linux Version 5.5. The developed code runs on many other platforms, but it is dependent on the platforms that support ZCS. Infrastructure includes the following: Linux/Unix platform with Pthreads; ZCS Version 6.0.10+; Java Version 1.6+; HTML browser with JavaScript support, C, C++; Tomcat Version 5+; Sendmail/libmilter and eMF Distribution CD
ii. Features
The code was written in a modular fashion with unit-level testing and testing at the overall system level as well. Some of the major modules are documented below.
Rules:
Rules are built by a knowledgeable domain expert using a grammar that facilitates the definition and targeting of disallowed content. The syntax of the language and how it is used in scoring are documented in detail herein.
Scoring:
This algorithm transforms the user-written rules into machine-usable patterns that are sent either to the software emulator or the high-throughput pattern matching hardware. The choice to use the emulator or hardware is a user-configurable option, as documented herein. The results of the pattern matching are then coupled with complex domain rules for in-context scoring of matching terms.
Message Flow Logic:
The message flow module uses values from different algorithms, such as the “English” word ratio, the presence or absence of image attachments, and other content indicators, to decide whether the message contains disallowed content.
Gateway-Milter:
The gateway-milter daemon code runs continuously, listening to requests and email messages being sent, from the ZCS. Once a request is received, it spins off a separate thread to handle that request. In this manner, the processing of one message should not delay the delivery of another. The software emulator is not as fast as the pattern-matching hardware. Messages are processed in parallel, up to the maximum number of threads specified in the configuration file.
Derivative Classifier (DC) Review Tools:
The review tools are the modules that a human reviewer will use to adjudicate the contents of suspected messages. The interface is web-based and uses the Zimbra email client interface. The account is a special “reviewer” email account and actions taken within this account have special meaning. If a reviewer “forwards” the message, it is interpreted by eMF as the reviewer stating that the message is no longer suspected (e.g., the system has encountered a false positive). If the reviewer “replies” to a message, the message goes back to the sender for a classification marking correction, and if it is deemed to contain unallowable content, the message may be kept for future reference/action. In addition to containing a standard inbox, the reviewer account may have additional email folders to hold adjudicated messages. Suspicious content in each message is highlighted, as described below.
Highlighter:
The highlighter tool highlights the targeted content in the context of the original file. It uses results from the pattern matching as well as the goodness-of-fit results from the scoring algorithm. It displays the file in an HTML browser with hyperlinked and highlighted text.
Inventory:
Setup software and applications are contained on a Distribution CD with eMF software including all required third party open source software and libraries as well as all the code for Reviewer interface. This distribution disk assumes that the other required packages have been installed. This CD contains sample unclassified rules and testing material.
eMF Installation:
The gateway-milter and other modules do the email content scoring, redirection of emails, and notification to the user that the email has been redirected. The gateway-milter is based upon the design of the Clam Antivirus software and the Libmilter API software, which is part of the sendmail 8.14.4 distribution which is publicly available as open source. The documentation for the Libmilter library can be found the milter website.
Preparing for Installation:
It is assumed that ZCS 6.0.10, Apache Tomcat 6 or newer, and Java JDK 1.6 or newer are installed and configured properly. Test of the functionality of these packages is preferably done prior to eMF installation.
User accounts must be created and passwords assigned. Also, the Linux account “zimbra” is the owner of the installation directories and other zimbra work files. Ideally the installation location is the default “/opt/zimbra” directory.
Testing for ZCS, sending a simple email message can be used. This test will walk through all the paces of login, email composition, and email reading using the ZCS web interface. User validation/authentication is required, and only valid users will be able to get to the ZCS web mail interface. Any issues encountered should be resolved before proceeding.
Testing for Tomcat, using a web browser to open the default URL will verify proper functioning. Furthermore, Tomcat should be set up to run under the Linux zimbra user account. How this is done depends on the way Tomcat is installed. For example, for “package” installations a configuration value (TOMCAT_USER) needs to be changed to zimbra. In others, the Tomcat process needs to be started from the zimbra account.
Testing for Java, the “java—version” command on a terminal window will verify that it is configured correctly. The first line of the output should be something like: java version “1.6.0—24”. Output such as “command not found,” and “java not recognized,” indicates that Java is not properly installed.
Gateway-Milter Installation:
The recommended location of the gateway-milter installation is “/opt/zimbra/data/output” assuming that zimbra is installed in the default location “/opt/zimbra”. An alternate location is possible. Similarly, special accounts such as “dc_scanner” and “special” are created on the ZCS system. The configuration file in the installation directory reflects these values. The configuration file also specifies the location of the Tomcat server. This value is updated prior to installation. Also, if the default location is not used, several test scripts are updated to reflect the chosen installation directory.
Scripts that are to be updated are:
ZCS to Gateway-Milter Configuration Parameters:
There are many parameters in the configuration file. The default values should work well, though any changes to these values should be carefully chosen before installation as they may have a significant impact on the correct running and communications between processes. After initial installation this file is located at /opt/zimbra/clamav/etc and the parameters can be modified to affect scoring and message rerouting.
Four parameters are involved in correct communication between ZCS and the milter daemon. They are: MilterSocket, TemporaryDirectory, OutputDirectory, and LogFile. Details for each are documented below.
The Milter socket value should correspond to the value that will be used after the installation script. In the sample command line depicted below, the value 2704 should be consistent. This value was chosen because it is in the proximity of the other filter daemons, and does not conflict with other standard installed Red Hat Linux packages.
The next three values are used for logging, processing, and the location of analysis output files. The recommended installation values are depicted below, assuming that Zimbra is installed in the default “/opt/zimbra” location. Adjust as necessary for the local environment.
The values listed in this section are used for scoring and can be updated after installation by editing the configuration file. These parameters are used in the communication between the milter and are crucial to the scoring process.
DerivativeScanner is the name of the account where suspected email messages would be rerouted. This account should be a zimbra email account. Notice that there is only one Derivative Classifier account value. All suspect messages will be routed to this account.
ScoreScript identifies the scoring script that will be invoked for every message to be processed. The output of this script is passed back to the miller. Warnings and other messages generated by this script are logged to the milter log file identified in the previous section.
Messages that score higher than the ScoreTrigger value will be rerouted for human review. This value needs to be carefully chosen and must be correlated to the rules used in scoring. Refer to Section 0 of this manual for more information.
EnglishPercentThreshold is another trigger parameter, the ratio of English words vs. non-English words. If this ratio is below the number indicated by this value, the message is rerouted. The scoring works well with good English text, but OCR or foreign language documents should be reviewed by a DC.
SpecialAddress is used by the DC Reviewer to identify false positives. Those messages deemed to be false positives will be “forwarded” by the DC Reviewer to this special address to indicate that the message should be released.
Hyperlink is the string associated with the URL for the DC review browser. This value is added to the top of each message rerouted for review. This is a web-based URL that points to the installation location of Tomcat and the linked results directory under Tomcat. The unique ID of a message is added to the value in this line to create the hyperlink added to rerouted messages. This is the hyperlink a DC reviewer will follow to review the contents of a suspected message. The value the hypertext transfer protocol (HTTP) is localhost:8080 should reflect the values to be used at this particular site. The value as shown here may only be appropriate for testing.
After any change is made, the gateway-milter process must be restarted using the following commands executed by the “root” user:
The installation script (install.sh) needs to reflect three values as intended for installation at this site:
To verify that these values are set as intended, edit the “install.sh” script located at the top level of the installation directory as copied from the distribution CD. Suggested values can be included in the distribution install.sh script.
INSTALL_DIR is the location of the installation files. This is the destination directory for copying the contents of the distribution CD. This directory is transient and therefore can be erased after successful testing has been completed.
DEPLOY_TO directory is the location for software deployment. If changed from the recommended value as provided in the distribution CD, it should be changed in the other scripts. Furthermore, all instances of this value must be changed in the configuration file.
TOMCAT_DIR is the top level Apache Tomcat directory. At this level, the webapps, bin, etc. folders are located. The Tomcat setup allows the hyperlinks to be able to refer to the URL that will display the highlighted content. This directory is dependent on the installation and has no default location. The recommended value is:
where X is the version of Tomcat.
The distribution disk contains a folder named gateway_install.tar.gz it should be unpacked to “/opt”. Beware: the opt partition might be too small for the eMF code and zimbra. If this is the case, create “opt” under “/local” and create a symbolic link from “/”.
The entire /opt/gateway_install/README file disclosed herein include a few salient points of the installation procedure as it directly relates to the milter process. The “postconf” command is the one that registers the milter with postfix as listening on port 2704 of the local machine. Prior to installation, shut down Zimbra.
Copy the folder gateway_install to /opt.
After the script runs, execute the following commands:
To permanently turn off default sendmail, the following command should be run once.
Exit from zimbra account to root,
At this point the gateway milter software should be running and filtering content.
The Tomcat server should be restarted at this time.
Access Control: Zimbra and Tomcat operations should be executed from the “zimbra” Linux account. This account should also be used to make configuration changes.
The milter process should be started from the “root” account.
Configuration Files The main configuration file will be in /opt/zimbra/clamav/etc/gateway milter.conf. This file can then be updated if installation values change later on.
The whitelisted_addresses file contains a list addresses that will be ignored by the scanner. It is beneficial if this list is correct and complete, as otherwise, messages can be misrouted (e.g., allowed to pass through unscanned or get loop back to the reviewer repeatedly. These also include addresses for system messages and other internal routine mail messages. It is located at:
Messages are not accepted by the system until scoring has been completed. The default Zimbra system timeout may be too short to accommodate messages that are very large messages. As a result, the web interface will indicate that the message was not received by the system, when in fact it may have been processed correctly. This usually occurs with messages with very large attachments. To address this issue, it is recommended that the following Zimbra option be set accordingly. As shown in
The eMF process generates a series of temporary files and analysis folders that should be periodically removed and/or archived. Files located in /opt/zimbra/data/tmp and /opt/zimbra/data/output/process should be cleaned after 24 hours. Folders in the /opt/zimbra/data/output/results folder should be archived for future reference. These folders contain the expanded attachments, text extracted from files, highlighted content and scoring information. These folders have a unique name with the form of msgYYMMDD_#######_dat. These files are the ones referenced by the hyperlink provided for the reviewer in the messages rerouted to their in-box. It is beneficial to give the reviewer ample time to adjudicate the message, such as not less than one week after the message was sent.
Suggested Linux commands are as follows:
These commands should be part of a cleanup script to be run on a daily basis for cleaning up and archiving following site specific guidance. Notice should be taken in the use of forward and backward quotes and spacing as show in the commands listed above.
ZCS and Tomcat management documentation is beyond the scope of this manual. The only new process is the gateway-milter. The best way to clear-up errors is to systematically check the packages for basic functionality. Start with Zimbra, it is best to use the “zmcontrol” function. As zimbra execute the “zmcontrol status” command to verify if all the zimbra modules are operational. If not this should be resolved first. Second, the milter should be restarted as a precaution as instructed below.
To check to see if the milter is running, use the following command:
If you suspect that the milter is hung, it can be restarted by first killing the process identified by the “ps” command listed above, and then starting it again. One indication of a hung miller is the inability to send emails. If the miller is not running, a correctly configured ZCS/eMF system will not allow messages to flow.
By default the milter is not configured to run at system start time and should be set up according to each site's start-up procedures. It may be appropriate to use “/etc/init.d” scripts for this purpose.
The same procedure used to restart a hung milter should be used if changes are made to the configuration file.
Start-up or run time errors are written to the log files listed below. If problems occur, consult these two logs for error indicators.
Zimbra has many different error conditions and they will change with newer releases. However, it should be noted that once the milter is installed, mail will not flow through Zimbra unless the milter is running correctly. This is a fail safe condition; that no email will flow unless it has been scanned. Therefore, if mail can not be delivered (it can be composed but an error occurs when sending) the milter may be the problem. One can look at the milter log file to see is some error is condition is recorded, then as a precaution the milter may be restarted and try sending the message again. A simple text message should be used for basic testing first. If the milter process dies after each test then eMF support should be contacted for in depth guidance.
The zimbra.log file contains Zimbra-specific messages and may have indicators of Zimbra installation problems.
The gateway_milter.log file contains detailed entries about the messages are they are being analyzed. It can be monitored in real time by using the “tail-f” command. This is useful for testing installation.
Rules can be used for defining the targeted concepts and determining the goodness-of-fit of a document's text to those concepts. Rules reside in the /opt/zimbra/data/output/rules folder. Changes to the rules file should be made in this directory and require a “recompile” of the rules. This process preprocesses the rules and optimizes them to be used in real time filtering. The script to compile the rules is located at /opt/zimbra/data/ouput/distGateway folder, and it is called compile.sh. Changes to the rules and execution of the script should be done under the zimbra account.
The main command is:
The -ruleFile parameter specifies the input rule file. This is a text file that follows the syntax described below. This file can have any name and location. The -output parameter is the location of the preprocessed rules file. This file is referenced by other scripts and therefore should maintain the name and location as specified in the compile.sh script. The compile script can be invoked as many times as needed to compile changes made to the rules. This should be an iterative process of rule development and testing. For rule testing there is a runtest1.sh located at same location as the compile script. This allows of line testing of rules and scoring. A correctly configured runtest1.sh requires only the name of the input file. The output is stored in the “/opt/zimbra/data/output/results/<file_name>.” For testing purposes the <file_name> folder should be deleted. This is not an issue when running the milter as it generates a unique name for each message.
The two main types of the rules are: concept rules and weighted rules. Concept rules are used to define words and/or phrases. The main purpose of this type of rule is to group conceptually-related concepts for later reuse. Concept rule blocks begin with the label “SYN” and end with “ENDSYN”. Weighted rules are used to assign a specific weight to one or more concept rules. Weighted rule blocks begin with the label “WF<weight function parameters>” and end with “ENDWF”. They are usually comprised of one or more references to concept rules. Only the weighted rules contribute to the total score of the document when they are matched.
Concept Rule Syntax
Every concept rule definition must start with the following:
Every rule definition must close with the following line:
The word list enclosed in paragraph-level locality translates to matching all of these words within the same paragraph. A paragraph is defined as a series of words or numbers ending with any one of {‘.’ ‘!’ ‘?’}. By default, paragraphs are limited to having no more than 10 sentences. Sentence locality, which is activated with the following rule syntax:
The word list enclosed in sentence-level locality requires that each of the listed words appear within the same sentence. This is the DEFAULT locality; if no locality is specified, and the words do not appear in double quotes, each of the specified words must appear, in any order, to trigger a match to the specified definition. By default, sentences are limited to having no more than 30 words.
Clause locality, which is activated with the following rule syntax:
The word list enclosed in clause-level locality requires each of the words to appear within the same clause in order to count as a match for this rule line. A clause cannot be longer than the sentence that contains it and are therefore limited to having no more than 30 words.
A primary purpose of SYN rules is to group concepts that are related to one another in some meaningful way so that the SYN can be incorporated into other SYN rules or weighted rules. Each line in a SYN rule definition is considered to be interchangeable with each other line within the same rule definition. Once a rule is defined, it can be reused in other rule definitions by referring to its unique name. If the rule name is one single word, without any spaces, it can be referenced by preceding the name with an equal sign (‘=’) anywhere in rule definition lines:
If the rule name contains empty spaces in its name, the following syntax must be used:
Weighted rules are collections of one or more concepts, with a weighting function assigned to each collection. The syntax is generally the same as for a concept rule, except that these rules are meant mainly for re-usage of concept rules, as they do not have any unique name identifier. As mentioned before, their primary use is to define the weight function by which the included concept rules will contribute to the total score of the document.
There is a variety of weight functions were made available for defining how the rule in weighted. They are:
CONST
For the weight rule:
Similarly,
SLOPE
STEP
if (numHits<=numSteps) increment=Σstepi, for i=1 to numHits−1(this translates to step0+step1+ . . . +stepnumHits-1)
else
increment=step0+step1+ . . . +stepnumHits-1+stepnumHits-1*(numHits−numSteps)
The DC Reviewer Interface provides the Derivative Classifier with access to the suspected content via a web-based interface. The directory for each message contains the analyzed files that are needed to adjudicate the messages that scored above the predefined limit. The interface can be easily accessed through the provided link in any redirected message. The original message, extracted text, and any attachments, are accessible through the interface. Package files such as zip, tar, gzip, and bzip are expanded into folders so that the reviewer can see their raw content by following the links provided within the interface.
An overview of the process is depicted in
If a message scores higher than the threshold or contains unsupported formats, the message is routed for review. At this time, the possibly-disallowed content, in context, is highlighted, to facilitate a rapid review of the message. The re-routed messages are sent to a special account as defined in the configuration file. The email message contains a hyperlink to the material for review. The DC then reviews the content and decides what action to take. If this is a “false positive,” the reviewer may allow the message to flow to the intended recipients. If the message is not properly marked, the message should be sent back to the original sender for correction. If the message contains disallowed content, the message is retained and not allowed to flow, requiring further actions outside of the eMF software.
The DC reviewer will log on to ZCS using the web-based interface shown in
The value to modify is called:
Logging into the system will bring up an unmodified web email interface under ZCS. As with any email, it will display the messages in the inbox, and give the ability to manage messages as well as to create other mailboxes as needed. Redirected messages are displayed in the interface. The actions taken on this account are interpreted by the eMF by the nature of this account. A sample inbox is shown in
The top part of the web page displays a list of the incoming messages. In
In
When they hyperlink is clicked, the user will see something similar to the image depicted in
In the top pane, the score and the system-generated email message identifier are provided. The pane on the left contains indices of the matched words that contributed to the scoring of the document. They are arranged from the highest-weighted rule (starting from the top) to the least significantly weighted rule. The color-coding of red to green corresponds to the contribution of a term in a context rule. The number in parentheses is the number of matches of that term in the combined text. Finally, on the right panel is the actual combined text content of the message.
The left index panel is interactive, allowing the user to navigate through the text to the first instance of a term in the text panel. The right panel also supports interactive mode; it allows users hit to hit navigation. A click on the right side of a term takes you to the previous instance of that term, if a previous instance exists. A click on the left side of a term takes you to the next instance of that term, if a subsequent instance exists.
Here is the list of commonly seen files in the viewed directory:
For options 1 to 3 the message could be moved to another Mail folder created by the reviewer to denote action taken, pending action, or other ways of organizing the messages as needed by the reviewer.
Sample error messages and their interpretation are depicted below. Only messages that require adjudication are rerouted to the DC Reviewer. Rerouting occurs for several reasons. A snapshot of the message as it is displayed in the ZCS web mail window is shown.
At the top of the message sent to the reviewer, there is a hyperlink to the details for that message, and following the hyperlink is a brief explanation of why the message was rerouted. The possible conditions that reroute a message are as follows:
1) The message contains disallowed content
2) The message contains unsupported file formats
3) The message contains foreign language or jumbled text
4) A software processing error occurred.
The reviewer may need to take different action depending on the condition of the rerouted message. If the reviewer determines that this message does contain disallowed content, he or she should follow site-specific guidance. Otherwise the reviewer may determine that the message is a false positive and should be allowed to flow through the system or sent back to the originator for changes. The system-generated messages that may appear at the top of the rerouted message are displayed below. Following these messages there may be additional information such as the file names of offending attachments and/or the programming error codes.
“There is probable disallowed content in this message based on a score of ###.##”
Condition one means that the goodness of fit for the disallowed model is higher than a configuration threshold and therefore the system suspects that it is disallowed content.
“This message contains images or non-supported file formats. Those attachments could not be assigned a score.”
Condition two signifies that the body of the message or one or more of its attachments contains files in non-supported format for text extraction. These are files such as images, audio and video, or other special formats. A list of the offending files follows the message for this condition. The reviewer should carefully consider all factors such as scoring and the content of image and other files to make a decision.
“A score could not be assigned to an attachment to this message. Some part probably contains foreign language or jumbled text.”
Condition three signifies that text extracted from the body of the message or one or more of its attachments does not look like English text. These may be spreadsheets, files that used Optical Character Recognition (OCR) to extract text out of images, or those in a foreign language. As for condition two, the reviewer should carefully consider all options and decide if this is allowed or disallowed content.
“There was a program error in processing this message.”
Condition four signifies that there was a software error in processing this message and scoring and other analysis may be unreliable. The data available in the review interface may still be good, but may not be complete, in which case the reviewer should carefully examine the message and decide what action to take.
Various actions can be taken by the system for messages generated using ZCS and eMF. It should be noted that email messages that flow through this system will be filtered for unallowable content. These messages and messages containing audio, video, and images will be redirected for human review and therefore be delayed. Similarly, content that is not marked appropriately may be delayed pending review. Therefore, it is encouraged that users will carefully select the material including attachments to be sent, and ensure they only contain “allowed” content and are marked appropriately.
Email users will log on to the system using the site-specific provided URL. This will prompt the user for account name and password. Once these have been provided, a screen, like the one depicted in
Clicking on the “New->Compose” under the Mail tab will bring the user to a screen depicted in
Messages that do not contain disallowed content, images or other audio/visual attachments will flow through the system unchanged. Other messages will be redirected for human review and will be delayed if a reviewer is not readily available.
Selecting the “Send” button will initiate the analysis of the message and will wait for the server's reply before allowing the user to continue. Messages with long attachments may take a minute or more to process. The user should be patient and wait for the system's reply before proceeding.
Messages that are rerouted for DC review will generate a warning back to the original sender to indicate that the message will be delayed until it has been reviewed. A sample message sent to original sender is depicted in
Another condition that causes emails to be rejected is the size of the attachments. Most sites restrict the size of an email and attachments to 10 MB. Please adhere to local site guidance, as these large email messages will be rejected by the system.
Tomcat under Red Hat Linux is usually installed via a simple extraction of the distribution contents from a “.gz” file. An exemplary sequence is as follows:
1. Download tomcatX.Y.Z.gz for Red Hat from the internet.
2. As root:
The rule-graphing package is located in the /opt/zimbra/data/output/GTree directory. To run it, the user is in the GTree directory (where FixFile.jar, guess.bat and guess directory are located).
The command is:
guess.bat (this is the name of the script being called) must contain the right script commands for the OS, and in Linux/Unix it must be an executable (chmod 775 guess.bat).
A sample screen is depicted in
In the center, the mail rule in this file is provided, then synonyms (depicted such as in red), sentences (such as in purple), clauses (such as in blue), and simple terms (literals, such as in orange). These can be toggled on and off to display or hide the items listed.
This function is provided for advanced users, and it can be very powerful for displaying the details of complicated rule sets in one graph.
The techniques and solutions described herein can be performed by software, hardware, or both of a computing environment, such as one or more computing devices. For example, computing devices include server computers, desktop computers, laptop computers, notebook computers, handheld devices, netbooks, tablet devices, mobile devices, PDAs, and other types of computing devices.
With reference to
A computing environment may have additional features. For example, the computing environment 900 includes storage 940, one or more input devices 950, one or more output devices 960, and one or more communication connections 970. An interconnection mechanism (not shown) such as a bus, controller, or network interconnects the components of the computing environment 900. Typically, operating system software (not shown) provides an operating environment for other software executing in the computing environment 900, and coordinates activities of the components of the computing environment 900.
The storage 940 may be removable or non-removable, and includes magnetic disks, magnetic tapes or cassettes, CD-ROMs, CD-RWs, DVDs, or any other computer-readable media which can be used to store information and which can be accessed within the computing environment 900. The storage 940 can store software 980 containing instructions for any of the technologies described herein.
The input device(s) 950 may be a touch input device such as a keyboard, mouse, pen, or trackball, a voice input device, a scanning device, or another device that provides input to the computing environment 900. For audio, the input device(s) 950 may be a sound card or similar device that accepts audio input in analog or digital form, or a CD-ROM reader that provides audio samples to the computing environment. The output device(s) 960 may be a display, printer, speaker, CD-writer, or another device that provides output from the computing environment 900.
The communication connection(s) 970 enable communication over a communication mechanism to another computing entity. The communication mechanism conveys information such as computer-executable instructions, audio/video or other information, or other data. By way of example, and not limitation, communication mechanisms include wired or wireless techniques implemented with an electrical, optical, RF, infrared, acoustic, or other carrier.
The techniques herein can be described in the general context of computer-executable instructions, such as those included in program modules, being executed in a computing environment on a target real or virtual processor. Generally, program modules include routines, programs, libraries, objects, classes, components, data structures, etc., that perform particular tasks or implement particular abstract data types. The functionality of the program modules may be combined or split between program modules as desired in various embodiments. Computer-executable instructions for program modules may be executed within a local or distributed computing environment.
Any of the computer-readable media herein can be non-transitory (e.g., memory, magnetic storage, optical storage, or the like).
Any of the storing actions described herein can be implemented by storing in one or more computer-readable media (e.g., computer-readable storage media or other tangible media).
Any of the things described as stored can be stored in one or more computer-readable media (e.g., computer-readable storage media or other tangible media).
Any of the methods described herein can be implemented by computer-executable instructions in (e.g., encoded on) one or more computer-readable media (e.g., computer-readable storage media or other tangible media). Such instructions can cause a computer to perform the method. The technologies described herein can be implemented in a variety of programming languages.
Any of the methods described herein can be implemented by computer-executable instructions stored in one or more computer-readable storage devices (e.g., memory, magnetic storage, optical storage, or the like). Such instructions can cause a computer to perform the method.
The technologies from any example can be combined with the technologies described in any one or more of the other examples. In view of the many possible embodiments to which the principles of the disclosed technology may be applied, it should be recognized that the illustrated embodiments are examples of the disclosed technology and should not be taken as a limitation on the scope of the disclosed technology. Rather, the scope of the disclosed technology includes what is covered by the following claims. We therefore claim as our invention all that comes within the scope and spirit of the claims.
This application claims priority to U.S. Provisional Application No. 61/609,792, filed Mar. 12, 2012, which is incorporated herein in its entirety by reference.
This invention was made with government support under Contract No. DE-AC52-06NA25396 awarded by the U.S. Department of Energy. The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
61609792 | Mar 2012 | US |