System and method for protecting specified data combinations

Description

TECHNICAL FIELD OF THE INVENTION

This invention relates in general to the field of data management and, more particularly, to a system and a method for protecting specified combinations of data.

BACKGROUND OF THE INVENTION

Computer networks have become indispensable tools for modern business. Enterprises can use networks for communications and, further, can store data in various forms and at various locations. Critical information frequently propagates over a network of a business enterprise. Certain federal and state regulations provide restrictions covering the dissemination of particular types of information by various organizations or businesses. Thus, in addition to the potential loss of proprietary information and the resulting negative impact to business, an enterprise may also face legal liability for the inadvertent or intentional leakage of certain data. Modern enterprises often employ numerous tools to control the dissemination of such information and many of these tools attempt to keep outsiders, intruders, and unauthorized personnel from accessing or receiving confidential, valuable, or otherwise sensitive information. Commonly, these tools can include firewalls, intrusion detection systems, and packet sniffer devices.

The ability to offer a system or a protocol that provides an effective data management system, capable of securing and controlling the movement of important information, can be a significant challenge to security professionals, component manufacturers, service providers, and system administrators alike.

BRIEF DESCRIPTION OF THE DRAWINGS

To provide a more complete understanding of the present invention and features and advantages thereof, reference is made to the following description, taken in conjunction with the accompanying figures, wherein like reference numerals represent like parts, in which:

FIG. 1 is a simplified block diagram of an exemplary implementation of a system for protecting specified data combinations in a network environment in accordance with one embodiment of the present disclosure;

FIG. 2 is a simplified block diagram of a computer, which may be utilized in embodiments of the data combination protection system in accordance with the present disclosure;

FIG. 3 is a block diagram of a registration system in the data combination protection system in accordance with one embodiment of the present disclosure;

FIG. 4 is a block diagram of various data file structures in the data combination protection system in accordance with one embodiment of the present disclosure;

FIG. 5 is a simplified block diagram with example data input and output in accordance with one aspect of the registration system of the present disclosure;

FIGS. 6A, 6B, and 7 are simplified flowcharts illustrating a series of example steps associated with the registration system;

FIG. 8 illustrates file contents in an example scenario associated with the registration system processing in accordance with one embodiment of the present disclosure;

FIG. 9 is a block diagram of a detection system in the data combination protection system in accordance with one embodiment of the present disclosure;

FIG. 10 is a simplified block diagram with example data input and output in accordance with one aspect of the detection system of the present disclosure;

FIGS. 11-12 are simplified flowcharts illustrating a series of example steps associated with the detection system; and

FIG. 13 illustrates file contents in an example scenario associated with the detection system processing in accordance with one embodiment of the present disclosure.

DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS

Overview

A method in one example embodiment includes extracting a plurality of data elements from a record of a data file, tokenizing the plurality of data elements into a plurality of tokens, and storing the plurality of tokens in a first tuple of a registration list. The method further includes selecting one of the plurality of tokens as a token key for the first tuple, where the token key occurs less frequently in the registration list than each of the other tokens in the first tuple. In more specific embodiments, at least one data element is an expression element having a character pattern matching a predefined expression pattern, where the predefined expression pattern represents at least two words and a separator between the words. In other specific embodiments, at least one data element is a word defined by a character pattern of one or more consecutive essential characters. Other more specific embodiments include determining an end of the record by recognizing a predefined delimiter.

A method in another example embodiment includes extracting a plurality of data elements from an object, tokenizing the plurality of data elements into a plurality of object tokens, and identifying a first tuple in the registration list. The method further includes determining if each one of a plurality of associated tokens in the first tuple corresponds to at least one of the object tokens. Additionally, the method includes validating an event if an amount of correspondence between the plurality of associated tokens in the first tuple and the plurality of object tokens meets a predetermined threshold. In more specific embodiments, the predetermined threshold is met when each of the associated tokens in the first tuple corresponds to at least one of the plurality of object tokens.

Example Embodiments

FIG. 1 is a simplified block diagram illustrating an example implementation of a data combination protection system 10 for registering and detecting specified combinations of data in an exemplary network 100. Data combination protection system 10 may include multiple network elements such as a network appliance 12 having a registration system 22 and a plurality of network appliances 14, 16, and 18 having detection systems 24, 26, and 28, respectively. These network appliances 12, 14, 16, and 18 can be managed by or otherwise coupled to another network element such as network appliance 30 with a data protection manager 32. In addition, a network security platform 140 may provide an existing infrastructure of network security for network 100 and may be suitably integrated with data combination protection system 10.

The network environment illustrated in FIG. 1 may be generally configured or arranged to represent any communication architecture capable of exchanging packets. Such configurations may include separate divisions of a given business entity such as that which is shown for purposes of illustration in FIG. 1 (e.g., a Marketing segment 152, a Sales segment 154, a Production segment 156). In addition, other common network elements such as an email gateway 162, a web gateway 164, a switch 172, a firewall 174, and at least one client device 130 may also be provided in network 100. Network 100 may also be configured to exchange packets with other networks, such as Internet 180, through firewall 174.

Data combination protection system 10 can help organizations protect against the inadvertent and intentional disclosures of confidential data from a network environment. Embodiments of data combination protection system 10 can be used to register specified combinations of data elements and to detect registered data combinations within objects of the network environment. For example, data elements that are sufficiently distinctive when combined to identify an individual, and which can potentially expose confidential or sensitive information about the individual, can be registered as a combination and detected in objects in the network by data combination protection system 10. System 10 can create a registration list with each specified combination or set of data elements represented in a separate tuple or record of the registration list. The registering operations to create these tuples in the registration list can be performed on any data file having one or more sets of data elements with each set of data elements delimited from other sets of data elements by a predefined delimiter. The registration list can be indexed with keys, where each key corresponds to one of the data elements represented in a tuple.

Data combination protection system 10 can perform detecting operations to find one or more registered combinations of data elements in an object (e.g., word processing document, spreadsheet, database, electronic mail document, plaintext file, any human language text file, etc.) in the network environment. The object could be captured in the network and formatted for transmission (e.g., HTML, FTP, SMTP, Webmail, etc.), or stored in a database, file system, or other storage repository. In one embodiment, when all of the data elements in a registered combination of data elements (i.e., represented in one tuple of the registration list) are detected in an object, an event is flagged or validated and the object may be prevented from being transmitted and/or may be reported for a network operator or other authorized person to monitor and take any appropriate remedial actions. In other embodiments, if a particular threshold amount of a registered combination of data elements is found in an object, then an event may be validated.

For purposes of illustrating the techniques of data combination protection system 10, it is important to understand the activities and security concerns that may be present in a given network such as the network shown in FIG. 1. The following foundational information may be viewed as a basis from which the present disclosure may be properly explained. Such information is offered earnestly for purposes of explanation only and, accordingly, should not be construed in any way to limit the broad scope of the present disclosure and its potential applications.

A challenge in many security environments is the ability to control confidential electronic data. In one example security issue, many organizations collect and store data that can be used to identify individuals who may be associated with the organization or may simply be members of the general public or various segments thereof. This sensitive data may include, for example, name, social security number, credit card number, address, telephone number, date of birth, citizenship, account number, employer, marital status, and the like. A sensitive data element alone in an object, or even a small number of sensitive data elements in an object, may not be sufficiently distinctive to identify a particular person or to reveal confidential information. As the number of sensitive data elements associated with a particular person increases within an object, however, the possibility of the person becoming identifiable also increases and, therefore, the risk of exposing related confidential information increases. Similarly, other types of confidential information may also become identifiable as the number of associated data elements related to the confidential information increases (e.g., data elements related to intellectual property, corporate financial data, confidential government information, etc.).

Various federal and state laws also regulate the disclosure of individuals' nonpublic personal information and personally identifiable information by certain organizations or entities. For example, the Health Insurance Portability and Accountability Act of 1996 (HIPAA) regulates the use and disclosure of protected health information (PHI) if the information is individually identifiable (i.e., containing information such as name, address, date of birth, social security number, or other information that could be used to identify a particular person). Similarly, the Gramm-Leach-Bliley Act of 1999 (GLBA) seeks to protect individuals' personal financial information by regulating the disclosure of non-public personal information by financial institutions. In another example, the Payment Card Industry (PCI) Data Security Standard also regulates the use and disclosure of data elements on payment cards. Such regulations may proscribe unauthorized dissemination of electronic data containing predetermined combinations of data elements (e.g., name, social security number, and date of birth) that could potentially identify particular individuals and their personal information.

Monitoring objects for sensitive data elements can be problematic for several reasons. First, the volume of data maintained in some networks requires sophisticated processing techniques to minimize network performance degradation. With roughly 300 million people in the United States alone, the number of data elements related to just those individuals could quickly increase to billions of data elements. Standard computer memory and processing capabilities need to be optimized in order to efficiently process objects to register and evaluate billions of data elements.

Another monitoring problem occurs because certain data is not always presented in a standard format. For example, numerous formats can be used for a date of birth (e.g., ‘Jun. 25, 1964’, ‘06-25-1964’, ‘1964.May.25’, etc.) or a telephone number (e.g., ‘(000) 000-0000’, ‘000-000-0000’, ‘000.000.0000’, etc.). In one example scenario, data elements may be stored in a network in one format, and then disclosed in an object in a different format. Regulations and resulting penalties for an unauthorized data disclosure, however, may apply to a disclosure of confidential information regardless of the format used in the disclosure. Thus, detecting sensitive data elements in objects requires recognizing varying formats of particular data.

The multitude of formats in which electronic data can be shared electronically may also hinder security systems from successfully monitoring electronic disclosures of confidential information. Electronic data can be provided in numerous configurations (e.g., spreadsheets with predefined columns and rows, email messages, word processing documents, databases, transmitted objects formatted using a defined protocol, etc.). Consequently, in a system in which specified combinations of data elements are being monitored, such elements may not necessarily be located in close proximity to other associated data elements of the same specified combination. The data elements in a particular specified combination could be separated by words, formatting characters, lines, or any separator or delimiter within an object. Sophisticated techniques are needed to evaluate and validate objects containing specified combinations of data elements, regardless of where such data elements appear within the object.

A system for protecting specified data combinations outlined by FIG. 1 can resolve many of these issues. In accordance with one example implementation of data combination protection system 10, registration system 22 is provided in network 100 to create a registration list of specified combinations or sets of data elements to be monitored. The registration system can recognize and register data elements presented in various character formats or patterns and provided in various electronic file formats having a predefined delimiter between each set of data elements. Multiple detection systems 24, 26, and 28 may also be provided to evaluate captured and/or stored objects in the network environment to determine which objects contain one or more of the registered sets of data elements. The detection systems may be configured to recognize data elements within an object and to determine whether each data element of a registered combination of data elements is contained somewhere within the confines of the object. The registration list may be indexed and searched by the detection system in a manner that optimizes computer resources and that minimizes any network performance issues.

Note that in this Specification, references to various features (e.g., elements, structures, modules, components, steps, etc.) included in “one embodiment”, “example embodiment”, “an embodiment”, “another embodiment”, “some embodiments”, “various embodiments”, “other embodiments”, “alternative embodiment”, and the like are intended to mean that any such features may be included in one or more embodiments of the present disclosure, but may or may not necessarily be included in the same embodiments.

Turning to the infrastructure of FIG. 1, data combination protection system 10 may be implemented in exemplary network 100, which may be configured as a local area network (LAN) and implemented using various wired configurations (e.g., Ethernet) and/or wireless technologies (e.g., IEEE 802.11x). In one embodiment, network 100 may be operably coupled to Internet 180 by an Internet Service Provider (ISP) or through an Internet Server with dedicated bandwidth. Network 100 could also be connected to other logically distinct networks configured as LANs or any other suitable network type. Furthermore, network 100 could be replaced with any other type of network where appropriate and according to particular needs. Such networks include a wireless LAN (WLAN), a metropolitan area network (MAN), a wide area network (WAN), a virtual private network (VPN), or any other appropriate architecture or system that facilitates communications in a network environment. The connection to Internet 180 and other logically distinct networks may include any appropriate medium such as, for example, digital subscriber lines (DSL), telephone lines, T1 lines, T3 lines, wireless, satellite, fiber optics, cable, Ethernet, etc. or any combination thereof. Numerous networking components such as gateways, routers, switches (e.g., 172), and the like may be used to facilitate electronic communication within network 100 and between network 100, Internet 180, and any other logically distinct networks linked to network 100.

Network 100 may be configured to permit transmission control protocol/internet protocol (TCP/IP) communications for the transmission or reception of electronic packets. Network 100 may also operate in conjunction with a user datagram protocol/IP (UDP/IP) or any other suitable protocol where appropriate and based on particular needs. In addition, email gateway 162 may allow client computers such as client device 130, which is operably connected to network 100, to send and receive email messages using Simple Mail Transfer Protocol (SMTP) or any other suitable protocol.

Client device 130 represents one or more endpoints or customers wishing to affect or otherwise manage electronic communications in network 100. The term ‘client device’ may be inclusive of devices used to initiate an electronic communication, such as a computer, a personal digital assistant (PDA), a laptop or electronic notebook, a cellular telephone, or any other device, component, element, or object capable of initiating voice, audio, or data exchanges within network 100. The endpoints may also be inclusive of a suitable interface to a human user, such as a microphone, a display, or a keyboard or other terminal equipment. The endpoints may also be any device that seeks to initiate an electronic communication on behalf of another entity or element, such as a program, a database, or any other component, device, element, or object capable of initiating a voice or a data exchange within network 100.

Network appliances having registration and detection systems can provide a data combination protection system 10 in network 100 that enables protection against inadvertent or intentional information leaking, in which particular combinations of leaked data can potentially expose confidential information. These network appliances may be able to access communication pathways associated with the network configuration, such that one or more appliances have access to e-mail traffic, other network traffic, or data that is simply residing somewhere in the business infrastructure (e.g., on a server, a repository, etc.). In particular, network appliance 12 with registration system 22 can be deployed in network 100 for access to databases and repositories 112 containing sensitive data elements. Registration system 22 can register specific combinations of data from databases and repositories 112, or from other files or objects in a suitable format. The registered combinations of data can be used by detection systems 24, 26, and 28 of network appliances 14, 16, and 18 to detect leaks of any complete registered data combination, or a predetermined portion thereof, in network traffic or to detect the presence of such data combinations, or predetermined portions thereof, residing in an unauthorized segment of the business infrastructure.

Network appliances 14, 16, and 18 with detection systems 24, 26, and 18 can be deployed at network egress points (e.g., email gateway 162, web gateway 164, switch 172, etc.) to protect internal-to-external and internal-to-internal network traffic. When a network appliance detects a risk event, it can alert an administrator, which can leverage existing infrastructure to block or quarantine sensitive information from leaving the network. As a device deployed using passive interception techniques, such as a network tap or in traffic mirroring, the network appliances can operate non-disruptively, requiring no changes to applications, servers, workstations, or the network itself. The network appliances can monitor and analyze all applications, protocols, and content types and trigger enforcement actions in real time.

Data protection manager 32 in network appliance 30 illustrated in FIG. 1 may be designed to simplify administration of data combination protection system 10 as it can offer a centralized interface to manage registration system 22 and all detection systems 24, 26, and 28 across multiple network appliances. Data protection manager 32 may be configured to centrally maintain data generated from registration system 22 and detection systems 24, 26, and 28 and to coordinate data flow between the distributed registration and detection systems, which can reside in various network appliances as shown in FIG. 1. In particular, one embodiment includes a registration list and an index to the registration list created by registration system 22, which can be distributed by data protection manager 32 to each of the distributed detection systems 24, 26, and 28.

Data protection manager 32 may also be configured to allow an authorized security professional (e.g., IT administrator, network operator, etc.) to determine what data input is provided to registration system 22 including which databases or other repositories registration system 22 crawls for data input, to designate enforcement or monitoring states associated with individual detection systems, and to designate who can access the corresponding findings. Enforcement actions can include alerting an appropriate administrator, directing an enforcement device to block or quarantine the suspect traffic, and/or reporting on the traffic. Monitoring actions can include alerting an appropriate administrator and/or reporting on the suspect traffic, without blocking or quarantining actions.

Data protection manager 32 may also provide a centralized query mechanism, which allows organizations to quickly search through capture databases contained on multiple distributed network appliances simultaneously. By allowing the administrator a unified view over all historical data captured throughout points in the network where network appliances are deployed, organizations can quickly perform forensic analysis, conduct investigations, and leverage captured data to update security posture to safeguard sensitive information or to handle emerging threats. In addition, the data protection manager may provide unified reports and diagnostic information.

One or more tables and lists may be included in these network appliances. In some embodiments, these tables and lists may be provided externally to these elements, or consolidated in any suitable fashion. The tables and lists are memory elements for storing information to be referenced by their corresponding network appliances. As used herein in this document, the term ‘table’ and ‘list’ is inclusive of any suitable database or storage medium (provided in any appropriate format) that is capable of maintaining information pertinent to the operations detailed herein in this Specification. For example, the tables and lists may store information in an electronic register, diagram, record, index, or queue. The tables and lists may keep such information in any suitable random access memory (RAM), read only memory (ROM), erasable programmable ROM (EPROM), electronically erasable PROM (EEPROM), application specific integrated circuit (ASIC), software, hardware, or in any other suitable component, device, element, or object where appropriate and based on particular needs.

A capture system 29 may also be a part of (or coupled to) one or more network appliances, such as network appliance 18, and may be operably connected to a corresponding capture database 118. In one example embodiment, capture system 29 may be the capture system as shown and described in co-pending U.S. patent application Ser. No. 12/358,399, filed Jan. 23, 2009, entitled “SYSTEM AND METHOD FOR INTELLIGENT STATE MANAGEMENT,” by William Deninger et al., which was previously incorporated by reference herein in its entirety. Capture system 29 may be configured to intercept data leaving a network, such as network 100, or being communicated internally to a network such as network 100. Capture system 29 can reconstruct objects (e.g., files or other documents) leaving the network or being communicated internally, and store the reconstructed objects in a searchable manner in, for example, capture database 118.

In some embodiments, capture system 29 may also be implemented in conjunction with the other various detection systems 24 and 26 of network 100 for capturing data from the corresponding egress points (e.g., email gateway 162 and web gateway 164). Capture system 29 may also be implemented in conjunction with detection systems in other associated but logically and/or geographically distinct networks. These capture systems may be included within a network appliance with a detection system as shown in FIG. 1, or provided as a separate component. In other embodiments, any other suitable form of intercepting network traffic may be used to provide detection systems 24, 26, and 28 with internal and outbound network traffic of network 100 to be analyzed.

In FIG. 1, switch 172 is connected to network appliance 18 and to Internet 180 through firewall 174. Switch 172, which may be implemented as a router or other network device capable of interconnecting network components, can transmit an outgoing data stream to Internet 180 and a copy of that stream to capture system 29. Switch 172 may also send incoming data to capture system 29 and to network 100. In alternative embodiments, capture system 29, registration system 22, detection systems 24, 26, and 28, and data protection manager 30 may be included as part of other network devices such as switches, routers, gateways, bridges, loadbalancers, servers, or any other suitable device, component, or element operable to exchange information in a network environment.

Data combination protection system 10 is also scalable as distributed networks can include additional detection systems for protecting data leakage across distributed network segments (e.g., having separate access points, being geographically dispersed, etc.) of a network infrastructure. Data protection manager 32 may continue to coordinate data flow between registration system 22 and detection systems 24, 26, and 28 in addition to detection systems provided in distributed segments of network 100.

Turning to FIG. 2, FIG. 2 is a simplified block diagram of a general or special purpose computer 200, such as network appliances 12, 14, 16, 18, and 30 or other computing devices, connected to network 100. Computer 200 may include various components such as a processor 220, a main memory 230, a secondary storage 240, a network interface 250, a user interface 260, and a removable memory interface 270. A bus 210, such as a system bus, may provide electronic communication between processor 210 and the other components, memory, and interfaces of computer 200.

Processor 220, which may also be referred to as a central processing unit (CPU), can include any general or special-purpose processor capable of executing machine readable instructions and performing operations on data as instructed by the machine readable instructions. Main memory 230 may be directly accessible to processor 220 for accessing machine instructions and can be in the form of random access memory (RAM) or any type of dynamic storage (e.g., dynamic random access memory (DRAM)). Secondary storage 240 can be any non-volatile memory such as a hard disk, which is capable of storing electronic data including executable software files. Externally stored electronic data may be provided to computer 200 through removable memory interface 270. Removable memory interface 270 represents a connection to any type of external memory such as compact discs (CDs), digital video discs (DVDs), flash drives, external hard drives, or any other external media.

Network interface 250 can be any network interface controller (NIC) that provides a suitable network connection between computer 200 and any network elements (e.g., email gateway 162, web gateway 164, switch 172, databases and repositories 118 and 112, other network appliances, etc.) and networks to which computer 200 connects for sending and receiving electronic data. For example, network interface 250 could be an Ethernet adapter, a token ring adapter, or a wireless adapter. A user interface 260 may be provided to allow a user to interact with the computer 200 via any suitable means, including a graphical user interface display. In addition, any appropriate input mechanism may also be included such as a keyboard, mouse, voice recognition, touch pad, input screen, etc.

Not shown in FIG. 2 is additional hardware that may be suitably coupled to processor 220 and bus 210 in the form of memory management units (MMU), additional symmetric multiprocessing (SMP) elements, read only memory (ROM), erasable programmable ROM (EPROM), electronically erasable PROM (EEPROM), peripheral component interconnect (PCI) bus and corresponding bridges, small computer system interface (SCSI)/integrated drive electronics (IDE) elements, etc. Any suitable operating systems may also be configured in computer 200 to appropriately manage the operation of hardware components therein. Moreover, these computers may include any other suitable hardware, software, components, modules, interfaces, or objects that facilitate the operations thereof. This may be inclusive of appropriate algorithms and communication protocols that facilitate the registration and detection operations detailed herein.

These elements, shown and/or described with reference to computer 200, are intended for illustrative purposes and are not meant to imply architectural limitations of computers such as network appliances 12, 14, 16, 18, and 30, utilized in accordance with the present disclosure. In addition, each computer, including network appliances 12, 14, 16, 18, and 30, may include more or less components where appropriate and based on particular requirements. As used herein in this Specification, the term ‘computer’ is meant to encompass any personal computers, network appliances, routers, switches, gateways, processors, servers, load balancers, firewalls, or any other suitable device, component, element, or object operable to affect or process electronic information in a network environment.

Registration System

Turning to FIG. 3, a simplified block diagram of one embodiment of a registration system 300 is shown. Registration system 300 can include a registration list module 310 and an index table module 320. Input to registration list module 310 can include a delimited data file 330 and a regular expressions table 350 and output of registration list module 310 can include a registration list 360. In one embodiment, delimited data file 330 may represent a plurality of delimited data files generated for various databases and/or files in a network and provided as input to registration list module 310. These delimited data files include specified combinations or sets of data elements to be registered by registration system 300.

Registration list module 310 may perform the functions of extraction 312, tokenization 314, and tuple storage 316. In one embodiment, delimited data file 330 includes a plurality of records delimited by a predefined delimiter such as, for example, a carriage return. Each record may include one or more data elements, which are extracted by extraction function 312. The set of data elements within a record can be a specified combination of related data elements (e.g., a name, a phone number, a social security number, an account number, etc.) that requires safeguarding. Each of the data elements of a record are tokenized by tokenization function 314 into a token (e.g., a numerical representation), which can then be stored in a tuple or record of registration list 360 by tuple storage function 316. Thus, a tuple in registration list 360 may include numerical representations or tokens of each data element in one particular combination of related data elements that is sought to be protected.

The data elements extracted and tokenized from delimited data file 330 can include words and/or expression elements, which can have multiple possible formats (e.g., phone number, date of birth, account number, etc.). A data element can be compared to regular expressions table 350 to determine whether the particular character pattern of the data element matches a predefined expression pattern (i.e., a regular expression), as described in U.S. patent application Ser. No. 12/358,399, filed Jan. 23, 2009, entitled “SYSTEM AND METHOD FOR INTELLIGENT STATE MANAGEMENT,” which has been previously incorporated herein by reference in its entirety. Although data combination protection system 10 could be configured to use a regular expression table as shown and described in U.S. patent application Ser. No. 12/358,399, it will be apparent that regular expressions table 350 used by data combination protection system 10 may be configured in numerous other ways, as long as the table 350 includes the predefined expression patterns.

In one embodiment, regular expressions table 350 includes numerous expression patterns, including a plurality of expression patterns for the same concept. For example, a telephone number concept could include the following regular expression patterns: ‘(nnn) nnn-nnnn’, ‘nnn-nnn-nnnn’, and ‘nnn.nnn.nnnn’ with ‘n’ representing numbers 0-9. Similarly, different states use different sequences of characters and separators for driver's license numbers. Thus, a driver's license concept could include a regular expression pattern for each unique sequence of characters and separators representing possible numbers of a driver's license in different states. For example, ‘dnnn-nnnn-nnnn-nn’, and ‘dnnn-nnnn-nnnn’ could be expression patterns for license numbers in Wisconsin and Illinois, with ‘n’ representing numbers 0-9 and ‘d’ representing letters A-Z.

Expression patterns in regular expression table 350 may be user-configurable through an interface that allows a user to define expression patterns for a particular concept. In addition, some expression patterns may be automatically generated or may be preconfigured in data combination protection system 10. For example, a list of common or popular regular expression patterns can be preconfigured in regular expressions table 350 that may be tailored specifically to the industry into which the data combination protection system 10 is sold.

Index table module 320 may perform the functions of token count operation 322, token key selection 324, and index storage 326 to create index table 370. Token count operation function 322 processes registration list 360 to count all of the occurrences of each token in registration list 360. A temporary prime count table 340 may be created to store the count sums. Token key selection function 324 can then process each tuple and, using prime count table 340, select the least frequently occurring one of the tokens from each tuple as a token key. Each unique token key may then be stored in an index of index table 370. Thus, index table 370 can contain a plurality of indexes, each having a unique token key and each being associated with one or more tuples of registration list 360.

FIG. 4 provides a more detailed illustration of exemplary file structures of delimited data file 330 with an example record 1, registration list 360 with an example tuple 362, and index table 370 with an example index 372. Delimited data file 330 is shown with a detailed first record 332 illustrating a possible configuration of record 1 with an example combination of data elements types (i.e., words and expression elements). First record 332 corresponds to tuple 362 of registration list 360, where each word and expression element from first record 332 corresponds to one token in tuple 362. Tuple 362 is indexed in registration list 360 by index 372 of index table 370, which includes a registration list offset that is a pointer (i.e., offset 4) to the beginning (i.e., token 1) of tuple 362.

In one example embodiment, delimited data file 330 may be configured as a file with a plurality of records (e.g., record 1, record 2, record 3, etc.) having a predefined delimiter between each record. A delimiter can be any formatting character or other character used to designate the end of one record and the beginning of a next record. Some common delimiters include carriage returns, line feeds, semi-colons, and periods. However, any character could be designated as a delimiter if the data file is appropriately configured with the particular delimiter. In one example embodiment, if a carriage return is defined as the delimiter for delimited data file 330, then each record would end with a carriage return.

As shown in expanded first record 332, each record may be comprised of a plurality of data elements (i.e., words or expression elements). The data elements within each record of delimited data file 330 are separated by at least one separator (e.g., comma, space, dash, etc.). A word may be comprised of a string of characters having one or more consecutive essential characters without any separators. An expression element may be comprised of a string of characters having at least two words and one or more separators between the words. In one embodiment, essential characters can include a fundamental unit in a written language including numerical digits, letters of a written language, and/or symbols representing speech segments of a written language (e.g., syllabograms, etc.). Speech segments of a language can include words, syllables of words, distinct sounds, phrases, and the like.

Separators can include any character that is not an essential character and that is not recognized as a predefined delimiter indicating an end of a record in the data file. Examples of separators include punctuation marks, word dividers and other symbols indicating the structure and organization of a written language (e.g., dashes, forward slashes, backward slashes, left parentheticals, right parentheticals, left brackets, right brackets, periods, spaces, an at symbol, an ampersand symbol, a star symbol, a pound symbol, a dollar sign symbol, a percent sign symbol, a quote, a carriage return, a line feed, etc.). In some data file configurations, separators can include characters that are equivalent to the predefined delimiter for the data file. However, in such data files, the equivalent character within a record must be differentiated from the predefined delimiter that indicates an end of the record. Thus, the equivalent character within the record would be processed either as a separator between data elements or as a separator included within an expression element.

In an example embodiment, delimited data file 330 is a comma separated variable (CSV) list, which can be a text format generated for a database or other file having a tabular data format. A CSV list can include multiple data elements in each record with the data elements being separated by commas. Each record in the CSV list includes a character designated as a predefined delimiter to indicate an end of the record, such as a carriage return or line feed. These predefined delimiters conform to Request for Comments (RFC) 4180, in which carriage returns and line feeds within a record are encapsulated in quotes or appropriately escaped in order to differentiate them from a predefined delimiter indicating an end of record. Additionally, in CSV lists, quotes may also be used as separators between data elements or within an expression element if appropriately escaped (i.e., an empty set of quotes to indicate a literal quote).

Generally, for a database or other file having a tabular data format, each CSV record includes the same number of data elements. Embodiments of registration system 300, however, can accommodate varying numbers of data elements in each record, because each record is delineated by a predefined delimiter that is recognized by system 300. Moreover, registration system 300 can also accommodate other formats of delimited data file 330 as long as each record (containing a desired combination of data elements) is delineated by a predefined delimiter, which is designated for the data file 330 and recognized by registration system 300. For example, a free form textual document, in which a variety of separators (e.g., spaces, dashes, etc.) separate data elements, may be provided as a delimited data file if a predefined delimiter (e.g., line feed, carriage return, period, etc.) is used to separate successive pairs of records and is designated as the delimiter for the data file such that it is recognized by registration system 300.

In the example first record 332 of FIG. 4, ten data elements are shown, including 2 words, 2 expression elements, and 6 words in succession. A separator is provided between each of the successive data elements and a delimiter is provided at the end of first record 332. After a data element has been identified and extracted from first record 332 by registration list module 310 of registration system 300, the data element may be tokenized into one token (e.g., token 1 through token 10) and stored in tuple 362 of registration list 360. An end tag may also be provided to denote the end of a tuple in registration list 360. Registration list module 310 can process each record of delimited data file 330 and create a separate tuple in registration list 360 corresponding to each record.

Once registration list 360 is complete with tuples corresponding to each record of delimited data file 330, index table module 320 may process registration list 360 to create index table 370. In the example shown in FIG. 4, index table module 320 generates index 372 to provide an index for locating tuple 362 in registration list 360. Prime count table 340, which stores the sums of occurrences for each token in registration list 360, can be generated. A token key for tuple 362 can then be computed by searching prime count table 340 to find a token from tuple 362 that appears with the least frequency in the entire registration list 360, relative to the other tokens in tuple 362. In this example illustration, token 2 is shown as the token occurring with the least frequency (i.e., the lowest sum of occurrences), compared to the sums of occurrences of token 1 and tokens 3-10. Thus, token 2 may be selected as the token key and used to create index 372.

In one embodiment, index table 370 can be generated using a known technique of forcing hash numbers (e.g., token keys) into a narrow boundary with modulus, in which the boundary is defined by a prime number. This can be advantageous for particularly large amounts of data, where a smaller area of memory may be allocated to accommodate the data and the data is generally distributed uniformly within the allocated memory. Thus, extremely large amounts of data can be more efficiently processed. The size of index table 370 could be generated by, for example, data protection manager 32 of system 10, based on resources selected by an authorized user during resource provisioning of system 10. Once the memory is allocated, each index can be placed in a space within index table 370 corresponding to a value (e.g., a remainder) calculated by performing a modulo operation on the token key with the prime number size of the index table. If statistical collisions occur (i.e., different token keys have the same result from a modulo operation), then the different token keys can be link-listed in the same space of index table 370.

A registration list offset, which points to a beginning of tuple 362 (e.g., offset 4 pointing to token 1) may be added to index 372 and associated with the token key. In addition, a document identifier (“document ID” or “docID”), which can identify delimited data file 330 may also be added to index 372 and associated with the token key. Thus, when multiple delimited data files are used to create registration list 360, the document ID field in an index identifies which delimited data file is associated with the tuple to which the accompanying registration list offset points. In addition, if two or more token keys are link-listed in a space within index table 370, then the offsets and document IDs corresponding to a particular token key are associated with that particular token key in the index.

The <NEXT> field of index 372 represents additional registration list offsets and document IDs that may be associated with the same token key in index 372. For example, a second tuple having a second offset in registration list 360 may also contain token 2. If token 2 is the token in the second tuple that occurs with the least frequency in the registration list 360 relative to the other tokens in the second tuple, then token 2 of the second tuple could be selected as the token key for the second tuple. Thus, the same index 372 could be used to designate the second tuple by adding a second registration list offset and an appropriate document ID after the <NEXT> pointer.

Turning to FIG. 5, FIG. 5 is a simplified block diagram illustrating example data input and a resulting prime count table 540, which may be generated by token count operation 322 of index table module 320. Data element 501 (word 1), data element 502 (word 1), data element 503 (expression element 1), and data element 504 (expression element 2) represent example data elements of a delimited data file, such as delimited data file 330, which are stored as tokens in one or more tuples of a registration list such as registration list 360. Token count operation function 322 may count the tokens generated for each of the data elements 501, 502, 503, and 504 and may produce prime count table 540. In one embodiment, prime count table 540 may include ‘n’ entries 542 with corresponding token sums 544. In this example, ‘n’ is equal to a prime number and a modulo operation is performed on each token to determine which entry corresponds to the token sum to be incremented. Thus, in this example, entry 2 corresponds to tokens representing data element 501 (word 1) and data element 502 (word 1) and, therefore, has a token sum of 2. In addition, entries 4 and 7 correspond to tokens representing data element 503 (expression element 1) and data element 504 (expression element 2), respectively, and each has a token sum of 1.

Turning to FIGS. 6A, 6B, and 7, simplified flowcharts illustrate operational processing of registration system 300. FIGS. 6A and 6B are simplified flowcharts illustrating example operational steps for registration list module 310 of registration system 300. FIG. 7 is a simplified flowchart illustrating example operational steps for index table module 320 of registration system 300.

FIG. 6A shows the overall flow 600 of registration list module 310, including the processing of one or more delimited data files, the processing of each record of each delimited data file, and the processing of each data element in each record of the one or more delimited data files. Flow may begin in step 602 of FIG. 6A, where a first delimited data file is obtained. In one embodiment, registration system 300 can be configured to crawl one or more desired databases or other data files and convert the databases or other data files to one or more delimited data files. As previously discussed herein, in one example, a database or other data file could be converted to a comma separated variable list (CSV), which could be provided as the delimited data file.

Once the delimited data file is obtained, a first record is fetched in step 604. In step 606 a start of a first data element is identified in the fetched record. In step 608, applicable extraction, tokenization, and storage operations are performed on the current data element, which will be described in more detail herein with reference to FIG. 6B. After applicable extraction, tokenization, and storage operations have been performed for the current data element, flow moves to decision box 610 to determine whether more data elements exist in the record. If more data elements exist in the record, then a start of a next data element in the record is identified in step 612. Flow then loops back to step 608 to perform extraction, tokenization, and storage on the new data element.

With reference again to decision box 610, if a predefined delimiter is recognized in the record after the current data element, then it is determined that no more data elements exist in the record. Flow may then move to decision box 614 to determine whether there are more records in delimited data file. If more records exist in the delimited data file, then a next record is fetched in step 616 and flow loops back to step 606 to identify a start of a first data element in the new record.

If it is determined that no more records exist in delimited data file in decision box 614, however, then flow passes to decision box 618 to determine whether there are more delimited data files to be processed. If it is determined that one or more delimited data files exist that have not been processed, then a next delimited data file is obtained in step 620 and flow loops back to step 604 and a first record is fetched from the new delimited data file. However, if it is determined in decision box 618 that all delimited data files have been processed, then the flow ends.

FIG. 6B shows the overall flow of step 608 in FIG. 6A, illustrating example operational steps to extract, tokenize, and store a data element from a record of a delimited data file. Flow may begin in step 652 where regular expression table 350 is searched to find a longest match to a character pattern of a string of characters beginning at the start of the data element. In one embodiment, expression patterns from regular expression table 350 are compared in order of size from longest to shortest to determine if there is a match. In decision box 654 a query is made as to whether a match from the regular expression table 350 was found.

If it is determined that none of the regular expression patterns match a character pattern of any string of characters beginning at the start of the data element (i.e., the data element does not match any regular expression patterns in regular expression table 350), then the data element represents a word and flow moves to step 660 to find an end of the data element (i.e., the word). The end of word is the last consecutive essential character beginning at the start of the data element. After the word is extracted in step 662, flow passes to decision box 664, where the word may be evaluated to determine whether it is a ‘stop word’. ‘Stop words’ can include any words determined by an administrator or otherwise specified as a stop word, such as simple grammar construction words (e.g., like, and, but, or, is, the, an, a, as, etc.). If the word is determined to be a stop word, then it is ignored and the flow ends without tokenizing or storing the word. However, if the word is determined not to be a stop word, then flow moves to step 668 where the word may be stemmed. A stemming process such as, for example, a known porter stemming algorithm, may be applied to the word in which any suffixes and/or affixes can be extracted off of a stem of the word.

After stemming has been performed if necessary, flow may pass to step 670 where the word (or stemmed word) is tokenized. In one embodiment, tokenization includes converting the word (or stemmed word) into a 32-bit numerical representation or token. In step 672, the token is stored in a tuple of registration list 360, where the tuple corresponds to the record from which the data element was extracted. After the token has been stored, flow ends and processing continues at step 610 of FIG. 6A.

In one embodiment, the numerical representation for the token is generated using a Federal Information Processing Standards (FIPS) approved hash function. Typically, if the hash function has a lesser degree of numerical intensity, and is, therefore, a less secure hash, then less computer resources are used to calculate the hash. However, because registration list 360 may be stored in multiple places throughout a network and searched repeatedly by a plurality of detection systems as shown in FIG. 1, a greater numerical intensity may be desirable for the hash function. Thus, it may be desirable to generate more secure tokens for words and expression elements containing personal and otherwise sensitive information, even if generating such tokens requires more computer resources.

Another consideration is the size of the numerical representation used for the tokens. A 32-bit numerical value alone may not be statistically viable. That is, one word or expression element alone could generate many false positive results if one of the detection systems searches a target document or file for only one 32-bit token representing the data element. The probability of a false positive can be reduced, however, when a record includes two or more data elements that must be found in a document to validate a match. The probability of a false positive can be reduced by 2³²for each additional token that is included in a tuple and that must be found in a document to validate a match. For example, the probability of a false positive for a pair of words is 2⁶⁴and for three words is 2⁹⁶. Accordingly, in one embodiment, each tuple includes at least two tokens.

Referring again to decision box 654, if it is determined that a match was found between an expression pattern of regular expression table 350 and the character pattern of a string of characters beginning at the start of the data element, then the data element represents an expression element and has the same length as the matching expression pattern. The expression element can be extracted at step 656 and normalized in step 658. In one embodiment, normalizing the expression element may include eliminating any separators from the expression element. For example, a phone number could be normalized to ‘nnnnnnnnnn’ with ‘n’ representing any number 0 through 9. In other embodiments, normalization may include modifying separators and/or particular essential characters of the expression element to achieve a predefined standard form for the expression element. For example, all dates could be standardized to the form ‘YYYY-MM-DD’ with ‘YYYY’ representing the year, ‘MM’ representing the month, and ‘DD’ representing the day.

Once the expression element has been extracted and normalized, flow may move to step 670 where the expression element is tokenized and, in step 672, the resulting token is stored in a tuple of registration list 360. After the token has been stored in registration list 360, flow returns to step 610 of FIG. 6A.

Turning to FIG. 7, FIG. 7 shows the overall flow 700 of index table module 320, which generates index table 370 with token keys and associated offsets to the corresponding tuples stored in registration list 360. To reduce the overhead of processing by detection systems 24, 26, and 28, shown in FIG. 1, each of the tuples can be indexed by a token key. In one embodiment, a token key can be a token that, compared to other tokens in the same tuple, has the lowest frequency occurrence in all tuples of the entire registration list 360. Thus, if multiple delimited data files are used to create registration list 360, a token key could be selected having the lowest frequency of all tuples created from multiple delimited data files.

In one example embodiment, a token key can be determined using a prime count table, such as prime count table 340 shown in FIG. 3, and further illustrated in an example prime count table 540 in FIG. 5. Beginning in step 702 of flow 700, prime count table 340 can be generated for the tokens stored in registration list 360 using the known technique, as previously described herein, of forcing hash numbers (e.g., tokens) into a narrow boundary with modulus, in which the boundary is defined by a prime number. Using a prime count table can alleviate computer resources needed to process data elements potentially numbering in the billions. Theoretically, the 32-bit numerical representation (2³²) could represent greater than 4 billion possible tokens. In a real-world example scenario, if an enterprise has four different entries of sensitive data for 300 million individuals, then the number of entries would exceed 1 billion. Computer resources may not be able to adequately perform processing functions if each individual entry is counted to produce index table 370. The use of prime count table 340, however, allows a smaller area of memory to be allocated and used to count the tokens in registration list 360 and select lowest frequency tokens as token keys.

In one embodiment, the size of a prime count table may be generated by, for example, data protection manager 32 of system 10, based on resources selected by an authorized user during resource provisioning of system 10. In one example scenario, for an enterprise having collected sensitive data for 300 million people, if 100 million entries are determined to be adequate to count tokens, then the size of the prime count table could be defined by the next closest prime number (e.g., 100,000,007). Thus, a table with 100,000,007 entries can be created and each of the entries cleared with a zero value.

Once memory has been allocated and defined for a prime count table, each token in registration list 360 can be processed to determine which entry to increment in prime count table 340. In one embodiment, registration list 360 may be sequentially processed from the first token in the first tuple to the last token in the last tuple. For each token, a modulo operation can be performed using the prime number and the numerical value of the particular token. The remainder value of the modulo operation is located in prime count table 340 and incremented by 1. Some statistical collisions may occur in which tokens generated for two different data elements result in the same remainder. In this case the same entry in prime count table 340 can be incremented, thus artificially increasing the number count of the entry, which corresponds to more than one token. However, an artificial increase of a word count does not significantly diminish the viability of determining the token in each tuple having the lowest frequency in the registration list.

After prime count table 340 is generated in step 702, flow passes to step 704 where a first tuple is identified in registration list 360. Steps 706 through 722 then perform looping to determine a token key for each tuple and to generate index table 370. Accordingly, the loop begins in step 706 where prime count table 340 is searched to determine which one of the tokens in the current tuple has the lowest count or frequency. In step 708, the token of the current tuple having the lowest frequency according to prime count table 340 is selected as a token key for the current tuple.

After selecting the token key for the current tuple, flow may pass to step 710 where all indexes in index table 370 can be searched for a matching token key. With reference to decision box 712, if no index is found with a token key matching the selected token key for the current tuple, then flow passes to step 716, where a new index is created in index table 370 using the selected token key. Flow then passes to step 718 where a document identifier and offset are added to the new index. In one embodiment, the document ID may be obtained from header information of the corresponding tuple in registration list 360. The offset may be a pointer or index to the corresponding tuple in registration list 360. For example, the offset can be an index number of the first token appearing in the corresponding tuple.

With reference again to decision box 712, if an index is found in index table 370 with a token key matching the selected token key for the current tuple, then an index has already been created for another tuple using the same token key. In this scenario, flow may pass to step 714 where the current tuple information can be added to the existing index. A pointer (e.g., <NEXT> pointer) can be added to the end of the existing index and then a document ID and offset corresponding to the current tuple can be added. Thus, any number of tuples having the same token key can use the same index.

After the index is created in step 718 or updated in step 714, flow passes to decision box 720 to determine whether the current tuple is the last tuple in registration list 360. If the current tuple is not the last tuple, then the next tuple is identified in step 722 and flow passes back to step 706 to begin processing the next tuple to select a token key and update index table 370. However, if it is determined in decision box 720 that the current tuple is the last tuple in registration list 360, then all tuples have been processed and flow 700 ends.

Selecting a lowest frequency token as a token key for a tuple helps improve processing efficiency during detection processing activities, which will be further described herein with reference to FIGS. 9-13. By using lowest frequency tokens as token keys in the index table, tuples in the registration list need not be compared to an object being evaluated unless the object contains a data element that, when tokenized, is equivalent to a token key in the index table. Thus, more tuples may be excluded from unnecessary processing in this embodiment than if a more commonly occurring token is selected as a token key.

Alternative embodiments could be implemented to reduce the processing required to generate the lowest frequency token keys for an index table. Although such embodiments could reduce the backend registration processing, additional processing may be required by the detection system. In one such alternative embodiment, a different token key selection criteria (i.e., other than the lowest frequency selection criteria) may be used. For example, tokens from tuples could be selected as token keys based upon a predetermined column or position of a data element in a record. Although the index table may be more quickly generated as result, more tuples may be evaluated during the detection processing, particularly if at least some of the token keys correspond to more commonly occurring data elements. Nevertheless, this embodiment may be desirable based on the particular needs of an implementation. In addition, the token key selection criteria may be user-configurable, such that an authorized user can determine the selection criteria to be used by registration system 300 when selecting the token keys.

FIG. 8 illustrates a scenario in which a record 802 with example data elements is processed by registration system 300. Record 802 is an example single record of a delimited data file, such as delimited data file 330, which may have a plurality of records. Record 802 includes data elements separated by spaces and ending with a carriage return, which is the predefined delimiter. Each of the data elements is evaluated to determine if it is a word or an expression element. The data elements represented as words (i.e., Carol, Deninger, 123, Apple, Lane, Boise, Id., and 99999) are extracted and tokenized. The data elements which are determined to match a regular expression pattern, are extracted and normalized. In this example case, normalizing the expression element includes removing any nonessential characters. The normalized expression element is then tokenized.

The following table represents the type of data, the example data element contents of record 802 corresponding to each type of data, and the tokens generated for each data element:

TABLE 1

Token (Numerical

Data Element/
Representation of

Type of Data
Normalized Data Element
Data Element)

First Name
Carol
23

Last Name
Deninger
55

Social Security Number
000-00-0000/000000000
99

Date of Birth
1960 Jan. 1/19600101
69

Street Address 1
123
19

Street Address 2
Apple
44

Street Address 3
Lane
32

City
Boise
73

State
ID
29

Zip Code
99999
07

A tuple 812 of registration list 810 is created by registering record 802. Tokens 804 generated from record 802 may be stored in sequential order in tuple 812 of registration list 810. In one embodiment tuple 812 includes header information (not shown) including a document identifier identifying the delimited data file or associated data storage (e.g., Customer records database in Sales) associated with record 802. Also, an end of each tuple in registration list 810 can be defined by a termination entry such as a zero, as shown at the end of tuple 812. In addition, offsets 814 are provided with registration list 810, with each offset pointing to a separate token entry in registration list 810.

Index table 820 may be generated for registration list 810, with index 822 corresponding to tuple 812. Index 822 includes a token key (55), which is shown as the second occurring token in tuple 812. Token key (55) may be selected if it is the token of tuple 812 having the lowest frequency occurrence in the entire registration list 810, as previously described herein. In addition, offset (1001) is provided with token key (55) and points to the first occurring token (23) in tuple 812. Thus offset (1001) indicates the beginning of tuple 812. Index 822 may also include a docID or document identifier indicating the delimited data file or data storage associated with record 802.

Detection System

Turning to FIG. 9, a simplified block diagram of one embodiment of a detection system 900 is shown. Detection system 900 can include an evaluate module 910 and a validate module 920. Input to evaluate module 910 can include an input object 930, regular expressions table 350, and index table 370. Output of evaluate module 910 can include a bit hash table 940 and a pending key list 950, both of which may be temporary. Evaluate module 910 may perform the functions of extraction 912, tokenization 914, bit set operation 916, and pending key list creation 918. Generally, evaluate module 910 processes a file, such as input object 930, to extract and tokenize each data element of the file in substantially the same manner that registration system 300 extracted and tokenized data elements of delimited data file 330. Thus, extracted and tokenized data elements from the file can be compared to the extracted and tokenized data elements from the delimited data file 330.

Input object 930 can include any type of data file or document to be analyzed to determine if any registered combination of data elements, or a predetermined threshold amount thereof, is present in the file or document. In one embodiment, input object 930 can be provided by capture system 29, as shown in FIG. 1, when packets are intercepted by capture system 29 in network 100 and the objects are reconstructed from the intercepted packets, as previously described herein and described in U.S. patent application Ser. No. 12/358,399, which has been previously incorporated herein by reference in its entirety. Example input objects include, but are not limited to, Microsoft Office documents (such as Word, Excel, PowerPoint, etc.), portable document format (PDF) files, text files, email messages, email attachments, any human language text document (e.g., Englishtext, Frenchtext, Germantext, Spanishtext, Japanesetext, Chinesetext, Koreantext, Russiantext, etc.), and the like. In addition to these various objects, a storage repository such as, for example, a database, may also be processed by detection system 900 to evaluate the contents for the presence of any registered data combinations. In one example embodiment, a common file, such as a CSV list, can be generated for a database or other file and provided to detection system 900 as input object 930.

Input object 930 can include words and/or expression elements separated by any number of separators and/or delimiters. In one embodiment, the contents of input object 930 can be sequentially processed. A character pattern of each data element of input object 930 can be compared to regular expressions table 350 to determine whether the data element matches a predefined expression pattern as previously described herein and as described in U.S. patent application Ser. No. 12/358,399, filed Jan. 23, 2009, entitled “SYSTEM AND METHOD FOR INTELLIGENT STATE MANAGEMENT,” which has been previously incorporated herein by reference in its entirety. If the data element matches a predefined expression pattern, then the entire expression element can be extracted and normalized, such that tokenization function 914 can be performed on the normalized expression element. If the data element does not match a predefined expression pattern, then the data element is a word, which may be extracted and tokenized by tokenization function 914.

Bit set operation 916 and pending key list creation 918 may also be performed by evaluate module 910. Bit set operation 916 sets bits corresponding to each tokenized data element in bit hash table 940, thereby providing an efficient way of indicating each tokenized data element of input document 930. Pending key list creation 918 compares each tokenized data element of input object 930 to index table 370 to identify a corresponding token key in an index. In one embodiment, a corresponding token key is identified when the token key is equivalent to the tokenized data element (i.e., having the same numerical representation). If a corresponding token key is identified, then the tokenized data element or object token is saved to pending key list 950 for further analysis by validate module 920.

Validate module 920 of detection system 900 may perform the functions of registration list and bit hash table comparison 922 and event list update 924. Registration list and bit hash table comparison 922 can process pending keys (i.e., tokens) from pending key list 950 to find corresponding indexes in index table 370. In one embodiment, a pending key corresponds to a token key in an index when the pending key is equivalent to the token key (i.e., having the same numerical representation). The indexes can then be used to locate corresponding tuples in registration list 360. In one embodiment, the tokens in the identified tuples can be compared to bit hash table 940 to determine how many tokens in an identified tuple are present in input object 930. If it is determined that input object 930 contains data elements that, when tokenized, correspond to all of the tokens for a tuple, or correspond to a predetermined threshold amount thereof, then an event is validated. The use of bit hash table 940 to determine whether tokenized data elements of input object 930 correspond to tokens in a tuple will be further described herein with reference to FIGS. 10-13. Event list update 924 can update an event list 960, indicating the particular registered data combination that is found in input object 930, the document identifier associated with the particular registered data combination, and any other desired information (e.g., date and time stamp, source and/or destination addresses of network traffic, port numbers, etc.).

Turning to FIG. 10, FIG. 10 is a simplified block diagram illustrating example data input and a resulting bit vector or bit hash table 1040, which may be generated by bit set operation 916 of evaluate module 910. Data element 1001 (word 1), data element 1002 (word 1), data element 1003 (expression element 1), and data element 1004 (expression element 2) represent example data elements of an input object, such as input object 930. Setting a bit position is done by changing a bit from “0” to “1” or from “1” to “0”, depending on which value is the default. In one embodiment, all bits in bit hash table 1040 are initialized to “0” and a bit associated with a particular bit position in bit hash table 1040 can be set to a “1” if a data element corresponding to the same bit position is found in the input object.

In one example embodiment, bit set operation 916 can determine which data elements correspond to which bit positions of bit hash table 1040 by using a known prime number hashing technique. Bit hash table 1040 may include m bits, where m is equal to a prime number. When a modulo operation is performed on a token generated for one of the data elements 1001-1004, the result of the modulo operation can indicate the bit position corresponding to the data element represented by the token. Thus, the bit corresponding to the particular bit position can then be set to indicate the presence of the data element in the input object. In the example in FIG. 10, bit position 2 may correspond to data element 1001 (word 1) and data element 1002 (word 1), bit position 5 may correspond to data element 1003 (expression element 2), and bit position 10 may correspond to data element 1004 (expression element 1). Accordingly, each of the bits corresponding to bit positions 2, 5, and 10 may be set to a 1.

With reference to FIGS. 11 and 12, simplified flowcharts illustrate operational processing of detection system 900. FIG. 11 is a simplified flowchart illustrating example operational steps for evaluate module 910 of detection system 900 and FIG. 12 is a simplified flowchart illustrating example operational steps for validate module 920 of detection system 900.

Turning to FIG. 11, evaluation processing flow 1100 may include extraction and tokenization functions for input object 930 similar to the extraction and tokenization functions applied to delimited data files by registration list processing flow 600 of FIGS. 6A and 6B. Evaluation processing flow 1100 may begin in step 1102 where a start of a first data element in input object 930 is identified. In step 1104, regular expression table 350 is searched to find a longest match to a character pattern of a string of characters beginning at the start of the data element. In one embodiment, expression patterns from regular expression table 350 are compared in order of size from longest to shortest to determine if there is a match.

In decision box 1106 a query is made as to whether a match from the regular expression table 350 was found. If it is determined that none of the regular expression patterns match a character pattern of any string of characters beginning at the start of the data element (i.e., the data element does not match any regular expression patterns in regular expression table 350), then the data element represents a word and flow moves to step 1112 to find an end of the data element (i.e., the word), which can be extracted in step 1114. The end of the word is the last consecutive essential character beginning at the start of the data element. After the word has been extracted in step 1114, flow moves to decision box 1116, where the word may be evaluated to determine whether it is a ‘stop word’, as previously described herein. If the word is determined to be a stop word, then it is ignored and the flow proceeds to decision box 1128 to determine whether the current word is the last data element in input object 930. If the current word is the last data element, then processing ends. However, if the word is not the last data element in input object 930, then flow moves to step 1130 to find the start of the next data element. Flow then loops back to step 1104 to perform the extraction, tokenization, and storage of the new data element.

With reference again to decision box 1116, if the current word is determined not to be a stop word, then flow moves to step 1118 where the word may be stemmed. A stemming process such as, for example, a porter stemming algorithm, may be applied to the word in which any suffixes and/or affixes can be extracted off a stem of the word. After stemming has been performed if necessary, flow may pass to step 1120 where the word (or stemmed word) is tokenized. In one embodiment, tokenization includes converting the word (or stemmed word) into a 32-bit numerical representation or token, which is accomplished using the same technique used by registration list module 310 (e.g., Federal Information Processing Standards (FIPS) approved hash function).

After a token has been generated for the word in step 1120, a bit may be set in bit hash table 940 in step 1122. The set bit corresponds to a bit position in bit hash table 940 determined by performing a modulo operation on the token using the prime number size of the bit hash table, as previously described herein. The bit is set to indicate that the word, represented by the token, was found in input object 930. Some statistical collisions may occur in which tokens generated for two different data elements result in the same remainder. However, the system maintains statistical viability, at least in part because triggering an event requires a particular combination of data elements to be found in a document, rather than a single individual data element. In addition, collisions are typically infrequent when the table is sufficiently sized to a prime number.

After setting the proper bit in bit hash table 940, flow passes to decision box 1124 to determine whether the token corresponds to a token key in one of the indexes of index table 370. If the token corresponds to a token key in one of the indexes, then flow passes to step 1126 and the token is saved to pending key list 950. After the token is saved to pending key list 950, or if the token did not correspond to any token key of the indexes in index table 370, then flow passes to decision box 1128 to determine whether the data element corresponding to the current token is the last data element in input object 930. If the data element is not the last data element in input object 930, then flow passes to step 1130 where a start of the next data element is found. Flow then loops back to step 1104 to perform the extraction, tokenization, and storage of the new data element. With reference again to decision box 1128, if the data element is the last data element in input object 930, then the entire input object 930 has been processed and flow 1100 ends.

Referring back to decision box 1106, if it is determined that a match was found between an expression pattern of regular expression table 350 and a character pattern of a string of characters beginning at the start of the data element, then the data element represents an expression element and has the same length as the matching expression pattern. The expression element can be extracted in step 1108 and normalized in step 1110. In one embodiment, the particular type of normalizing employed by evaluate module 910 is the same type of normalizing employed in registration list module 310. As previously described herein, normalizing the expression element may include eliminating any separators from the expression element or modifying separators and/or particular essential characters of the expression element to achieve a predefined standard form for the expression element.

Once the expression element has been extracted and normalized, flow may move to step 1120 where the normalized expression element is tokenized. In step 1122, a bit may be set in bit hash table 940 corresponding to the value of a remainder resulting from a modulo operation on the token using the prime number size of the bit hash table, as previously described herein. After setting the proper bit in bit hash table 940, flow passes to decision box 1124 to determine whether the token corresponds to a token key in one of the indexes of index table 370. If the token corresponds to a token key in one of the indexes, then flow passes to step 1126 and the token is saved to pending key list 950. After the token is saved to pending key list 950, or if the token did not correspond to any token key in the indexes of index table 370, then flow passes to decision box 1128 to determine whether the data element corresponding to the current token is the last data element in input object 930. If the data element is not the last data element in input object 930, then flow passes to step 1130 where a start of the next data element is found. Flow then loops back to step 1104 to perform the extraction, tokenization, and storage of the new data element. With reference again to decision box 1128, if the data element is the last data element in input object 930, then the entire input object 930 has been processed and flow 1100 ends.

Turning to FIG. 12, FIG. 12 illustrates example operational steps in a validation processing flow 1200 of validate module 920 of detection system 900. Generally, validation processing flow 1200 uses bit hash table 940, pending key list 950, registration list 360, and index table 370 to determine whether a registered combination of data elements, or a predetermined threshold amount thereof, are contained in input object 930.

Flow may begin in step 1202 where a first pending key is retrieved from pending key list 950. Flow then moves to step 1204 where index table 370 is searched for an index with a token key corresponding to the pending key. Once an index is found, flow moves to step 1206 to find a first tuple identified in the index. The first tuple can be identified by using a first offset linked to the token key in the index. The offset may point to a location in the registration list of a token at the beginning of the corresponding tuple.

Once the first token of the corresponding tuple has been identified in registration list 360 in step 1208, operational steps 1210 through 1222 process the tuple until either an event is validated (i.e., all data elements or a threshold amount of data elements of a registered data combination are present in input object) or not validated (i.e., all data elements or a threshold amount of data elements of a registered data combination were not found in input object). In decision box 1210, a query is made as to whether a bit corresponding to the token is set in bit hash table 940. Thus, a modulo operation may be performed on the token using the prime number size of bit hash table 940 to determine which bit position to check in bit hash table 940. If the bit in the appropriate bit position is set, then flow may pass to step 1212 where a data element count can be incremented. The data element count indicates a total number of tokens, from the tuple being processed, that are found in bit hash table 940. After the data element count has been incremented, or if the bit was not set in bit hash table 940, then flow passes to decision box 1214 to determine whether the current token is the last token in the tuple. If the current token is not the last token in the tuple, then flow passes to step 1216 to identify the next token in the tuple. Flow then loops back to decision box 1210 to determine whether a bit corresponding to the new token is set.

Once every token in the tuple has been processed, in decision box 1214 it is determined that the last token in the tuple has been evaluated. Flow may then pass to decision box 1218 where a query is made as to whether the data element count is greater than or equal to a predetermined threshold amount. In one embodiment, an event may be validated when all data elements from a single record of a delimited data file are found in an input document. Thus, in this embodiment, the predetermined threshold amount would equal the number of data elements in the record (i.e., the number of tokens in the corresponding tuple). However, other embodiments may use a certain percentage (e.g., 50%, 75%, etc.) or particular minimum number (e.g., 2, 3, 4, etc.) of the total number of data elements from a single record. Administratively, data protection manager 32 shown in FIG. 1 may be configured to allow an authorized user to set the predetermined threshold amount as desired.

If the data element count meets or exceeds the predetermined threshold amount in step 1218, then an event is validated and the flow passes to step 1220 where task and file information are retrieved. In one example embodiment, file information may be retrieved from the document ID (docID) corresponding to the particular offset in the index used to locate the current tuple. In addition, other information related to input object 930 (e.g., transmission information such as source and destination addresses, source and address ports, date and time, email addresses of an associated email message, file path of document, database, or other storage repository, etc.) may be obtained in order to correctly identify the particular object containing registered data combinations. In addition, the particular data elements of the registered combination of data elements found in input object 930 may be stored and/or displayed for an authorized user to review.

Once all of the desired information for a validated event has been retrieved, flow passes to step 1222 in which the event may be recorded in event list 960 and/or appropriate notifications (e.g., email notification, Syslog notification, status messages, etc.) may be provided to an authorized user including some or all of the retrieved information. The validation of an event can also trigger actions to prevent the transmission of an object that triggered the event validation or to lock down a database or other storage repository that triggered the event validation. Such enforcement actions can be implemented via capture system 29 or other existing infrastructure designed to stop the flow of data transmissions.

With reference again to decision box 1218, if the data element count does not meet the predetermined threshold, then no event is validated and steps 1220 and 1222 are bypassed. After all of the tokens of the current tuple have been processed and either an event has been validated or no event has been validated, then flow passes to decision box 1224 where a determination is made as to whether the tuple being processed is the last tuple identified in the index. If the current tuple is not the last tuple in the index, then the subsequent <NEXT> pointer in the index indicates the next tuple to be processed by designating an offset for the next tuple in registration list 360. Thus, if the index has a <NEXT> pointer that is not null, then flow passes to step 1226 and the next tuple is identified by the offset linked to the <NEXT> pointer. Flow then loops back to step 1208 to begin processing tokens of the next tuple to determine whether to validate an event for the next tuple.

With reference again to decision box 1224, if the current tuple is determined to be the last tuple in the index, then flow passes to decision box 1228 to determine whether the pending key is the last pending key in pending key list 950. If the current pending key is not the last one in pending key list 950, then the next pending key is retrieved from pending key list 950 in step 1230 and flow loops back to step 1204, where index table 370 is searched for a token key that corresponds to the new pending key. Flow then continues processing to determine whether to validate an event for each tuple indicated by the particular index of index table 370.

With reference again to decision box 1228, if the current pending key is the last pending key in pending key list 950, then all of the pending keys identified in input object 930 have been processed and events have been validated for corresponding tuples, if appropriate. Not shown in FIG. 12, however, are additional steps that may be performed after all of the pending keys have been processed to prepare memory allocations for subsequent detection system processing. For example, all bits in bit hash table 940 may be set to the default value (e.g., “0”), and a pointer of pending key list 950 may be reset to the beginning of the list.

Turning to FIG. 13, FIG. 13 illustrates a scenario in which an example input document 1302 is processed by detection system 300. A representative sample of data elements is shown in input document 1302, with ellipses indicating additional data elements not shown. In addition, a registration list 1310 and an index table 1320 are shown already created from registration system 300. Tokenized words 1304 show the object tokens generated for each of the data elements shown in input document 1302. During evaluation processing of detection system 300, a bit is set for each of the object tokens shown in bit hash table 1350. In addition, for each object token, index table 1320 is searched for a token key in an index corresponding to the object token. In the example data of FIG. 13, object token (55) is found in an index having a token key (55) and, therefore, object token (55) is stored in a pending key list 1340.

After bit hash table 1350 and pending key list 1340 have been generated, each of the pending keys in pending key list 1340 is processed to determine if a corresponding tuple of tokens, or a predetermined threshold amount of tokens in the corresponding tuple, are represented in bit hash table 1350. In the example scenario of FIG. 13, index table 1320 is searched for an index with a token key corresponding to pending key (55). Index 1322, having token key (55), is found and validation processing is performed as indicated at box 1306. The offset 1001 of index 1322 is used to identify tuple 1312. Each of the tokens in tuple 1312 is analyzed to determine if a corresponding bit is set in bit hash table 1350. In this case, all of the tokens of tuple 1312 are represented by a bit set in bit hash table 1350. Therefore, the predetermined threshold is met, an event is validated, and an event list may be updated as indicated in box 1308. Thus, in this example, detection system 300 determines that input document 1302 contains a threshold amount of a registered combination of data elements (i.e., data elements represented by tuple 1312) and, consequently, validates an event.

While the above described processing flows illustrate an example embodiment, alternatively, other processing flows may be implemented. For example, instead of sequentially processing each data element of a record in delimited data file 330, or sequentially processing each data element of input object 930, a parser may be used as described in U.S. patent application Ser. No. 12/358,399, which was previously incorporated herein by reference. In such an embodiment, a parser can parse extracted data to identify all of the expression elements within the particular record or object. Expression elements can be identified by parsing expression patterns from regular expressions table 350 over the record or object. In one embodiment, expression patterns are parsed over the record or object in descending order from longest to shortest. Once all of the expression elements are identified, then each word could be extracted from the remaining data in the record or object.

Software for achieving the registration and detection operations outlined herein can be provided at various locations (e.g., the corporate IT headquarters, network appliances distributed to egress points of a network, etc.). In other embodiments, this software could be received or downloaded from a web server (e.g., in the context of purchasing individual end-user licenses for separate networks, devices, servers, etc.) in order to provide this system for protecting specified combinations of data. In one example implementation, this software is resident in one or more computers sought to be protected from a security attack (or protected from unwanted or unauthorized manipulations of data).

In various examples, the software of the system for protecting specified data combinations in a computer network environment could involve a proprietary element (e.g., as part of a network security solution with McAfee® Network Data Loss Prevention (NDLP) software, McAfee® ePolicy Orchestrator (ePO) software, etc.), which could be provided in (or be proximate to) these identified elements, or be provided in any other device, server, network appliance, console, firewall, switch, information technology (IT) device, distributed server, etc., or be provided as a complementary solution (e.g., in conjunction with a firewall), or provisioned somewhere in the network.

In certain example implementations, the registration and detection activities outlined herein may be implemented in software. This could be inclusive of software provided in network appliances 12, 14, 16, 18, and 30 (e.g., registration system 22, detection systems 24, 26, and 28, and capture system 29). These elements and/or modules can cooperate with each other in order to perform registration and detection activities as discussed herein. In other embodiments, these features may be provided external to these elements, included in other devices to achieve these intended functionalities, or consolidated in any appropriate manner. For example, some of the processors associated with the various elements may be removed, or otherwise consolidated such that a single processor and a single memory location are responsible for certain activities. In a general sense, the arrangement depicted in FIG. 1 may be more logical in its representation, whereas a physical architecture may include various permutations, combinations, and/or hybrids of these elements.

In various embodiments, all of these elements (e.g., network appliances 12, 14, 16, 18, and 30) include software (or reciprocating software) that can coordinate, manage, or otherwise cooperate in order to achieve the registration and detection operations, as outlined herein. One or all of these elements may include any suitable algorithms, hardware, software, components, modules, interfaces, or objects that facilitate the operations thereof. In the implementation involving software, such a configuration may be inclusive of logic encoded in one or more tangible media (e.g., embedded logic provided in an application specific integrated circuit (ASIC), digital signal processor (DSP) instructions, software (potentially inclusive of object code and source code) to be executed by a processor, or other similar machine, etc.), which may be inclusive of non-transitory media.

In some of these instances, one or more memory elements (e.g., main memory 230, secondary storage 240, etc.) can store data used for the operations described herein. This includes the memory element being able to store software, logic, code, or processor instructions that are executed to carry out the activities described in this Specification. A processor can execute any type of instructions associated with the data to achieve the operations detailed herein in this Specification. In one example, the processor (as shown in FIG. 2) could transform an element or an article (e.g., data) from one state or thing to another state or thing. In another example, the activities outlined herein may be implemented with fixed logic or programmable logic (e.g., software/computer instructions executed by a processor) and the elements identified herein could be some type of a programmable processor, programmable digital logic (e.g., a field programmable gate array (FPGA), an erasable programmable read only memory (EPROM), an electrically erasable programmable read only memory (EEPROM)), an ASIC that includes digital logic, software, code, electronic instructions, flash memory, optical disks, CD-ROMs, DVD ROMs, magnetic or optical cards, other type of machine-readable medium suitable for storing electronic instructions, or any suitable combination thereof.

In various embodiments, the registration and detection systems 22, 24, 26, and 28 have been described above as systems implemented in stand-alone devices, such as network appliances 12, 14, 16, and 18. In one embodiment, the registration and detection systems 22, 24, 26, and 28 can be implemented in an appliance constructed using commonly available computing equipment and storage systems capable of supporting the software requirements. However, the registration and detection systems could alternatively be implemented on any computer capable of intercepting and accessing data from a network. For example, registration system 22 could be implemented on a server of network 100 shown in FIG. 1. In another example, detection systems 14, 16, and 18 could be implemented on their respective gateways and routers/switches.

Any of the memory items discussed herein should be construed as being encompassed within the broad term ‘memory element.’ Similarly, any of the potential processing elements, modules, and machines described in this Specification should be construed as being encompassed within the broad term ‘processor.’ Each of the computers may also include suitable interfaces for receiving, transmitting, and/or otherwise communicating data or information in a network environment.

Note that with the numerous examples provided herein, interaction may be described in terms of two, three, four, or more network elements. However, this has been done for purposes of clarity and example only. It should be appreciated that the system can be consolidated in any suitable manner. Along similar design alternatives, any of the illustrated computers, modules, components, and elements of FIG. 1 may be combined in various possible configurations, all of which are clearly within the broad scope of this Specification. In certain cases, it may be easier to describe one or more of the functionalities of a given set of flows by only referencing a limited number of network elements. It should be appreciated that the system of FIG. 1 (and its teachings) is readily scalable and can accommodate a large number of components, as well as more complicated/sophisticated arrangements and configurations. Accordingly, the examples provided should not limit the scope or inhibit the broad teachings of data combination protection system 10 as potentially applied to a myriad of other architectures.

It is also important to note that the operations described with reference to the preceding FIGURES illustrate only some of the possible scenarios that may be executed by, or within, the system. Some of these operations may be deleted or removed where appropriate, or these steps may be modified or changed considerably without departing from the scope of the discussed concepts. In addition, the timing of these operations may be altered considerably and still achieve the results taught in this disclosure. The preceding operational flows have been offered for purposes of example and discussion. Substantial flexibility is provided by the system in that any suitable arrangements, chronologies, configurations, and timing mechanisms may be provided without departing from the teachings of the discussed concepts.

Claims

1. At least one non-transitory, computer readable medium comprising instructions that, when executed, cause one or more processors to perform a method comprising: identifying an object including a plurality of data elements, wherein the plurality of data elements correspond to a plurality of object tokens;identifying a tuple or record based, at least in part, on an identification of a token key associated with one of the plurality of object tokens, wherein the token key is one of a plurality of registered tokens included in the tuple; andtaking an action based on a determination that a number of the plurality of registered tokens corresponding to the plurality of object tokens at least satisfies a predetermined threshold, wherein the action includes preventing transmission of the object or locking down a database or a storage repository.
2. The at least one computer readable medium of claim 1, the method further comprising: tokenizing the plurality of data elements into the plurality of object tokens, wherein the object is a data file, document, or storage repository.
3. The at least one computer readable medium of claim 2, wherein the plurality of data elements are tokenized by converting each of the data elements to a respective hash value.
4. The at least one computer readable medium of claim 1, the method further comprising: using an offset related to the token key to identify a beginning of the tuple or record.
5. The at least one computer readable medium of claim 1, wherein the token key occurs with less frequency across a plurality of tuples in a registration list than frequencies at which other registered tokens of the tuple or record occur across the plurality of tuples.
6. The at least one computer readable medium of claim 1, the method further comprising: representing the plurality of object tokens in a bit hash table by setting a respective bit in the bit hash table for the plurality of object tokens; anddetermining, for each registered token of the plurality of registered tokens, whether a bit is set in a bit position of the bit hash table that corresponds to the respective registered token.
7. The at least one computer readable medium of claim 1, wherein, if two or more tuples of a registration list are indexed by the token key, an index includes two or more offsets indicating respective locations of the two or more tuples, each of the two or more tuples includes a respective set of data file tokens, and each of the respective sets of data file tokens includes the token key.
8. An apparatus, comprising: a memory device including a set of instructions; anda processor, coupled to the memory device, that, when executing the set of instructions, identifies an object including a plurality of data elements, wherein the plurality of data elements correspond to a plurality of object tokens,identifies a tuple or record based, at least in part, on an identification of a token key associated with one of the plurality of object tokens, wherein the token key is one of a plurality of registered tokens included in the tuple, andtakes an action based on a determination that a number of the plurality of registered tokens corresponding to the plurality of object tokens at least satisfies a predetermined threshold, wherein the action includes preventing transmission of the object or locking down a database or a storage repository.
9. The apparatus of claim 8, wherein the processor, when executing the set of instructions, tokenizes the plurality of data elements into the plurality of object tokens, and the object is a data file, document, or storage repository.
10. The apparatus of claim 9, wherein the plurality of data elements are tokenized by converting each of the data elements to a respective hash value.
11. The apparatus of claim 8, wherein the processor, when executing the set of instructions, uses an offset related to the token key to identify a beginning of the tuple or record.
12. The apparatus of claim 8, wherein the token key occurs with less frequency across a plurality of tuples in a registration list than frequencies at which other registered tokens of the tuple or record occur across the plurality of tuples.
13. The apparatus of claim 8, wherein the processor, when executing the set of instructions, represents the plurality of object tokens in a bit hash table by setting a respective bit in the bit hash table for the plurality of object tokens, and determines, for each registered token of the plurality of registered tokens, whether a bit is set in a bit position of the bit hash table that corresponds to the respective registered token.
14. The apparatus of claim 8, wherein, if two or more tuples of a registration list are indexed by the token key, an index includes two or more offsets indicating respective locations of the two or more tuples, each of the two or more tuples includes a respective set of data file tokens, and each of the respective sets of data file tokens includes the token key.
15. A method, comprising: identifying an object including a plurality of data elements, wherein the plurality of data elements correspond to a plurality of object tokens;identifying a tuple or record based, at least in part, on an identification of a token key associated with one of the plurality of object tokens, wherein the token key is one of a plurality of registered tokens included in the tuple; andtaking an action based on a determination that a number of the plurality of registered tokens corresponding to the plurality of object tokens at least satisfies a predetermined threshold, wherein the action includes preventing transmission of the object or locking down a database or a storage repository.
16. The method of claim 15, further comprising: tokenizing the plurality of data elements into the plurality of object tokens, wherein the object is a data file, document, or storage repository.
17. The method of claim 16, wherein the plurality of data elements are tokenized by converting each of the data elements to a respective hash value.
18. The method of claim 15, further comprising: using an offset related to the token key to identify a beginning of the tuple or record.
19. The method of claim 15, wherein the token key occurs with less frequency across a plurality of tuples in a registration list than frequencies at which other registered tokens of the tuple or record occur across the plurality of tuples.
20. The method of claim 15, further comprising: representing the plurality of object tokens in a bit hash table by setting a respective bit in the bit hash table for the plurality of object tokens; anddetermining, for each registered token of the plurality of registered tokens, whether a bit is set in a bit position of the bit hash table that corresponds to the respective registered token.

RELATED U.S. APPLICATION INFORMATION

This application is a continuation of (and claims the benefit under 35 U.S.C. § 120) from U.S. application Ser. No. 15/700,826, filed Sep. 11, 2017, entitled “SYSTEM AND METHOD FOR PROTECTING SPECIFIED DATA COMBINATIONS,” which is a continuation of (and claims the benefit under 35 U.S.C. § 120) from U.S. application Ser. No. 14/457,038, filed Aug. 11, 2014, entitled “SYSTEM AND METHOD FOR PROTECTING SPECIFIED DATA COMBINATIONS,” issued as U.S. Pat. No. 9,794,254 on Oct. 17, 2017, which is a continuation of (and claims the benefit under 35 U.S.C. § 120) from U.S. application Ser. No. 12/939,340, filed Nov. 4, 2010, entitled “SYSTEM AND METHOD FOR PROTECTING SPECIFIED DATA COMBINATIONS,” issued as U.S. Pat. No. 8,806,615 on Aug. 12, 2014, and this application is related to U.S. patent application Ser. No. 12/358,399, filed Feb. 25, 2009, entitled “SYSTEM AND METHOD FOR INTELLIGENT STATE MANAGEMENT,” issued as U.S. Pat. No. 8,473,442 on Jun. 25, 2013, commonly assigned to the assignee hereof. The disclosures of these applications are considered part of and are incorporated by reference herein in their entireties.

US Referenced Citations (493)

Number	Name	Date	Kind
4286255	Siy	Aug 1981	A
4710957	Bocci et al.	Dec 1987	A
5249289	Thamm et al.	Sep 1993	A
5465299	Matsumoto et al.	Nov 1995	A
5479654	Squibb	Dec 1995	A
5497489	Menne	Mar 1996	A
5542090	Henderson et al.	Jul 1996	A
5557747	Rogers et al.	Sep 1996	A
5577249	Califano	Nov 1996	A
5623652	Vora et al.	Apr 1997	A
5768578	Kirk	Jun 1998	A
5781629	Haber et al.	Jul 1998	A
5787232	Greiner et al.	Jul 1998	A
5794052	Harding	Aug 1998	A
5813009	Johnson et al.	Sep 1998	A
5873081	Harel	Feb 1999	A
5924096	Draper et al.	Jul 1999	A
5937422	Nelson et al.	Aug 1999	A
5943670	Prager	Aug 1999	A
5987610	Franczek et al.	Nov 1999	A
5995111	Morioka et al.	Nov 1999	A
6026411	Delp	Feb 2000	A
6073142	Geiger et al.	Jun 2000	A
6078953	Vaid et al.	Jun 2000	A
6094531	Allison et al.	Jul 2000	A
6108697	Raymond et al.	Aug 2000	A
6122379	Barbir	Sep 2000	A
6161102	Yanagilhara et al.	Dec 2000	A
6175867	Taghadoss	Jan 2001	B1
6192472	Garay et al.	Feb 2001	B1
6243091	Berstis	Jun 2001	B1
6243720	Munter et al.	Jun 2001	B1
6278992	Curtis et al.	Aug 2001	B1
6292810	Richards	Sep 2001	B1
6336186	Dyksterhouse et al.	Jan 2002	B1
6343376	Saxe et al.	Jan 2002	B1
6356885	Ross et al.	Mar 2002	B2
6363488	Ginter et al.	Mar 2002	B1
6389405	Oatman et al.	May 2002	B1
6389419	Wong et al.	May 2002	B1
6408294	Getchius et al.	Jun 2002	B1
6408301	Patton et al.	Jun 2002	B1
6411952	Bharat	Jun 2002	B1
6457017	Watkins et al.	Sep 2002	B2
6460050	Pace et al.	Oct 2002	B1
6493761	Baker et al.	Dec 2002	B1
6499105	Yoshiura et al.	Dec 2002	B1
6502091	Chundi et al.	Dec 2002	B1
6515681	Knight	Feb 2003	B1
6516320	Odom et al.	Feb 2003	B1
6523026	Gillis	Feb 2003	B1
6539024	Janoska et al.	Mar 2003	B1
6556964	Haug et al.	Apr 2003	B2
6556983	Altschuler et al.	Apr 2003	B1
6571275	Dong et al.	May 2003	B1
6584458	Millett et al.	Jun 2003	B1
6598033	Ross et al.	Jul 2003	B2
6629097	Keith	Sep 2003	B1
6662176	Brunet et al.	Dec 2003	B2
6665662	Kirkwood et al.	Dec 2003	B1
6675159	Lin et al.	Jan 2004	B1
6691209	O'Connell	Feb 2004	B1
6754647	Tackett et al.	Jun 2004	B1
6757646	Marchisio	Jun 2004	B2
6771595	Gilbert et al.	Aug 2004	B1
6772214	McClain et al.	Aug 2004	B1
6785815	Serret-Avila et al.	Aug 2004	B1
6804627	Marokhovsky et al.	Oct 2004	B1
6820082	Cook et al.	Nov 2004	B1
6857011	Reinke	Feb 2005	B2
6937257	Dunlavey	Aug 2005	B1
6950864	Tsuchiya	Sep 2005	B1
6976053	Tripp et al.	Dec 2005	B1
6978297	Piersol	Dec 2005	B1
6978367	Hind et al.	Dec 2005	B1
7007020	Chen et al.	Feb 2006	B1
7020654	Najmi	Mar 2006	B1
7020661	Cruanes et al.	Mar 2006	B1
7062572	Hampton	Jun 2006	B1
7062705	Kirkwood et al.	Jun 2006	B1
7072967	Saulpaugh et al.	Jul 2006	B1
7082443	Ashby	Jul 2006	B1
7093288	Hydrie et al.	Aug 2006	B1
7103607	Kirkwood et al.	Sep 2006	B1
7130587	Hikokubo et al.	Oct 2006	B2
7133400	Henderson et al.	Nov 2006	B1
7139973	Kirkwood et al.	Nov 2006	B1
7143109	Nagral et al.	Nov 2006	B2
7158983	Willse et al.	Jan 2007	B2
7165175	Kollmyer et al.	Jan 2007	B1
7171662	Misra et al.	Jan 2007	B1
7181769	Keanini et al.	Feb 2007	B1
7185073	Gai et al.	Feb 2007	B1
7185192	Kahn	Feb 2007	B1
7188173	Anderson et al.	Mar 2007	B2
7194483	Mohan et al.	Mar 2007	B1
7219131	Banister et al.	May 2007	B2
7219134	Takeshima et al.	May 2007	B2
7243120	Massey	Jul 2007	B2
7246236	Stirbu	Jul 2007	B2
7254562	Hsu et al.	Aug 2007	B2
7254632	Zeira et al.	Aug 2007	B2
7266845	Hypponen	Sep 2007	B2
7272724	Tarbotton et al.	Sep 2007	B2
7277957	Rowley et al.	Oct 2007	B2
7290048	Barnett et al.	Oct 2007	B1
7293067	Maki et al.	Nov 2007	B1
7293238	Brook et al.	Nov 2007	B1
7296011	Chaudhuri et al.	Nov 2007	B2
7296070	Sweeney et al.	Nov 2007	B2
7296088	Padmanabhan et al.	Nov 2007	B1
7296232	Burdick et al.	Nov 2007	B1
7299277	Moran et al.	Nov 2007	B1
7299489	Branigan et al.	Nov 2007	B1
7378500	Ramelson et al.	May 2008	B2
7424744	Wu et al.	Sep 2008	B1
7426181	Feroz et al.	Sep 2008	B1
7434058	Ahuja et al.	Oct 2008	B2
7467202	Savchuk	Dec 2008	B2
7477780	Boncyk et al.	Jan 2009	B2
7483916	Lowe et al.	Jan 2009	B2
7493659	Wu et al.	Feb 2009	B1
7505463	Schuba et al.	Mar 2009	B2
7506055	McClain et al.	Mar 2009	B2
7506155	Stewart et al.	Mar 2009	B1
7509677	Saurabh et al.	Mar 2009	B2
7516492	Nisbet et al.	Apr 2009	B1
7539683	Satoh et al.	May 2009	B1
7551629	Chen et al.	Jun 2009	B2
7577154	Yung et al.	Aug 2009	B1
7581059	Gupta et al.	Aug 2009	B2
7596571	Sifry	Sep 2009	B2
7599844	King et al.	Oct 2009	B2
7657104	Deninger et al.	Feb 2010	B2
7664083	Cermak et al.	Feb 2010	B1
7685254	Pandya	Mar 2010	B2
7689614	de la Iglesia et al.	Mar 2010	B2
7730011	Deninger et al.	Jun 2010	B1
7739080	Beck et al.	Jun 2010	B1
7760730	Goldschmidt et al.	Jul 2010	B2
7760769	Lovett et al.	Jul 2010	B1
7774604	Lowe et al.	Aug 2010	B2
7783589	Hornkvist	Aug 2010	B2
7801852	Wong et al.	Sep 2010	B2
7814327	Ahuja et al.	Oct 2010	B2
7818326	Deninger et al.	Oct 2010	B2
7844582	Arbilla et al.	Nov 2010	B1
7849065	Kamani et al.	Dec 2010	B2
7886359	Jones et al.	Feb 2011	B2
7899828	de la Iglesia et al.	Mar 2011	B2
7907608	Liu et al.	Mar 2011	B2
7921072	Bohannon et al.	Apr 2011	B2
7926099	Chakravarty et al.	Apr 2011	B1
7930540	Ahuja et al.	Apr 2011	B2
7949849	Lowe et al.	May 2011	B2
7958227	Ahuja et al.	Jun 2011	B2
7962591	Deninger et al.	Jun 2011	B2
7979524	Dieberger et al.	Jul 2011	B2
7984175	de la Iglesia et al.	Jul 2011	B2
7996373	Zoppas et al.	Aug 2011	B1
8005863	de la Iglesia et al.	Aug 2011	B2
8010689	Deninger et al.	Aug 2011	B2
8046372	Thirumalai	Oct 2011	B1
8055601	Pandya	Nov 2011	B2
8056130	Njemanze et al.	Nov 2011	B1
8065739	Bruening et al.	Nov 2011	B1
8166307	Ahuja et al.	Apr 2012	B2
8176049	Deninger et al.	May 2012	B2
8200026	Deninger et al.	Jun 2012	B2
8205242	Liu et al.	Jun 2012	B2
8205244	Nightingale et al.	Jun 2012	B2
8261347	Hrabik et al.	Sep 2012	B2
8271794	Lowe et al.	Sep 2012	B2
8286253	Lu et al.	Oct 2012	B1
8301635	de la Iglesia et al.	Oct 2012	B2
8307007	de la Iglesia et al.	Nov 2012	B2
8307206	Ahuja et al.	Nov 2012	B2
8341734	Hernacki et al.	Dec 2012	B1
8396844	Balkany	Mar 2013	B1
8463800	Deninger et al.	Jun 2013	B2
8473442	Deninger et al.	Jun 2013	B1
8504537	de la Iglesia et al.	Aug 2013	B2
8521757	Nanda et al.	Aug 2013	B1
8560534	Lowe et al.	Oct 2013	B2
8601537	Weimen Lu et al.	Dec 2013	B2
8612570	Nair et al.	Dec 2013	B1
8635706	Liu	Jan 2014	B2
8645397	Koudas	Feb 2014	B1
8656039	de la Iglesia et al.	Feb 2014	B2
8667121	Ahuja et al.	Mar 2014	B2
8683035	Ahuja et al.	Mar 2014	B2
8700561	Ahuja et al.	Apr 2014	B2
8706709	Ahuja et al.	Apr 2014	B2
8707008	Lowe et al.	Apr 2014	B2
8730955	Liu et al.	May 2014	B2
8762386	de la Iglesia et al.	Jun 2014	B2
8806615	Ahuja et al.	Aug 2014	B2
8825665	Harbarth	Sep 2014	B2
8850591	Ahuja et al.	Sep 2014	B2
8918359	Ahuja et al.	Dec 2014	B2
9092471	de la Iglesia et al.	Jul 2015	B2
9094338	Ahuja et al.	Jul 2015	B2
9195937	Deninger et al.	Nov 2015	B2
9326134	Ahuja et al.	Apr 2016	B2
9374225	Malhan et al.	Jun 2016	B2
9430564	Ahuja et al.	Aug 2016	B2
20010010717	Goto et al.	Aug 2001	A1
20010013024	Takahashi et al.	Aug 2001	A1
20010032310	Corella	Oct 2001	A1
20010037324	Agrawal et al.	Nov 2001	A1
20010046230	Rojas	Nov 2001	A1
20020032677	Morgenthaler et al.	Mar 2002	A1
20020032772	Olstad et al.	Mar 2002	A1
20020046221	Wallace et al.	Apr 2002	A1
20020052896	Streit et al.	May 2002	A1
20020065956	Yagawa et al.	May 2002	A1
20020078355	Samar	Jun 2002	A1
20020091579	Yehia et al.	Jul 2002	A1
20020103799	Bradford	Aug 2002	A1
20020103876	Chatani et al.	Aug 2002	A1
20020107843	Biebesheimer et al.	Aug 2002	A1
20020116124	Garin et al.	Aug 2002	A1
20020116721	Dobes et al.	Aug 2002	A1
20020126673	Dagli et al.	Sep 2002	A1
20020128903	Kernahan	Sep 2002	A1
20020129140	Peled et al.	Sep 2002	A1
20020159447	Carey et al.	Oct 2002	A1
20030009718	Wolfgang et al.	Jan 2003	A1
20030028493	Tajima	Feb 2003	A1
20030028774	Meka	Feb 2003	A1
20030046369	Sim et al.	Mar 2003	A1
20030053420	Duckett et al.	Mar 2003	A1
20030055962	Freund et al.	Mar 2003	A1
20030065571	Dutta	Apr 2003	A1
20030084300	Koike	May 2003	A1
20030084318	Schertz	May 2003	A1
20030084326	Tarquini	May 2003	A1
20030093678	Bowe et al.	May 2003	A1
20030099243	Oh et al.	May 2003	A1
20030105716	Sutton et al.	Jun 2003	A1
20030105739	Essafi et al.	Jun 2003	A1
20030105854	Thorsteinsson et al.	Jun 2003	A1
20030131116	Jain et al.	Jul 2003	A1
20030135612	Huntington	Jul 2003	A1
20030167392	Fransdonk	Sep 2003	A1
20030185220	Valenci	Oct 2003	A1
20030196081	Savarda et al.	Oct 2003	A1
20030204741	Schoen et al.	Oct 2003	A1
20030210694	Jayaraman	Nov 2003	A1
20030221101	Micali	Nov 2003	A1
20030225796	Matsubara	Dec 2003	A1
20030225841	Song et al.	Dec 2003	A1
20030231632	Haeberlen	Dec 2003	A1
20030233411	Parry et al.	Dec 2003	A1
20040001498	Chen et al.	Jan 2004	A1
20040003005	Chaudhuri	Jan 2004	A1
20040010484	Foulger et al.	Jan 2004	A1
20040015579	Cooper et al.	Jan 2004	A1
20040036716	Jordahl	Feb 2004	A1
20040054779	Takeshima et al.	Mar 2004	A1
20040059736	Willse et al.	Mar 2004	A1
20040059920	Godwin	Mar 2004	A1
20040064537	Anderson et al.	Apr 2004	A1
20040071164	Baum	Apr 2004	A1
20040093323	Bluhm et al.	May 2004	A1
20040111406	Udeshi et al.	Jun 2004	A1
20040111678	Hara	Jun 2004	A1
20040114518	MacFaden et al.	Jun 2004	A1
20040117414	Braun et al.	Jun 2004	A1
20040120325	Ayres	Jun 2004	A1
20040122863	Sidman	Jun 2004	A1
20040122936	Mizelle et al.	Jun 2004	A1
20040123237	Lin	Jun 2004	A1
20040139061	Colossi et al.	Jul 2004	A1
20040139120	Clark et al.	Jul 2004	A1
20040143598	Drucker et al.	Jul 2004	A1
20040181513	Henderson et al.	Sep 2004	A1
20040181690	Rothermel et al.	Sep 2004	A1
20040193594	Moore et al.	Sep 2004	A1
20040194141	Sanders	Sep 2004	A1
20040196970	Cole	Oct 2004	A1
20040205457	Bent et al.	Oct 2004	A1
20040215612	Brody	Oct 2004	A1
20040215626	Colossi et al.	Oct 2004	A1
20040220944	Behrens et al.	Nov 2004	A1
20040225645	Rowney et al.	Nov 2004	A1
20040230572	Omoigui	Nov 2004	A1
20040230891	Pravetz et al.	Nov 2004	A1
20040249781	Anderson	Dec 2004	A1
20040267753	Hoche	Dec 2004	A1
20050004911	Goldberg et al.	Jan 2005	A1
20050021715	Dugatkin et al.	Jan 2005	A1
20050021743	Fleig et al.	Jan 2005	A1
20050022114	Shanahan et al.	Jan 2005	A1
20050027881	Figueira et al.	Feb 2005	A1
20050033726	Wu et al.	Feb 2005	A1
20050033747	Wittkotter	Feb 2005	A1
20050033803	Vleet et al.	Feb 2005	A1
20050038788	Dettinger et al.	Feb 2005	A1
20050038809	Abajian et al.	Feb 2005	A1
20050044289	Hendel et al.	Feb 2005	A1
20050050028	Rose et al.	Mar 2005	A1
20050050205	Gordy et al.	Mar 2005	A1
20050055327	Agrawal et al.	Mar 2005	A1
20050055399	Savchuk	Mar 2005	A1
20050075103	Hikokubo et al.	Apr 2005	A1
20050086252	Jones et al.	Apr 2005	A1
20050091443	Hershkovich et al.	Apr 2005	A1
20050091532	Moghe	Apr 2005	A1
20050097441	Herbach et al.	May 2005	A1
20050108244	Riise et al.	May 2005	A1
20050114452	Prakash	May 2005	A1
20050120006	Nye	Jun 2005	A1
20050127171	Ahuja et al.	Jun 2005	A1
20050128242	Suzuki	Jun 2005	A1
20050131876	Ahuja et al.	Jun 2005	A1
20050132034	de la Iglesia et al.	Jun 2005	A1
20050132046	de la Iglesia et al.	Jun 2005	A1
20050132079	de la Iglesia et al.	Jun 2005	A1
20050132197	Medlar	Jun 2005	A1
20050132198	Ahuja et al.	Jun 2005	A1
20050132297	Milic-Frayling et al.	Jun 2005	A1
20050138110	Redlich et al.	Jun 2005	A1
20050138242	Pope et al.	Jun 2005	A1
20050138279	Somasundaram	Jun 2005	A1
20050149494	Lindh et al.	Jul 2005	A1
20050149504	Ratnaparkhi	Jul 2005	A1
20050166066	Ahuja et al.	Jul 2005	A1
20050177725	Lowe et al.	Aug 2005	A1
20050180341	Nelson et al.	Aug 2005	A1
20050182765	Liddy	Aug 2005	A1
20050188218	Walmsley et al.	Aug 2005	A1
20050203940	Farrar et al.	Sep 2005	A1
20050204129	Sudia et al.	Sep 2005	A1
20050228864	Robertson	Oct 2005	A1
20050235153	Ikeda	Oct 2005	A1
20050262044	Chaudhuri et al.	Nov 2005	A1
20050273614	Ahuja et al.	Dec 2005	A1
20050289181	Deninger et al.	Dec 2005	A1
20060005247	Zhang et al.	Jan 2006	A1
20060021045	Cook	Jan 2006	A1
20060021050	Cook et al.	Jan 2006	A1
20060036593	Dean	Feb 2006	A1
20060037072	Rao et al.	Feb 2006	A1
20060041560	Forman et al.	Feb 2006	A1
20060041570	Lowe et al.	Feb 2006	A1
20060041760	Huang	Feb 2006	A1
20060047675	Lowe et al.	Mar 2006	A1
20060075228	Black et al.	Apr 2006	A1
20060080130	Choksi	Apr 2006	A1
20060083180	Baba et al.	Apr 2006	A1
20060106793	Liang	May 2006	A1
20060106866	Green et al.	May 2006	A1
20060150249	Gassen et al.	Jul 2006	A1
20060167896	Kapur et al.	Jul 2006	A1
20060184532	Hamada et al.	Aug 2006	A1
20060235811	Fairweather	Oct 2006	A1
20060242126	Fitzhugh	Oct 2006	A1
20060242313	Le et al.	Oct 2006	A1
20060242694	Gold	Oct 2006	A1
20060251109	Muller et al.	Nov 2006	A1
20060253445	Huang et al.	Nov 2006	A1
20060271506	Bohannon et al.	Nov 2006	A1
20060272024	Huang et al.	Nov 2006	A1
20060288216	Buhler et al.	Dec 2006	A1
20070006293	Balakrishnan et al.	Jan 2007	A1
20070011309	Brady et al.	Jan 2007	A1
20070028039	Gupta et al.	Feb 2007	A1
20070036156	Liu et al.	Feb 2007	A1
20070039049	Kupferman et al.	Feb 2007	A1
20070050334	Deninger et al.	Mar 2007	A1
20070050381	Hu et al.	Mar 2007	A1
20070050467	Borrett et al.	Mar 2007	A1
20070050846	Xie et al.	Mar 2007	A1
20070081471	Talley et al.	Apr 2007	A1
20070094394	Singh et al.	Apr 2007	A1
20070106660	Stern et al.	May 2007	A1
20070106685	Houh et al.	May 2007	A1
20070106693	Houh et al.	May 2007	A1
20070110089	Essafi et al.	May 2007	A1
20070112837	Houh et al.	May 2007	A1
20070112838	Bjarnestam et al.	May 2007	A1
20070116366	Deninger et al.	May 2007	A1
20070124384	Howell et al.	May 2007	A1
20070136599	Suga	Jun 2007	A1
20070139723	Beadle et al.	Jun 2007	A1
20070140128	Klinker et al.	Jun 2007	A1
20070143235	Kummamuru	Jun 2007	A1
20070143559	Yagawa	Jun 2007	A1
20070150365	Bolivar	Jun 2007	A1
20070162609	Pope et al.	Jul 2007	A1
20070162954	Pela	Jul 2007	A1
20070185868	Roth	Aug 2007	A1
20070220607	Sprosts et al.	Sep 2007	A1
20070226504	de la Iglesia et al.	Sep 2007	A1
20070226510	de la Iglesia et al.	Sep 2007	A1
20070248029	Merkey et al.	Oct 2007	A1
20070260643	Borden et al.	Nov 2007	A1
20070266044	Grondin et al.	Nov 2007	A1
20070271254	de la Iglesia et al.	Nov 2007	A1
20070271371	Singh Ahuja	Nov 2007	A1
20070271372	Deninger et al.	Nov 2007	A1
20070280123	Atkins et al.	Dec 2007	A1
20070294235	Millett	Dec 2007	A1
20080010256	Lindblad	Jan 2008	A1
20080027971	Statchuk	Jan 2008	A1
20080028467	Kommareddy et al.	Jan 2008	A1
20080030383	Cameron	Feb 2008	A1
20080071813	Nair et al.	Mar 2008	A1
20080082497	Leblang et al.	Apr 2008	A1
20080091408	Roulland et al.	Apr 2008	A1
20080112411	Stafford et al.	May 2008	A1
20080115125	Stafford et al.	May 2008	A1
20080127346	Oh et al.	May 2008	A1
20080140657	Azvine et al.	Jun 2008	A1
20080141117	King et al.	Jun 2008	A1
20080159627	Sengamedu	Jul 2008	A1
20080235163	Balasubramanian et al.	Sep 2008	A1
20080263019	Harrison et al.	Oct 2008	A1
20080270462	Thomsen	Oct 2008	A1
20080276295	Nair	Nov 2008	A1
20090070327	Loeser et al.	Mar 2009	A1
20090070328	Loeser et al.	Mar 2009	A1
20090070459	Cho et al.	Mar 2009	A1
20090100055	Wang	Apr 2009	A1
20090157659	Satoh et al.	Jun 2009	A1
20090158430	Borders	Jun 2009	A1
20090178110	Higuchi	Jul 2009	A1
20090187568	Morin	Jul 2009	A1
20090193033	Ramzan et al.	Jul 2009	A1
20090216752	Terui	Aug 2009	A1
20090222442	Houh et al.	Sep 2009	A1
20090232391	Deninger et al.	Sep 2009	A1
20090235150	Berry	Sep 2009	A1
20090254516	Meiyyappan	Oct 2009	A1
20090254532	Yang et al.	Oct 2009	A1
20090271367	Dharawat	Oct 2009	A1
20090288026	Barabas et al.	Nov 2009	A1
20090288164	Adelstein et al.	Nov 2009	A1
20090300709	Chen et al.	Dec 2009	A1
20090326925	Crider	Dec 2009	A1
20100011016	Greene	Jan 2010	A1
20100011410	Liu	Jan 2010	A1
20100023726	Aviles	Jan 2010	A1
20100037324	Grant et al.	Feb 2010	A1
20100042625	Zoellner et al.	Feb 2010	A1
20100088317	Bone et al.	Apr 2010	A1
20100100551	Knauft	Apr 2010	A1
20100121853	de la Iglesia et al.	May 2010	A1
20100174528	Oya et al.	Jul 2010	A1
20100185622	Deninger et al.	Jul 2010	A1
20100191732	Lowe et al.	Jul 2010	A1
20100195909	Wasson	Aug 2010	A1
20100268959	Lowe et al.	Oct 2010	A1
20100332502	Carmel et al.	Dec 2010	A1
20110004599	Deninger et al.	Jan 2011	A1
20110040552	Van Guilder et al.	Feb 2011	A1
20110106846	Matsumoto et al.	May 2011	A1
20110131199	Simon et al.	Jun 2011	A1
20110149959	Liu et al.	Jun 2011	A1
20110167212	Lowe et al.	Jul 2011	A1
20110167265	Ahuja et al.	Jul 2011	A1
20110196911	de la Iglesia et al.	Aug 2011	A1
20110197284	Ahuja et al.	Aug 2011	A1
20110208861	Deninger et al.	Aug 2011	A1
20110219237	Ahuja et al.	Sep 2011	A1
20110258197	de la Iglesia et al.	Oct 2011	A1
20110276575	de la Iglesia et al.	Nov 2011	A1
20110276709	Deninger et al.	Nov 2011	A1
20120114119	Ahuja et al.	May 2012	A1
20120179687	Liu	Jul 2012	A1
20120180137	Liu	Jul 2012	A1
20120191722	Deninger et al.	Jul 2012	A1
20130246334	Ahuja et al.	Sep 2013	A1
20130246335	Ahuja et al.	Sep 2013	A1
20130246336	Ahuja et al.	Sep 2013	A1
20130246337	Ahuja et al.	Sep 2013	A1
20130246338	Doddapaneni	Sep 2013	A1
20130246371	Ahuja et al.	Sep 2013	A1
20130246377	Gaitonde	Sep 2013	A1
20130246424	Deninger et al.	Sep 2013	A1
20130246431	Ahuja et al.	Sep 2013	A1
20130246925	Ahuja et al.	Sep 2013	A1
20130247208	Bishop	Sep 2013	A1
20130254838	Ahuja et al.	Sep 2013	A1
20130268548	Timm et al.	Oct 2013	A1
20140032919	Ahuja et al.	Jan 2014	A1
20140164314	Ahuja et al.	Jun 2014	A1
20140164442	de la Iglesia	Jun 2014	A1
20140289416	Ahuja et al.	Sep 2014	A1
20150067810	Ahuja et al.	Mar 2015	A1
20150106875	Ahuja et al.	Apr 2015	A1
20160142442	Deninger	May 2016	A1

Foreign Referenced Citations (14)

Number	Date	Country
01192237	Jun 2008	CN
2499806	Sep 2012	EP
6-98770	Apr 1994	JP
2005-63030	Mar 2005	JP
2005-209193	Aug 2005	JP
5727027	Apr 2015	JP
10-2008-0087021	Sep 2008	KR
10-2014-0041391	Apr 2014	KR
10-1538305	Jul 2015	KR
WO 2001047205	Jun 2001	WO
WO 2001099373	Dec 2001	WO
WO 2004008310	Jan 2004	WO
WO 2011080745	Jul 2011	WO
WO 2012060892	May 2012	WO

Non-Patent Literature Citations (103)

Entry
Non-Final Office Action from U.S. Appl. No. 10/854,005 dated Nov. 5, 2008.
Final Office Action from U.S. Appl. No. 10/854,005 dated May 11, 2009.
Non-Final Office Action from U.S. Appl. No. 10/854,005 dated Oct. 15, 2009.
Non-Final Office Action from U.S. Appl. No. 10/854,005 dated Mar. 25, 2010.
Final Office Action from U.S. Appl. No. 10/854,005 dated Sep. 14, 2010.
Final Office Action from U.S. Appl. No. 10/854,005 dated Dec. 2, 2010.
Office Action from U.S. Appl. No. 10/854,005, dated Feb. 16, 2011.
Final Office Action from U.S. Appl. No. 10/854,005 dated Aug. 4, 2011.
Notice of Allowance for U.S. Appl. No. 10/854,005 dated Aug. 23, 2012.
Notice of Allowance for U.S. Appl. No. 10/854,005 dated Jun. 3, 2013.
Office Action from U.S. Appl. No. 11/388,734, dated Feb. 5, 2008.
Final Office Action from U.S. Appl. No. 11/388,734, dated Jul. 24, 2008.
Notice of Allowance for U.S. Appl. No. 11/388,734 dated Dec. 11, 2012.
Notice of Allowance for U.S. Appl. No. 11/388,734 dated Apr. 4, 2013.
Office Action from U.S. Appl. No. 14/042,202, dated Aug. 21, 2015.
Notice of Allowance from U.S. Appl. No. 14/042,202, dated Feb. 19, 2016.
Non-Final Office Action from U.S. Appl. No. 14/457,038, dated May 11, 2015.
Final Office Action from U.S. Appl. No. 14/457,038, dated Aug. 24, 2015.
Office Action from U.S. Appl. No. 14/457,038, dated Feb. 22, 2016.
Office Action from U.S. Appl. No. 14/457,038 , dated Sep. 6, 2016.
Notice of Allowance from U.S. Appl. No. 14/457,038, dated Jan. 27, 2017.
Notice of Allowance from U.S. Appl. No. 14/457,038, dated May 22, 2017.
Office Action from U.S. Appl. No. 14/942,587, dated Jun. 30, 2016.
A Model-Driven Approach for Documenting Business and Requirements Interdependencies for Architectural Decision Making Berrocal, J.; Garcia Alonso, J.; Vicente Chicote, C.; Murillo, J.M. Latin America Transactions, IEEE (Revista IEEE America Latina) Year: 2014, vol. 12, Issue: 2 pp. 227-235, DOI: 10.1109/TLA.2014.6749542.
ACM Digital Library, “Tuple Token Registration,” search on Mar. 8, 2018 4:36:08 PM, 5 pages. retrieved and printed from https://dl.acm.org/results.cfm?query=tuple+registration+token&Go.x=44&Go.y=2.
Analysis of Stroke Intersection for Overlapping PGF Elements Yan Chen; Xiaoqing Lu; Jingwei Qu; Zhi Tang 2016 12th IAPR Workshop on Document Analysis Systems (DAS) Year: 2016; pp. 245-250, DOI: 10.1109/DAS.2016.11 IEEE Conference Publications.
Chapter 1. Introduction, “Computer Program product for analyzing network traffic,” Ethereal. Computer program product for analyzing network traffic, pp. 17-26, http://web.archive.org/web/20030315045117/www.ethereal.com/distribution/docs/user-guide, approximated copyright 2004-2005, printed Mar. 12, 2009.
Compression of Boolean inverted files by document ordering Gelbukh, A.; Sangyong Han; Sidorov, G. Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003 International Conference on Year: 2003 pp. 244-249, DOI: 10.1109/NLPKE.2003.1275907.
Compressing Inverted Files in Scalable Information Systems by Binary Decision Diagram Encoding Chung-Hung Lai; Tien-Fu Chen Supercomputing, ACM/IEEE 2001 Conference Year: 2001 pp. 36-36, DOI: 10.1109/SC.2001.10019.
Further Result on Distribution Properties of Compressing Sequences Derived From Primitive Sequences Over Oun-Xiong Zheng; Wen-Feng Qi; Tian Tian Information Theory, IEEE Transactions on Year: 2013, vol. 59, Issue: 8 pp. 5016-5022, DOI: 10.1109/TIT.2013.2258712.
Google Scholar, “Token Registration Tuples” search on Mar. 8, 2018 4:35:17 PM, 2 pages retrieved and printed from https://scholar.google.com/scholar?hl=en&as_sdt=0%2C44&q=token+registration+tuples&btnG=.
Peter Gordon, “Data Leakage—Threats and Mitigation”, IN: SANS Inst. (2007). http://www.sans.org/reading-room/whitepapers/awareness/data-leakage-mitigation-1931?show=data-leakage-threats-mitigation-1931&cat=awareness (69 pages).
Han, OLAP Mining: An Integration of OLAP with Data Mining, Oct. 1997, pp. 1-18.
IEEE Xplore, “Tuple Token Registration,” search on Mar. 8, 2018, 8 pages retrieved and printed from http://ieeexplore.ieee.org/search/searchresult.jsp?newsearch=true&queryText=tuple%20token%20registration.
Integrated Modeling and Verification of Real-Time Systems through Multiple Paradigms Marcello M. Bersani: Carlo A. Furia; Matteo Pradelia; Matteo Rossi 2009 Seventh IEEE International Conference on Software Engineering and Formal Methods Year: 2--0 pp. 13-22, DOI: 10.1109/SEFM.2009.16 IEEE Conference Publications.
Mao et al. “MOT: Memory Online Tracing of Web Information System,” Proceedings of the Second International Conference on Web Information Systems Engineering (WISE '01); pp. 271-277, (IEEE0-0-7695-1393-X/02) Aug. 7, 2002 (7 pages).
Microsoft Outlook, Out look, copyright 1995-2000, 2 pages.
Niemi, Constructing OLAP Cubes Based on Queries, Nov. 2001, pp. 1-7.
Preneel, Bart, “Cryptographic Hash Functions”, Proceedings of the 3rdSymposium on State and Progress of Research in Cryptography, 1993, pp. 161-171.
Schultz, Data Mining for Detection of New Malicious Executables, May 2001, pp. 1-13.
Walter Allasia et al., Indexing and Retrieval of Multimedia Metadata on a Secure DHT, University of Torino, Italy, Department of Computer Science, Aug. 31, 2008, 16 pages.
Webopedia, definition of “filter”, 2002, p. 1.
Werth, T. et al., “Chapter 1—DAG Mining in Procedural Abstraction,” Programming Systems Group; Computer Science Department, University of Erlangen-Nuremberg, Germany (in Sep. 19, 2011 Nonfinal Rejection).
International Search Report and Written Opinion and Declaration of Non-Establishment of International Search Report for International Application No. PCT/US2011/024902 dated Aug. 1, 2011.
International Preliminary Report on Patentability Written Opinion of the International Searching Authority for International Application No. PCT/US2011/024902 dated May 7, 2013.
Office Action issued by the Chinese Patent Office dated Mar. 10, 2016 in Chinese Patent Application No. 201180058414.4.
Notice of Allowance issued by the Chinese Patent Office dated Sep. 17, 2016 in Chinese Patent Application No. 201180058414.4.
EPO Official Action for EP Application No. 11 704 904.9 dated Feb. 15, 2017.
EPO Official Action for EP Application No. 11 704 904.9 dated Feb. 19, 2018.
English Translation of the Notice of Allowance, KIPO dated Apr. 15, 2015, Notice of Allowance Summary.
Korean Patent Office Notice of Preliminary Rejection for Korean Patent Application No. 2013-7014404 dated Oct. 8, 2014 [Translation provided].
Korean Patent Office Notice of Preliminary Rejection for Korean Patent Application No. 2013-7014404 dated Apr. 22, 2014 [Translation provided].
Japanese Patent Office Notification of Reasons for Refusal for JP Patent Application No. 2013537659 dated Jul. 22, 2014 [Translation provided].
U.S. Appl. No. 13/024,923, filed Feb. 10, 2011, entitled “High Speed Packet Capture,” Inventor(s) Weimin Liu, et al.
U.S. Appl. No. 13/047,068, filed Mar. 14, 2011, entitled “Cryptographic Policy Enforcement,” Inventor(s) Ratinder Paul Singh Ahuja, et al.
U.S. Appl. No. 13/049,533, filed Mar. 16, 2011, entitled “File System for a Capture System,” Inventor(s) Rick Lowe, et al.
U.S. Appl. No. 13/089,158, filed Apr. 18, 2011, entitled “Attributes of Captured Objects in a Capture System,” Inventor(s) Ratinder Paul Singh Ahuja, et al.
U.S. Appl. No. 13/099,516, filed May 3, 2011, entitled “Object Classification in a Capture System,” Inventor(s) William Deninger, et al.
U.S. Appl. No. 11/254,436, filed Oct. 19, 2005, entitled “Attributes of Captured Objects in a Capture System,” Inventor(s) William Deninger et al.
U.S. Appl. No. 11/900,964, filed Sep. 14, 2007, entitled “System and Method for Indexing a Capture System,” Inventor(s) Ashok Doddapaneni et al.
U.S. Appl. No. 12/190,536, filed Aug. 12, 2008, entitled “Configuration Management for a Capture/Registration System,” Inventor(s) Jitendra B. Gaitonde et al.
U.S. Appl. No. 12/352,720, filed Jan. 13, 2009, entitled “System and Method for Concept Building,” Inventor(s) Ratinder Paul Singh Ahuja et al.
U.S. Appl. No. 12/354,688, filed Jan. 15, 2009, entitled “System and Method for Intelligent Term Grouping,” Inventor(s) Ratinder Paul Ahuja et al.
U.S. Appl. No. 12/358,399, filed Jan. 23, 2009, entitled “System and Method for Intelligent State Management,” Inventor(s) William Deninger et al.
U.S. Appl. No. 12/360,537, filed Jan. 27, 2009, entitled “Database for a Capture System,” Inventor(s) Rick Lowe et al.
U.S. Appl. No. 12/410,875, filed Mar. 25, 2009, entitled “System and Method for Data Mining and Security Policy Management,” Inventor(s) Ratinder Paul Singh Ahuja et al.
U.S. Appl. No. 12/410,905, filed Mar. 25, 2009, entitled “System and Method for Managing Data and Policies,” Inventor(s) Ratinder Paul Singh Ahuja et al.
U.S. Appl. No. 12/690,153, filed Jan. 20, 2010, entitled “Query Generation for a Capture System,” Inventor(s) Erik de la Iglesia, et al.
U.S. Appl. No. 12/751,876, filed Mar. 31, 2010, entitled “Attributes of Captured Objects in a Capture System,” Inventor(s) William Deninger, et al.
U.S. Appl. No. 12/829,220, filed Jul. 1, 2010, entitled “Verifying Captured Objects Before Presentation,” Inventor(s) Rick Lowe, et al.
U.S. Appl. No. 12/873,061, filed Aug. 31, 2010, entitled “Document Registration,” Inventor(s) Ratinder Paul Singh Ahuja, et al.
U.S. Appl. No. 12/873,860, filed Sep. 1, 2010, entitled “A System and Method for Word Indexing in a Capture System and Querying Thereof,” Inventor(s) William Deninger, et al.
U.S. Appl. No. 12/939,340, filed Nov. 3, 2010, entitled “System and Method for Protecting Specified Data Combinations,” Inventor(s) Ratinder Paul Singh Ahuja, et al.
U.S. Appl. No. 12/967,013, filed Dec. 13, 2010, entitled “Tag Data Structure for Maintaining Relational Data Over Captured Objects,” Inventor(s) Erik de la Iglesia, et al.
U.S. Appl. No. 13/168,739, filed Jun. 24, 2011, entitled “Method and Apparatus for Data Capture and Analysis System,” Inventor(s) Erik de la Iglesia, et al.
U.S. Appl. No. 13/187,421, filed Jul. 20, 2011, entitled “Query Generation for a Capture System,” Inventor(s) Erik de la Iglesia, et al.
U.S. Appl. No. 13/188,441, filed Jul. 21, 2011, entitled “Locational Tagging in a Capture System,” Inventor(s) William Deninger et al.
U.S. Appl. No. 13/422,791, filed Mar. 16, 2012, entitled “System and Method for Data Mining and Security Policy Management”, Inventor, Weimin Liu.
U.S. Appl. No. 13/424,249, filed Mar. 19, 2012, entitled “System and Method for Data Mining and Security Policy Management”, Inventor, Weimin Liu.
U.S. Appl. No. 13/431,678, filed Mar. 27, 2012, entitled “Attributes of Captured Objects in a Capture System”, Inventors William Deninger, et al.
U.S. Appl. No. 13/436,275, filed Mar. 30, 2012, entitled “System and Method for Intelligent State Management”, Inventors William Deninger, et al.
U.S. Appl. No. 13/337,737, filed Dec. 27, 2011, entitled “System and Method for Providing Data Protection Workflows in a Network Environment”, Inventor(s) Ratinder Paul Singh Ahuja, et al.
U.S. Appl. No. 13/338,060, filed Dec. 27, 2011, entitled “System and Method for Providing Data Protection Workflows in a Network Environment”, Inventor(s) Ratinder Paul Singh Ahuja, et al.
U.S. Appl. No. 13/338,159, filed Dec. 27, 2011, entitled “System and Method for Providing Data Protection Workflows in a Network Environment”, Inventor(s) Ratinder Paul Singh Ahuja, et al.
U.S. Appl. No. 13/338,195, filed Dec. 27, 2011, entitled “System and Method for Providing Data Protection Workflows in a Network Environment”, Inventor(s) Ratinder Paul Singh Ahuja, et al.
U.S. Appl. No. 14/157,130, filed Jan. 16, 2014, entitled “System and Method for Providing Data Protection Workflows in a Network Environment”, Inventor(s) Ratinder Paul Singh Ahuja, et al.
U.S. Appl. No. 14/042,202, filed Sep. 30, 2013, entitled “Document De-Registration”, Inventors(s) Ratinder Paul Singh Ahuja, et al.
U.S. Appl. No. 13/896,210, filed May 16, 2013, entitled “System and Method for Data Mining and Security Policy Management” Inventor(s) Ratinder Paul Singh Ahuja et al.
U.S. Appl. No. 14/181,521, filed Feb. 14, 2014.
U.S. Appl. No. 14/222,477, filed Mar. 21, 2014.
U.S. Appl. No. 14/457,038, filed Aug. 11, 2014.
Advisory Action from U.S. Appl. No. 10/815,239 dated May 13, 2009.
Advisory Action from U.S. Appl. No. 10/854,005 dated Aug. 5, 2009.
Advisory Action from U.S. Appl. No. 11/388,734 dated Jan. 26, 2009.
Office Action from U.S. Appl. No. 10/815,239, dated Jun. 13, 2007.
Final Office Action from U.S. Appl. No. 10/815,239, dated Feb. 8, 2008.
Final Office Action from U.S. Appl. No. 10/815,239 dated Mar. 17, 2009.
Non-Final Office Action from U.S. Appl. No. 10/815,239 dated Aug. 18, 2009.
Final Office Action from U.S. Appl. No. 10/815,239 dated Nov. 30, 2009.
Non-Final Office Action from U.S. Appl. No. 10/815,239 dated Jun. 8, 2009.
Notice of Allowance for U.S. Appl. No. 10/815,239 dated Feb. 24, 2010.
Notice of Allowance for U.S. Appl. No. 10/815,239 dated Jun. 1, 2010.
Non-Final Office Action from U.S. Appl. No. 10/854,005 dated Feb. 5, 2008.

Related Publications (1)

	Number	Date	Country
	20190230076 A1	Jul 2019	US

Continuations (3)

	Number	Date	Country
Parent	15700826	Sep 2017	US
Child	16365812		US
Parent	14457038	Aug 2014	US
Child	15700826		US
Parent	12939340	Nov 2010	US
Child	14457038		US

System and method for protecting specified data combinations

Information

Patent Number

Date Filed

Date Issued

Inventors

Original Assignees

Examiners

Agents

CPC

Field of Search

US

CPC

International Classifications

Disclaimer

Abstract