The invention relates to the field of data security. More particularly, the invention relates to detecting and protecting data in computer files.
Many businesses receive correspondence, such as from customers or vendors, which may contain sensitive data, such as confidential financial information. This correspondence may be stored in computer data files. For example, the stored correspondence may include emails that are stored in email archives or other storage. The stored correspondence may also include documents scanned into a computer system and stored as text or other data files.
The stored correspondence files may be accessible by a large number of people in a data-driven company, such as a bank. Since it is not always known which stored correspondence files contain sensitive information, when they were received or archived, or where they are currently stored, it is difficult to protect the correspondence files that contain sensitive information. The stored correspondence files or other files containing sensitive information may occupy a large amount of space in a computer system. It is time consuming to go through each correspondence file to determine if sensitive information is contained in the file. Other problems exist.
It is therefore desirable to address the drawbacks in conventional network file filtering systems.
The invention overcoming these and other problems in the art relates to a system and method for network file filtering, which include scanning at least one data file for the density of a selected pattern. The invention may restrict access to the file if the density of the selected pattern in the text file is greater than or equal to a predetermined key word density threshold.
The invention is described in relation to a network file filtering system and method. Nonetheless, the characteristics and parameters pertaining to the system and method may be applicable to other types of file filtering systems and other data or file identification or search systems. Like elements are referred to using like numerals for clarity throughout the drawings and description.
Although only four sites or nodes 1-4 are shown, any number of sites 1-4 may exist in system 10. In one embodiment, system 10 may include only one site 1-4. In another embodiment, system 10 may include as many sites as necessary or desired by a user.
In one embodiment, system 10 may include a server for managing network-related traffic. In one embodiment, each of sites 1-4 may include a network server. The server may be or include, for instance, a workstation running the Microsoft Windows™ NT™, Windows™ 2000, Unix, Linux, Xenix, IBM AIX™, Hewlett-Packard UX™, Novell Netware™, Sun Microsystems Solaris™, OS/2™, BeOS™, Mach, Apache, OpenStep™ or other operating system or platform.
Each of sites 1-4 may communicate to each other and to network 5 through communications link 7. Communications link 7 may be a part of network 5 in one embodiment. Communications link 7 may be, include or interface to any one or more of, for instance, the Internet, an intranet, a PAN (Personal Area Network), a LAN (Local Area Network), a WAN (Wide Area Network) or a MAN (Metropolitan Area Network), a storage area network (SAN), a frame relay connection, an Advanced Intelligent Network (AIN) connection, a synchronous optical network (SONET) connection, a digital T1, T3, E1 or E3 line, Digital Data Service (DDS) connection, DSL (Digital Subscriber Line) connection, an Ethernet connection, an ISDN (Integrated Services Digital Network) line, a dial-up port such as a V.90, V.34 or V.34bis analog modem connection, a cable modem, an ATM (Asynchronous Transfer Mode) connection, or an FDDI (Fiber Distributed Data Interface) or CDDI (Copper Distributed Data Interface) connection.
Communications link 7 may furthermore be, include or interface to any one or more of a WAP (Wireless Application Protocol) link, a GPRS (General Packet Radio Service) link, a GSM (Global System for Mobile Communication) link, a CDMA (Code Division Multiple Access) or TDMA (Time Division Multiple Access) link such as a cellular phone channel, a GPS (Global Positioning System) link, CDPD (cellular digital packet data), a RIM (Research in Motion, Limited) duplex paging type device, a Bluetooth radio link, or an IEEE 802.11-based radio frequency link. Communications link 7 may yet further be, include or interface to any one or more of an RS-232 serial connection, an IEEE-1394 (Firewire) connection, a Fibre Channel connection, an IrDA (infrared) port, a SCSI (Small Computer Systems Interface) connection, a USB (Universal Serial Bus) connection or other wired or wireless, digital or analog interface or connection.
Sites 1-4 may communicate with each other and to network 5 using network enabled code. Network enabled code may be, include or interface to, for example, Hyper text Markup Language (HTML), Dynamic HTML, Extensible Markup Language (XML), Extensible Stylesheet Language (XSL), Document Style Semantics and Specification Language (DSSSL), Cascading Style Sheets (CSS), Synchronized Multimedia Integration Language (SMIL), Wireless Markup Language (WML), Java™, Jini™, C, C++, Perl, UNIX Shell, Visual Basic or Visual Basic Script, Virtual Reality Markup Language (VRML), ColdFusion™ or other compilers, assemblers, interpreters or other computer languages or platforms.
Database 26 may be used to store data regarding scanning criteria, restricting criteria, scanning and restricting algorithms, identification of files that need to be restricted and any other data associated with filtering files having sensitive information in a system 10. The database 26 may be, include or interface to, for example, the Oracle™ relational database sold commercially by Oracle Corp. Other databases, such as Informix™, DB2 (Database 2), Sybase or other data storage or query formats, platforms or resources such as OLAP (On Line Analytical Processing), SQL (Standard Query Language), a storage area network (SAN), Microsoft Access™ or others may also be used, incorporated or accessed in the invention.
As will be described in more detail below, the various processes illustrated in
At step 301, the scanning module 22 may scan at least one data file and the system 10 for density of a selected pattern. In one embodiment, scanning the data files for the density of a selected pattern may include scanning the data file for occurrences of the selected pattern and determining the density of the selected pattern in the data file. In one embodiment, the density of the selected pattern may be determined by dividing the number of occurrences of the selected pattern by the size of the data file which may be given in any format known today or later developed.
In one embodiment, scanning the data file for the selected pattern density may also include comparing the density of the selected pattern in the file to a threshold density. In one embodiment, the threshold density may be a predetermined threshold density. In another embodiment, the threshold density may be a selectable threshold density. The threshold density may be selected by a systems administrator or other user. The threshold density may be selected based on the type of data being scanned. The types of data being scanned may include ASCII (American Standard Code for Information Interchange) text, streaming audio, graphics, etc.
In one embodiment, the threshold density may be selected based on the length of the file. For example, the longer the file is, the lower the threshold density. In one embodiment, the threshold density may be a variable threshold density that varies with the size of the file scanned.
In one embodiment, the threshold density may be determined after analyzing the selected pattern density in at least one data file. The threshold density may be based on predetermined formulas or trial and error methods.
In one embodiment, the selected pattern may include a key word string. In one embodiment, the key word string may be a key word string of a predetermined length including at least one predetermined substring. For example, the key word string may be a numeric string having a length equal to the length of a standard credit card number. In one embodiment, the predetermined substring may be a substring associated with a specific credit card issue. For example, the substring may be the first four digits of a credit card number identifying a specific bank issuing the credit card if the filtering is being done by a bank or other financial service company.
The selected pattern may be any pattern that works in an industry. Thus, the pattern may be determined by finding patterns that tend preferentially to be present in proprietary data in the business area of the data being filtered. For example, if the filtering is being performed by a chemical company, a chemical name or process technical term may be used for the selected pattern. In another embodiment, the selected pattern may include a non-text pattern. For example, the selected pattern may be a symbol or other graphic representation.
In one embodiment, the selected pattern may include a plurality of selected patterns. For example, the selected patterns may include social security number, date of birth, and credit card number. The threshold density of the selected patterns may be an aggregate threshold density. For example, the density of the set of selected patterns may calculated by determining the individual density of each selected pattern. The individual densities of the selected patterns that are predetermined to be ‘less useful’ may be subjected to a range constraint so that if the individual density of the less useful pattern is below the range, the individual density will be set to the minimum density of the range. All of the individual densities may be multiplied together to produce a product density. The product density may then be compared to the aggregate threshold density. The aggregate threshold density may be determined empirically.
By combining several selected patterns or discriminators that are particularly usable, in the manner described above, a discriminator may be obtained that performs much better than any of the individual selected patterns. A well chosen combination of selected patterns may result in filtering that produces a very low alarm rate (rate of filtering files that do not need filtering). For example, in a financial services company, a combination of addresses and credit card numbers may produce an alarm rate of less than 10%. Protecting the extra 10% of files would be negligible overhead compared to the time and cost investment of examining each data file for sensitive data.
In one embodiment, the selected threshold density may be predetermined based on the type of data for which the scan is performed. In another embodiment, the selected threshold density may be selected after scanning at least one data file to determine what the selected pattern should be. The selected threshold density may be selected by a user or selected by performing an electronically performed algorithm to select the selected pattern.
At step 302, the restricting system 24 may restrict access to each file where the selected pattern density is greater than or equal to the threshold density. In one embodiment, restricting access to the file may include activating a security system for each file having a selected pattern density greater than or equal to a threshold density. In one embodiment, the file having a selected pattern density greater than a threshold density may be assigned an identifier or label to identify the file as a sensitive file. The identifier or label may alert system 10 to activate restricting system 24 when access of the sensitive file is attempted. In one embodiment, the identifier may be stored in database 26.
In one embodiment, the restricting system may perform an algorithm to restrict access to all files having a sensitive file identifier stored in database 26.
In one embodiment, activating the security system may include scanning the database 26 or all of the files in system 10 to identify data files having an identifier stored in database 26 or having an associated sensitive file identifier.
In one embodiment, the security system may include restricting access to a sensitive file by password protecting the file. In one embodiment, restricting access to a file may include controlling access to the file based on the time of day when a file is being accessed. In one embodiment, access to a sensitive file may be restricted based on the time of day a specific user is trying to access the file.
In one embodiment, access to the file may be restricted based on the user trying to access the file. In one embodiment, only certain users or a certain subset of users may have access to the file. For example, for a first set of restricted files, only clerical staff may have access to the files in the first set. For a second set of restricted files, only management may have access to the files of the second set.
In one embodiment, the place of access by the user may be restricted. For example, a user may only be able to access the file from the user's own desktop terminal. In one embodiment, the user may only be able to access the file from a certain central terminal.
In one embodiment, the type of file authorization assigned to the user may be used to restrict access to the file. For example, a user may be authorized to view certain types of files such as financial information, etc. In one embodiment, a person assigned a highly sensitive file authorization may not be able to view a low sensitivity file. In another embodiment, a user having authorization to view low sensitivity files may not have authorization to view files having a higher sensitivity. Thus, there may be a minimum or maximum security authorization, or both, assigned to the file to restrict access.
In one embodiment, the type of privileges authorization assigned to the user may be used to restrict access to a file. In one embodiment, the types of privileges authorization may include privilege to view a file, privilege to copy a file, privilege to back up a file, or privilege to edit a file. In one embodiment, controlling access based on the types of privileges authorized may include a privilege ceiling where a user with a greater amount of privilege than the privilege ceiling may be restricted from accessing the sensitive file. Thus, a user with the privilege of copying or editing files may not have access to a restricted file having a privilege ceiling of viewing the file, whereas a user having a privilege of viewing files would have access to the file.
In one embodiment, controlling access based on the types of privileges authorized may include a privilege floor where a user with a lesser amount of privilege than the privilege floor is restricted from accessing the file. In this embodiment, a user having a privilege of only viewing a file may not have access to a restricted file having a privilege floor of editing the file.
In one embodiment, restricting access to a sensitive file may include hiding the file from an unauthorized access. In one embodiment, hiding the file may include redirecting an unauthorized user to another file in any location of the system 10 when the unauthorized user tries to access the sensitive file.
In one embodiment, restricting access to the file may also include activating an alarm to indicate when an unauthorized access is occurring. In one embodiment, the restricting system 24 may execute site specific commands to gather evidence of what actions an unauthorized user is performing when the unauthorized user is trying to access the sensitive file. The restricting system 24 may execute the site specific commands to gather evidence without exposing the file to the unauthorized user.
In one embodiment, restricting access to the file may include granting identifiers to a file opening process for the file at the time the file is opened and then revoking the identifiers when the file is closed. In one embodiment, the restricting system 24 may prevent a covert code from running in association with the sensitive file. In one embodiment, preventing the covert code from running may include attaching a crypt checksum to the file. In one embodiment, preventing the covert code from running may include attaching a privilege mask to the file.
In one embodiment, full network awareness may be implemented so that an extended access control is very powerful. Cross-network checks for access control may be performed. In one embodiment, distributed firewall checks of access rates may be performed for access control and alarms, providing statistical quality control. Checks can be done of the access frequency of users to files. For example, a clerk who normally must access a customer file to answer phone queries might access a few hundred customer records per day. By watching access frequencies, a clerk accessing thousands of customer records per day might be flagged, since he might be doing this access for unauthorized purposes. Checks of network operations may be used to control files as they are created or inherited from a directory protection profile.
A database management system may be used as a lookup agent. The “change dir” command may be overloaded so that some preselected patterns might imply looking for files flagged with some security labels when seen, instead of selecting file names only, which could allow selection of more attributes including security attributes. This may speed up finding of content. In one embodiment, search engine techniques may be used to populate the database management system. In one embodiment, the database management system may also return “not-yet-classified” files in directory lists. The system may allow full soft linking and full conditioned soft links, not just on access fail. These access control methods, are published in the program Safety, published on the DECUS VMS SIG tapes in 1996. Softlinks are also known to Unix users as “symbolic links”.
As was described in relation to
A system and method for filtering files is described where the files may stay at the location in which they are stored. Thus, there is no need to add large databases or use additional memory in existing databases to store the files found to include sensitive data. A method for filtering files is described where each file does not have to be read by an individual to determine whether the file contains sensitive data. Thus, the speed of file filtering is greatly increased by using a program to scan documents for selected pattern strings.
While the foregoing description includes many details and specificities, it is to be understood that these have been included for purposes of explanation only, and are not to be interpreted as limitations of the invention. Many modifications to the embodiments described above can be made without departing from the spirit and scope of the invention, as is intended to be encompassed by the following claims and their legal equivalents.
The subject matter of this application is related to the subject matter of provisional application U.S. Ser. No. 60/284,940, filed Apr. 20, 2001, assigned or under obligation of assignment to the same entity as this application, from which application priority is claimed, and which application is incorporated by reference.
Number | Name | Date | Kind |
---|---|---|---|
3938091 | Atalla et al. | Feb 1976 | A |
4321672 | Braun et al. | Mar 1982 | A |
4567359 | Lockwood | Jan 1986 | A |
4633397 | Macco | Dec 1986 | A |
4695880 | Johnson | Sep 1987 | A |
4696491 | Stenger | Sep 1987 | A |
4713761 | Sharpe | Dec 1987 | A |
4725719 | Oncken et al. | Feb 1988 | A |
4823264 | Deming | Apr 1989 | A |
4882675 | Nichtberger | Nov 1989 | A |
4964043 | Galvin | Oct 1990 | A |
4992940 | Dworkin | Feb 1991 | A |
5016270 | Katz | May 1991 | A |
5050207 | Hitchcock | Sep 1991 | A |
5084816 | Boese | Jan 1992 | A |
5157717 | Hitchcock | Oct 1992 | A |
5220501 | Lawlor | Jun 1993 | A |
5265033 | Vajk | Nov 1993 | A |
5317683 | Hager | May 1994 | A |
5321841 | East | Jun 1994 | A |
5351186 | Bullock | Sep 1994 | A |
5412708 | Katz | May 1995 | A |
5418951 | Damashek | May 1995 | A |
5420405 | Chasek | May 1995 | A |
5424938 | Wagner | Jun 1995 | A |
5446740 | Yien | Aug 1995 | A |
5450537 | Hirai | Sep 1995 | A |
5467269 | Flaten | Nov 1995 | A |
5473143 | Vak | Dec 1995 | A |
5473732 | Change | Dec 1995 | A |
5479530 | Nair et al. | Dec 1995 | A |
5485370 | Moss et al. | Jan 1996 | A |
5511117 | Zazzera | Apr 1996 | A |
5532920 | Hartrick | Jul 1996 | A |
5537314 | Kanter | Jul 1996 | A |
5537437 | Kaku | Jul 1996 | A |
5544086 | Davis | Aug 1996 | A |
5557518 | Rosen | Sep 1996 | A |
5568489 | Yien | Oct 1996 | A |
5570465 | Tsakanikas | Oct 1996 | A |
5590197 | Chen | Dec 1996 | A |
5592560 | Deaton | Jan 1997 | A |
5594837 | Noyes | Jan 1997 | A |
5598557 | Doner | Jan 1997 | A |
5606496 | D'Agostino | Feb 1997 | A |
5621789 | McCalmont | Apr 1997 | A |
5621812 | Deaton | Apr 1997 | A |
5625767 | Bartell | Apr 1997 | A |
5634101 | Blau | May 1997 | A |
5638457 | Deaton | Jun 1997 | A |
5644493 | Motai | Jul 1997 | A |
5652786 | Rogers | Jul 1997 | A |
5653914 | Holmes et al. | Aug 1997 | A |
5657383 | Gerber | Aug 1997 | A |
5659165 | Jennings | Aug 1997 | A |
5664115 | Fraser | Sep 1997 | A |
5675662 | Deaton | Oct 1997 | A |
5684870 | Maloney | Nov 1997 | A |
5710887 | Chelliah | Jan 1998 | A |
5710889 | Clark | Jan 1998 | A |
5727163 | Bezos | Mar 1998 | A |
5734838 | Robinson | Mar 1998 | A |
5740231 | Cohn | Apr 1998 | A |
5754840 | Rivette | May 1998 | A |
5758328 | Giovannoli | May 1998 | A |
5761647 | Boushy | Jun 1998 | A |
5761661 | Coussens | Jun 1998 | A |
5774122 | Kojima | Jun 1998 | A |
5778178 | Arunachalam | Jul 1998 | A |
5784562 | Diener | Jul 1998 | A |
5790650 | Dunn | Aug 1998 | A |
5790785 | Klug | Aug 1998 | A |
5793861 | Haigh | Aug 1998 | A |
5794221 | Egendorf | Aug 1998 | A |
5794259 | Kikinis | Aug 1998 | A |
5796395 | De Hond | Aug 1998 | A |
5802498 | Comesanas | Sep 1998 | A |
5802502 | Gell | Sep 1998 | A |
5815657 | Williams | Sep 1998 | A |
5815683 | Vogler | Sep 1998 | A |
5819092 | Ferguson | Oct 1998 | A |
5819285 | Damico | Oct 1998 | A |
5826241 | Stein | Oct 1998 | A |
5826245 | Sandberg-Diment | Oct 1998 | A |
5826250 | Trefler | Oct 1998 | A |
5832182 | Zhang et al. | Nov 1998 | A |
5832476 | Tada | Nov 1998 | A |
5835580 | Fraser | Nov 1998 | A |
5838906 | Doyle | Nov 1998 | A |
5842178 | Giovannoli | Nov 1998 | A |
5842211 | Horadan | Nov 1998 | A |
5842217 | Light | Nov 1998 | A |
5844553 | Hao | Dec 1998 | A |
5845259 | West | Dec 1998 | A |
5845260 | Nakano | Dec 1998 | A |
5847709 | Card | Dec 1998 | A |
5848427 | Hyodo | Dec 1998 | A |
5862223 | Walker | Jan 1999 | A |
5870456 | Rogers | Feb 1999 | A |
5870724 | Lawlor | Feb 1999 | A |
5873072 | Kight | Feb 1999 | A |
5884032 | Bateman | Mar 1999 | A |
5884288 | Chang | Mar 1999 | A |
5884305 | Kleinberg et al. | Mar 1999 | A |
5889863 | Weber | Mar 1999 | A |
5892900 | Ginter | Apr 1999 | A |
5898780 | Liu | Apr 1999 | A |
5903881 | Schrader | May 1999 | A |
5914472 | Foladare | Jun 1999 | A |
5915244 | Jack | Jun 1999 | A |
5918214 | Perkowski | Jun 1999 | A |
5918217 | Maggioncalda | Jun 1999 | A |
5918239 | Allen | Jun 1999 | A |
5926812 | Hilsenrath | Jul 1999 | A |
5933816 | Zeanah | Aug 1999 | A |
5933817 | Hucal | Aug 1999 | A |
5933823 | Cullen | Aug 1999 | A |
5933827 | Cole | Aug 1999 | A |
5940812 | Tengel | Aug 1999 | A |
5952641 | Korshun | Sep 1999 | A |
5953710 | Fleming | Sep 1999 | A |
5958007 | Lee | Sep 1999 | A |
5960411 | Hartman | Sep 1999 | A |
5963952 | Smith | Oct 1999 | A |
5963953 | Cram | Oct 1999 | A |
5969318 | Mackenthun | Oct 1999 | A |
5970482 | Pham | Oct 1999 | A |
5982370 | Kamper | Nov 1999 | A |
5991751 | Rivette | Nov 1999 | A |
5991780 | Rivette | Nov 1999 | A |
5995948 | Whitford | Nov 1999 | A |
5999907 | Donner | Dec 1999 | A |
6005939 | Fortenberry | Dec 1999 | A |
6012088 | Li | Jan 2000 | A |
6014636 | Reeder | Jan 2000 | A |
6014638 | Burge | Jan 2000 | A |
6018714 | Risen | Jan 2000 | A |
6026398 | Brown et al. | Feb 2000 | A |
6026429 | Jones | Feb 2000 | A |
6032147 | Williams | Feb 2000 | A |
6049835 | Gagnon | Apr 2000 | A |
6055637 | Hudson | Apr 2000 | A |
6061665 | Bahreman | May 2000 | A |
6064987 | Walker | May 2000 | A |
6081810 | Rosenzweig | Jun 2000 | A |
6088683 | Jalili | Jul 2000 | A |
6088700 | Larsen | Jul 2000 | A |
6098070 | Maxwell | Aug 2000 | A |
6112181 | Shear | Aug 2000 | A |
6131810 | Weiss | Oct 2000 | A |
6134549 | Regnier | Oct 2000 | A |
6138129 | Combs | Oct 2000 | A |
6144948 | Walker | Nov 2000 | A |
6148293 | King | Nov 2000 | A |
6170011 | Macleod Beck | Jan 2001 | B1 |
6185242 | Arthur | Feb 2001 | B1 |
6189029 | Fuerst | Feb 2001 | B1 |
6195644 | Bowie | Feb 2001 | B1 |
6201948 | Cook | Mar 2001 | B1 |
6202158 | Urano et al. | Mar 2001 | B1 |
6332141 | Gonzalez et al. | Dec 2001 | B2 |
6363381 | Lee et al. | Mar 2002 | B1 |
6374251 | Fayyad et al. | Apr 2002 | B1 |
6377942 | Hinsley et al. | Apr 2002 | B1 |
6438666 | Cassagnol et al. | Aug 2002 | B2 |
6738779 | Shapira | May 2004 | B1 |
6785810 | Lirov et al. | Aug 2004 | B1 |
20010054003 | Chien | Dec 2001 | A1 |
20020010599 | Levison | Jan 2002 | A1 |
20030009426 | Ruiz-Sanchez | Jan 2003 | A1 |
20050131721 | Doctorow et al. | Jun 2005 | A1 |
Number | Date | Country | |
---|---|---|---|
60284940 | Apr 2001 | US |