The present invention relates to computer memory management, and more particularly, is related to monitoring access to computer files of interest.
Hashing is a technique that allows to convert a range of key values into a range of indexes of an array. A hash table is a data structure which stores data in an associative manner. Ina hash table, data is stored in an array format, where an index (hash) value is used to look up (access) a corresponding data entry (or “bucket” or “slot” or “bin”) containing data. Accessing data in a hash table is very fast if the index of the desired data entry in the hash table is known. Thus, a hash table is a data structure in which insertion and search operations are very fast irrespective of the size of the data. A hash table uses an array as a storage medium and uses hash technique to generate an index to locate the entry in the hash table where an element is to be inserted or retrieved.
Organizations have an interest in protecting sensitive information. In particular, an organization may wish to restrict and/or monitor access to computer files containing sensitive information. Monitoring of computer file access typically involves detecting file access commands, such as copy, move, delete, et cetera. Previously, access to a file of interest could be determined by comparing a file access command against a list of files of interest. While this is relatively simple when comparing access of a single file to a small list of files of interest (for example, less than one hundred files of interest), the process may become resource intensive when comparing a long list of accessed files to a long list of files of interest. Adding a wildcard (a place-holder such as ‘*’, or ‘?’ indicating one or more unspecified characters) to the search pattern can increase the complexity by at least an order of magnitude, thereby straining computer processing resources.
This problem may be analogized to a dictionary search. There are multiple methods for pattern match searches within a dictionary, but none of them efficiently address the case of a dictionary search using patterns with and without wild cards while preserving capability of performing changes within the dictionary. The nearest approach is linear matching which suffers from high complexity. Therefore, there is a need in the industry to address the abovementioned shortcomings.
Embodiments of the present invention provide a system and method for scalable file filtering using wildcards. Briefly described, the present invention is directed to a system that monitors access to computer files against a dynamically changeable non-heterogeneous collection load balanced across two hash tables. User activity is monitored on a target device to detect a user entered pattern including a wildcard character. A server in communication with the target device, receives the user entered pattern, selects one of the two hash tables, and calculates an index for the selected hash table based on the user entered pattern. The index is used to access the selected hash table to receive a stored pattern. The hash tables each have a plurality of entries, and each entry includes a list of one or more patterns that have the same hash index but different pattern values sorted by length in characters from longest to shortest.
Other systems, methods and features of the present invention will be or become apparent to one having ordinary skill in the art upon examining the following drawings and detailed description. It is intended that all such additional systems, methods, and features be included in this description, be within the scope of the present invention and protected by the accompanying claims.
The accompanying drawings are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification. The drawings illustrate embodiments of the invention and, together with the description, serve to explain the principals of the invention.
The following definitions are useful for interpreting terms applied to features of the embodiments disclosed herein, and are meant only to define elements within the disclosure.
As used within this disclosure, a “pattern” refers to a sequence of characters organized as a text string. While embodiments described below generally use the pattern to refer to a file pathname, the invention is applicable to any pattern of characters.
As used within this disclosure, a “wildcard character” refers to a placeholder represented by a single character in a pattern, such as an asterisk (*), which can be interpreted as a number of literal characters or an empty pattern. Wildcard characters are often used in file searches so the full pattern need not be typed. In Unix-like and DOS operating systems, the asterisk character (*, also called “star”) matches zero or more characters, while the question mark (?) matches exactly one character. The characters occurring in the pattern before the wildcard character are referred to as the “prefix,” and the characters occurring in the pattern after the wildcard are referred to as the “postfix.”
As used within this disclosure, a “collection” refers to a non-heterogeneous list of patterns, for example a list of file names stored as entries in a database. For the embodiments described herein, the collection is distributed across two tables. Usage of two or more hash tables greatly reduces chances that a bucket in the hash table will contain long linked list, ideally each bucket will contain one entry only.
As used within this disclosure, a “non-heterogeneous list” refers to a collection containing patterns with wildcards mixed with patterns without wildcards.
As used within this disclosure, a “search” refers to a case insensitive character search of a dynamically changeable non-heterogeneous collection.
As used herein, a “hash function” is a process (see
As used within this disclosure a “hash table” is a data structure (see
As used within this disclosure, a “stack” is a data structure containing a variable plurality of ordered entries, where adding a subsequent entry to the stack pushes all previous entries down in the stack, and “popping” the stack refers to removing the topmost entry in the stack such that the next sequential stack entry moves up to become the topmost entry.
As used within this disclosure a “collision” refers to a condition during a database search by a hash index yields more than one result.
As used within this disclosure “amortized complexity” relates to amortized analysis and formally can be described as the total expense (in terms of consumption of computer resources) per operation, evaluated over a sequence of operations. This guarantees the total expense of the entire sequence, while permitting individual operations to be much more expensive than the amortized cost.
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers are used in the drawings and the description to refer to the same or like parts.
Embodiments of the method disclosed herein are generally applicable to a dynamic (changeable at run time) collection of pathnames to tracked files, for example, files containing information sensitive to an organization. The collection may contain thousands of entries that change in real time in response to user actions. The tracked file may be entered into the collection either by full path name without wildcards, or a (partial) pathname including wildcards. An operation matching the pattern against the collection is expected to be very fast, not affected (performance wise) by the size of the collection and/or changes performed within the collection.
As described below in greater detail with regard to
For example, on an Apple Macintosh computer, the OS Accessibility API provides extensions so user interface devices (for example, keyboards, mice, trackpads, etc.) may be customized to accommodate users with special needs. An Accessibility OS profile provides access to events such as keystrokes, mouse clicks, and other user activities that may be leveraged to monitor usage of the host device. Similar OS hooks are available for other operating systems. The agent 820 may be implemented as a background process, such as a daemon, which may be installed in the computer 810 by a (human) system administrator 802 in a manner that is invisible and unobtrusive to a user 801 of the host device. Further, unlike a stand-alone application, the background process may not be inadvertently (or intentionally) disabled by the user 801 who does not have system administrator privileges.
The agent 820 may be configured to monitor for specific patterns of user activity, and to log and transmit log entries to the monitor application server 830. The monitor application server 830 may then catalog the user activity in a database stored within the server data store 863, and/or scan the log entries against a table of rules to determine if the host device 810 is being used in a manner of interest/concern. A console user (human) 803 may access the monitor application server 830, for example, using a web browser.
In general, it is desirable that the agent 820 operate in an unobtrusive manner, for example, without noticeably drawing on resources of the host device 810, such as processor power, storage capacity/throughput, and/or communication bandwidth.
The agent 820 is notified by the OS 815 when the user 801 enters a command to find and/or access a file on the computer 810. The agent 820 extracts a pattern (string) from the command that may contain a wildcard. The agent 820 uses the pattern to access a collection, for example a database on the server data store 863 or the agent data store 862 containing filenames of sensitive files that the system administrator 802 wishes to monitor. The agent 820 uses one or more of the scalable file filtering methods using wildcards described in the embodiments below.
When each pattern in a collection is discrete (i.e. contains no wildcards), the performance characteristics of the operations shown in
For the incremental search, a first branch (block 320) if the pattern is not empty or the incremental character taken in block 310 is not a wildcard, as shown by block 315, an incremental hash key is calculated by an incremental hash function for the accumulated characters, as shown by block 320. For example, an incremental hash function that calculates a hash H(n) for n characters is: H(n)=F(H(n−1))+T(n), where F represents a function that transforms the value of H(n−1) and T represents a function that changes value of n−1 character. Given the hash value of H(n−1) and value of n-th character, H(n) may be calculated. The incremental hash key is placed on (pushed onto) a LIFO (last-in-first-out) stack of incremental indexes, as shown by block 330, and another character is taken from the input pattern, as shown by block 310. For example, based on the first incremental hash, a second incremental pattern is created for a first character and second characters of the pattern, and a second incremental index is calculated based upon the stored index value. Additional incremental patterns and incremental index are created by continuing in a similar fashion for subsequent characters of the search pattern. When the end of the pattern or a wildcard is reached, the LIFO stack is processed, as shown starting in block 340, until either there is a match or the list of indexes in the LIFO stack is exhausted. Specifically, the hash table is searched for the topmost calculated incremental index in the LIFO stack, as shown by block 340. If the incremental index is found in the hash table, the table entry indexed by the hash value is fetched, as shown by block 350, and the process ends, as shown by block 390. Since the entry may contain linked list of values with same index, after the entry was fetched, each element of the linked list compared to the search pattern until either a match found. If the end of the list end is reached without a match, the process continues.
Returning to block 340, if the incremental index is not found in the hash table, if there are more indexes on the stack, as shown by block 360, the top hash value on the stack is popped, as shown by block 380, and control branches back to block 340. If none of the indexes in the stack are found, that is, the index stack has been depleted without finding a match in the hash table, a failed lookup is reported, as shown by block 370, and the process ends, as shown by block 390.
The approach in the flowchart 300 may be used to distinguish between, for example, a stored hash pattern of “test*this” with real pattern “test the very string.” Since both the stored patterns share the same hash index calculated for the pattern “test”, on lookup a match occurs when the LIFO stack of stored hashes reach value of hash for “test” either of these search patterns.
Hash table degradation may occur if a search pattern begins with the wildcard “*” or just very few characters before “*”. Here, the hash is based on very few characters, resulting in many matching paths. For example, patterns like “te*ster” and “te*sting” have same prefix of “te” which will generate an identical hash value. If enough entries with the same prefix are accumulated, performance degradation may result due to performing linear searches within the hash bin matching prefix (key) of “te”, To avoid hash table degradation due to too many entries associated with same hash index, the present embodiments employ load balancing, which may use a plurality of hash tables, for example,
Upon an insert operation request, the embodiments evaluate which hash table is most efficient in terms of performance for use with a given pattern. The choice of hash table may be determined by comparing the number of characters from the beginning or end of the search pattern before a wildcard is encountered. The hash table may be selected based on whether the direct or reverse hash would include a larger number of characters. Alternatively, the method may keep a count (list length, or “hit count”) of entities contained in each hash table bucket, and when a new entry is created, count entries for the direct hash entry bin and the reverse hash, and select the table with the smaller hit count, Other criteria may also be used to select the direct or reverse hash table. Use of a direct and reverse hash table significantly reduces the possibility of hash table degradation.
Returning to block 455, if the forward and reverse lists are not of equal length, if the forward list is longer, as shown by block 456, the reverse hash table is selected, as shown by block 480, otherwise the forward linked list is selected, as shown by block 470.
Table 1 contains a sequence of eight user entered patterns to serve as an example to illustrate the load balancing as performed by the embodiments, which perform an assessment of each pattern in order to decide which kind of hash table is most appropriate to be used. For simplicity, the example starts with empty forward and reverse hash tables as the condition of empty allows for the assumption that a hit count (the number of patterns in the linked list referenced by each bucket of the respective hash table) for each pattern is equal (zero), so no load balancing is required for the first pattern. The pattern of “*text*.sys” will be used with reverse hash table because the pattern has a more informative postfix of “.sys” in comparison with an empty prefix.
The sequence column of Table 1 indicates the order of insertion of the corresponding pattern into the collection. The pattern column has the pattern to be stored in the collection, here a text string corresponding to a file pathname. The hash table column indicates the hash table chosen to store the pattern (either forward or reverse) based on the method 400 of
In this example of Table 1, the user adds the patterns in Table 1 to the collection in the sequence order of 1 to 8. An evaluation of pattern (1) places pattern (1) in the forward hash table because the prefix of “c:\user\test” has more characters than postfix of “.txt”. When a subsequent pattern is processed, the embodiments make similar evaluation as for the first pattern, but also find that, due to entry of the first pattern in the forward hash table, there is hash table entry with similar prefix of “c:\user\test” so when the embodiment compares the hit count of the pattern (2) prefix “c:\user\test” (hit count=1) with hit count of postfix “.exe” (hit count=0), since the postfix hit counter is lower than the prefix hit count, pattern (2) is placed into reverse hash table.
Using same approach, patterns (3) and (4) are placed into the reverse hash table. Patterns (5) and (6) contains no wildcards, so there is no prefix and postfix to compare, in such cases the patterns are placed in the forward hash table. Pattern (7) has an empty prefix, but the empty prefix has a 0 hit counter in the forward table, while postfix of “.sys” already appears in the reverse table (because of pattern (3)), therefore pattern (7) is placed into the forward table. Pattern (8) is placed into the reverse table, because although the hit counters are equal for the forward table (based on the prefix) and reverse table (based on the postfix), the postfix of “.sys” more detailed than the empty prefix.
The present system for executing the functionality described in detail above may be a computer, an example of which is shown in the schematic diagram of
The processor 702 is a hardware device for executing software, particularly that stored in the memory 706. The processor 702 can be any custom made or commercially available single core or multi-core processor, a central processing unit (CPU), an auxiliary processor among several processors associated with the present system 700, a semiconductor based microprocessor (in the form of a microchip or chip set), a macroprocessor, or generally any device for executing software instructions.
The memory 706 can include any one or combination of volatile memory elements (e.g., random access memory (RAM, such as DRAM, SRAM, SDRAM, etc.)) and nonvolatile memory elements (e.g., ROM, hard drive, tape, CDROM, etc.). Moreover, the memory 706 may incorporate electronic, magnetic, optical, and/or other types of storage media. Note that the memory 706 can have a distributed architecture, where various components are situated remotely from one another, but can be accessed by the processor 702.
The software 708 defines functionality performed by the system 700, in accordance with the present invention. The software 708 in the memory 706 may include one or more separate programs, each of which contains an ordered listing of executable instructions for implementing logical functions of the system 700, as described below. The memory 706 may contain an operating system (O/S) 720. The operating system essentially controls the execution of programs within the system 700 and provides scheduling, input-output control, file and data management, memory management, and communication control and related services.
The I/O devices 710 may include input devices, for example but not limited to, a keyboard, mouse, scanner, microphone, etc. Furthermore, the I/O devices 710 may also include output devices, for example but not limited to, a printer, display, etc. Finally, the I/O devices 710 may further include devices that communicate via both inputs and outputs, for instance but not limited to, a modulator/demodulator (modem; for accessing another device, system, or network), a radio frequency (RF) or other transceiver, a telephonic interface, a bridge, a router, or other device.
When the system 700 is in operation, the processor 702 is configured to execute the software 708 stored within the memory 706, to communicate data to and from the memory 706, and to generally control operations of the system 700 pursuant to the software 708, as explained above. The operating system 720 is read by the processor 702, perhaps buffered within the processor 702, and then executed.
When the system 700 is implemented in software 708, it should be noted that instructions for implementing the system 700 can be stored on any computer-readable medium for use by or in connection with any computer-related device, system, or method. Such a computer-readable medium may, in some embodiments, correspond to either or both the memory 706 or the storage device 704. In the context of this document, a computer-readable medium is an electronic, magnetic, optical, or other physical device or means that can contain or store a computer program for use by or in connection with a computer-related device, system, or method. Instructions for implementing the system can be embodied in any computer-readable medium for use by or in connection with the processor or other such instruction execution system, apparatus, or device. Although the processor 702 has been mentioned by way of example, such instruction execution system, apparatus, or device may, in some embodiments, be any computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. In the context of this document, a “computer-readable medium” can be any means that can store, communicate, propagate, or transport the program for use by or in connection with the processor or other such instruction execution system, apparatus, or device.
Such a computer-readable medium can be, for example but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or propagation medium. More specific examples (a nonexhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic) having one or more wires, a portable computer diskette (magnetic), a random access memory (RAM) (electronic), a read-only memory (ROM) (electronic), an erasable programmable read-only memory (EPROM, EEPROM, or Flash memory) (electronic), an optical fiber (optical), and a portable compact disc read-only memory (CDROM) (optical). Note that the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
In an alternative embodiment, where the system 700 is implemented in hardware, the system 700 can be implemented with any or a combination of the following technologies, which are each well known in the art: a discrete logic circuit(s) having logic gates for implementing logic functions upon data signals, an application specific integrated circuit (ASIC) having appropriate combinational logic gates, a programmable gate array(s) (PGA), a field programmable gate array (FPGA), etc.
The agent 820 may be tailored to communicate with a specific operating system 815 resident on the computer 810. For example, the agent 820 may be specific to Windows OS, MacOS, or Unix/Linux, among others. While
In general, the agent 820 may be configured to act as an intermediary between the operating system 815 and the monitor application server 830, in particular, the agent 820 generally conveys collected data to the monitor application server 830, and the monitor application server operates upon the collected data to determine if targeted activities have been performed by a user 801, here a human operating the computer 810.
As noted previously within this disclosure the user 801 is a human who interacts with the computer 810, the system administrator 802 is a human who controls and configures the operating system 815 of the computer 810, and the console user 803 is a human who controls and interacts with the monitor application 800. Of course, there may be a plurality of users 801, system administrators 802, and/or console users 803, and in some circumstances a system administrator 802 and the console user 803 may be the same individual.
The flow of activity and communication between the components is as follows: The monitor application 800 includes an agent 820 which is installed locally on the computer 810. The agent 820 captures information about user activity, secures it, and sends it to the monitor application server 830. In embodiments where there is more than one monitor application server 830, they may be load balanced with either a software or hardware-based device (not shown). In that case the agents 820 communicate with the load balancer's virtual IP (VIP). The monitor application server 830 analyzes and compresses received data, then stores the data, for example by splitting textual data in an SQL Server database, and graphic images on a file share, where the SQL server database and the file share are stored in the server data store 863. The console user 803 connects to a Web Console Web-based interface to the monitor application 800, for example using a web browser, and search for, replay, run reports and inspect alerts based on the captured user activity. Any component of the data transfer or data storage process can be encrypted, if desired.
In an exemplary application, if the collection includes a list of files each containing data deemed by the system administrator 802 and/or the monitor application server 830 as containing sensitive information, a pattern match in the collection indicates that the user 801 was accessing a file containing sensitive information. In response, the monitor application server 830 may alert the system administrator 802 of the file access by the user 801.
When patterns do not include wildcards, the amortized complexity of the pattern search (insert/remove/match) may be described as O(n), where n is size of pattern that the operation performed with, independent of collection size. Such complexity is the best possible result for given case, as it is impossible to perform pattern matching without accessing each character of pattern that being matched, so O(n) can't be improved.
Under the disclosed embodiments, with a collection that contains wildcards the amortized complexity of matching of given pattern is identical to the complexity with a wildcard that matches the pattern, which is optimal result for given case.
Therefore, the described embodiments provide matching with optimal performance impact without preventing changes to the collection and dependency on the collection size. As a result, the user can perform searches with many rules including full path names or path names with wildcards and maintain low central processing unit (CPU) overheard searching a path inside it. The embodiments provide for very efficient searches for small file path sizes. The embodiments are still efficient with larger file paths, since on typical operating systems the file path is size limited. The described embodiments are suitable for use in both kernel level and user space level for file path lookups.
It will be apparent to those skilled in the art that various modifications and variations can be made to the structure of the present invention without departing from the scope or spirit of the invention. In view of the foregoing, it is intended that the present invention cover modifications and variations of this invention provided they fall within the scope of the following claims and their equivalents.
This application claims the benefit of U.S. Provisional Patent Application Ser. No. 62/788,175, filed Jan. 4, 2019, entitled “Scalable File Filtering Methods Using Wildcards” which is incorporated by reference herein in its entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US20/12116 | 1/3/2020 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62788175 | Jan 2019 | US |