System and method for clustering host inventories

Description

TECHNICAL FIELD

This disclosure relates in general to the field of computer network administration and support and, more particularly, to identifying similar software inventories on selected hosts.

BACKGROUND

The field of computer network administration and support has become increasingly important and complicated in today's society. Computer network environments are configured for virtually every organization and usually have multiple interconnected computers (e.g., end user computers, laptops, servers, printing devices, etc.). Typically, each computer has its own set of executable software, each of which can be represented by an executable software inventory. For Information Technology (IT) administrators, congruency among executable software inventories of similar computers (e.g., desktops and laptops) simplifies maintenance and control of the network environment. Differences between executable software inventories, however, can arise in even the most tightly controlled network environments. In addition, each organization may develop its own approach to computer network administration and, consequently, some organizations may have very little congruency and may experience undesirable diversity of executable software on their computers. Particularly in very large organizations, executable software inventories may vary greatly among computers across departmental groups. Varied executable software inventories on computers within organizations present numerous difficulties to IT administrators to maintain, to troubleshoot, to service, and to provide uninterrupted access for business or other necessary activities. Innovative tools are needed to assist IT administrators to successfully support computer network environments with computers having incongruities between executable software inventories.

BRIEF DESCRIPTION OF THE DRAWINGS

To provide a more complete understanding of the present disclosure and features and advantages thereof, reference is made to the following description, taken in conjunction with the accompanying figures, wherein like reference numerals represent like parts, in which:

FIG. 1 is a pictorial representation of an exemplary network environment in which various embodiments of a system and method for clustering host inventories may be implemented in accordance with the present disclosure;

FIG. 2 is a simplified block diagram of a computer, which may be utilized in embodiments in accordance with the present disclosure;

FIG. 3 is a simplified flowchart illustrating a series of example steps associated with the system in accordance with one embodiment of the present disclosure;

FIG. 4 illustrates an n×m vector matrix format used in accordance with an embodiment of the present disclosure;

FIG. 5 is a flowchart illustrating a series of example steps for generating values for an n×m vector matrix as shown in FIG. 4 in accordance with one embodiment of the present disclosure;

FIG. 6 illustrates an example selected group of hosts in a network environment to which embodiments of the present disclosure may be applied;

FIG. 7 illustrates a vector matrix created by application of the flow of FIG. 5 to the example selected group of hosts of FIG. 6;

FIG. 8 is an example cluster diagram of the hosts of FIG. 6 that could be created from the system in accordance with embodiments of the present disclosure;

FIG. 9 is a simplified flowchart illustrating a series of example steps associated with the system in accordance with another embodiment of the present disclosure;

FIG. 10 illustrates an n×n similarity matrix format used in accordance with one embodiment of the present disclosure;

FIG. 11 is a flowchart illustrating a series of example steps for generating values for an n×n similarity matrix as shown in FIG. 10 in accordance with one embodiment of the present disclosure; and

FIG. 12 illustrates a similarity matrix created by application of the flow of FIG. 11 to the example selected group of hosts of FIG. 6.

DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS

Overview

A method in one example implementation includes obtaining a plurality of host file inventories corresponding respectively to a plurality of hosts, calculating input data using the plurality of host file inventories, and providing the input data to a clustering procedure to group the plurality of hosts into one or more clusters of hosts. The method further includes each cluster of hosts being grouped using a predetermined similarity criteria. More specific embodiments include each of the plurality of host file inventories including a set of one or more file identifiers with each file identifier representing a different executable software file on a corresponding one of the plurality of hosts. In another more specific embodiment, the method includes each of the one or more file identifiers including a token sequence of one or more tokens. In other more specific embodiments, the calculating the input data includes transforming the plurality of host file inventories into a similarity matrix. In another more specific embodiment, the calculating the input data includes transforming the plurality of host file inventories into a matrix of keyword vectors in Euclidean space, where each keyword vector corresponds to one of the plurality of hosts.

Example Embodiments

FIG. 1 is a pictorial representation of a computer network environment 100 in which embodiments of a system for clustering host inventories may be implemented in accordance with the present disclosure. Computer network environment 100 illustrates a network of computers including a plurality of hosts 110a, 110b, and 110c (referred to collectively herein as hosts 110), which may each have, respectively, a set of executable files 112a, 112b, and 112c (referred to collectively herein as sets of executable files 112) and a host inventory feed 114a, 114b, and 114c (referred to collectively herein as host inventory feeds 114). Hosts 110 may be operably connected to a central server 130 through communication link 120. Central server 130 may include an administrative module 140, a host inventory preparation module 150, and a clustering module 160. A management console 170 can also be suitably connected to central server 130 to provide an interface for users such as Information Technology (IT) administrators, network operators, and the like.

In example embodiments, the system for clustering host inventories may be utilized to provide valuable information to users (e.g., IT administrators, network operators, etc.) identifying computers having similar operating systems and installed executable software files. In one example, when the system for clustering host inventories is applied to a computer network environment such as network environment 100 of FIG. 1, software inventories from hosts 110 may be transformed by host inventory preparation module 150 into input data for a clustering algorithm or procedure. Clustering module 160 may apply the clustering algorithm to the prepared input data to create a clustering diagram or other information identifying logical groupings of hosts 110 having similar operating systems and installed sets of executable files 112. The clustering diagram may also identify any outlier hosts 110 having significant differences in operating systems and/or executable files relative to the other hosts 110 in network environment 100. Thus, the IT administrator or other user is provided with valuable information that enables the discovery of trends and exceptions of computers, such as hosts 110, in the particular network environment. As a result, common policies may be applied to computers within logical groupings and remedial action may be taken on any identified outlier computers.

For purposes of illustrating the techniques of the system for clustering host inventories, it is important to understand the activities occurring within a given network. The following foundational information may be viewed as a basis from which the present disclosure may be properly explained. Such information is offered earnestly for purposes of explanation only and, accordingly, should not be construed in any way to limit the broad scope of the present disclosure and its potential applications.

Typical network environments used in organizations and by individuals often include a plurality of computers such as end user desktops, laptops, servers, network appliances, and the like, and each may have an installed set of executable software. In large organizations, network environments may include hundreds or thousands of computers, which may span different buildings, cities, and/or geographical areas around the world. IT administrators may be tasked with the extraordinary responsibility of maintaining these computers in a way that minimizes or eliminates disruption to business activities.

One difficulty IT administrators face includes maintaining multiple computers in a chaotic or heterogeneous network environment. In such an environment, congruency between executable software of the computers may be minimal. For example, executable files may be stored in different memory locations on different computers, different versions of executable files may be installed in different computers, executable files may be stored on some computers but not on others, and the like. Such networks may require additional time and resources to be adequately supported as IT administrators may need to individualize policies, maintenance, upgrades, repairs, and/or any other type of support to suit particular computers having nonstandard executable software and/or operating systems.

Homogenous network environments, in which executable software of computers are congruent or at least similar, may also benefit from a system and method for clustering host inventories. In homogeneous environments or substantially homogeneous environments, particular computers may occasionally deviate from standard computers within the network environment. For example, malicious software may break through the various firewalls and other network barriers creating one or more deviant computers. In addition, end users of computers may install various executable software files from transportable disks or download such software creating deviant computers. In accordance with the present disclosure, a system for clustering host inventories could readily identify any outliers having nonstandard and possibly malicious executable software.

A system and method for clustering host inventories, as outlined in FIG. 1, could greatly enhance abilities of IT administrators or other users managing computer networks to effectively support both heterogeneous and homogeneous network environments. The system, which may be implemented in a computer such as server 130, enables identification of logical groupings of computers with similar executable file inventories and identification of outliers (e.g., computers with drastically different executable file inventories). In accordance with one example implementation, host file inventories of executable files from hosts 110 are provided for evaluation. The host file inventories are transformed into input data for a clustering algorithm. Once the input data is prepared, the clustering algorithm is applied and one or more diagrams or charts may be created to show logical clusters or groupings of hosts 110 having the same or similar software inventories. In addition, the diagrams or charts may also show any of the hosts 110 that drastically deviate from other hosts 110. Thus, the system provides network or IT administrators with valuable information that may be used to more effectively manage hosts 110 within network environment 100.

Note that in this Specification, references to various features (e.g., elements, structures, modules, components, steps, etc.) included in “one embodiment”, “example embodiment”, “an embodiment”, “another embodiment”, “some embodiments”, “various embodiments”, “one example”, “other embodiments”, and the like are intended to mean that any such features may be included in one or more embodiments of the present disclosure, but may or may not necessarily be included in the same embodiments.

Turning to the infrastructure of FIG. 1, the example network environment 100 may be configured as one or more networks and may be configured in any form including, but not limited to, local area networks (LANs), wide area networks (WANs) such as the Internet, or any combination thereof. In some embodiments, communication link 120 may represent any electronic link supporting a LAN environment such as, for example, cable, Ethernet, wireless (e.g., WiFi), ATM, fiber optics, etc. or any combination thereof. In other embodiments, communication link 120 may represent a remote connection to central server 130 through any appropriate medium (e.g., digital subscriber lines (DSL), telephone lines, T1 lines, T3 lines, wireless, satellite, fiber optics, cable, Ethernet, etc. or any combination thereof) and/or through any additional networks such as a wide area network (e.g., the Internet). In addition, gateways, routers and the like may be used to facilitate electronic communication between hosts 110 and central server 130.

In an example embodiment, hosts 110 may represent end user computers that could be operated by end users. The end user computers may include desktops, laptops, and mobile or handheld computers (e.g., personal digital assistants (PDAs) or mobile phones). Hosts 110 can also represent other computers (e.g., servers, appliances, etc.) having executable software, which could be similarly evaluated and clustered by the system, using executable file inventories derived from sets of executable files 112 on such hosts 110. It should be noted that the network configurations and interconnections shown and described herein are for illustrative purposes only. FIG. 1 is intended as an example and should not be construed to imply architectural limitations in the present disclosure.

Sets of executable files 112 on hosts 110 can include all executable files on respective hosts 110. In this Specification, references to “executable file”, “program file”, “executable software file”, and “executable software” are meant to encompass any software file comprising instructions that can be understood and processed by a computer such as executable files, library modules, object files, other executable modules, script files, interpreter files, and the like. In one embodiment, the system could be configured to allow the IT administrator to select a particular type of executable file to be clustered. For example, an IT Administrator may choose only dynamic-link library (DLL) modules for clustering. Thus, sets of executable files 112 would include only DLL modules on the respective hosts 110. In addition, the IT administrator may also be permitted to select particular hosts to which clustering is applied. For example, all end user computers in a network or within a particular part of the network may be selected. In another example, all servers within a network or within a particular part of the network may be selected.

Central server 130 of network environment 100 represents an exemplary server or other computer linked to hosts 110, which may provide services to hosts 110. The system of clustering host inventories may be implemented in central server 130 using various embodiments of host inventory preparation module 150 and clustering module 160. For example, keyword techniques may be used with vector based clustering in one example embodiment. In this example embodiment, host inventory preparation module 150 creates an (n×m) vector matrix where columns of the matrix may correspond to a determined number (i.e., “m”) of unique keywords, each of which is associated with one or more executable files in a selected number (i.e., “n”) of hosts. The rows of the vector matrix may correspond to the n selected hosts. Clustering module 160 can then apply a clustering algorithm to the vector matrix to create logical groupings of the n selected hosts. In another example embodiment, compression techniques may be used with similarity based clustering. In this example embodiment, host inventory preparation module 150 may create an (n×n) similarity matrix using compression techniques for a selected number (i.e., “n”) of hosts. Clustering module 160 may then apply a clustering algorithm to the similarity matrix to create logical groupings of the n selected hosts. In one embodiment, selected hosts may include all of the hosts 110 in a particular network environment such as network environment 100. In other embodiments, selected hosts may include particular hosts selected by a user or predefined by policy, with such hosts existing in one or more network environments.

Management console 170 linked to central server 130 may provide viewable cluster data for the IT administrators or other authorized users. Administrative module 140 may also be incorporated to allow IT administrators or other authorized users to add the logical groupings from a cluster analysis to an enterprise management system and to apply common policies to selected groupings. In addition, deviant or exceptional groupings or outliers can trigger various remedial actions (e.g., emails, vulnerability scans, etc.). In addition, management console 170 may also provide a user interface for the IT Administrator to select particular hosts and/or particular types of executable files to be included in the clustering procedure, in addition to other user provided configuration data for the system. One exemplary enterprise management system that could be used includes McAfee® electronic Policy Orchestrator (ePO) software manufactured by McAfee, Inc. of Santa Clara, Calif.

Turning to FIG. 2, FIG. 2 is a simplified block diagram of a general or special purpose computer 200, such as hosts 110, central server 130, or other computing devices connected to network environment 100. Computer 200 may include various components such as a processor 220, a main memory 230, a secondary storage 240, a network interface 250, a user interface 260, and a removable memory interface 270. A bus 210, such as a system bus, may provide electronic communication between processor 210 and the other components, memory, and interfaces of computer 200.

Processor 220, which may also be referred to as a central processing unit (CPU), can include any general or special-purpose processor capable of executing machine readable instructions and performing operations on data as instructed by the machine readable instructions. Main memory 230 may be directly accessible to processor 220 for accessing machine instructions and can be in the form of random access memory (RAM) or any type of dynamic storage (e.g., dynamic random access memory (DRAM)). Secondary storage 240 can be any non-volatile memory such as a hard disk, which is capable of storing electronic data including executable software files. Externally stored electronic data may be provided to computer 200 through removable memory interface 270. Removable memory interface 270 may provide connection to any type of external memory such as compact discs (CDs), digital video discs (DVDs), flash drives, external hard drives, or any other external media.

Network interface 250 can be any network interface controller (NIC) that provides a suitable network connection between computer 200 and any networks to which computer 200 connects for sending and receiving electronic data. For example, network interface 250 could be an Ethernet adapter, a token ring adapter, or a wireless adapter. A user interface 260 may be provided to allow a user to interact with the computer 200 via any suitable means, including a graphical user interface display. In addition, any appropriate input mechanism may also be included such as a keyboard, mouse, voice recognition, touch pad, input screen, etc.

Not shown in FIG. 2 is additional hardware that may be suitably coupled to processor 220 and bus 210 in the form of memory management units (MMU), additional symmetric multiprocessing (SMP) elements, read only memory (ROM), peripheral component interconnect (PCI) bus and corresponding bridges, small computer system interface (SCSI)/integrated drive electronics (IDE) elements, etc. Any suitable operating systems will also be configured in computer 200 to appropriately manage the operation of hardware components therein. These elements, shown and/or described with reference to computer 200, are intended for illustrative purposes and are not meant to imply architectural limitations of computers such as hosts 110 and central server 130, utilized in accordance with the present disclosure. As used herein in this Specification, the term ‘computer’ is meant to encompass any personal computers, network appliances, routers, switches, gateways, processors, servers, load balancers, firewalls, or any other suitable device, component, element, or object operable to affect or process electronic information in a network environment.

Turning to FIG. 3, an example system flow 300 of a keyword-based embodiment of the system and method for clustering host inventories is illustrated. Flow may begin at step 310 where host file inventories (I₁through I_n, with n=number of selected hosts) are generated for each of the selected hosts 110. Each host file inventory can include a set of file identifiers, with each file identifier representing a different executable file in the set of executable files 112a, 112b, or 112c of the corresponding selected host 110a, 110b, or 110c. Each file identifier may include a sequence of one or more tokens associated with the executable file represented by the file identifier. In one embodiment, the user may be provided with the option of choosing the number of tokens and the configuration of each token. For example, a simple file identifier could include a single token having a checksum configuration. A checksum can be a mathematical value or hash sum (e.g., a fixed string of numerical digits) derived by applying an algorithm to an executable file. If the algorithm is applied to another executable file that is identical to the first executable file, then the checksums should match. However, if the other executable file is different in any way (e.g., different type of software program, same software program but different version, same software program that has been altered in some way, etc.), then the checksums are very unlikely to match. Thus, the same executable file stored in different hosts or stored in different locations on disk of the same host should have identical checksums.

In other examples, more complex file identifiers could be selected to provide a higher level of distinctiveness of an executable file. In one such example, a file identifier could include a sequence of first and second tokens having a checksum configuration and file path configuration, respectively, where the file path indicates where the executable file is stored on disk in the particular host in which it is installed. Thus, if identical executable files X and Y are installed on host 110a and host 110b, respectively, but are stored in different locations of memory, then the first token of the file identifier generated for executable file X on host 110a could be the same as the first token of the file identifier generated for executable file Y on host 110b. However, the second token of the file identifier generated for executable file X on host 110a could be different than the second token of the file identifier generated for executable file Y on host 110b.

Numerous other file identifiers may be configured by using any number of tokens and configuring the tokens to include any combination of available program file attributes, checksums, and/or file paths. Program file attributes may include, for example, creation date, modification date, security settings, vendor name, and the like. Although file identifiers may be configured with any number of such tokens, an executable file without a particular program file attribute, which is selected as one of the tokens, may have a file identifier with only the tokens available to that executable file. For example, if the file identifier is configured to include a first token (e.g., a checksum) and a second token (e.g., a vendor name), then an executable file without an embedded vendor name would have a file identifier with only a first token corresponding to the file checksum. In contrast, an executable file having an embedded vendor name would have a file identifier with both first and second tokens corresponding to the file checksum and vendor name, respectively.

The file identifiers and resulting host file inventories I₁through I_nmay be provided by various implementations. In one embodiment, the file identifiers and resulting host file inventories may be generated by host inventory feeds 114 for each host 110 and pushed to central server 130. For embodiments in which a user configures the file identifier by selecting a number of tokens for the token sequence and by selecting individual token configurations, central server 130 may provide the user selected configuration criteria to each host 110. Host inventory feeds 114 may then generate file identifiers with token sequences having the particular user-selected configuration. In another embodiment, checksums for each executable file may be generated on hosts 110 by host inventory feeds 114 and then pushed to central server 130 along with other file attributes and file paths such that host inventory preparation module 150 of central server 130 can generate the file identifiers and resulting host file inventories for each of the selected hosts 110. In one embodiment, enumeration of executable files from the sets of executable files 112 of selected hosts 110 can be achieved by existing security technology such as, for example, Policy Auditor software or Application Control software, both manufactured by McAfee, Inc. of Santa Clara, Calif.

Referring again to FIG. 3, after file identifiers and host file inventories have been determined for all of the selected hosts 110 in step 310, flow then moves to step 320 where a keyword method is used to transform host file inventories I₁through I_ninto a vector matrix, which will be further described herein with reference to FIGS. 4 and 5. Once a vector matrix is created, flow moves to step 330 where a vector-based clustering analysis is performed on the vector matrix. Exemplary types of clustering analysis that may be performed on the vector matrix include agglomerative hierarchical clustering and partitional clustering. The results of such clustering techniques may be stored in a memory element of central server (e.g., secondary storage 240 of computer 200), or may be stored in a database or other memory element external to central server 130.

After vector-based clustering has been performed on the vector matrix in step 330, flow moves to step 340 where one or more reports can be generated indicating the clustered groupings determined during the clustering analysis and can be provided to authorized users by various methods (e.g., screen displays, electronic files, hard copies, emails, etc.). Exemplary reports may include a textual report and/or a visual representation (e.g., a proximity plot, a dendrogram, heat maps of a permuted keyword matrix, heat maps of a reduced keyword matrix where rows and columns have been merged to illustrate clusters, other cluster plots, etc.) enabling the user to view logical groupings of the selected hosts. For example, after the clustering analysis has been performed, a graphical user interface of management console 170 may display a proximity plot having physical representations of each host, with identifiable logical groupings (e.g., uniquely colored groupings, circled or otherwise enclosed groupings, representations of groupings with connected lines, etc.). Once the similar groupings and outlier hosts have been identified, an IT Administrator or other authorized user can apply common policies to hosts within the logical groupings and remedial action may be taken on any identified outlier hosts. For example, outlier hosts may be remediated to a standard software configuration as defined by the IT Administrators.

Turning to FIG. 4, FIG. 4 illustrates a matrix format 400 used when generating a vector matrix in one embodiment of the system and method of clustering host inventories. In this embodiment, host file inventories are transformed into a vector matrix in Euclidean space using a keyword method. The following variables may be identified when generating a vector matrix:

- n=number of selected hosts
- m=number of keywords
- H_i=host in network, with i=1 to n
  - (e.g., H₁=host 110a, H₂=host 110b, H₃=host 110c)
- K_j=unique keyword, with j=1 to m
- I_i=host file inventory on H_i, with i=1 to n

The number of keywords associated with an executable file equals the number of tokens in the token sequence of the file identifier representing the executable file. Therefore, one or more keywords can be associated with each executable file in sets of executable files 112 of selected hosts 110. In addition, each keyword could be associated with multiple executable files in the same or different hosts. Thus, a keyword sequence km may be defined as a sequence of unique keywords K₁through K_m, where each keyword is associated with one or more executable files in sets of executable files 112 of all selected hosts 110.

Vector matrix format 400 includes n rows 460 and m columns 470, with n and m defining the dimensions of the resulting n-by-m (i.e., n×m) vector matrix. Each row of vector matrix format 400 is denoted by a unique host H_i(i=1 to n), and each column is denoted by a unique keyword K_j(j=1 to m) of keyword sequence km. Each entry 480 is denoted by a variable with subscripts i and j (i.e., a_i,j) where i and j correspond to the respective row and column where the entry is located. For example, entry a_2,1is found in row 2, column 1 of vector matrix format 400. Each row of entries represents a row vector 410, 420, and 430 for its corresponding host H₁, H₂and H_n. For example, a_1,1, a_1,2, through a_1,mdefine row vector 410 for host H₁. Once each of the entries 480 has been filled with a determined value, row vectors 410, 420, through 430 can be provided as input data to a vector-based clustering algorithm to create a cluster graph or plot showing logical groupings of hosts H₁through H_n, having similar inventories of executable files and any host outliers having dissimilar host inventories.

Turning to FIG. 5, FIG. 5 illustrates a flow 500 using a keyword method to transform host file inventories into a list of vectors in Euclidean space, represented by vector matrix format 400. Flow 500 corresponds to step 320 in flow 300 of FIG. 3 and may be implemented, at least in part, by host inventory preparation module 150 of central server 130 shown in FIG. 1. Flow may begin at step 510 to determine keyword sequence km, which is a sequence of m unique keywords (km=K₁, K₂, . . . K_m) and is a basis for m-dimensional keyword space. In one embodiment, to determine km the file identifiers of all host file inventories of selected hosts 110 can be evaluated to find each unique keyword. In one example, each of the file identifiers of the host file inventories (I₁through I_n) includes a sequence of first and second tokens with the first token having a file checksum configuration and the second token having a file path configuration. If the same version of Microsoft® Word software is installed on each of the selected hosts 110a, 110b, and 110c in the same location on disk, then each file identifier representing the Microsoft® Word software in each of the host file inventories includes a first token containing a checksum of the Microsoft® Word software and a second token containing a file path of the software. In this example, keyword sequence km could include a first keyword (K₁) containing the checksum, which is the same on each of the selected hosts 110a, 110b, and 110c, and a second keyword (K₂) containing the file path, which is the same on each of the selected hosts 110a, 110b, and 110c.

Once keyword sequence km has been determined, the algorithm of flow 500 computes a list of position vectors for an n×m vector matrix. Variables ‘i’ and ‘j’ are used to construct the vector matrix having n×m vector matrix format 400, in m-dimensional keyword space, for each host H_iby iterating over j through km and producing appropriate values for the position vectors indicating whether each host file inventory I_icontains each keyword K_j.

The iterative flow to find keywords of keyword sequence km in file identifiers of host inventories is illustrated in steps 520 through 575 of FIG. 5. In step 520, variable i is set to 1 and steps 530, 570, and 575 form an outer loop iterating through the hosts. Variable j is set to 1 in step 530 and steps 540 through 565 form an inner loop iterating over j through km. After variables i and j are set to 1, flow moves to step 540 where keyword K_jis retrieved. Flow moves to decision box 545 where a query is made as to whether keyword K_jis found in host file inventory I_iof host H_i. Thus, host file inventory I_iof host H_iis searched for a file identifier containing keyword K_j. If keyword K_jis not found in host file inventory I_ithen flow moves to step 550 where row i, column j (i.e., a_i,j) in the vector matrix may be updated with an appropriate value indicating keyword K_jwas not found in host file inventory I_i. However, if in step 545, keyword K_jis found in host file inventory I_i, then flow moves to step 555 where row i, column j (i.e., a_i,j) in the vector matrix may be updated with an appropriate value indicating keyword K_jwas found in host file inventory I_i.

The values of entries a_i,jin the vector matrix, which indicate whether keyword K_jis found in a host file inventory I_i, may vary depending upon the particular implementation of the system. In one embodiment, an entry a_i,jis assigned a ‘1’ value in step 555, indicating keyword K_jwas found in host file inventory I_i, or a ‘0’ value in step 560, indicating keyword K_jwas not found in host file inventory I_i. Thus, in this embodiment, vector matrix contains only ‘0’ and/or ‘1’ values. In another embodiment, entry a_i,jis assigned a value in step 555 or 550 corresponding to a frequency of occurrence of keyword K_jin host file inventory I_i. For example, assume file identifiers in a host file inventory I₁include a first token configured as a checksum and a second token configured as a vendor name, with three executable files on host H₁having the same embedded vendor name, XYZ, resulting in keyword K₂of keyword sequence km being assigned the embedded vendor name XYZ. In this embodiment, when host file inventory I₁of host H₁is searched for keyword K₂, entry a_1,2could be updated with a value of 3 because of the three occurrences of vendor name XYZ in file identifiers of host file inventory I₁. Thus, in this embodiment, vector matrix may contain ‘0’ values and/or positive integer values.

After row i, column j is filled with an appropriate value in step 555 or 550, flow moves to decision box 560 where a query is made as to whether j<m. If j<m, then host file inventory I_iof host H_ihas not been checked for all of the keywords in km. Therefore, flow moves to step 565 where j is set to j+1, and flow loops back to step 540 to get the next keyword K_j(with j=j+1) in km and search for K_jin host file inventory I_i. If, however, in decision box 560 it is determined that j is not less than m (i.e., j≧m), then host file inventory I_ihas been searched for all of the keywords K₁through K_min keyword sequence km, so flow moves to decision box 570, which is part of the outer loop of flow 500. A query is made in decision box 570 to determine whether i<n, and if i<n, then not all of the hosts have been evaluated to generate corresponding keyword vectors. Therefore, flow moves to step 575 where i is set to i+1, and flow loops back to step 530. In step 530, j is set to 1 again, so that a vector for the next host H_i(with i=i+1) can be generated by inner loop steps 540 through 565. With reference again to decision box 570, if i is not less than n (i.e., i≧n), then all of the hosts H₁through H_nhave been evaluated such that all of the vectors have been created in n×m vector matrix and, therefore, the flow ends.

The embodiment of the flow 500 shown in FIG. 5 creates a vector in keyword space successively for each host H₁through H_n. Other embodiments, however, could be configured to switch the inner and outer loops in flow 500 such that for each keyword K_j, a column of vector matrix entries is produced by iterating over i through hosts H_iand producing an appropriate value when keyword K_jis found in host file inventory I_iof host H_iand an appropriate value when K_jis not found in host file inventory I_iof host H_i. This processing could be repeated until all columns are filled, thereby generating the list of vectors in rows 1 through n.

The clustering analysis performed on the resulting vector matrix may include commonly available clustering techniques such as agglomerative hierarchical clustering or partitional clustering. In agglomerative hierarchical clustering, each element begins as a separate cluster and elements are merged into successively larger clusters, which may be represented in a tree structure called a dendrogram. A root of the tree represents a single cluster of all of the elements and the leaves of the tree represent separated clusters of the elements. Generally, merging schemes in agglomerative hierarchical clustering used to achieve logical groupings may include schemes well-known in the art such as single-link (i.e., the distance between clusters is equal to the shortest distance from any member of one cluster to any member of another cluster), complete-link (i.e., the distance between clusters is equal to the greatest distance from any member of one cluster to any member of another cluster), group-average (i.e., the distance between clusters is equal to the average distance from any member of one cluster to any member of another cluster), and centroid (i.e., the distance between clusters is equal to the distance from the center of any one cluster to the center of another cluster).

Known techniques may be implemented in which predetermined similarity criteria sets the point at which clustering is halted (e.g., cut point determination). Cut point determination may be made, for example, at a specified level of similarity or when consecutive similarities are the greatest, which is known in the art. As an example, a tree structure representing clusters could be cut at a predetermined height resulting in more or less clusters depending on the selected height at which the cut is made. Cut point determinations may be determined based on a particular network environment or particular hosts being clustered. In one example embodiment, an IT administrator or other authorized user could define the cut point determination used by the clustering procedure by determining a desired threshold for similarity based on the particular network environment.

In other embodiments, partitional clustering may be used. Partitional clustering typically involves an algorithm that determines all clusters at one time. In partitional clustering, predetermined similarity criteria may provide, for example, a selected number of clusters to be generated or a maximum diameter for the clusters. One exemplary software package that implements these various clustering techniques is CLUTO Software for Clustering High-Dimensional Datasets developed by George Karypis, Professor at the Department of Computer Science & Engineering, University of Minnesota, Minneapolis and Saint Paul, Minn., which may be found on the World Wide Web at http://glaros.dtc.umn.edu/gkhome/view/cluto.

Turning to FIGS. 6, 7, and 8, an example selected plurality of hosts 600 with executable files, a vector matrix 700 generated using the executable files of selected hosts 600, and an example resulting cluster plot 800 are illustrated, respectively. In FIG. 6, host 1 (H₁) is shown with a set of executable files 601 including executable files 610, 620, 630, and 640. Host 2 (H₂) is shown with a set of executable files 602 including executable files 610, 650, 660, and 670. Host 3 (H₃) is shown with a set of executable files 603 including executable files 610, 620, 630, and 680. Host 4 (H₄) is shown with a set of executable files 604 including executable files 610, 650, and 660. Host 5 (H₅) is shown with a set of executable files 605 including executable files 640, 670, and 680.

FIG. 7 illustrates the resulting vector matrix 700 after the keyword method of flow 500 has been applied to the sets of executable files 601 through 605 of the selected plurality of hosts 600 of FIG. 6. Vector matrix 700 shows hosts 1 through 5 (H₁through H₅) corresponding to rows 760 containing keyword vectors 710, 720, 730, 740, and 750, respectively. Columns 770 of vector matrix 700 are designated by keywords K₁through K₈. In this example vector matrix 700, entries 780 include a ‘1’ indicating that keyword K_jis contained in a file identifier of host file inventory I_i, or a ‘0’ indicating that keyword K_jis not contained in a file identifier of host file inventory I_i.

In the example scenario of applying the keyword method of flow 500 to the sets of executable files 601 through 605 of selected plurality of hosts 600 in order to create vector matrix 700, the following variables may be identified:

- n=5 (host computers)
- H₁through H₅=hosts in network (e.g., H₁=host 1, H₂=host 2, etc.)
- I₁through I₅=host file inventories representing sets of executable files 601 through 605, respectively
- m=8 (keywords)
- K₁through K₈=unique keywords
  
  Each of the host file inventories I₁through I₅includes a set of file identifiers representing one of the sets of executable files 601, 602, 603, 604, and 605, respectively. Each executable file in a set of executable files is represented by a separate file identifier in the particular host file inventory. In this exemplary scenario, file identifiers each include a first token having a checksum configuration. Unique keywords are determined among all sets of executable files of selected hosts H₁through H₅. Thus, 8 unique keywords may be determined for selected hosts 600:
- K₁=checksum for executable file 610
- K₂=checksum for executable file 620
- K₃=checksum for executable file 630
- K₄=checksum for executable file 640
- K₅=checksum for executable file 650
- K₆=checksum for executable file 660
- K₇=checksum for executable file 670
- K₈=checksum for executable file 680
  
  A keyword sequence km can then be created in step 520 with the 8 unique keywords:
- km=K₁K₂K₃K₄K₅K₆K₇K₈
  
  Thus, in this example scenario, the following host file inventories I₁through I₅could include file identifiers having first tokens equivalent to the following keywords:
- I₁→K₁, K₂, K₃, K₄
- I₂→K₁, K₅, K₆, K₇
- I₃→K₁, K₂, K₃, K₈
- I₄→K₁, K₅, K₆
- I₅→K₄, K₇, K₈

Once keyword sequence km is determined, flow moves to step 520 where variable i is set to 1 and then the iterative flow begins to create n×m (5×8) vector matrix 700 shown in FIG. 7. In step 530, variable j is set to 1 and keyword K_j(K₁) is retrieved from km in step 540. Flow moves to decision box 545 where host file inventory I_i(I₁) of host H_i(H₁) is searched for keyword K_j(K₁). In this example, keyword K₁is found in host file inventory I₁of host H₁, so flow moves to step 555 where a ‘1’ entry is added to row i, column j (row 1, column 1) of vector matrix 700. After vector matrix 700 has been updated flow moves to decision box 560 where a query is made as to whether j<m. Since 1 is less than 8, the flow moves to step 565 where j is set to 2 (i.e., j=j+1). Flow then loops back to step 540 to search for the next keyword K_j(K₂) in host file inventory I₁(I₁) of host H_i(H₁). In this case, keyword K₂is found in host file inventory I₁, so a ‘1’ entry is added to row i, column j (row 1, column 2) of vector matrix 700. The variable j is still less than 8, (i.e., 2<8) as determined in decision box 560, so flow moves to step 565 and j is set to 3 (i.e., j=j+1). This iterative processing continues for each value of j until j=8, thereby filling in each entry 780 of keyword vector 710 for host H_i(H₁).

After the last entry 780 of keyword vector 710 has been added to vector matrix 700, flow moves to decision box 560 where a query is made as to whether j<m (i.e., Is 8<8?). Because j is not less than 8, flow moves to decision box 570 where a query is made as to whether i<n (i.e., Is 1<5?). Because 1 is less than 5, flow moves to step 575 where i is set to 2 (i.e., i=i+1) and flow loops back to step 530 where j is set to 1. The inner iterative loop then begins in step 540 to search for all keywords in host file inventory I_i(I₂) of host H_i(H₂) beginning with keyword K_j(K₁). Thus, in the embodiment used in this example scenario, rows 760 are successively filled with a ‘1’ or a ‘0’ value for each entry a_i,juntil each vector row 710 through 750 has been completed. As previously discussed herein, however, another embodiment provides that each entry a_i,jin rows 760 could be filled with a value corresponding to the frequency of occurrence of keyword K_jfound in host file inventory I_i.

Vector matrix 700 can be provided as input data to a vector based clustering procedure, as previously described herein. Information generated from the clustering procedure could be provided in numerous ways such as, for example, reports, screen displays, files, emails, etc. In one example, the information could be provided in a proximity plot such as example proximity plot 800 illustrated in FIG. 8. Proximity plot 800 is an example graph that could be created by a vector-based clustering procedure applied to vector matrix 700 of FIG. 7. If agglomerative hierarchical clustering is used, clusters 810, 820, and 830 may be determined based on a cut point determination. If partitional clustering is used, clusters 810, 820, and 830 may be generated based on a predetermined number of clusters. Proximity plot 800 shows two clusters and one outlier. Cluster 810 represents hosts H₁and H₃and cluster 820 represents hosts H₂and H₄. Hosts H₁and H₃may be clustered together because they have three common executable files 610, 620, and 630. Hosts H₂and H₄may be clustered together because they also have three common executable files 610, 650, and 660. Outlier 830 represents host H₅, which may be indicated as an outlier, because it has, in this example, none or only one common executable file with each of the other hosts. Although the clustering information is displayed on proximity plot 800 shown in FIG. 8, other textual reports and/or visual representations, as previously described herein with reference to FIG. 3, may be used to show clusters 810 and 820 and outlier 830.

Turning to FIG. 9, FIG. 9 illustrates an example system flow 900 of a compression-based embodiment of a system and method for clustering host inventories. Flow may begin at step 910 where file identifiers and host file inventories (I₁through I_n, with n=number of selected hosts) may be generated for each of the selected hosts 110, as previously described herein with reference to FIG. 3.

After file identifiers and host file inventories have been determined for each of the selected hosts 110, flow then moves to step 920 where a compression technique may be used to transform host file inventories into a similarity matrix, which will be further described herein with reference to FIGS. 10 and 11. Once a similarity matrix is created, flow moves to step 930 where a similarity-based clustering analysis can be performed on the similarity matrix. The similarity-based clustering analysis performed on the similarity matrix may include, for example, agglomerative hierarchical clustering or partitional clustering. The results of such clustering techniques may be stored in a memory element of central server (e.g., secondary storage 240 of computer 200), or may be stored in a database or other memory element external to central server 130.

After similarity-based clustering has been performed on the similarity matrix in step 930, flow moves to step 940 where one or more reports can be generated indicating the clustered groupings determined during the clustering analysis, as previously described herein with reference to FIG. 3. Such reports for similarity-based clustering may include a textual report and/or a visual representation (e.g., a proximity plot, a dendrogram, heat maps of a similarity matrix where rows and columns have been merged to illustrate clusters, other cluster plots, etc.) enabling the user to view logical groupings of the selected hosts. Once the similar groupings and outlier hosts have been identified, an IT Administrator or other authorized user can apply common policies to computers within the logical groupings and remedial action may be taken on any identified outlier computers. For example, outlier computers may be remediated to a standard software configuration as defined by the IT Administrators.

Turning to FIG. 10, FIG. 10 illustrates a matrix format 1000 used when generating a similarity matrix in one embodiment of the system and method of clustering host inventories. The similarity matrix is generated by applying a compression method to a plurality of host file inventories, each of which includes a set of file identifiers. As an example, each of the sets of file identifiers may represent one of the sets of executable files 112a, 112b, or 112c on the corresponding selected host 110a, 110b, or 110c. In addition, the following variables may be identified when generating a similarity matrix:

- n=number of selected hosts
- H_i=host in network, with i=1 to n (e.g., H₁=host 110a, H₂=host 110b, H₃=host 110c)
- H_j=host in network, with j=1 to n (e.g., H₁=host 110a, H₂=host 110b, H₃=host 110c)
- I_i=host file inventory of H_i
- I_j=host file inventory of H_j

Similarity matrix format 1000 includes n rows 1060 and n columns 1070, with ‘n’ defining the number of dimensions of the resulting n-by-n (i.e., n×n) similarity matrix. Each row of similarity matrix format 1000 is denoted by host H_i(i=1 to n), and each column is denoted by host H_j(j=1 to n). Each entry 1080 is denoted by a variable with subscripts i and j (i.e., a_i,j) where i and j correspond to the respective row and column where the entry is located. For example, entry a_2,1is found in row 2, column 1 of similarity matrix format 1000.

When a similarity matrix is created in accordance with one embodiment of this disclosure, each entry a_i,jhas a numerical value representing the similarity distance between host H_iand host H_jwith 1 representing the highest degree of similarity. In one embodiment, the similarity distances represented by entries a_1,1through a_n,ncan include any numerical value from 0 to 1, inclusively (i.e., 0≦a_i,j≦1). In this embodiment, the closer a_i,jis to 1, the greater the similarity is between host file inventories I_iand I_jof hosts H_iand H_j, and the closer a_i,jis to zero, the greater the difference is between host file inventories I_iand I_jof hosts H_iand H_j. Thus, a value of 1 in a_i,jmay indicate hosts H_iand H_jhave identical host file inventories and therefore, identical sets of executable files, whereas a value of zero in a_i,jmay indicate hosts H₁and H_jhave no common file identifiers in their respective host file inventories and therefore, no common executable files in their respective sets of executable files. Once each of the entries 1080 has been filled with a calculated value, the resulting similarity matrix can be provided as input data into a similarity-based clustering algorithm to create a cluster graph or plot showing logical groupings of hosts H₁through H_nhaving similar sets of executable files and outlier hosts having dissimilar sets of executable files. The clustering analysis performed on the resulting similarity matrix may include commonly available clustering techniques such as agglomerative hierarchical clustering or partitional clustering, as previously described herein with reference to clustering analysis of a vector matrix.

Turning to FIG. 11, FIG. 11 illustrates a flow 1100 using a compression method to transform host file inventories I₁through I_nof hosts H₁through H_n, respectively, into a similarity matrix. Flow 1100 corresponds to step 920 of FIG. 9 and may be implemented, at least in part, by host inventory preparation module 150 of central server 130, shown in FIG. 1. When flow 1100 begins, i is set to 1 in step 1110 and j is set to 1 in step 1115. Variables ‘i’ and ‘j’ are used to construct the n×n similarity matrix for the selected plurality of hosts being clustered. Steps 1115, 1175, and 1180 form an outer loop iterating through the rows of hosts and steps 1120 through 1170 form an inner loop iterating through the columns of hosts.

In step 1120, a list of file identifiers (e.g., checksums, checksums combined with a file path, checksums combined with one or more file attributes, etc.) representing a set of executable files on host H_iare extracted from host file inventory I_iand put in a file F_i. In step 1125, a list of file identifiers representing a set of executable files on host H_jare extracted from host file inventory I_jand put in a file F_j. In step 1130, files F_iand F_jare concatenated and put in file F_ij. It will be apparent that the use of files F_i, F_j, and F_ijto store file identifiers is an example implementation of the system, and that memory buffers or any other suitable representation allowing concatenation, compression, and length determination of data may also be used.

After files F_i, F_j, and F_ijare prepared, compression is applied to each of the files. A compression utility such as, for example, gzip, bzip, bzip2, zlib, or zip compression utilities may be used to compress files F_i, F_j, and F_ij. Also, in some embodiments, the list of file identifiers in files F_i, F_j, and F_ijmay be sorted to enable more accurate compression by the compression utility. In step 1140, file F_iis compressed and the length of the result is represented as C_i. In step 1145, file F_jis compressed and the length of the result is represented as C_j. In step 1150, file F_ijis compressed and the length of the result is represented as C_ij. After compressing each of the files, normalized compression distance (NCD_i,j) between H_iand H_jis computed in step 1155.

Normalized compression distance (NCD) is used for clustering and is based on an algorithm developed by Kolmogorov called normalized information distance (NID). NCD is discussed in detail in Rudi Cilibrasi's 2007 thesis entitled “Statistical Interference through Data Compression,” which may be found at http://www.illc.uva.nl/Publications/Dissertation/DS-2007-01.text.pdf and can be used to compute the distance between similar data. NCD may be computed using the following equation:

NCD_i,j=[C_ij−min{C_i,C_j}]/max{C_i,C_j}

Once NCD_i,jhas been computed, flow moves to step 1160 where a_i,jis computed by the following equation: a_i,j=1−NCD_i,j. The value a_i,jis then used to construct the similarity matrix by adding a_i,jto row i, column j. After the similarity matrix has been updated in step 1160, flow moves to decision box 1165 and a query is made as to whether j<n. If j<n, then additional entries in row i of the similarity matrix need to be computed (i.e., similarity distance has not been computed between host H_iand all of the hosts H_j(j=1 to n). In this case, flow moves to step 1170 where j is set to j+1. Flow then loops back to step 1120 where the inner loop of flow 1100 repeats and the similarity distance is computed between host H_iand the next host H_jwith j=j+1.

With reference again to decision box 1165, if j is not less than n (i.e., j≧n), then all of the entries in row i have been computed and flow moves to decision box 1175 where a query is made as to whether i<n. If i<n, then not all rows of similarity matrix 1000 have been computed, and therefore, flow moves to step 1180 where i is set to i+1. Flow then loops back to step 1115 where j is set to 1 so that entries a_i,jfor the next row i (H_i, with i=i+1) can be generated by inner loop steps 1120 through 1170. With reference again to decision box 1175, if i is not less than n (i.e., i≧n) then entries for all of the rows i through n have been computed and, therefore, the similarity matrix has been completed and flow ends.

It will be apparent that flow 1100 could be optimized in numerous ways. One optimization technique includes caching the lengths of compressed files C_iand C_j, which are used multiple times during flow 1100 to calculate entries 1080 in the similarity matrix. In addition, the extracted lists of file identifiers F_iand F_jmay also be cached for use during flow 1100. It will also be noted that the matrix should be symmetric along the diagonal a_1,1through a_n,n. This symmetry could be used in the implementation of the system to compute only one-half of the matrix and then reflect the results over the diagonal.

Turning to 12, FIG. 12 shows an example similarity matrix 1200 generated by applying the compression method of flow 1100 of FIG. 11 to host file inventories I₁through I₅of the example selected plurality of hosts 600 of FIG. 6. FIG. 12 shows hosts 1 through 5 (H₁through H₅) corresponding to rows 1260 and columns 1270, forming a 5×5 similarity matrix 1200. Entries 1280 of similarity matrix 1200 include values from 0 to 1, inclusively. The closer the value is to 1, the closer the distance or greater the similarity of the corresponding hosts in row i, column j. For example, each entry in matrix 1200 with the same host in the corresponding row and column, (e.g, a_1,1, a_2,2, a_3,3, etc.) has a value of 1 because the hosts, and therefore the host file inventories, are identical. In contrast, each entry in similarity matrix 1200, in which the corresponding hosts H_iand H_jhave respective executable file inventories I_iand I_jwith no common executable files, has a value of zero (e.g., a_5,4, a_4,5).

Applying the compression method flow 1100 of FIG. 11 to the example selected plurality of hosts 600 of FIG. 6, in order to transform host file inventories into similarity matrix 1200, the following variables can be identified:

- n=5 (hosts)
- H₁through H₅=hosts in network (e.g., H₁=host 1, H₂=host 2, etc.)
- I₁through I₅=host file inventories representing sets of executable files 601 through 605, respectively

Each of the host file inventories I₁through I₅includes a set of file identifiers representing one of the sets of executable files 601, 602, 603, 604, and 605. Each executable file in a set of executable files is represented by a separate file identifier in the particular host file inventory. In this example scenario in which each file identifier includes a single token having a checksum configuration, the following host file inventories of hosts H₁through H₅could include file identifiers D₁through D₈, which represent executable files 610 through 680, respectively:

- I₁→D₁, D₂, D₃, D₄
- I₂→D₁, D₅, D₆, D₂
- I₃→D₁, D₂, D₃, D₈
- I₄→D₁, D₅, D₆
- I₅→D₄, D₇, D₈

In step 1110, i is set to 1 and then the iterative looping begins to create an n×n (5×5) similarity matrix 1200 shown in FIG. 12. In step 1115 j is set to 1 and flow passes to steps 1120 through 1125 where the following variables can be determined:

- F_i(F₁)=D₁D₂D₃D₄(i.e., list of file identifiers for I_i(I₁))
- F_j(F₁)=D₁D₂D₃D₄(i.e., list of file identifiers for I_j(I₁))
- F_ij(F₁F₁)=D₁D₂D₃D₄D₁D₂D₃D₄(i.e., concatenated files F_i(F₁) and F_j(F₁))
  
  Flow then moves to steps 1140 through 1150 where compression is applied to these files and the length of the compressed files is represented as follows:
- C_i(C₁)=length of compressed file F_i(F₁)
- C_j(C₁)=length of compressed file F_j(F₁)
- C_ij(C₁C₁)=length of compressed file F_ij(F₁F₁)
  
  For simplicity of explanation, example arbitrary values are provided in which each file identifier has a defined length of 1, such that C₁=4 and C₁C₁=4. It will be apparent, however, that these values are provided for example purposes only and may not accurately reflect actual values produced by a compression utility. After compression has been applied to the files, NCD_i,jis computed using the compressed values C_i, C_j, and C_ij. In this example,

$\begin{matrix} {NCD}_{1, 1} = [C_{1} C_{1} - \min {C_{1}, C_{1}}] / \max {C_{1}, C_{1}} \\ = [4 - \min {4, 4}] / \max {4, 4} \\ = 0 \end{matrix}$

After the NCD_1,1value is computed in step 1155, flow moves to step 1160 and a_i,jis computed:

$\begin{matrix} a_{1, 1} = 1 - {NCD}_{1, 1} \\ = 1 - 0 \\ = 1 \end{matrix}$

The ‘1’ value is added to row i, column j (row 1, column 1) of similarity matrix 1200. After similarity matrix 1200 has been updated, flow moves to decision box 1165 where a query is made as to whether j<n. Since 1 is less than 5, flow moves to step 1170 where j is set to 2 (i.e., j=j+1). Flow then loops back to step 1120 to determine the similarity distance between H_i(H₁) and the next host H_j(H₂). In this case, after extraction and compression are performed, NCD_i,j(NCD_1,2) is computed as 0.75, because H₁and H₂have only one common file identifier D₁and, therefore, only one common executable file 601. In step 1160, NCD_1,2is used to compute a_1,2as 0.25, which is added to row i, column j (row 1, column 2) of similarity matrix 1200. The variable j is still less than 5, (i.e., 2<5) as determined in decision box 1165, so flow moves to step 1170 and j is set to 3 (i.e., j=j+1). This iterative processing continues for each value of j until j=5, thereby filling in each entry for H₁in row i (row 1) of similarity matrix 1200.

After the last entry of row i (row 1) has been added to similarity matrix 1200, flow moves to decision box 1165 where a query is made as to whether j<n (i.e., Is 5<5?). Because j is not less than 5, flow moves to decision box 1175 where a query is made as to whether i<n (i.e., Is 1<5?). Because 1 is less than 5, flow moves to step 1180 where i is set to 2 (i.e., i=i+1) and flow loops back to step 1115 where j is set to 1. The inner iterative loop then begins in step 1120 to determine the similarity distance between host file inventory I_i(I₂) of host H_i(H₂) and each host file inventory I_j(I₁through I₅). Thus, rows 1160 are successively filled with similarity distance values a_i,juntil each row has been completed.

After the compression method of flow 1100 has finished processing, similarity matrix 1200 can be provided as input to a similarity-based clustering procedure, as previously described herein with reference to clustering techniques used with a vector matrix. Information generated from the clustering procedure could be provided in numerous ways, as previously described herein with reference to FIG. 9. In one example, the information could be provided in a proximity plot such as example proximity plot 800 illustrated in FIG. 8, which has been previously shown and described herein.

Software for achieving the operations outlined herein can be provided at various locations (e.g., the corporate IT headquarters, end user computers, distributed servers in the cloud, etc.). In other embodiments, this software could be received or downloaded from a web server (e.g., in the context of purchasing individual end-user licenses for separate networks, devices, servers, etc.) in order to provide this system for clustering host inventories. In one example implementation, this software is resident in one or more computers sought to be protected from a security attack (or protected from unwanted or unauthorized manipulations of data).

In other examples, the software of the system for clustering host inventories in a computer network environment could involve a proprietary element (e.g., as part of a network security solution with McAfee® EPO software, McAfee® Application Control software, etc.), which could be provided in (or be proximate to) these identified elements, or be provided in any other device, server, network appliance, console, firewall, switch, information technology (IT) device, distributed server, etc., or be provided as a complementary solution (e.g., in conjunction with a firewall), or provisioned somewhere in the network.

In certain example implementations, the clustering activities outlined herein may be implemented in software. This could be inclusive of software provided in central server 130 (e.g., via administrative module 140, host inventory preparation module 150 and clustering module 160) and hosts 110 (e.g., via host inventory feed 114). These elements and/or modules can cooperate with each other in order to perform clustering activities as discussed herein. In other embodiments, these features may be provided external to these elements, included in other devices to achieve these intended functionalities, or consolidated in any appropriate manner. For example, some of the processors associated with the various elements may be removed, or otherwise consolidated such that a single processor and a single memory location are responsible for certain activities. In a general sense, the arrangement depicted in FIG. 1 may be more logical in its representation, whereas a physical architecture may include various permutations/combinations/hybrids of these elements.

In various embodiments, all of these elements (e.g., hosts 110, central server 130) include software (or reciprocating software) that can coordinate, manage, or otherwise cooperate in order to achieve the clustering operations, as outlined herein. One or all of these elements may include any suitable algorithms, hardware, software, components, modules, interfaces, or objects that facilitate the operations thereof. In the implementation involving software, such a configuration may be inclusive of logic encoded in one or more tangible media (e.g., embedded logic provided in an application specific integrated circuit (ASIC), digital signal processor (DSP) instructions, software (potentially inclusive of object code and source code) to be executed by a processor, or other similar machine, etc.), which may be inclusive of non-transitory media. In some of these instances, one or more memory elements (as shown in FIG. 2) can store data used for the operations described herein. This includes the memory element being able to store software, logic, code, or processor instructions that are executed to carry out the activities described in this Specification. A processor can execute any type of instructions associated with the data to achieve the operations detailed herein in this Specification. In one example, the processors (as shown in FIG. 2) could transform an element or an article (e.g., data) from one state or thing to another state or thing. In another example, the activities outlined herein may be implemented with fixed logic or programmable logic (e.g., software/computer instructions executed by a processor) and the elements identified herein could be some type of a programmable processor, programmable digital logic (e.g., a field programmable gate array (FPGA), an erasable programmable read only memory (EPROM), an electrically erasable programmable read only memory (EEPROM)) or an ASIC that includes digital logic, software, code, electronic instructions, or any suitable combination thereof.

Any of the memory items discussed herein should be construed as being encompassed within the broad term ‘memory element.’ Similarly, any of the potential processing elements, modules, and machines described in this Specification should be construed as being encompassed within the broad term ‘processor.’ Each of the computers, servers, and other devices may also include suitable interfaces for receiving, transmitting, and/or otherwise communicating data or information in a network environment.

Note that with the examples provided herein, interaction may be described in terms of two, three, four, or more network components. However, this has been done for purposes of clarity and example only. It should be appreciated that the system can be consolidated in any suitable manner. Along similar design alternatives, any of the illustrated computers, modules, components, and elements of FIG. 1 may be combined in various possible configurations, all of which are clearly within the broad scope of this Specification. In certain cases, it may be easier to describe one or more of the functionalities of a given set of flows by only referencing a limited number of components or network elements. Therefore, it should also be appreciated that the system of FIG. 1 (and its teachings) is readily scalable. The system can accommodate a large number of components, as well as more complicated or sophisticated arrangements and configurations. Accordingly, the examples provided should not limit the scope or inhibit the broad teachings of the system as potentially applied to a myriad of other architectures.

It is also important to note that the operations described with reference to the preceding FIGURES illustrate only some of the possible scenarios that may be executed by, or within, the system. Some of these operations may be deleted or removed where appropriate, or these operations may be modified or changed considerably without departing from the scope of the discussed concepts. In addition, the timing of these operations may be altered considerably and still achieve the results taught in this disclosure. The preceding operational flows have been offered for purposes of example and discussion. Substantial flexibility is provided by the clustering system in that any suitable arrangements, chronologies, configurations, and timing mechanisms may be provided without departing from the teachings of the discussed concepts.

Claims

1. One or more non-transitory media including code for execution that, when executed by a processor, is operable to: obtain a plurality of host file inventories corresponding respectively to a plurality of hosts in a network environment, wherein each of the plurality of host file inventories includes one or more file identifiers, each of the file identifiers of a particular host file inventory representing a different executable file on one of the plurality of hosts corresponding to the particular host file inventory;calculate input data by transforming the plurality of host file inventories into a similarity matrix for the plurality of hosts, wherein for at least each unique pair of host file inventories of the plurality of host file inventories, the transforming includes:determining a normalized compression distance (NCD) between the unique pair of host file inventories;determining a numerical value representing a similarity distance between the unique pair of host file inventories, the numerical value being determined based on the NCD; andupdating the similarity matrix to include the numerical value representing the similarity distance between the unique pair of host file inventories; andprovide the input data to a clustering procedure to group the plurality of hosts into one or more clusters of hosts, wherein the one or more clusters of hosts are grouped using a predetermined similarity criteria.
2. The one or more non-transitory media of claim 1, wherein the determining the NCD includes: storing, in a first file, one or more file identifiers of a first host file inventory of the unique pair of host file inventories;storing, in a second file, one or more file identifiers of a second host file inventory of the unique pair of host file inventories;concatenating the first and second files in a concatenated file;compressing the first file into a compressed first file;compressing the second file into a compressed second file; andcompressing the concatenated file into a compressed concatenated file, wherein the NCD is computed based on the compressed first file, the compressed second file, and the compressed concatenated file.
3. The one or more non-transitory media of claim 1, wherein a variable n represents a total number of the plurality of hosts, the similarity matrix being an n by n (“n×n”) matrix.
4. The one or more non-transitory media of claim 1, wherein each entry in the similarity matrix is one of a plurality of numerical values included in the similarity matrix, wherein each entry represents a similarity distance between a pair of hosts of the plurality of hosts.
5. The one or more non-transitory media of claim 4, wherein only one-half of the plurality of numerical values that are not entries in a diagonal line of symmetry in the similarity matrix are determined, and wherein the determined numerical values are reflected over the diagonal line of symmetry in the similarity matrix.
6. The one or more non-transitory media of claim 1, wherein the code for execution, when executed by a processor, is further operable to: generate information indicating the one or more clusters of hosts, wherein each of the one or more clusters includes at least one host.
7. The one or more non-transitory media of claim 6, wherein the information is a proximity plot.
8. The one or more non-transitory media of claim 1, wherein the clustering procedure is an agglomerative hierarchical clustering technique with the predetermined similarity criteria including a cut point determination to define a stopping point of the clustering procedure.
9. The one or more non-transitory media of claim 1, wherein the clustering procedure is a partitional clustering technique.
10. An apparatus, comprising: at least one processor coupled to at least one memory element;a host inventory preparation module that when executed by the at least one processor, is configured to:obtain a plurality of host file inventories corresponding respectively to a plurality of hosts in a network environment, wherein each of the plurality of host file inventories includes one or more file identifiers, each of the file identifiers of a particular host file inventory representing a different executable file on one of the plurality of hosts corresponding to the particular host file inventory; andcalculate input data by transforming the plurality of host file inventories into a similarity matrix for the plurality of hosts, wherein for at least each unique pair of host file inventories of the plurality of host file inventories, the transforming includes:determining a normalized compression distance (NCD) between the pair of host file inventories;determining a numerical value representing a similarity distance between the pair of host file inventories, the numerical value being determined based on the NCD; andupdating the similarity matrix to include the numerical value representing the similarity distance between the pair of host file inventories; anda clustering module that when executed by the at least one processor, is configured to:receive the input data; andgroup the plurality of hosts into one or more clusters of hosts, wherein the one or more clusters of hosts are grouped using a predetermined similarity criteria.
11. The apparatus of claim 10, wherein the clustering module, when executed by the at least one processor, is further configured to: generate information indicating the one or more clusters of hosts, wherein each of the one or more clusters includes at least one host.
12. The apparatus of claim 10, wherein the information is a proximity plot.
13. The apparatus of claim 10, wherein the clustering module includes an agglomerative hierarchical clustering technique with the predetermined similarity criteria including a cut point determination to define a stopping point of the clustering module.
14. The apparatus of claim 10, wherein the clustering module includes a partitional clustering technique.
15. The apparatus of claim 10, wherein the determining the NCD includes: storing, in a first file, one or more file identifiers of a first host file inventory of the unique pair of host file inventories;storing, in a second file, one or more file identifiers of a second host file inventory of the unique pair of host file inventories;concatenating the first and second files in a concatenated file;compressing the first file into a compressed first file;compressing the second file into a compressed second file; andcompressing the concatenated file into a compressed concatenated file, wherein the NCD is computed based on the compressed first file, the compressed second file, and the compressed concatenated file.
16. The apparatus of claim 10, wherein a variable n represents a total number of the plurality of hosts, the similarity matrix being an n by n (“n×n”) matrix.
17. The apparatus of claim 10, wherein each entry in the similarity matrix is one of a plurality of numerical values included in the similarity matrix, wherein each entry represents a similarity distance between a pair of hosts of the plurality of hosts.
18. The apparatus of claim 17, wherein only one-half of the plurality of numerical values that are not entries in a diagonal line of symmetry in the similarity matrix are determined, and wherein the determined numerical values are reflected over the diagonal line of symmetry in the similarity matrix.
19. A computer implemented method executed by one or more processors, comprising: obtaining a plurality of host file inventories corresponding respectively to a plurality of hosts in a network environment;calculating input data by transforming the plurality of host file inventories into a similarity matrix for the plurality of hosts, wherein, for at least each unique pair of host file inventories of the plurality of host file inventories, the transforming includes:storing, in a first file, one or more file identifiers of a first host file inventory of the pair of host file inventories;storing, in a second file, one or more file identifiers of a second host file inventory of the pair of host file inventories; concatenating the first and second files in a concatenated file;compressing the first file into a compressed first file;compressing the second file into a compressed second file; andcompressing the concatenated file into a compressed concatenated file;determining a normalized compression distance (NCD) between the first and second host file inventories based on the compressed first file, the compressed second file, and the compressed concatenated file;determining a numerical value representing a similarity distance between the pair of host file inventories, the numerical value being determined based on the NCD; andupdating the similarity matrix to include the numerical value representing the similarity distance between the pair of host file inventories; andproviding the input data to a clustering procedure to group the plurality of hosts into one or more clusters of hosts, wherein the one or more clusters of hosts are grouped using a predetermined similarity criteria.
20. The computer implemented method of claim 19, wherein each of the plurality of host file inventories includes one or more file identifiers, each of the file identifiers of a particular host file inventory representing a different executable file on one of the plurality of hosts corresponding to the particular host file inventory.
21. The method of claim 19, wherein each entry in the similarity matrix is one of a plurality of numerical values included in the similarity matrix, wherein each entry represents a similarity distance between a pair of hosts of the plurality of hosts.
22. The method of claim 21, wherein only one-half of the plurality of numerical values that are not entries in a diagonal line of symmetry in the similarity matrix are determined, and wherein the determined numerical values are reflected over the diagonal line of symmetry in the similarity matrix.
23. The method of claim 19, wherein the clustering procedure is an agglomerative hierarchical clustering technique with the predetermined similarity criteria including a cut point determination to define a stopping point of the clustering procedure.
24. The method of claim 19, wherein the clustering procedure is a partitional clustering technique.

RELATED APPLICATION

This application is a continuation (and claims the benefit of priority under 35 U.S.C. §120) of U.S. patent application Ser. No. 12/880,125, filed Sep. 12, 2010, entitled, “SYSTEM AND METHOD FOR CLUSTERING HOST INVENTORIES,” by inventors Rishi Bhargava et al. The disclosure of the prior application is considered part of (and is incorporated by reference in) the disclosure of this application.

US Referenced Citations (268)

Number	Name	Date	Kind
4688169	Joshi	Aug 1987	A
4982430	Frezza et al.	Jan 1991	A
5155847	Kirouac et al.	Oct 1992	A
5222134	Waite et al.	Jun 1993	A
5390314	Swanson	Feb 1995	A
5521849	Adelson et al.	May 1996	A
5560008	Johnson et al.	Sep 1996	A
5699513	Feigen et al.	Dec 1997	A
5778226	Adams et al.	Jul 1998	A
5778349	Okonogi	Jul 1998	A
5787427	Benantar et al.	Jul 1998	A
5842017	Hookway et al.	Nov 1998	A
5907709	Cantey et al.	May 1999	A
5907860	Garibay et al.	May 1999	A
5926832	Wing et al.	Jul 1999	A
5974149	Leppek	Oct 1999	A
5987610	Franczek et al.	Nov 1999	A
5987611	Freund	Nov 1999	A
5991881	Conklin et al.	Nov 1999	A
6064815	Hohensee et al.	May 2000	A
6073142	Geiger et al.	Jun 2000	A
6141698	Krishnan et al.	Oct 2000	A
6192401	Modiri et al.	Feb 2001	B1
6192475	Wallace	Feb 2001	B1
6256773	Bowman-Amuah	Jul 2001	B1
6275938	Bond et al.	Aug 2001	B1
6321267	Donaldson	Nov 2001	B1
6338149	Ciccone, Jr. et al.	Jan 2002	B1
6356957	Sanchez, II et al.	Mar 2002	B2
6393465	Leeds	May 2002	B2
6442686	McArdle et al.	Aug 2002	B1
6449040	Fujita	Sep 2002	B1
6453468	D'Souza	Sep 2002	B1
6460050	Pace et al.	Oct 2002	B1
6587877	Douglis et al.	Jul 2003	B1
6611925	Spear	Aug 2003	B1
6662219	Nishanov et al.	Dec 2003	B1
6748534	Gryaznov et al.	Jun 2004	B1
6769008	Kumar et al.	Jul 2004	B1
6769115	Oldman	Jul 2004	B1
6795966	Lim et al.	Sep 2004	B1
6832227	Seki et al.	Dec 2004	B2
6834301	Hanchett	Dec 2004	B1
6847993	Novaes et al.	Jan 2005	B1
6907600	Neiger et al.	Jun 2005	B2
6918110	Hundt et al.	Jul 2005	B2
6930985	Rathi et al.	Aug 2005	B1
6934755	Saulpaugh et al.	Aug 2005	B1
6988101	Ham et al.	Jan 2006	B2
6988124	Douceur et al.	Jan 2006	B2
7007302	Jagger et al.	Feb 2006	B1
7010796	Strom et al.	Mar 2006	B1
7024548	O'Toole, Jr.	Apr 2006	B1
7039949	Cartmell et al.	May 2006	B2
7065767	Kambhammettu et al.	Jun 2006	B2
7069330	McArdle et al.	Jun 2006	B1
7082456	Mani-Meitav et al.	Jul 2006	B2
7093239	van der Made	Aug 2006	B1
7124409	Davis et al.	Oct 2006	B2
7139916	Billingsley et al.	Nov 2006	B2
7152148	Williams et al.	Dec 2006	B2
7159036	Hinchliffe et al.	Jan 2007	B2
7177267	Oliver et al.	Feb 2007	B2
7203864	Goin et al.	Apr 2007	B2
7251655	Kaler et al.	Jul 2007	B2
7290266	Gladstone et al.	Oct 2007	B2
7302558	Campbell et al.	Nov 2007	B2
7330849	Gerasoulis et al.	Feb 2008	B2
7346781	Cowie et al.	Mar 2008	B2
7349931	Horne	Mar 2008	B2
7350204	Lambert et al.	Mar 2008	B2
7353501	Tang et al.	Apr 2008	B2
7363022	Whelan et al.	Apr 2008	B2
7370360	van der Made	May 2008	B2
7406517	Hunt et al.	Jul 2008	B2
7441265	Staamann et al.	Oct 2008	B2
7464408	Shah et al.	Dec 2008	B1
7506155	Stewart et al.	Mar 2009	B1
7506170	Finnegan	Mar 2009	B2
7506364	Vayman	Mar 2009	B2
7546333	Alon et al.	Jun 2009	B2
7546594	McGuire et al.	Jun 2009	B2
7552479	Conover et al.	Jun 2009	B1
7577995	Chebolu et al.	Aug 2009	B2
7603552	Sebes et al.	Oct 2009	B1
7607170	Chesla	Oct 2009	B2
7657599	Smith	Feb 2010	B2
7669195	Qumei	Feb 2010	B1
7685635	Vega et al.	Mar 2010	B2
7698744	Fanton et al.	Apr 2010	B2
7703090	Napier et al.	Apr 2010	B2
7757269	Roy-Chowdhury et al.	Jul 2010	B1
7765538	Zweifel et al.	Jul 2010	B2
7783735	Sebes et al.	Aug 2010	B1
7809704	Surendran et al.	Oct 2010	B2
7818377	Whitney et al.	Oct 2010	B2
7823148	Deshpande et al.	Oct 2010	B2
7836504	Ray et al.	Nov 2010	B2
7840968	Sharma et al.	Nov 2010	B1
7849507	Bloch et al.	Dec 2010	B1
7856661	Sebes et al.	Dec 2010	B1
7865931	Stone et al.	Jan 2011	B1
7870387	Bhargava et al.	Jan 2011	B1
7873955	Sebes et al.	Jan 2011	B1
7895573	Bhargava et al.	Feb 2011	B1
7908653	Brickell et al.	Mar 2011	B2
7937334	Bonissone et al.	May 2011	B2
7937455	Saha et al.	May 2011	B2
7966659	Wilkinson et al.	Jun 2011	B1
7987230	Sebes et al.	Jul 2011	B2
7996836	McCorkendale et al.	Aug 2011	B1
8015388	Rihan et al.	Sep 2011	B1
8015563	Araujo et al.	Sep 2011	B2
8028340	Sebes et al.	Sep 2011	B2
8195931	Sharma et al.	Jun 2012	B1
8234713	Roy-Chowdhury et al.	Jul 2012	B2
8291497	Griffin et al.	Oct 2012	B1
8307437	Sebes et al.	Nov 2012	B2
8321932	Bhargava et al.	Nov 2012	B2
8332929	Bhargava et al.	Dec 2012	B1
8341627	Mohinder	Dec 2012	B2
8352930	Sebes et al.	Jan 2013	B1
8381284	Dang et al.	Feb 2013	B2
8495060	Chang	Jul 2013	B1
8515075	Saraf et al.	Aug 2013	B1
8539063	Sharma et al.	Sep 2013	B1
8544003	Sawhney et al.	Sep 2013	B1
8549003	Bhargava et al.	Oct 2013	B1
8549546	Sharma et al.	Oct 2013	B2
8555404	Sebes et al.	Oct 2013	B1
8561051	Sebes et al.	Oct 2013	B2
8561082	Sharma et al.	Oct 2013	B2
8615502	Saraf et al.	Dec 2013	B2
8701182	Bhargava et al.	Apr 2014	B2
8701189	Saraf et al.	Apr 2014	B2
8707422	Bhargava et al.	Apr 2014	B2
20020056076	van der Made	May 2002	A1
20020069367	Tindal et al.	Jun 2002	A1
20020083175	Afek et al.	Jun 2002	A1
20020099671	Mastin et al.	Jul 2002	A1
20030014667	Kolichtchak	Jan 2003	A1
20030023736	Abkemeier	Jan 2003	A1
20030033510	Dice	Feb 2003	A1
20030073894	Chiang et al.	Apr 2003	A1
20030074552	Olkin et al.	Apr 2003	A1
20030115222	Oashi et al.	Jun 2003	A1
20030120601	Ouye et al.	Jun 2003	A1
20030120811	Hanson et al.	Jun 2003	A1
20030120935	Teal et al.	Jun 2003	A1
20030145232	Poletto et al.	Jul 2003	A1
20030163718	Johnson et al.	Aug 2003	A1
20030167292	Ross	Sep 2003	A1
20030167399	Audebert et al.	Sep 2003	A1
20030200332	Gupta et al.	Oct 2003	A1
20030212902	van der Made	Nov 2003	A1
20030220944	Schottland et al.	Nov 2003	A1
20030221190	Deshpande et al.	Nov 2003	A1
20040003258	Billingsley et al.	Jan 2004	A1
20040015554	Wilson	Jan 2004	A1
20040051736	Daniell	Mar 2004	A1
20040054928	Hall	Mar 2004	A1
20040143749	Tajali et al.	Jul 2004	A1
20040167906	Smith et al.	Aug 2004	A1
20040230963	Rothman et al.	Nov 2004	A1
20040243678	Smith et al.	Dec 2004	A1
20040255161	Cavanaugh	Dec 2004	A1
20050018651	Yan et al.	Jan 2005	A1
20050086047	Uchimoto et al.	Apr 2005	A1
20050108516	Balzer et al.	May 2005	A1
20050108562	Khazan et al.	May 2005	A1
20050114672	Duncan et al.	May 2005	A1
20050132346	Tsantilis	Jun 2005	A1
20050228990	Kato et al.	Oct 2005	A1
20050235360	Pearson	Oct 2005	A1
20050257207	Blumfield et al.	Nov 2005	A1
20050257265	Cook et al.	Nov 2005	A1
20050260996	Groenendaal	Nov 2005	A1
20050262558	Usov	Nov 2005	A1
20050273858	Zadok et al.	Dec 2005	A1
20050283823	Okajo et al.	Dec 2005	A1
20050289538	Black-Ziegelbein et al.	Dec 2005	A1
20060004875	Baron et al.	Jan 2006	A1
20060015501	Sanamrad et al.	Jan 2006	A1
20060037016	Saha et al.	Feb 2006	A1
20060080656	Cain et al.	Apr 2006	A1
20060085785	Garrett	Apr 2006	A1
20060101277	Meenan et al.	May 2006	A1
20060133223	Nakamura et al.	Jun 2006	A1
20060136910	Brickell et al.	Jun 2006	A1
20060136911	Robinson et al.	Jun 2006	A1
20060195906	Jin et al.	Aug 2006	A1
20060200863	Ray et al.	Sep 2006	A1
20060230314	Sanjar et al.	Oct 2006	A1
20060236398	Trakic et al.	Oct 2006	A1
20060259734	Sheu et al.	Nov 2006	A1
20070011746	Malpani et al.	Jan 2007	A1
20070028303	Brennan	Feb 2007	A1
20070039049	Kupferman et al.	Feb 2007	A1
20070050579	Hall et al.	Mar 2007	A1
20070050764	Traut	Mar 2007	A1
20070074199	Schoenberg	Mar 2007	A1
20070083522	Nord et al.	Apr 2007	A1
20070101435	Konanka et al.	May 2007	A1
20070136579	Levy et al.	Jun 2007	A1
20070143851	Nicodemus et al.	Jun 2007	A1
20070169079	Keller et al.	Jul 2007	A1
20070192329	Croft et al.	Aug 2007	A1
20070220061	Tirosh et al.	Sep 2007	A1
20070220507	Back et al.	Sep 2007	A1
20070253430	Minami et al.	Nov 2007	A1
20070256138	Gadea et al.	Nov 2007	A1
20070271561	Winner et al.	Nov 2007	A1
20070300215	Bardsley	Dec 2007	A1
20080005737	Saha et al.	Jan 2008	A1
20080005798	Ross	Jan 2008	A1
20080010304	Vempala et al.	Jan 2008	A1
20080022384	Yee et al.	Jan 2008	A1
20080034416	Kumar et al.	Feb 2008	A1
20080052468	Speirs et al.	Feb 2008	A1
20080082977	Araujo et al.	Apr 2008	A1
20080120499	Zimmer et al.	May 2008	A1
20080141371	Bradicich et al.	Jun 2008	A1
20080163207	Reumann et al.	Jul 2008	A1
20080163210	Bowman et al.	Jul 2008	A1
20080165952	Smith et al.	Jul 2008	A1
20080184373	Traut et al.	Jul 2008	A1
20080235534	Schunter et al.	Sep 2008	A1
20080294703	Craft et al.	Nov 2008	A1
20080301770	Kinder	Dec 2008	A1
20090007100	Field et al.	Jan 2009	A1
20090038017	Durham et al.	Feb 2009	A1
20090043993	Ford et al.	Feb 2009	A1
20090055693	Budko et al.	Feb 2009	A1
20090113110	Chen et al.	Apr 2009	A1
20090144300	Chatley et al.	Jun 2009	A1
20090150639	Ohata	Jun 2009	A1
20090249053	Zimmer et al.	Oct 2009	A1
20090249438	Litvin et al.	Oct 2009	A1
20100071035	Budko et al.	Mar 2010	A1
20100077479	Viljoen	Mar 2010	A1
20100114825	Siddegowda	May 2010	A1
20100250895	Adams et al.	Sep 2010	A1
20100281133	Brendel	Nov 2010	A1
20100332910	Ali et al.	Dec 2010	A1
20110029772	Fanton et al.	Feb 2011	A1
20110035423	Kobayashi et al.	Feb 2011	A1
20110047543	Mohinder	Feb 2011	A1
20110078550	Nabutovsky	Mar 2011	A1
20110099634	Conrad et al.	Apr 2011	A1
20110113467	Agarwal et al.	May 2011	A1
20110138461	Bhargava et al.	Jun 2011	A1
20120030731	Bhargava et al.	Feb 2012	A1
20120030750	Bhargava et al.	Feb 2012	A1
20120278853	Chowdhury et al.	Nov 2012	A1
20120290828	Bhargava et al.	Nov 2012	A1
20130024934	Sebes et al.	Jan 2013	A1
20130031111	Jyoti et al.	Jan 2013	A1
20130091318	Bhattacharjee et al.	Apr 2013	A1
20130097355	Dang et al.	Apr 2013	A1
20130097356	Dang et al.	Apr 2013	A1
20130117823	Dang et al.	May 2013	A1
20130246423	Bhargava et al.	Sep 2013	A1
20130247016	Sharma et al.	Sep 2013	A1
20130247027	Shah et al.	Sep 2013	A1
20130247032	Bhargava et al.	Sep 2013	A1
20130247192	Krasser	Sep 2013	A1
20130326620	Merza et al.	Dec 2013	A1
20140006405	Bhargava et al.	Jan 2014	A1

Foreign Referenced Citations (10)

Number	Date	Country
1 482 394	Dec 2004	EP
2 037 657	Mar 2009	EP
WO 9844404	Oct 1998	WO
WO 0184285	Nov 2001	WO
WO 2006012197	Feb 2006	WO
WO 2006124832	Nov 2006	WO
WO 2008054997	May 2008	WO
WO 2011059877	May 2011	WO
WO 2012015485	Feb 2012	WO
WO 2012015489	Feb 2012	WO

Non-Patent Literature Citations (42)

Entry
Bjornar Larsen et al., Fast and effective text mining using linear-time document clustering , 1999, ACM, 16-22.
Notification of International Preliminary Report on Patentability and Written Opinion mailed May 24, 2012 for International Application No. PCT/US2010/055520, 5 pages.
Sailer et al., sHype: Secure Hypervisor Approach to Trusted Virtualized Systems, IBM research Report, Feb. 2, 2005, 13 pages.
Kurt Gutzmann, “Access Control and Session Management in the HTTP Environment,” Jan./Feb. 2001, pp. 26-35, IEEE Internet Computing.
Eli M. Dow, et al., “The Xen Hypervisor,” INFORMIT, dated Apr. 10, 2008, http://www.informit.com/articles/printerfriendly.aspx?p=1187966, printed Aug. 11, 2009 (13 pages).
“Xen Architecture Overview,” Xen, dated Feb. 13, 2008, Version 1.2, http://wiki.xensource.com/xenwiki/XenArchitecture?action=AttachFile&do=get&target=Xen+architecture—Q1+2008.pdf, printed Aug. 18, 2009 (9 pages).
Desktop Management and Control, Website: http://www.vmware.com/solutions/desktop/, printed Oct. 12, 2009, 1 page.
Secure Mobile Computing, Website: http://www.vmware.com/solutions/desktop/mobile.html, printed Oct. 12, 2009, 2 pages.
Cilibrasi, Rudi Langston, “Statistical Inference Through Data Compression,” Institute for Logic, Language and Computation, ISBN: 90-6196-540-3, Copyright 2007, retrieved Sep. 10, 2010 from http://www.illc.uva.nl/Publications/Dissertations/DS-2007-01.text.pdf, 225 pages.
Karypis, George, Contact/METIS/CLUTO/MONSTER/YASSPP/Forums, Internal Lab Website, copyright 2006-2010, retrieved Sep. 10, 2010 from http://glaros.dtc.umn.edu/gkhome, 1 page.
Tagarelli, et al., “A Segment-based Approach to Clustering Multi-Topic Documents,” copyright 2005-2010, George Karypis, Internal Lab Website, retrieved Sep. 10, 2010 from http://glaros.dtc.umn.edu/gkhome/cluto/cluto/publications, 1 page.
Ying Zhao and George Karypis, “Hierarchical Clustering Algorithms for Document Datasets,” copyright 2005-2010, George Karypis, Internal Lab Website, retrieved Sep. 10, 2010 from http://glaros.dtc.umn.edu/gkhome/cluto/cluto/publications, 1 page.
Ying Zhao and George Karypis, “Topic-Driven Clustering for Document Datasets,” copyright 2005-2010, George Karypis, Internal Lab Website, retrieved Sep. 10, 2010 from http://glaros.dtc.umn.edu/gkhome/cluto/cluto/publications, 1 page.
Ying Zhao and George Karypis, “Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering,” copyright 2005-2010, George Karypis, Internal Lab Website, retrieved Sep. 10, 2010 from http://glaros.dtc.umn.edu/gkhome/cluto/cluto/publications, 1 page.
Ying Zhao and George Karypis, “Clustering in Life Sciences,” copyright 2005-2010, George Karypis, Internal Lab Website, retrieved Sep. 10, 2010 from http://glaros.dtc.umn.edu/gkhome/cluto/cluto/publications, 1 page.
Ying Zhao and George Karypis, “Evaluation of Hierarchical Clustering Algorithms for Document Datasets,” copyright 2005-2010, George Karypis, Internal Lab Website, retrieved Sep. 10, 2010 from http://glaros.dtc.umn.edu/gkhome/cluto/cluto/publications, 1 page.
Ying Zhao and George Karypis, “Criterion Fuctions for Document Clustering: Experiments and Analysis,” copyright 2005-2010, George Karypis, Internal Lab Website, retrieved Sep. 10, 2010 from http://glaros.dtc.umn.edu/gkhome/cluto/cluto/publications, 1 page.
Steinbach, et al., “A Comparison of Document Clustering Techniques,” copyright 2005-2010, George Karypis, Internal Lab Website, retrieved Sep. 10, 2010 from http://glaros.dtc.umn.edu/gkhome/cluto/cluto/publications, 1 page.
Karypis, et al., “CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modelings,” copyright 2005-2010, George Karypis, Internal Lab Website, retrieved Sep. 10, 2010 from http://glaros.dtc.umn.edu/gkhome/cluto/cluto/publications, 1 page.
Matt Rasmussen and George Karypis, “gCLUTO: An Interactive Clustering, Visualitzation, and Analysis System,” copyright 2005-2010, George Karypis, Internal Lab Website, retrieved Sep. 10, 2010 from http://glaros.dtc.umn.edu/gkhome/cluto/gcluto/publications, 1 page.
Matthew Rasmussen, et al., “wCLUTO: A Web-enabled Clustering Toolkit,” copyright 2005-2010, George Karypis, Internal Lab Website, retrieved Sep. 10, 2010 from http://glaros.dtc.umn.edu/gkhome/cluto/wcluto/publications, 1 page.
Dommers, Calculating the normalized compression distance between two strings, Jan. 20, 2009, retrieved Sep. 10, 2010 from http://www.c-sharpcorner.com/UploadFile/acinonyx72/NCD01202009071004AM/NCD.aspx, 5 pages.
A Tutorial on Clustering Algorithms, retrieved Sep. 10, 2010 from http://home.dei.polimi.it/matteucc/lustering/tutorial.html, 6 pages.
Barrantes et al., “Randomized Instruction Set Emulation to Dispurt Binary Code Injection Attacks,” Oct. 27-31, 2003, ACM, pp. 281-289.
Gaurav et al., “Countering Code-Injection Attacks with Instruction-Set Randomization,” Oct. 27-31, 2003, ACM, pp. 272-280.
Check Point Software Technologies Ltd.: “ZoneAlarm Security Software User Guide Version 9”, Aug. 24, 2009, XP002634548, 259 pages, retrieved from Internet: URL:http://download.zonealarm.com/bin/media/pdf/zaclient91—user—manual.pdf.
Notification of Transmittal of the International Search Report and the Written Opinion of the International Searching Authority (1 page), International Search Report (4 pages), and Written Opinion (3 pages), mailed Mar. 2, 2011, International Application No. PCT/US2010/055520.
Notification of Transmittal of the International Search Report and the Written Opinion of the International Searching Authority, or the Declaration (1 page), International Search Report (6 pages), and Written Opinion of the International Searching Authority (10 pages) for International Application No. PCT/US2011/020677 mailed Jul. 22, 2011.
Notification of Transmittal of the International Search Report and Written Opinion of the International Searching Authority, or the Declaration (1 page), International Search Report (3 pages), and Written Opinion of the International Search Authority (6 pages) for International Application No. PCT/US2011/024869 mailed Ju. 14, 2011.
Tal Garfinkel, et al., “Terra: A Virtual Machine-Based Paltform for Trusted Computing,” XP-002340992, SOSP'03, Oct. 19-22, 2003, 14 pages.
IA-32 Intel® Architecture Software Developer's Manual, vol. 3B; Jun. 2006; pp. 13, 15, 22 and 145-146.
Mung-Sup Kim et al., “A load cluster management system using SNMP and web”, [Online], May 2002, pp. 367-378, [Retrieved from Internet on Oct. 24, 2012], <http://onlinelibrary.wiley.com/doi/10.1002/nem.453/pdf>.
G. Pruett et al., “BladeCenter systems management software”, [Online], Nov. 2005, pp. 963-975, [Retrieved from Internet on Oct. 24, 2012], <http://citeseerx.ist.pus.edu/viewdoc/download?doi=10.1.1.91.5091&rep=rep1&type=pdf>.
Philip M. Papadopoulos et al., “NPACI Rocks: tools and techniques for easily deploying manageable Linux clusters” [Online], Aug. 2002, pp. 707-725, [Retrieved from internet on Oct. 24, 2012], <http://onlinelibrary.wiley.com/doi/10.1002/cpe.722/pdf>.
Thomas Staub et al., “Secure Remote Management and Software Distribution for Wireless Mesh Networks”, [Online], Sep. 2007, pp. 1-8, [Retrieved from Internet on Oct. 24, 2012], <http://cds.unibe.ch/research/pub—files/B07.pdf>.
Taskar et al., Probabilistic Classification and Clustering in Relational Data, 2001, Google, 7 pages.
USPTO May 24, 2013 Notice of Allowance from U.S. Appl. No. 12/880,125.
International Preliminary Report on Patentability received from the PCT Application No. PCT/US2011/020677, mailed on Feb. 7, 2013, 9 pages.
International Preliminary Report on Patentability received for the PCT Application No. PCT/US2011/024869, mailed on Feb. 7, 2013, 6 pages.
Office Action received for the U.S. Appl. No. 12/880,125, mailed on Jul. 5, 2012, 12 pages.
Ex Parte Quayle Action received for the U.S. Appl. No. 12/880,125, mailed on Dec. 21, 2012, 4 pages.
USPTO Mar. 28, 2014 Nonfinal Rejection in U.S. Appl. No. 13/012,138, 21 pages.

Related Publications (1)

	Number	Date	Country
	20140006405 A1	Jan 2014	US

Continuations (1)

	Number	Date	Country
Parent	12880125	Sep 2010	US
Child	14016497		US

System and method for clustering host inventories

Information

Patent Number

Date Filed

Date Issued

Inventors

Original Assignees

Examiners

Agents

CPC

US Classifications

Field of Search

US

CPC

International Classifications

Abstract