The present invention relates to data processing systems, and more particularly to systems capable of scanning and indexing data.
Security, compliance, and search software programs each require adaptive, and often overlapping, knowledge about the content, state, location, access, and usage of a dynamic corpus of data located within respective domains. For example, anti-malware software typically scans and stores information indicative of threats and implements remedial actions. Further, compliance software conventionally scans file content and logs file location and other state information, in order to apply predetermined policies to data usage and storage. Still yet, search software indexes data content to facilitate rapid searching and concept mapping, by using computer algorithms to automatically associate related words, phrases, concepts, etc.
Any attempt to combine the foregoing disparate solutions pose a variety of interoperability challenges by requiring multiple software agents, management layers, indexes, etc. Further, implementing disparate solutions would reduce system efficiency by virtue of the competing and overlapping use of system and network resources. Even still, any attempt to combine such systems would inevitably diminish human productivity by requiring multiple interfaces, policies, workflows, etc.; as well as be cost-prohibitive since they each typically require an enterprise to scale installation to maximize effectiveness.
There is thus a need for addressing these and/or other issues associated with the prior art.
A system, method and computer program product are provided for scanning and indexing data for different purposes. Included is a universal engine operable to scan and index data stored in at least one device, for a plurality of different purposes. Further provided is at least one application for controlling the universal engine to perform the scanning and indexing for at least one of the different purposes.
Further, such scanning may include any analysis of data, while the aforementioned indexing may refer to any processing which results in a data structure that is representative, at least in part, of the data, for facilitating subsequent analysis. Just by way of example, in one optional embodiment, the scanning may include the analysis of the data and/or indexed data, utilizing various criteria, patterns (e.g. signatures, etc.), rules, etc. for the purpose of reaching at least one conclusion.
Still yet, the indexing may, in different embodiments, include an automatic classification or manipulation of the data based on content of the data, a creator of the data, a location of the data, metadata associated with the data, and/or any other desired aspect of the data, in an embodiment where the indexing is based on a content of the data, various text analysis may be performed to identify key or repeated terms (e.g. nouns, verbs, etc.). Still yet, such words may be weighted as appropriate (e.g. based on location, use, etc.), Bayesian algorithms may be used, etc. To this end, content-related insight into the data may be provided by a data structure that has a size that is less than that of the data itself. Of course, such examples of contextual indexing are set forth for illustrative purposes only, as any indexing may be used that meets the above definition.
With continuing reference to
More illustrative information will now be set forth regarding various optional architectures and features with which the foregoing framework may or may not be implemented, per the desires of the user. For example, the aforementioned scanning and indexing, as well as possibly any action prompted based on such scanning/indexing, may be performed based on predetermined policies. In such embodiment, different policies may be used in conjunction with different applications. In another embodiment, heuristics may be used to control such scanning and indexing, for improved performance, efficiency, etc.
It should be strongly noted that the following information is set forth for illustrative purposes and should not be construed as limiting in any manner. Thus, any of the following features may be optionally incorporated with, or without the exclusion of other features described.
As shown, a plurality of devices 202A-N are provided. In the context of the present description, the devices 202A-N may each include a desktop computer, lap-top computer, hand-held computer, mobile phone, personal digital assistant (PDA), peripheral (e.g. printer, etc.), any component of a computer or related system, and/or any other type of logic for that matter. In additional embodiments, virtualization techniques may be used in conjunction with the devices 202A-N.
As shown, the devices 202A-N are equipped with universal engines 204A-N capable of scanning and indexing data stored on the respective device. In one embodiment, the universal engines 204A-N may include any combination of hardware and/or software for providing a “natural language processor” that is capable of sorting through a plethora of business-formatted information files, regardless of the data type, file type, file location, etc.
Agents 206A-N remain in communication with the associated universal engines 204A-N, as shown, for controlling such scanning and indexing, as well as taking any desired resulting action, etc. For example, each agent 206A-N may store results of such scanning and indexing in a local database 208A-N, for reasons that will soon become apparent. While the agents 206A-N are shown to reside on the devices 202A-N, embodiments are contemplated in which the agents 206A-N communicate with, but remain separate from the devices 202A-N. In one embodiment, the agents 206A-N may take the form of self-populating/self-propagating bots that automatically crawl a network in the background.
Coupled to one or more of the devices 202A-N is at least one hub 210. Such coupling may, in one embodiment, be accomplished via a network including, but not limited to a local area network (LAN), a wireless network, a wide area network (WAN) such as the Internet, a peer-to-peer network, a personal area network (PAN), etc.
Further, it should be noted that, in various embodiments, a plurality of the hubs 210 may be situated in different regions and further be coupled to different subsets of the devices 202A-N. Still yet, a hierarchical framework may further be provided such that the hubs 210 (or subsets thereof) are coupled to additional hubs (not shown). For instance, a hierarchy of regional servers and at least one central server may be provided.
With continuing reference to
To this end, any aspect (e.g. priority, sequence, location, etc.) of the scanning, indexing, and/or resulting action may be controlled based on one or more policies. For example, one or more policies may dictate the criteria, patterns (e.g. signatures, etc.), etc. with which the data is scanned. Further, such policies may specify what particular data is indexed, based on specific criteria (e.g. a creator of the data, a location of the data, metadata associated with the data, words in the data, etc.). Even still, the policies may indicate which actions are to be taken based on the scanned/indexed data. This may be accomplished, for example, using specific rules that trigger an action based on results of a policy-specific scan of data that has been indexed in a policy-specific manner.
In one embodiment, different policies may be used in conjunction with different applications. While such applications are not shown in
By this design, the policies may be used to scan/index data in a manner that makes it more effectively available for use by a purpose-specific application. For instance, a security application may require data to be scanned/indexed differently with respect to a financial application. Further, the actions taken as a result of the uniquely scanned/indexed data will also vary significantly. To this end, the policies may be used to tailor the various aspects of the system 200 to accommodate the purpose of a particular application executed by the system 200.
To generate and/or modify the aforementioned policies and provide additional administrative functions (e.g. propagation of scan/index results among the local/remote databases, etc.), a centralized management console 216 remains in communication with the hub 210. To this end, policies may be dynamically created and applied on a real-time basis. In various embodiments, the management console 216 may be integrated with or separate from the hub(s) 210. Further, the management console 216 may include a graphical user interface (GUI) for facilitating such operation. In one specific embodiment, the management console 216 may include the ePolicy Orchestrator® offered by McAfee®, Inc.
With continuing reference to
As an option, heuristics may be employed in any desired capacity in the administration of the system 200. For example, the aforementioned policies may be configured or dynamically adapted based on heuristics gathered across the system 200 by way of a feedback loop. In one embodiment, the heuristics are fed back from the devices 202A-N utilizing the associated agents 206A-N. Such heuristics may include, but are not limited to an amount of processing/communication resources available at the associated device 202A-N, a schedule of such resource availability, etc.
Thus, the system 200 is capable of intelligently implementing the foregoing policies in view of such heuristics. For example, the heuristics may drive when and where the indexing and scanning takes place; a location where results of the indexing and scanning are stored; a timing of a communication of the policies, scan/index results, etc.; a timing of any actions taken based on the scan/index results, etc. Of course, such heuristics-driven controls are set forth for illustrative purposes only, as any aspect of the system 200 may be heuristically controlled.
To this end, the system 200 coordinates and/or consolidates scanning, indexing, and policy enforcement efforts using a distributed, heuristic data management system and a feedback loop that is governed by a common set of policies that are managed using the single management console 216. The system 200 is thus self-tuning, self-evolving, and self-modifying to provide an ever-increasingly capable data collection/analysis botnet hierarchy. With guiding scripts/policies entered by humans, the system 200 is capable of narrowing its focus in order to provide increasingly relevant data and/or conclusions based on the analysis of data collected thus far. These refined data results may then be delivered to an inference engine which is able to coalesce the sorted/prioritized data in order to present to the user a result that is best tuned to the original request issued to the system 200.
In one example of use, each agent 206A-N updates, coordinates, and enforces a set of electronic policies, and the multi-purpose universal engines 204A-N analyze system data as directed by the policy set, and may act based on correlating findings with the policy set. Still yet, each local database 208A-N stores scan results in accordance with the policy set, while each hub 210 communicates with the local databases 208A-N to facilitate data retrievals, as needed, for further analysis or use.
The management console 216, in turn, controls the system 200, updates agent software, and directs data migration from local to centralized indexes. To facilitate such control, the local agents 206A-N communicate local operating conditions, in addition to predetermined indicators, back to the management console 216, thus providing heuristic feedback that can be used at the administrator level to adjust a priority, nature and sequence of policy enforcement actions across an enterprise, or take specialized action on a specific resource or group of resources.
As shown, a protected network 302 is provided including a plurality of components including, but not limited to servers 304, workstations 306, an email system 308, etc. Coupled to such components is logic 310 adapted for scanning and indexing data stored on such components, storing results of such processing, and taking any resulting action based on heuristically-driven policies.
As further shown, an additional network 312 as well as additional devices 314 may be provided. In one embodiment, such additional network 312 and/or additional devices 314 may communicate with the protected network 302 by way of a virtual private network (VPN) connection 316 or utilizing any other desired technique. To this end, scan/index results and policy information may be distributed among multiple networks and devices, in a secure manner. For example, the protected network 302 may include data that is to remain most secure, while other data may be stored at the additional network 312 as well the additional devices 314.
As illustrated, the device 400 includes an agent 402 loaded thereon which allows local indexing of data and policy enforcement that is synchronized by a central administrator (e.g. via a control console 404, etc.). Further included are a variety of components including a plurality of policy files 406, a policy application engine 408, an index component 410, and a heuristic management component 412.
In use, the policy files 406 are received under the direction of the control console 404 for use by the policy application engine 408 to provide for specific actions to be invoked by different applications. Such policy application engine 408 executes policies in the policy files 406 based on pre-set factors and heuristic analysis of local and system-level conditions. Further, the index component 410 provides for a dynamic repository of file content and metadata, thus serving as an enterprise knowledge storehouse. The heuristic management component 412 controls the timing/size of data flow between resources based on priority, bandwidth, and/or asset usage, etc. To this end, data may be locally indexed and transferred to a central repository; and updates, queries, commands, etc. may be transmitted back.
As shown in
In use, the management console 505 may administer the policies 506 by distributing the same to enterprise and host-based applications 508. Such applications 508 may include security, compliance, search and any other type of programs that depend, at least in part, on scanning and indexing of data. While
Further, a unified index storage 510 connects to all data sources, providing enterprise search and data classification which can be leveraged by the designated applications 508. Under the control of a resource management module 512, complex computing tasks can be performed using idle machines in another geographical location to minimize impact on network performance during work hours. To accomplish this, the resource management module 512 may remain in communication with a variety of enterprise solutions 514 and data repositories 516.
Thus, the resource management module 512 may provide for workload distribution across the network, based on bandwidth, usage, and priority factors. Further, large data transfers and complex computing functions may take place when resources are idle or underutilized. For example, during work hours, indexing may occur locally as a background task (e.g. using servers, workstations, etc.). Further, at night or during idle time, indexed data may be transferred from the workstations and servers to regional data hubs. Likewise, policy updates and instructions may be distributed from the hubs to each network device.
A distributed knowledge management system is thus provided including policy-based content and meta-data indexing of electronic data in a computer network utilizing a distributed indexing/storage architecture. In a variety of embodiments, such architecture may include, among other things, a central data indexing hub, regional data hubs, and local agents capable of performing and/or directing local indexing/storage functions based on predetermined policies and/or at the direction of the central or regional data hubs.
Further, heuristic resource management functionality may be provided that connects every network asset to the central data hub and/or regional hubs and provides real-time and/or on-access assessments of asset usage, state information, and data content. This data may, in turn, be used by the central or regional data hubs to regulate the implementation of data management policies and information requests, including a scope/frequency of indexing and security protocols. The data may further be used to coordinate and execute distributed computing functions, and monitor overall network integrity, efficiency, and usage.
Still yet, integrated policy application functionality may be provided that leverages the distributed data index and heuristic resource management modules to execute electronic policies across the network or on specific network assets. Policies may thus encompass data management, accessibility, security and compliance functions.
To this end, various features may or may not be provided, as desired. For example, the system may provide improved access to corporate knowledge stored as electronic data. It may employ automated data classification technologies to enforce policies and manage information access/use (security), maintenance, storage and deletion. Further, investigative efforts, including audits and electronic discovery, may be streamlined. Network assets may also be leveraged to perform complex computing tasks and minimize under-utilization of resources. Thus, provided is a comprehensive data management model that integrates resource management, in formation accessibility, security, and policy enforcement in the context of a networked computing environment.
The workstation shown in
The workstation may have resident thereon any desired operating system. It will be appreciated that an embodiment may also be implemented on platforms and operating systems other than those mentioned. One embodiment may be written using JAVA, C, and/or C++ language, or other programming languages, along with an object oriented programming methodology. Object oriented programming (OOP) has become increasingly used to develop complex applications.
Of course, the various embodiments set forth herein may be implemented utilizing hardware, software, or any desired combination thereof. For that matter, any type of logic may be utilized which is capable of implementing the various functionality set forth herein.
While various embodiments have been described above, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of a preferred embodiment should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.
This application is a continuation (and claims the benefit of priority under 35 U.S.C. §120) of U.S. application Ser. No. 11/959,113, filed Dec. 18, 2007, now U.S. Pat. No. 8,086,582, entitled “SYSTEM, METHOD AND COMPUTER PROGRAM PRODUCT FOR SCANNING AND INDEXING DATA FOR DIFFERENT PURPOSES,” Inventor(s) Ronald Holland Wills, et al. The disclosure of the prior application is considered part of (and is incorporated by reference in) the disclosure of this application.
Number | Name | Date | Kind |
---|---|---|---|
4633393 | Rundell | Dec 1986 | A |
5111391 | Fields et al. | May 1992 | A |
5987610 | Franczek et al. | Nov 1999 | A |
6073142 | Geiger et al. | Jun 2000 | A |
6272641 | Ji | Aug 2001 | B1 |
6460050 | Pace et al. | Oct 2002 | B1 |
6765864 | Natarajan et al. | Jul 2004 | B1 |
6823341 | Dietz | Nov 2004 | B1 |
7028022 | Lightstone et al. | Apr 2006 | B1 |
7096503 | Magdych et al. | Aug 2006 | B1 |
7178166 | Taylor et al. | Feb 2007 | B1 |
7219098 | Goodwin et al. | May 2007 | B2 |
7263528 | Haff et al. | Aug 2007 | B2 |
7441024 | Sanghvi et al. | Oct 2008 | B2 |
7467225 | Anerousis et al. | Dec 2008 | B2 |
7472422 | Agbabian | Dec 2008 | B1 |
7506155 | Stewart et al. | Mar 2009 | B1 |
7516476 | Kraemer et al. | Apr 2009 | B1 |
7634458 | Singhal et al. | Dec 2009 | B2 |
7712138 | Zobel et al. | May 2010 | B2 |
7809914 | Kottomtharayil et al. | Oct 2010 | B2 |
7865965 | Kramer et al. | Jan 2011 | B2 |
7937365 | Prahlad et al. | May 2011 | B2 |
7950007 | Mohindra et al. | May 2011 | B2 |
7979906 | McColgan et al. | Jul 2011 | B2 |
8108923 | Satish et al. | Jan 2012 | B1 |
8185913 | Talwar et al. | May 2012 | B1 |
8205208 | Mausolf et al. | Jun 2012 | B2 |
8432570 | Couvering | Apr 2013 | B1 |
20020078134 | Stone et al. | Jun 2002 | A1 |
20020169819 | Nguyen et al. | Nov 2002 | A1 |
20020178217 | Nguyen et al. | Nov 2002 | A1 |
20030131256 | Ackroyd | Jul 2003 | A1 |
20040117624 | Brandt et al. | Jun 2004 | A1 |
20040193918 | Green et al. | Sep 2004 | A1 |
20040210320 | Pandya | Oct 2004 | A1 |
20040249824 | Brockway et al. | Dec 2004 | A1 |
20050010821 | Cooper et al. | Jan 2005 | A1 |
20050044016 | Irwin et al. | Feb 2005 | A1 |
20050066165 | Peled et al. | Mar 2005 | A1 |
20050154733 | Meltzer et al. | Jul 2005 | A1 |
20050198098 | Levin et al. | Sep 2005 | A1 |
20060062360 | O'Connor et al. | Mar 2006 | A1 |
20060080667 | Sanghvi et al. | Apr 2006 | A1 |
20060085852 | Sima | Apr 2006 | A1 |
20060101520 | Schumaker et al. | May 2006 | A1 |
20060136570 | Pandya | Jun 2006 | A1 |
20060184682 | Suchowski et al. | Aug 2006 | A1 |
20060190441 | Gross et al. | Aug 2006 | A1 |
20060256392 | Van Hoof et al. | Nov 2006 | A1 |
20070130140 | Cytron et al. | Jun 2007 | A1 |
20070136814 | Lee et al. | Jun 2007 | A1 |
20070150574 | Mallal et al. | Jun 2007 | A1 |
20070150948 | De Spiegeleer | Jun 2007 | A1 |
20070250935 | Zobel et al. | Oct 2007 | A1 |
20070288247 | Mackay | Dec 2007 | A1 |
20070300299 | Zimmer et al. | Dec 2007 | A1 |
20080005782 | Aziz | Jan 2008 | A1 |
20080077594 | Ota | Mar 2008 | A1 |
20080168048 | Bell et al. | Jul 2008 | A1 |
20080196083 | Parks et al. | Aug 2008 | A1 |
20080201384 | Batterywala | Aug 2008 | A1 |
20080276295 | Nair | Nov 2008 | A1 |
20080313639 | Kumar et al. | Dec 2008 | A1 |
20080313733 | Kramer et al. | Dec 2008 | A1 |
20090031312 | Mausolf et al. | Jan 2009 | A1 |
20090049518 | Roman et al. | Feb 2009 | A1 |
20090064323 | Lin | Mar 2009 | A1 |
20090100162 | Holostov et al. | Apr 2009 | A1 |
20090138573 | Campbell et al. | May 2009 | A1 |
20090271842 | Baumhof | Oct 2009 | A1 |
20090293103 | Palmer et al. | Nov 2009 | A1 |
20100064341 | Aldera | Mar 2010 | A1 |
20100180128 | Borden et al. | Jul 2010 | A1 |
Number | Date | Country | |
---|---|---|---|
20120079117 A1 | Mar 2012 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 11959113 | Dec 2007 | US |
Child | 13311089 | US |