Typical web search engines are unable to crawl networks on the Internet that have limited or restricted access. Thus, the corpus of content discoverable by web search engines is limited, and certain types of content may not be amenable to discovery by typical web search engines.
Various embodiments of the invention are disclosed in the following detailed description and the accompanying drawings.
The invention can be implemented in numerous ways, including as a process; an apparatus; a system; a composition of matter; a computer program product embodied on a computer readable storage medium; and/or a processor, such as a processor configured to execute instructions stored on and/or provided by a memory coupled to the processor. In this specification, these implementations, or any other form that the invention may take, may be referred to as techniques. In general, the order of the steps of disclosed processes may be altered within the scope of the invention. Unless stated otherwise, a component such as a processor or a memory described as being configured to perform a task may be implemented as a general component that is temporarily configured to perform the task at a given time or a specific component that is manufactured to perform the task. As used herein, the term ‘processor’ refers to one or more devices, circuits, and/or processing cores configured to process data, such as computer program instructions.
A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. The invention is described in connection with such embodiments, but the invention is not limited to any embodiment. The scope of the invention is limited only by the claims, and the invention encompasses numerous alternatives, modifications, and equivalents. Numerous specific details are set forth in the following description in order to provide a thorough understanding of the invention. These details are provided for the purpose of example, and the invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in detail so that the invention is not unnecessarily obscured.
Cyber criminals rely on remaining anonymous over networks such as the Internet when engaging in various malicious activities. Despite attempts to maintain anonymity, malicious entities often nevertheless inadvertently expose sensitive and potentially identifying information during benign use of various internet facilities. Various techniques for monitoring cyber activity across one or more internet portals and collecting and analyzing information as well as employing such information to profile malicious or suspect entities and activities and to alert potential targets are disclosed herein.
Data collection modules 104 may comprise any appropriate hardware and/or software components, such as user accounts and/or host devices, configured to monitor activity with respect to associated internet facilities 102 and gather relevant information. Although depicted as a single block in
Data input into data collection system 100 is processed by data processing engine 106. Data processing may comprise normalizing data, analyzing data, data mining, identifying relevant data, computing statistics, categorizing data, correlating data, aggregating related data, indexing data for searches, etc. Related sets of data, such as data associated with a particular entity or keyword, are stored in database 108. In some embodiments, database 108 comprises a searchable database from which data of interest may be retrieved using search engine 110. Data from data processing engine 106 and/or database 108 may in some embodiments be employed by alert engine 112 to generate alerts when certain data, such as data associated with malicious activity, is detected. Data collection system 100 further comprises an interface 114. In some embodiments, interface 114 comprises a dashboard. In some embodiments, interface 114 comprises an API (Application Programming Interface). In some embodiments, interface 114 may be employed to at least in part configure and/or tune data collection system 100. For example, the types of data to monitor and collect and/or actions to take if particular types of data are found may in some embodiments be configurable via interface 114. Moreover, interface 114 may comprise an interface for searching database 108 via search engine 110 and presenting search results. Furthermore, interface 114 may present other data that may be of interest, such as real time data collection, traffic analysis, and/or processing results, which may be presented in some embodiments via one or more gauges or other appropriate user interface widgets.
In some embodiments, data is input into data collection system 100 from a network of one or more open proxy servers 102(a) that have been configured to monitor traffic and collect data. The IP (Internet Protocol) address of a proxy server serves as the source address for activity conducted using the proxy server, thereby concealing the actual source of the activity and preserving anonymity. Although open proxy servers may be employed for benign activity such as circumventing internet censorship, they are often employed by entities who desire to remain anonymous when conducting malicious activity. A highly monitored stealth network of open proxy servers 102(a) is in some embodiments employed to lure malicious entities desiring to mask their identities. The existence of such open proxy servers may be publicized by manually adding them to lists of open proxy servers available on the Internet and/or may be discovered by entities actively scanning for open proxy servers. Due to their public nature, open proxy servers typically experience an enormous amount of traffic, and such traffic can be monitored, analyzed, and/or cataloged as desired. Open proxy servers may be advantageously employed to not only detect malicious or suspect activity by entities but also to learn sensitive information about such entities if they continue to the use the proxy servers for benign purposes that may reveal or aid in revealing their actual identities such as logging into and/or establishing sessions with respect to personal accounts.
In some embodiments, data is input into data collection system 100 from host devices configured to operate as nodes of an anonymity network 102(b) such as Tor. Anonymous communications over such a network may be facilitated, for example, using onion routing. Each node of an anonymity network may operate as an entrance node, a transit node, and/or an exit node of the network. In some embodiments, a sufficiently large number of devices may be deliberately configured to operate as nodes of anonymity network 102(b) so that a substantial portion of traffic associated with network 102(b) traverses the devices. Such traffic may be analyzed, and traffic seen by different devices may be correlated, possibly at least partially compromising the obfuscation of such communications. Moreover, any collected data may be further correlated with other data collected by data collection system 100.
In some embodiments, spam 102(c) is collected from various sources and input into data collection system 100. Spam may be collected, for instance, using a dedicated set of email accounts deliberately set up to elicit spam. In such cases, the associated email addresses may be employed to sign up for or create accounts on various sites expected to make the email addresses available to spammers. Spam harvested from these email accounts is analyzed by data collection system 100, for example, to identify threats such as phishing and spoofing attacks and to provide early notifications or alerts to potential targets. The analysis may include searching for keyword matches as well as correlating data from spam with other data collected by data collection system 100 via other internet facilities, for instance, to aid in identifying the origin or source of the spam. For example, if substantial references to a prominent financial institution are found to occur or occur frequently in spam messages, the financial institution may be alerted, and the origin or source of a potential attack may be identified by recognizing relationships that may exist between data harvested from spam and other information processed by data collection system 100.
In some embodiments, data is input into data collection system 100 from one or more social media networks 102(d). Many social media networks are at least not fully accessible without a registered user account. Moreover, an account holder may have limited access to only certain portions of the network. Thus, much of the data on such networks cannot be discovered or surfaced by search engine crawlers. However, a set of dedicated accounts may be deliberately set up or created to collect information from such networks. Crawlers may be employed with respect to such accounts to facilitate gathering of data. Any data gathered from a social media network may be mined and correlated with other data processed by data collection system 100.
In some embodiments, data is input into data collection system 100 from one or more forums 102(e). Content on many forums is accessible only to registered and/or vetted users and, thus, not discoverable by search engine crawlers. However, forums such as those associated with the hacker community are typically rife with intelligence on existing security breaches, security vulnerabilities, targets or potential targets, and other malicious activity. In some embodiments, a set of user accounts are deliberately created to gain access or entry into such forums. Crawlers may be employed with respect to such accounts to facilitate gathering of data. Furthermore, one or more dedicated forums may deliberately be deployed to attract various types of malicious entities. Such forums and/or forum accounts may be employed to seed posts related to particular topics and to entice other forum members to post information related to the topics. Any data gathered from a forum may be mined and correlated with other data processed by data collection system 100.
Although some examples of internet facilities that may be employed to feed data into data collection system 100 have been described, data may be input into data collection system 100 in various embodiments from any other appropriate data sources. Similar to the manner described for open proxy servers, data may be mined from any internet access point or resource that is left or configured open such as an open VPN (Virtual Private Network) server. Moreover, data may be mined from web sites, chat rooms, messaging services, IRC (Internet Relay Chat) networks, P2P (peer-to-peer) networks, etc. Malicious activity or intent may be detected by specifically surveilling internet facilities that are often or may be used for nefarious purposes. Data received by data collection system 100 from various sources is analyzed and correlated so that data associated with particular entities, activities, keywords, etc., may be aggregated and stored in database 108 as well as used by alert engine 112 to generate appropriate alerts for targets or potential targets of malicious activity. In some embodiments, data associated with both benign and malicious use is aggregated. Some of the data associated with an entity that is harvested from benign use by the entity, for example, may be employed to at least in part unmask the identity of the entity, for example, if the entity is found to be associated with malicious activity.
As described, the data collection system disclosed herein aids in generating awareness of current or real time Internet activity and strives to prevent or at least mitigate attacks or exploits as well as identify perpetrators of such activities. Services available via such a data collection system include, but are not limited to, providing a criminal profile database, providing criminal tracking, providing threshold triggers and alerts (e.g., distributed denial-of-service (DDoS) attacks may be detected based on increased traffic to targets and perpetrating as well as targeted parties may be identified), gathering performance data (e.g., on a network or host on the Internet), identifying Internet usage patterns, etc.
Although the foregoing embodiments have been described in some detail for purposes of clarity of understanding, the invention is not limited to the details provided. There are many alternative ways of implementing the invention. The disclosed embodiments are illustrative and not restrictive.