Internet traffic and the number of web servers and websites continues to grow at an enormous rate. At the same time, malicious websites are becoming an increasingly serious problem. Users often are provided with URLs to such websites in unsolicited emails, SMS or MMS messages, or other communications. If a user then visits the website using that URL, the website can harm the user or his or her computer in a multitude of different ways, including loading malware onto the user's computer or gathering sensitive data from the user's computer. For example, a malicious website can load a harmful virus or worm onto the user's computer as soon as the computer accesses the website.
There are existing methods for warning users about malicious websites. For example, a user can install security software onto his or her computers that will produce a warning message if the user attempts to visit a website that is a known malicious website. This type of software is dependent upon databases or lists of known malicious websites and requires that the database or list be constantly updated. These methods are effective for avoiding malicious websites that are already known. However, they provide no protection against new malicious websites that have not yet been added to the database or list.
What is needed is a method and apparatus for identifying malicious websites with a high probability, even if the website is new and not a known malicious website.
What is further needed is a method and apparatus for identifying malicious websites on an extremely large scale, as might be required for an Internet Service Provider or corporate network server that wishes to protect all of its end users from visiting malicious websites.
The aforementioned problems and needs are addressed by a method and apparatus for analyzing a URL and predicting whether the URL corresponds to a malicious website.
A prior art system is depicted in
With reference now to
An embodiment is now described with reference to
The embodiment is further described in
Additional description will now be provided of domain classification engine 110. The internal operation of an embodiment of domain classification engine 110 is shown in
An example of a domain name 300 is shown in
With reference again to
In parallel with feature extraction 210, domain classification engine 110 also performs Markov analysis (step 220). Markov analysis is a known method in the field of statistics a probability for an event is determined based on the probability of its sub-events. As applied in this embodiment, domain classification engine 110 determines the probability of a digit occurring in normal language (such as English) given the preceding two (or other number) digits. For example, if the received URL is google.com, domain classification engine will determine the probability of a “g” occurring at the beginning of a word, the probability of an “o” occurring after a “g,” the probability of an “o” occurring after a “g” and “o,” the probability of a “g” occurring after an “o” and “o,” and so forth. In this manner, domain classification engine 110 determines a probability for each digit. It them multiplies the probability for each digit to obtain a probability for the entire domain name. This can be referred to as the Markov Probability for the domain name and indicates the randomness of the domain name. The probabilities for each digit can be determined based on a database of existing usage, such as a dictionary, or a list of known, good (non-malicious) domain names. This Markov analysis takes advantage of the fact that malicious domain names often look like “gibberish” and do not make sense in everyday English or other spoken language.
Domain classification engine 230 then performs random forest classification (step 230). Random forest classification is a known method in the field of statistics whereby a classification is made of an input based upon an existing dataset. Here, random forest classification can comprise classifying a domain name as malicious based on a dataset of known malicious domain names. Random forest classification also can comprise classifying a domain name as good (non-malicious) based on a dataset of known good (non-malicious) domain names.
Domain classification engine 230 then generates a maliciousness rating (step 240) based on the results of the Markov analysis (step 220), feature extraction (step 210), and random forest classification (step 230). The maliciousness rating will indicate the likelihood that the domain name corresponds to a malicious website. A threshold can be chosen (e.g., 0.60 on a scale of 0 to 1.00) that is used to determine whether a website is malicious or not.
In response to a high maliciousness rating (indicating a high likelihood that the website is malicious), computer 100 can take any number of different actions, such as preventing access by computer 10 (or a plurality of computers) to website 40 or server 30; sending a message to computer 100; generating an alert for a user of computer 10 or the operator of computer 100, updating a list or database of known malicious websites or known good websites; or generating a user interface for an operator of computer 100 or a user of computer 10 that provides the maliciousness rating or data reflective of that rating (such as a graph). These actions optionally can be performed by an execution engine 120 (not shown), which is software running on computer 100.
The database or list of known malicious websites or known good websites can be continually updated. Thereafter, the probabilities for the Markov analysis can be updated, as can the models for the random forest classification. Thus, the quality of the predictions made by the embodiments as to whether a domain name corresponds to a malicious website or a good website will remain high even as the operators of malicious website change their strategies in selecting domain names.
In another application of the embodiments, domain classification engine 230 can be used to identify computers that already have been infected by malware. It is a common practice for malware to cause the infected computer to perform a DNS lookup on a domain name that the malware attacker controls. The infected computer will then obtain the IP address for that domain name and will be directed to a server at that IP address. The server will be controlled by the malware attacker, and the server will provide commands and/or instructions to the infected computer. Domain classification engine 230 can be used to analyze the domain names during the DNS lookup events and can generates a maliciousness rating for the domain names using the same methods and apparatuses discussed previously. If the maliciousness rating indicates a malicious domain name, then the same type of actions described previously can be taken (e.g., adding the domain to a list of known malicious websites), and in addition, an operator can be notified that the computer that initiated the DNS lookup likely has been affected with malware.
The embodiments described herein are valuable in detecting domain names, even if not yet known, of malicious websites. The embodiments also are very scalable and can be used in environments involving a large number of DNS requests, as is the case with ISPs or corporate network servers.
References to the present invention herein are not intended to limit the scope of any claim or claim term, but instead merely make reference to one or more features that may be covered by one or more of the claims. Materials, processes and numerical examples described above are exemplary only, and should not be deemed to limit the claims.