It has become common for users of computers connected to the World Wide Web (the “web”) to employ web browsers and search engines to locate web pages (or “documents”) having specific content of interest to them (the users). A web-based commercial search engine may index tens of billions of web documents maintained by computers all over the world. Users of the computers compose queries, and the search engine identifies documents that match the queries to the extent that such documents include keywords from the queries (known as the search results or result set) or that the documents seem otherwise relevant to the queries.
Unfortunately, in addition to valid documents on the Internet, there also exists a substantial amount of documents that contain or point to malicious software (“malware”), masquerading as useful and relevant documents. This malware is intentionally designed to cause harm to other computers or computer users (ranging in degree from minor to substantial) either directly (e.g., by harming the computer system itself) or indirectly (e.g., by spying and/or stealing information to facilitate identity theft of the computer's user). For example, recent research suggests that nearly 11,000 domains are being used to serve malware in the form of fake anti-virus software.
The proliferation of malware throughout the Internet over the past several years has been extensive. In addition to commonplace techniques for spreading malware—such as through spam email—attackers are constantly devising new and more sophisticated methods to infect user computers. One such technique is the widespread use of search engines as the medium for distributing malware. By manipulating search engine ranking algorithms using a variety of search engine optimization (SEO) techniques, attackers are able to poison search results for popular terms with seemingly relevant and harmless links to malicious web pages. By one estimate, in 2010, over half of the most popular keyword searches lead to first page results containing at least one link (if not many links) to a malicious web page.
Techniques are described that automatically detect search results poisoning attacks. Several implementations identify groups of suspicious uniform resource locators (URLs) containing multiple keywords and exhibiting patterns that deviate from other URLs in the same domain without crawling and evaluating the actual contents of each web page.
Several implementations are directed to identifying suspicious websites from among a plurality of websites, extracting lexical features for each suspicious website, clustering each suspicious website into a plurality of groups based on the extracted lexical features, and performing group analysis on each group to identify at least one suspicious group. Other implementations are directed to systems for automatically identifying suspicious websites based on a change in behavior, automatically clustering suspicious websites into groups based on the lexical features of the suspicious websites, and automatically analyzing the groups based on page structure similarity of the websites comprising each such group. Yet other implementations are directed to software for detecting a search engine optimization (SEO) attack comprising instructions that process a large population of URLs to identify suspicious URLs based on the presence of a subset of keywords in each URL, a similarity in structure between each URL, and the relative newness of each URL.
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
To facilitate an understanding of and for the purpose of illustrating the present disclosure and various implementations, exemplary features and implementations are disclosed in, and are better understood when read in conjunction with, the accompanying drawings—it being understood, however, that the present disclosure is not limited to the specific methods, precise arrangements, and instrumentalities disclosed. Similar reference characters denote similar elements throughout the several views. In the drawings:
The web 131 allows the client computers 110 to access documents 121 containing text or multimedia and maintained and served by the server computers 120. Typically, this is done with a web browser application program 114 executing on the client computers 110. The location of each of the documents 121 may be indicated by an associated uniform resource locator (URL) 122 that is entered into the web browser application program 114 to access the document (and thus the document and the URL for that document may be used interchangeably herein without loss of generality). Many of the documents may include hyperlinks 123 to other documents 121 (with each hyperlink in the form of a URL to its corresponding document).
In order to help users locate content of interest, a search engine 140 may maintain an index 141 of documents in a memory, for example, disk storage, random access memory (RAM), or a database. In response to a query 111, the search engine 140 returns a result set 112 that contains the terms (e.g., the keywords) of the query 111. To provide a high-quality user experience, search engines order search results using a ranking function that, based on the query 111 and for each document in the search result set 112, produces a score indicating how well the document matches the query 111. The ranking process may be implemented as part of a ranking engine 142 within the search engine 140.
Search Engine Optimization (SEO) is the process of optimizing web pages to rank higher in search engine results. As such, SEO techniques can be classified into one of two categories: white-hat and black-hat. Using white-hat SEO techniques, the focus of website creation is on the end-user although the page itself is still structured so that search engine crawlers can easily navigate the site, and typically support legal and legitimate goals pertaining to commerce or public relations without deceptiveness or fraud. Black-hat SEO techniques, in contrast, attempt to manipulate the rankings using deceptive and possibly illegal techniques such as keyword stuffing (filling the page with irrelevant keywords), hidden text and links, cloaking (where the content provided to crawlers is different from the content provided to users), redirects, and link farming. Websites engaged in black-hat SEO techniques are often removed from the search engine's index when detected. To detect black-hat SEO web pages, some approaches may evaluate the content of suspect web pages while others may evaluate the link structures leading to suspect pages.
In an SEO attack (also referred to as an “search-result poisoning attack”), an attacker “poisons” popular search terms by making extensive use of them in SEO pages so that links to these pages show up in the search results for those terms. When a victim then uses a search engine to search for something using the popular terms, the search results will include links that point to servers controlled by the attackers. These servers are often legitimate web servers hosting legitimate websites, but these servers have been compromised by the attackers and unwittingly host the fake SEO pages. When the unsuspecting victim clicks on one of the links to a fake SEO page from among the search results, the victim is immediately redirected by the fake SEO page (and possibly redirected again several times) to an exploit server that displays a “scareware” page, for example, a page that looks like an anti-virus scan with large flashy warnings of multiple infections found on victim's system. The purpose of this page is to scare the user into downloading and installing the purported anti-virus software offered by the website. Alternately, the exploit servers might also try to directly compromise the victim's browser.
Sophisticated SEO attacks may leverage a very large number of compromised servers, and thus these attacks are much harder to detect. Search engines also present a particularly attractive medium for distributing malware because such attacks are relatively low cost and because search engines provide an appearance of legitimacy. Since malicious pages are typically hosted on compromised web servers, they can be utilized by the attacker for malware distribution at no cost. Moreover, since malicious web pages look relevant to search engines, these pages are indexed and presented to end users the same as other relevant pages, and users generally click on search engine results without hesitation.
Automatically detecting malicious SEO links is extremely challenging because it is prohibitively expensive to test the maliciousness of every link on the Web using content-based approaches. Even testing only those links containing keywords is difficult as malicious SEO pages may look legitimate for most requests and only deliver malicious content when certain environmental requirements are met (e.g., the use of a vulnerable browser, the redirection by search engines, or user actions like mouse movements). Indeed, without intensive reverse engineering, ascertaining the environmental settings needed to obtain malicious content on a particular page is a challenge.
To automatically detect malicious SEO links, various implementations disclosed herein may rely on the observation that successful SEO attacks have three common features: (1) successful SEO attacks automatically generate pages with relevant content; (2) successful SEO attacks target multiple popular (i.e., “trendy”) search keywords to maximize coverage; and (3) successful SEO attacks form dense link structures among their black-hat SEO pages in order to boost PageRank scoring. Stated differently, attackers first automatically generate pages that look relevant to search engines, and since one page alone is not enough to trap many victims, attackers typically generate many pages to cover a wide range of popular search keywords. Then, to promote these pages to the top of the search results, attackers hijack the reputation of compromised servers and create dense link structures to boost PageRank scoring. With regard to automatic detection of SEO attacks, the first two features described above provide distinct signs or characteristics utilized by the various implementations disclosed herein.
Since SEO links are often set up on compromised web servers, these servers usually change their behavior after being compromised. For example, new links are often added to these servers once compromised, and these links have different URL structures than the older links and URLs present on the server before being compromised. In addition, since attackers control a large number of compromised servers and generate pages using scripts, their URL structures are often very similar across several compromised domains. Therefore, SEO attacks can be detected by looking for newly created pages that share the same structure on different domains, and thereby also identify a corresponding group of compromised servers controlled by the same attacker (or the same SEO campaign) across multiple servers.
In a search engine context, a universe of URLs (e.g., with regard to 302) may correspond to over 100 billion websites, and information on these websites is gathered and indexed by the search engine using existing means (such as webcrawlers) well-known in the art. Certain of these websites, however, are unlikely to have been compromised—constituting high-confidence websites (e.g., with regard to 304)—and thus can be excluded from the analysis at the outset (although they may be later verified once an SEO attack has been identified, as described further herein).
Since search engines take into account the presence of keywords found in a URL when computing the relevance of pages for a search request, the pages produced by an SEO attack will typically feature several keywords in the URL, and in particular these keywords may comprise a group of recent keywords (e.g., popular keywords) as tracked and reported by certain search engines. Since popular keywords will comprise more search requests than other search terms, an SEO attack would benefit from using such keywords in their fake SEO page URLs, and thus such words are extensively utilized during an SEO attack. While it is also common for legitimate websites to use links containing some popular keywords, fake URLs on compromised servers tend to make extensive use of such keywords and will be relatively new with URL structures often different from the older URLs from that domain—thus making fake SEO pages identifiable.
For example, for each URL that contains keywords (or a subset of keywords such as popular keywords), several implementations may extract (e.g., with regard to 306) the URL prefix before the keywords to identify the domain. For example, consider the following URL:
http://www.askania-fachmaerkte.de/images/news.php?page=lisa+roberts+gillan
In this example, the keywords in the URL are “lisa,” “roberts,” and “gillan,” and the URL prefix before the keywords is “http://www.askania-fachmaerkte.de/images/news.php?page=”. Using this information, these several implementations can determine if the corresponding website had pages starting with this particular URL prefix in the past, and thus would consider the appearance of a new URL prefix as suspicious and subject to further processing (e.g., with regard to 308).
Once websites that exhibit a change in behavior have been identified—together comprising the set of “suspicious websites”—then several implementations (e.g., with regard to 206) cluster the suspicious URLs so that malicious links from the same SEO campaign will be grouped together under the assumption that they are generated by the same script. However, to accomplish this clustering, these implementations first extract the lexical features from URLs (e.g., with regard to step 204). For several implementations, one or more of the following lexical features might be selected: (a) string features, such as the separator between keywords, argument names, filenames, subdirectory names before the keywords, etc.; (b) numerical features, such as the number of arguments in the URL, the length of arguments in the URL, the length of the filename, the length of keywords, etc.; and (c) the “bag of words” comprising the keywords. Thus, for the URL from the above example, the separator between keywords is “+”, the argument name is “page”, the filename is “news.php”, the directory before the keywords is “images”, the number of arguments in the URL is one, the length of arguments is four, the length of filename is nine, and bag of words is {lisa roberts gillan}. These particular features form a kind of signature—perhaps with an acceptable degree of variation—for fake SEO pages that are the result of an SEO attack campaign.
Given that most malicious URLs created on the same domain have similar structures, certain implementations disclosed herein next aggregate these URL features together by domain names and study the similarities across domains (although sub-domains might be considered separate from each other since it is possible for a sub-domain to get compromised and not the entire domain). For string features, for example, certain implementations might aggregate them by taking the string that covers the most URLs in the domain. Similarly, for numerical features, certain implementations might take the median, and for bag of words might take the union of words. As such, in an implementation, this analysis might comprise the entirety of the lexical features derived for suspicious websites.
Using these lexical features, several implementations (e.g., with regard to 206) then cluster URLs together into groups. Certain such implementations might use a well-known K-means++ method to initially select a K centroids that are distant from each other and then apply the K-means algorithm to compute K clusters. Clusters that are tight (i.e., having low residual sum of squares within a certain threshold) are then outputted (and temporarily set aside) while, for the remaining data points, the K-means algorithm is reapplied iteratively until no more big clusters (e.g., having 10 domains or more) can be selected.
The computation of distances between data points and a cluster centroid may not be straightforward for certain features as many of the features are not numerical values, yet can be made into numerical values depending on the implementation. For example, for string features in certain implementations, it may be determined (e.g., by comparison) whether two values are identical. Similarly, the distance between two bags of words (e.g., A=a1, a2, . . . , an and B=b1, b2, . . . , bm) might be defined as a ratio measure of normalized union and disjunction between the words comprising each bag of words. It is contemplated that several implementations may use the same approach taken for aggregating URL features into domain features for computing centroids.
Several implementations (e.g., with regard to step 208) perform group analysis to identify the compromised domain groups and filter out the legitimate groups formed during clustering. Since SEO links in an SEO attack campaign possess similar page structures, one approach for such analysis is to quantify simple page features such as, for example, determining the number of URLs in each web page in each cluster. In operation, these implementations might sample N (e.g., 100) pages from each cluster group and then crawl these pages to extract the number of URLs per page and build a histogram from this data. The premise here is that a histogram for a legitimate group would have a diverse and varying number of links (URLs) per webpage (a relatively flat and well-distributed histogram), while a histogram for a malicious cluster group—one that has very similar pages with almost identical number of URLs per page—would express peaks around the number of links (e.g., URLs) consistently found in fake SEO pages. Therefore, it is fairly straightforward for certain such implementations to normalize each histogram and then compute peaks in the normalized histograms such that, if the histogram has unusually high peak values—which would not be expected in unrelated web pages—the group having such an unexpected peak is confirmed as suspicious.
These final groups that are confirmed as suspicious may then be manually checked and verified in certain implementations to ensure that they are indeed the result of a malware campaign and not the result of some other legitimate effort. For example, these final groups—which are expected to be relatively small in number and therefore easily manageable—might be manually checked or, in certain implementations, might have the content of such pages automatically analyzed using methodologies that were not suitable for finding the universe of URLs but are workable for confirming the validity of smaller groups of URLs.
It is also expected that, for each malicious group identified, certain implementations might also output regular expression signatures using signature generation systems known and appreciated by skilled artisans. Since URLs within a group have similar features, it is expected that most groups would only output one signature. These derived signatures could then be applied to large search engines to capture a broad set of attacks appearing in search engine results in such implementations, and the fake SEO pages stemming from these attacks could be blocked and prevented from being listed in search engine results. In other words, search engines could employ subsystems for evaluating search results using the regular expression signatures to identify malicious links and then filter out those links to protect the search engine users. These signatures could also be applied to the websites of high-confidence servers skipped early in the detection process as it is possible that even high-confidence servers may have been compromised and thereby use the signatures to perform a back-end check.
Computer-executable instructions, such as program modules, being executed by a computer may be used. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Distributed computing environments may be used where tasks are performed by remote processing devices that are linked through a communications network or other data transmission medium. In a distributed computing environment, program modules and other data may be located in both local and remote computer storage media including memory storage devices.
With reference to
Computing device 400 may have additional features/functionality. For example, computing device 400 may include additional storage (removable and/or non-removable) including, but not limited to, magnetic or optical disks or tape. Such additional storage is illustrated in
Computing device 400 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by device 400 and includes both volatile and non-volatile media, removable and non-removable media.
Computer storage media include volatile and non-volatile, and removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Memory 404, removable storage 408, and non-removable storage 410 are all examples of computer storage media. Computer storage media include, but are not limited to, RAM, ROM, electrically erasable program read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computing device 400. Any such computer storage media may be part of computing device 400.
Computing device 400 may contain communications connection(s) 412 that allow the device to communicate with other devices. Computing device 400 may also have input device(s) 414 such as a keyboard, mouse, pen, voice input device, touch input device, etc. Output device(s) 416 such as a display, speakers, printer, etc. may also be included. All these devices are well known in the art and need not be discussed at length here.
It should be understood that the various techniques described herein may be implemented in connection with hardware or software or, where appropriate, with a combination of both. Thus, the methods and apparatus of the presently disclosed subject matter, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine-readable storage medium where, when the program code is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the presently disclosed subject matter.
Although exemplary implementations may refer to utilizing aspects of the presently disclosed subject matter in the context of one or more stand-alone computer systems, the subject matter is not so limited, but rather may be implemented in connection with any computing environment, such as a network or distributed computing environment. Still further, aspects of the presently disclosed subject matter may be implemented in or across a plurality of processing chips or devices, and storage may similarly be effected across a plurality of devices. Such devices might include personal computers, network servers, and handheld devices, for example.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.