A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
With the increased use of online advertising models that reward the owners of web hosts whenever an advertisement is viewed (pay-per-view advertising) or clicked on (pay-per-click advertising), it has become common for people to create hosts solely for the purpose of getting advertisements viewed by users. Such hosts are referred to as “spam” hosts (because of their similarity to “spam” e-mail, i.e., they are unsolicited displays of advertising) to distinguish them from hosts that primarily provide content of value (e.g., news articles, blogs, some service, etc.), other than advertisements.
Because most hosts on the Internet are found through search engines such as those provided by Yahoo! and Google, operators of spam hosts have incentive to make their hosts appear as close to the top in the results of any given search query entered into a search engine. Spam host operators typically do this by placing random or targeted content and links on the pages of their hosts in order to trick the analysis algorithms used by search engines into classifying the spam host higher in a given query's search results.
Conversely, the operators of the search engines recognize that those users performing searches are not interested viewing spam hosts. Thus, search engines have an incentive to make spam hosts appear near or at the bottom of the search results generated by any given search query or omit the spam hosts altogether from the search results. However, achieving this is complicated because it can be difficult to identify spam hosts without manually reviewing the content of each host and classifying it as a spam or non-spam host.
Systems and methods for identifying spam hosts are disclosed in which hosts are known to the system and initially classified as spam or non-spam. Then the hosts are partitioned into clusters based on how each host is linked to other hosts. Each cluster is then analyzed and, depending on the number of spam and non-spam hosts it contains, the cluster may be classified as a spam cluster or a non-spam cluster. The hosts within the cluster may then be reclassified based on the cluster's classification. The results may then be used in many different ways including to filter search results based on host classifications so that spam hosts are not displayed or displayed last in a results set.
In one aspect, the disclosure describes a method for identifying spam hosts within a set of hosts. The method includes indexing content on each host within the set of hosts on a network and indexing links on each host within the set of hosts on the network. Then each host is classified with a host spamicity value identifying the host as spam or non-spam based on an analysis of the information known about that host. A subset of the set of hosts are partitioned into a cluster based on each host's links to other hosts. The cluster is then classified with a cluster spamicity value based on the host spamicity values of the subset of hosts within the cluster. Based on the cluster spamicity value, all hosts in the cluster are reclassified with the same host spamicity value, thereby identifying all hosts in the cluster as either spam or non-spam. The hosts in the cluster may be reclassified only if the cluster spamicity value exceed some predetermined threshold, with those hosts not being reclassified retaining their original classification.
In another aspect, the disclosure describes a computer-readable medium storing computer executable instructions for a method of presenting a list of hosts as search results in response to a search query. The method includes receiving, from a requestor, a search query requesting a list of hosts matching a search term and identifying hosts matching the search term. The method further includes assigning a host spamicity value to each host matching the search term based on content and links on that host, the host spamicity value of each host identifying the host as either a spam host or a non-spam host. The list of the hosts matching the search term is then presented to the requestor, in which the list is sorted at least in part based on the host spamicity value of each host in the list.
In another aspect, the disclosure describes a system for generating a list of search results. The system includes a spam host identification module that identifies each of a plurality of hosts as either a spam host or a non-spam host based on content and links on that host. The spam host identification module may include a prediction module that initially classifies each host in the plurality of hosts as either a spam host or a non-spam host based on at least the content on that host. The spam host identification module may also include a clustering module that partitions the plurality of hosts into one or more clusters based on each host's links to other hosts and classifies each of the one or more clusters with a different cluster spamicity value based on the number of hosts within the cluster initially classified as spam hosts and non-spam hosts. The spam host identification module may also include a reclassification module that changes the initial classifications for each host in a first cluster based on a comparison of the first cluster's cluster spamicity value to one or more predetermined threshold values. The reclassification module may reclassify all hosts within the first cluster as spam hosts if the cluster spamicity value of the first cluster is less than a spam host threshold value. The reclassification module may also reclassify all hosts within the first cluster as non-spam hosts if the cluster spamicity value of the first cluster is greater than a non-spam host threshold value.
These and various other features as well as advantages will be apparent from a reading of the following detailed description and a review of the associated drawings. Additional features are set forth in the description that follows and, in part, will be apparent from the description, or may be learned by practice of the described embodiments. The benefits and features will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are intended to provide further explanation of the invention as claimed.
The following drawing figures, which form a part of this application, are illustrative of embodiments systems and methods described below and are not meant to limit the scope of the disclosure in any manner, which scope shall be based on the claims appended hereto.
The systems and methods described herein will classify each host 102, 104, 106, 108, 110 as either spam or non-spam. Because each host may provide access to one or more web pages or other types of content, classifying a host as spam will result in the classification of web pages on the host as spam also. For example, in an embodiment in which the TLD “www.cnn.com” is on a host 102, all the web pages and other content accessible via sub domains under the TLD (e.g., cnn.com/politics/todaysstory.htm and cnn.com/technology/fuelcell.htm) are classified the same as that host 102.
The architecture 100 illustrated is a networked client/server architecture in which some of the computing devices, such the hosts 102, 104, 106, 108, 110 are referred to as a “server” in that they serve requests for content (e.g., web pages and services) and other computing devices are referred to as a “client” that issue requests for content to servers. For example, in the embodiment shown the spam identification module 114 is incorporated into a server 112 that can serve web page search requests from clients, such as the client computing device 130 shown. The systems and methods described herein are suitable for use with other architectures as discussed in greater detail below.
Computing devices are well known in the art. For the purposes of this disclosure, a computing device such as the client 130, server 112 or host 102, 104, 106, 108, 110 includes a processor and memory for storing and executing data and software. Computing devices may be provided with operating systems that allow the execution of software applications in order to manipulate data. Examples of operating systems include an operating system suitable for controlling the operation of a networked server computer, such as the WINDOWS XP or WINDOWS 2003 operating systems from MICROSOFT CORPORATION.
In order to store the software and data files, a computing device may include a mass storage device in addition to the memory of the computing device. Local data structures, including discrete web pages such as .HTML files, may be stored on a mass storage device (not shown) that is connected to, or part of, any of the computing devices described herein. A mass storage device includes some form of computer-readable media and provides non-volatile storage of data for later use by one or more computing devices. Although the description of computer-readable media contained herein refers to a mass storage device, such as a hard disk or CD-ROM drive, it should be appreciated by those skilled in the art that computer-readable media can be any available media that can be accessed by a computing device.
By way of example, and not limitation, computer-readable media may comprise computer storage media and communication media. Computer storage media include volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EPROM, EEPROM, flash memory or other solid state memory technology, CD-ROM, DVD, or other optical storage, magnetic cassette, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by the computer.
As discussed above,
As shown in
The server 112 includes a number of modules. The term module is used to remind the reader that the functions and systems described herein may be embodied in a software application executing on a processor, in a piece of hardware purpose-built to perform a function or in a system embodied by a combination of software, hardware and firmware.
The server 112 includes a spam host identification module 114 as shown. As discussed in greater detail below, the spam host identification module 114 takes information previously indexed about the hosts, such as by a search engine 122 or web crawler (not shown), initially classifies all of the hosts as either a spam host or non-spam host, then, by clustering may further reclassify each host with a final classification as either spam or non-spam.
In the embodiment shown, the spam host identification module 114 includes a prediction module 116. The prediction module 116 analyzes the information in the index 124 and initially classifies each host with a value indicative of the results of a preliminary analysis of whether that host is a spam host or a non-spam host. This value shall be referred to as the host “spamicity” value.
In an embodiment, the spamicity value may be a simple binary 0 or 1, such that, for example, 1 indicates a spam host and 0 indicates a non-spam host (or vice versa). Alternatively, a more complicated algorithm may be used that provides a host spamicity value ranging between an upper limit and a lower limit that reflects the confidence of the algorithm in identifying a host as spam or non-spam.
In the embodiment shown, the spamicity values for hosts evaluated by the spam host identification module 114 are stored in a classification datastore 150. In an alternative embodiment, the spamicity values or equivalent information classifying hosts as spam/non-spam may be stored in the index 124 of host information so that all host information is stored in a single datastore.
In addition to the prediction module, a clustering module 118 may be provided as shown. As discussed in greater detail below, the clustering module clusters hosts using any of a known set of techniques based on how the hosts are linked to other hosts on the Internet. In an embodiment, the clustering module clusters all of the hosts known in the index into any number of clusters that may be previously defined or may be determined by the clustering module itself during analysis. In an embodiment, hosts that are heavily cross-linked are clustered into a single cluster and no host appears in more than one cluster. Thus, the clustering module may be considered to partition the hosts known in the index into different clusters.
In addition to the partitioning, the clustering module also classifies each cluster it creates as either a spam or non-spam cluster. In an embodiment, this is done by assigning a cluster spamicity value to each cluster. The clustering module generates a cluster spamicity value by an analysis of the host spamicity values of each of the hosts within each cluster as described in greater detail below.
The spam host identification module 114 further includes a reclassification module 120 as shown. The reclassification module 120 uses the information determined by the clustering module, e.g., the cluster spamicity value, and subsequently reclassifies each of the hosts based on the cluster spamicity value. Because the cluster spamicity value is based on the initial host spamicity values generated by the prediction module 116, reclassifying is ultimately based on both the results of the clustering module 118 and the results of the prediction module 116.
In the embodiment shown in
In the embodiment shown, the spam host identification (ID) module 114 is used to classify each of the hosts known to the system as either spam or non-spam. Using this classification, the search engine then can order the search results 126 so that spam hosts, e.g., those hosts identified as spam hosts by the spam host ID module 114, toward the end of the search result. Alternatively, spam hosts may be filtered so that they do not appear in the search result 126 at all.
In yet another embodiment, the spam host ID module 114 may be implemented as a standalone module or service on its own server or computing device (not shown) that continuously analyzes the data maintained in the index 124 and assigns each host a host spamicity value, which is stored in the index. Independent search engines on other servers then need only inspect the data maintained in the index 124, which now includes the host spamicity value for each host so that search results 126 may be easily ordered based on whether a host is spam or non-spam.
The web pages 204, 210 further include one or more links 206. The links may be hyperlinks or other references to other web pages on the same or other hosts accessible via the network 101. It is now common for many websites to have many links on each web page. A link may be a user-selectable text, control or icon that causes the user's browser to retrieve the linked page. Alternatively, the link may be a simple address in text that the user's device identifies as link to the page identified by the address.
Based on the information contained in the index, the method 300 further classifies each host in the index as either spam or non-spam in a classification operation 304. The initial classification of hosts as spam or non-spam may be done using any method known in the art, including methods in which the content of the pages of the hosts are analyzed, the number and type of links on each page in the host is analyzed, and/or a combination of both. Methods and algorithms for initially classifying a set of host web pages based on content and link features are known in the art. Any such method may be used in order to obtain an initial classification of each of the hosts.
In an embodiment, each host, regardless of the number of web pages associated with the host, is assigned its own value. In an alternative embodiment, each web page on a host may be assigned a web page spamicity value, thereby, in effect, modifying the systems described herein to be a spam web page identification system, as opposed to a spam host identification system.
After the host has been classified in the classification operation 304, a partition operation 306 is performed. In the partition operation, the hosts, based on their previous identification as spam or non-spam, are partitioned into one or more clusters. Several clustering techniques are known in the art for clustering data based on information known about the data and any suitable technique may be used. In the embodiment shown, the hosts are clustered based either on their content, their links, or both. The result of the partitioning operation is a plurality of clusters, each containing a subset of the hosts analyzed by the system.
After the partitioning, a cluster classification operation 308 assigns each cluster created by the partitioning operation 306 a cluster spamicity value based on the initial classifications of the hosts in the cluster. In an embodiment, the cluster spamicity value may be a simple average of the host spamicity values of the hosts in the cluster. For example, if a cluster contained five hosts, three of which were spam and assigned a host spamicity value of 1, and two of which were non-spam assigned a host spamicity value of 0, the cluster spamicity value may be two divided by five or 0.6. As another example, take the case of a cluster having 100 hosts in which 90% of the hosts are identified as non-spam and ten percent of the hosts are identified as spam, thereby resulting in a cluster spamicity value of 0.1. Many other ways of assigning a cluster spamicity value to the clusters may be used.
After a cluster spamicity value has been assigned to each cluster, each cluster spamicity value is then compared in a comparison operation 310 to predetermined thresholds for spam and non-spam. Based on the comparison, a reclassification operation 312 is performed in which one of three actions are performed in response to the detection of one three conditions. First, if a cluster spamicity value exceeds a predetermined spam threshold, then each host in that cluster is reclassified as spam regardless of its initial classification assigned in the classification operation 304. For example, if a spam threshold is a maximum threshold of 0.85 or more, then neither of the cluster examples provided above would exceed (in this case be less than) the spam threshold as the cluster spamicity values 0.6 and 0.1 are both below the spam threshold.
Likewise, a non-spam threshold may also be designated. If the cluster spamicity for any of the clusters exceeds a predetermined non-spam threshold, then each host in those clusters are reclassified with a host classification value indicating it is non-spam regardless of the initial classification determined in the initial classification operation 304. Again, continuing the examples provided above, if a non-spam threshold of 0.15 has been predetermined by the operators of the spam host detection system, then the cluster having a cluster spamicity of 0.6 would not be reclassified; however, the cluster having the spamicity of 0.1 would be reclassified so that all of the 100 hosts within that cluster would now have a host spamicity value of 0 indicating them as non-spam hosts.
The third possible outcome of the reclassification operation 312 occurs if the cluster spamicity value exceeds neither threshold, in which case the initial classification as determined in the classification operation 304 is retained. These three conditions, (i.e., reclassifying all hosts in a cluster that exceeds a spam threshold, reclassifying all hosts in a cluster that exceeds a non-spam threshold, or not reclassifying a cluster that exceeds neither threshold) may be considered a single reclassification operation 312 in which all hosts in each cluster are reclassified, if necessary, based on the cluster spamicity, which itself was derived from the initial classifications of the hosts. Regardless of the outcome of the classification operation 312, all hosts known to the system or being analyzed under this method are now identified as either a spam host or a non-spam host by the assignment of a final host spamicity value.
In the embodiment shown, a search engine receives a search query including search terms from a user in a receive search operation 402. In response, the search engine identifies hosts matching the search terms in a host identification operation 404. The host identification operation 404 may include identifying specific web pages within hosts that match the search terms from the search query or may only identify host sites that contain pages that match the search query. In order to match a host to a search query, the same content in the index used to classify hosts as spam or non-spam may be used to match the hosts to the query. Such matching algorithms are known in the art and any suitable algorithm may be used to identify hosts matching a search term.
Following the host identification operation 404, an retrieve classification operation 406 is performed. In the embodiment, the spam/non-spam classification derived by the method described with reference to
After the spam/non-spam classification of each host has been retrieved, search results are then filtered and/or sorted, based on whether each host identified in the search results is spam or non-spam and then presented to the requestor of the search in a search result presentation operation 408. The search results may be sorted so that spam hosts appear at the bottom of the search results. Alternatively, spam hosts may be identified in some way to the user/requestor in the search result. In yet another embodiment, hosts identified as spam may be filtered out of the search result entirely so that they are not presented to the user in response to the user's search query. Such filtering may be in response to a user default in which the user requests that the search engine not transmit any search results likely to be spam hosts.
The method 400 for generating search results identified in
In yet another embodiment, the systems and methods described above may be used to detect spam registrations in directories of hosts. Hosts identified as spam by the clustering method may then be removed from the registry or flagged as spam.
In yet another embodiment, the systems and methods described above may be used to detect spam or abusive replies in a forum context, e.g., Yahoo! answers. Again, answers and hosts associated with answers identified as spam by the clustering method may then be removed from the forum or flagged/sorted/displayed as spam.
The following is a description of an embodiment of a spam host identification system and method that was created and tested against a known dataset of spam and non-spam hosts to determine its efficacy. The WEBSPAM-UK2006 dataset, a publicly available Web spam collection, was used as the initial dataset upon which to test the spam host identification system. It is based on a crawl of the .uk domain done in May 2006, including 77.9 million pages and over 3 billion links in about 11,400 hosts.
This reference collection has been tagged at the host level by a group of volunteers. The assessors labeled hosts as “normal”, “borderline” or “spam”, and were paired so that each sampled host was labeled by two persons independently. For the ground truth, only hosts for which the assessors agreed were used, plus the hosts in the collection marked as non-spam because they belong to special domains such as police .uk or .gov.uk.
The benefit of labeling hosts instead of individual pages is that a large coverage can be obtained, meaning that the sample includes several types of Web spam, and the useful link information among them. Since about 2,725 hosts were evaluated by at least two assessors, a tagging with the same resources at the level of pages would have been either completely disconnected (if pages were sampled uniformly at random), or would have had a much smaller coverage (if a sub set of sites were picked at the beginning for sampling pages). On the other hand, the drawback of the host-level tagging is that a few hosts contain a mix of spam/non-spam content, which increases the classification errors. Domain-level tagging is another embodiment that could be used.
Evaluation. The evaluation of the overall process is based on a set of measures commonly used in Machine Learning and Information Retrieval. Given a classification algorithm C, we consider its confusion matrix:
where a represents the number of non-spam examples that were correctly classified, b represents the number of nonspam examples that were falsely classified as spam, c represents the spam examples that were falsely classified as nonspam, and d represents the number of spam examples that were correctly classified. We consider the following measures: true positive rate (or recall), false positive rate and F-measure. The recall R is d/(c+d). The false positive rate is defined as b/(b+a). The F-Measure is defined as F=2PR/(P+R), where P is the precision P=d/(b+d).
For evaluating the classification algorithms, we focus on the F-measure F as it is a standard way of summarizing both precision and recall. We also report the true positive rate and false positive rate as they have a direct interpretation in practice. The true positive rate R is the amount of spam that is detected (and thus deleted or demoted) by the search engine. The false positive rate is the fraction of non-spam objects that are mistakenly considered to be spam by the automatic classifier.
Cross-validation. All the predictions reported here were computed using tenfold cross validation. For each classifier, the true positives, false positives and F-measure were reported. A classifier whose prediction it is desired to estimate, is trained 10 times, each time using the 9 out of the 10 partitions as training data and computing the confusion matrix using the tenth partition as test data. The resulting ten confusion matrices are then averaged and the evaluation metrics on the average confusion matrix are estimated.
For the content analysis, a summary of the content of each host was obtained by taking the first 400 pages reachable by breadth-first search. The summarized sample contains 3.3 million pages. All of the content data used in the rest of this paper were extracted from a summarized version of the crawl. Note that the assessors spent on average 5 minutes per host, so the vast majority of the pages they inspected were contained in the summarized sample.
The spam detection system then used a cost-sensitive decision tree as part of the classification process. Experimentally, this classification algorithm was found to work better than other methods tried. The features used to learn the tree were derived from a combined approach based on link and content analysis to detect different types of Web spam pages.
The features analyzed to build the classifiers were extracted link-based features from the Web graph and hostgraph, and content-based features from individual pages. For the link-based features the method described by L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates in Link-based characterization and detection of Web Spam. In AIR Web, 2006 was used and for the content-based features the method described by A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly in Detecting spam web pages through content analysis. In WWW, pages 83-92, Edinburgh, Scotland, May 2006 was used.
Most of the link-based features were computed for the home page and the page in each host with the maximum Page-Rank. The remainder of link features were computed directly over the graph between hosts (obtained by collapsing together pages of the same host).
Degree-related measures. A number of measures were computed related to the in-degree and out-degree of the hosts and their neighbors. In addition, various other measures were considered, such as the edge-reciprocity (the number of links that are reciprocal) and the assortativity (the ratio between the degree of a particular page and the average degree of its neighbors). A total of 16 degree-related features was obtained.
PageRank. PageRank is a well known link-based ranking algorithm that computes a score for each page. Various measures related to the PageRank of a page and the PageRank of its in-link neighbors were calculated to obtain a total of 11 PageRank-based features.
TrustRank. Gy ongyi et al. (Z. Gy ongyi, H. Garcia-Molina, and J. Pedersen. Combating Web spam with TrustRank. In VLDB, 2004) introduced the idea that if a page has high PageRank, but it does not have any relationship with a set of known trusted pages then it is likely to be a spam page. TrustRank is an algorithm that, starting from a subset of hand-picked trusted nodes and propagating their labels through the Web graph, estimates a TrustRank score for each page. Using TrustRank the spam mass of a page, i.e., the amount of PageRank received from a spammer, may be estimated. The performance of TrustRank depends on the seed set. In the experiment 3,800 nodes were chosen at random from the Open Directory Project, excluding those that were labeled as spam. It was observed that the relative non-spam mass for the home page of each host (the ratio between the TrustRank score and the PageRank score) is a very effective measure for separating spam from non-spam hosts. However, using this measure alone is not sufficient for building an automatic classifier because it would yield a high number of false positives (around the 25%).
Truncated PageRank. Becchetti et al. described Truncated PageRank, a variant of PageRank that diminishes the influence of a page to the PageRank score of its neighbors.
Estimation of supporters. Given two nodes x and y, it is said that x is a d-supporter of y, if the shortest path from x to y has length d. Let Nd(x) be the set of the d-supporters of page x. An algorithm for estimating the set Nd(x) for all pages x based on probabilistic counting is described in Becchetti et al. For each page x, the cardinality of the set Nd(x) is an increasing function with respect to d. A measure of interest is the bottleneck number bd(x) of page x, which we define to be bd(x)=minj≦d{|Nj(x)|/|Nj-1(x)|}. This measure indicates the minimum rate of growth of the neighbors of x up to a certain distance. We expect that spam pages form clusters that are somehow isolated from the rest of the Web graph and they have smaller bottleneck numbers than nonspam pages.
For each web page in the data set a number of features were extracted based on the content of the pages. Most of the features reported by Ntoulas et al. were used, with the addition of new ones such as the entropy (see below), which is meant to capture the compressibility of the page. Ntoulas et al. use a set of features that measures the precision and recall of the words in a page with respect to the set of the most popular terms in the whole web collection. A new set of features was also used that measures the precision and recall of the words in a page with respect to the q most frequent terms from a query log, where q=100, 200, 500, 1000.
Number of words in the page, number of words in the title, average word length. For these features the method counted only the words in the visible text of a page, and we consider words consisting only of alphanumeric characters.
Fraction of anchor text. This feature was defined as the fraction of the number of words in the anchor text to the total number of words in the page.
Fraction of visible text. The fraction of the number of words in the visible text to the total number of words in the page, include html tags and other invisible text, was also used as a feature.
Compression rate. The visible text of the page was compressed using bzip. Compression rate is the ratio of the size of the compressed text to the size of the uncompressed text.
Corpus precision and corpus recall. The k most frequent words in the dataset, excluding stopwords, was used as a classification feature. Corpus precision refers to the fraction of words in a page that appear in the set of popular terms. Corpus recall was defined to be the fraction of popular terms that appear in the page. For both corpus precision and recall 4 features were extracted, for k=100, 200, 500 and 1000.
Query precision and query recall. The set of q most popular terms in a query log was used, and query precision and recall analogous to corpus precision and recall. As with corpus precision and recall, eight features were extracted, for q=100, 200, 500 and 1000.
Independent trigram likelihood. A trigram is three consecutive words. Let {pw} be the probability distribution of trigrams in a page. Let T={w} be the set of all trigrams in a page and k=|T(p)| be the number of distinct trigrams. Then the independent trigram likelihood is a measure of the independence of the distribution of trigrams. It is defined as P in Ntoulas et al. to be
Entropy of trigrams. The entropy is another measure of the compressibility of a page, in this case more macroscopic than the compressibility ratio feature because it is computed on the distribution of trigrams. The entropy of the distribution of trigrams, {pw}, is defined as H=−ΣwεTpw log pw.
The above list gives a total of 24 features for each page. In general we found that, for this dataset, the content-based features do not provide as good separation between spam and non-spam pages as for the data set used in Ntoulas et al. For example, we found that in the dataset we are using, the distribution of average word length in spam and non-spam pages were almost identical. In contrast, for the data set of Ntoulas et al. that particular feature provides very good separation. The same is true for many of the other content features. Some of the best features (judging only from the histograms) are the corpus precision and query precision, which is shown in
In total, 140 features were extracted for each host and 24 features for each page. The total number of link-based features, as described above, is 140 features for each host. The content-based features of pages were aggregated in order to obtain content-based features for hosts.
Let h be a host containing m web pages, denoted by the set P={p1, . . . , pm}. Let {circumflex over (p)} denote the home page of host h and p* denote the page with the largest PageRank among all pages in P. Let c(p) be the 24-dimensional content feature vector of page p. For each host h we form the content-based feature vector c(h) of h as follows
c(h)=c({circumflex over (p)}),c(p*),E[c(p)],Var[c(p)].
Here E[c(p)] is the average of all vectors c(p), pεP, and Var[c(p)] is the variance of c(p), pεP. Therefore, for each host there were 4×24=96 content features. In total, there were 140+96=236 link and content features.
In the process of aggregating page features, hosts h were ignored for which the home page {circumflex over (p)} or the maxPR page p* is not present in our summary sample. This left a total of 8,944 hosts, out of which 5,622 are labeled; from them, 12% are spam hosts.
The base classifier used in the experiment was the implementation of C4.5 (decision trees) given in Weka (see e.g., I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, 1999). Using both link and content features, the resulting tree used 45 unique features, of which 18 were content features.
In the data used, the non-spam examples outnumber the spam examples to such an extent that the classifier accuracy improves by misclassifying a disproportionate number of spam examples. To minimize the misclassification error, and compensate for the imbalance in class representation in the data, a cost-sensitive decision tree was used. A cost of zero was imposed for correctly classifying the instances, and the cost of misclassifying a spam host as normal was set to be R times more costly than misclassifying a normal host as spam. Table 1 shows the results for different values of R. The value of R is a parameter that can be tuned to balance the true positive rate and the false positive rate. In this case, it was desired to maximize the F-measure. Incidentally note that R=1 is equivalent to having no cost matrix, and is the baseline classifier.
It was then attempted to improve the results of the baseline classifier using bagging. Bagging is a technique that creates an ensemble of classifiers by sampling with replacement from the training set to create N classifiers whose training sets contain the same number of examples as the original training set, but may contain duplicates. The labels of the test set are determined by a majority vote of the classifier ensemble. In general, any classifier can be used as a base classifier, and in this experiment the cost-sensitive decision trees described above were used. Bagging improved our results by reducing the false-positive rate, as shown in Table 2. The decision tree created by bagging was roughly the same size as the tree created without bagging, and used 49 unique features, of which 21 were content features.
The results of classification reported in Tables 1 and 2 use both link and content features. Table 3 shows the contribution of each type of feature to the classification. The content features serve to reduce the false-positive rate, with-out diminishing the true positive result, and thus improve the overall performance of the classifier.
Given the above, an embodiment of the methods described above then was analyzed in which during the extraction of link-based features, all nodes in the network were anonymous, while in this regularization phase, the identity (the predicted label or initial host spamicity value) of each node is known, and important to the algorithm.
In the experiment, a graph clustering algorithm was used to partition the hosts into clusters based on links between hosts. In a first step, the undirected graph G=(V, E, w) is created where V is the set of hosts, w is a weighting function from V×V to integers so that the weight w(u, v) is equal to the number of links between any page in host u and any page in host v, and E is the set of edges with non-zero weight. Ignoring the direction of the links may result in a loss of information for detecting spam, but it drastically simplifies the graph clustering algorithm.
Next, the graph G was partitioned into clusters using the METIS graph clustering algorithm (see G. Karypis and V. Kumar. Multilevel k-way partitioning scheme for irregular graphs. Journal of Parallel and Distributed Computing, 48(1):96-129, 1998). The 11400 hosts of the graph were partitioned into 1000 clusters, so as to split the graph into many small clusters. By experimentation it was found that the number of clusters is not crucial, and we obtained similar results for partitioning the graph in 500 and 2000 clusters.
The clustering algorithm can be described as follows. Let the clustering of G consist of m clusters C1, . . . , Cm, which form a disjoint partition of V. Let p(h)ε[0 . . . 1] be the prediction of a particular classification algorithm C so that for each host h a value of p(h) equal to 0 indicates non-spam, while a value of 1 indicates spam. (Informally, we call p(h) the predicted spamicity of host h). For each cluster Cj, j=1, . . . , m, its average spamicity is
The algorithm used two thresholds, a lower threshold t1 and an upper threshold tu. For each cluster Cj if p(Cj)≦ti then all hosts in Cj were marked as non-spam, and p(h) is set to 0 for all h in the cluster Cj. Similarly, if p(Cj)≧tu then all hosts in Cj were marked as spam, and p(h) was set to 1.
The results of the clustering algorithm are shown in Table 4. The improvement of the F-measure obtained over classifier without bagging is statistically significant at the 0.05 confidence level; the improvement for the classifier with bagging is much smaller. Note that this algorithm never has access to the true labels of the data, but uses only predicted labels. The true labels being only used to determine the effectiveness of the method.
Those skilled in the art will recognize that the methods and systems of the present disclosure may be implemented in many manners and as such are not to be limited by the foregoing exemplary embodiments and examples. In other words, functional elements being performed by single or multiple components, in various combinations of hardware and software or firmware, and individual functions, may be distributed among software applications at either the client level or server level or both. In this regard, any number of the features of the different embodiments described herein may be combined into single or multiple embodiments, and alternate embodiments having fewer than, or more than, all of the features described herein are possible. Functionality may also be, in whole or in part, distributed among multiple components, in manners now known or to become known. Thus, myriad software/hardware/firmware combinations are possible in achieving the functions, features, interfaces and preferences described herein. Moreover, the scope of the present disclosure covers conventionally known manners for carrying out the described features and functions and interfaces, as well as those variations and modifications that may be made to the hardware or software or firmware components described herein as would be understood by those skilled in the art now and hereafter.
Furthermore, the embodiments of methods presented and described as flowcharts in this disclosure are provided by way of example in order to provide a more complete understanding of the technology. The disclosed methods are not limited to the operations and logical flow presented herein. Alternative embodiments are contemplated in which the order of the various operations is altered and in which sub-operations described as being part of a larger operation are performed independently.
While various embodiments have been described for purposes of this disclosure, such embodiments should not be deemed to limit the teaching of this disclosure to those embodiments. Various changes and modifications may be made to the elements and operations described above to obtain a result that remains within the scope of the systems and processes described in this disclosure. For example, a spam host identification system could be incorporated into an automated news aggregation system so that pages from spam hosts are not accidentally aggregated as non-spam news items. Numerous other changes may be made that will readily suggest themselves to those skilled in the art and which are encompassed in the spirit of the invention disclosed and as defined in the appended claims.