The foregoing aspects and many of the attendant advantages of this invention will become more readily appreciated as the same becomes better understood by reference to the following detailed description, when taken in conjunction with the accompanying drawings, wherein like reference numerals refer to like parts throughout the various views unless otherwise specified:
Embodiments of method and apparatus for web spam filtering are described herein. In the following description, numerous specific details are set forth (such as the C and C++ programming languages indicated as the language in which the software described is implemented) to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention can be practiced without one or more of the specific details, or with other methods, components, materials, etc. In other instances, well-known structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of the invention.
Reference throughout this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases “in one embodiment” or “in an embodiment” in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Thus, embodiments of this invention may be used as or to support a software program executed upon some form of processing core (such as the CPU of a computer) or otherwise implemented or realized upon or within a machine-readable medium. A machine-readable medium includes any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer). For example, a machine-readable medium can include such as a read only memory (ROM); a random access memory (RAM); a magnetic disk storage media; an optical storage media; and a flash memory device, etc. In addition, a machine-readable medium can include propagated signals such as electrical, optical, acoustical or other form of propagated signals (e.g., carrier waves, infrared signals, digital signals, etc.).
As shown in
If the configuration indicates that results are to be retrieved from a local database, the PHP uses the SQL query language to query a MySQL database to obtain URLs and associated descriptions that match the keyword or keywords passed as part of the search request, step 116. Alternatively, the PHP page contacts one or more search engines over a network connection, querying them for results that match the specified keyword or keywords, step 114. These search engines can be queried serially or in parallel.
For each URL in the returned list obtained from the search engines or from the database, the PHP page determines if the URL exists in a blacklist file, step 118, which indicates URLs that are already known to contain spam and do not need to be reprocessed. If the URL is in the blacklist, it is returned to the client with an indicator that it is spam, or optionally, it is removed altogether from the list of results returned to the client, step 130. If the URL is not in the blacklist, the PHP program determines if it is in the whitelist, step 120, which indicates that the specified URL is known not to contain spam and does not need to be processed. In this case, the result is returned to the client with an indicator that it is not spam, or alternatively, with no special indicator, step 128.
The software also supports a learning process, by which the user can indicate to the toolbar that a page or link identified as spam or not spam has been incorrectly identified. If the user clicks on the “spam” button that appears on the page next to a URL displayed on a page, this indicates to the server software, step 132, that the user is initiating a correction to the specified link. The server software then reprocesses the specified URL but identifies it as spam, step 134. Alternatively if the user clicks on a “not spam” button, the server software reprocesses the associated URL as containing valid, not spam content.
If the URL is in neither the blacklist nor the whitelist, the page pointed to by the URL is retrieved from the URL, step 122. In one implementation, the retrieved page is processed, and any URLs it contains are parsed and their contents retrieved (recursively, to a recursion level specified in a configuration file). In step 124, one or more algorithms are used to evaluate the retrieved page(s) to determine if they are spam or not. In step 126, a determination is made as to whether the retrieved content is spam; if it is, step 130 is executed and the blacklist is updated to include the specified URL. If it is not spam, step 128 is executed and the whitelist is updated to include the specified URL.
Due to the amount of processing power required to determine whether a page is spam or not, it is preferable in some instances to have the server evaluate URLs and their contents on an on-going basis, as illustrated in
It should be noted that the web filter software can use multiple pages to evaluate whether a URL is spam or not; that is, in addition to the frames concept described earlier, depending on how it is configured, the software may download multiple pages from a particular web site, evaluating them conjointly, to determine if they are spam. While a particular page may not be identified as spam, sometimes multiple pages when evaluated together, or sub-links, when evaluated, are determined to be spam. In step 326, sub-links contained within an evaluated page are then added to the URL list for further processing.
The illustrations described so far have focused primarily on the server-based aspects of filtering web spam. A client-based implementation is described in
In step 412, the toolbar waits for a new page to be loaded into the browser by the user. When the toolbar detects that a new page has been loaded, it evaluates the page to determine whether the page itself, or the links contained within it, are search spam or web spam, step 414. The toolbar then indicates, using an image in the toolbar, whether the page itself is spam or not spam; it also modifies the page so that when the user places the mouse cursor over a URL link in the page, a popup will indicate whether the particular link points to a page that is spam or not spam, step 416. In this way, the toolbar indicates to the user whether content the user is currently viewing, or thinking about viewing, is spam.
The toolbar also supports a learning process, by which the user can indicate to the toolbar that a page or link identified as spam or not spam has been incorrectly identified. If the user clicks on the “spam” button that appears on the toolbar, this indicates to the toolbar, step 418, that the user is initiating a correction to the current page or to one of the links identified in the page currently visible in the browser. The toolbar then reprocesses the contents of the current URL but identifies it as spam, step 420. Alternatively if the user clicks on the “not spam” button in the toolbar, the toolbar reprocesses the current page identifying it as containing valid, not spam content.
Toolbars installed on individual client computers can use a server to both backup their blacklist and whitelist information and to benefit from the network effects of multiple users determining which pages on the world wide web constitute spam, as shown in
As shown in
The above description of illustrated embodiments of the invention, including what is described in the Abstract, is not intended to be exhaustive or to limit the invention to the precise forms disclosed. While specific embodiments of, and examples for, the invention are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the invention, as those skilled in the relevant art will recognize.