The present invention relates generally to software. More specifically, content evaluation is disclosed.
Unsolicited content, often referred to as “spam,” is problematic in that large amounts of undesirable data are sent to and received by users over various electronic media including the World Wide Web (“web”). Spam may be delivered using e-mail or other electronic content delivery mechanisms, including messaging, the Internet, web, or other electronic communication media. In the context of search engines, crawlers, bots, and other content filtering mechanisms, the detection of undesirable content on the web (“web spam”) is a growing problem. For example, when a search is performed, all web pages that fit a given search may be listed in a results page. Included with the search results pages may be web pages that have been generated to specifically increase the visibility of a particular web site. Web spam “pushes” undesired content to users, hoping to entice users to visit a particular web site. Web spam also generates significant amounts of unusable or uninteresting data for users and can slow or prevent accurate search engine performance. There are various types of mechanisms for raising the visibility of particular web pages in a search listing or ranking.
In many cases, spam may be occurring over the web and Internet for commercial purposes. For example, search engine optimizers (SEOs) generate spam web pages (“web spam”), either automatically or manually, in order to enhance the desirability or “searchability” or a particular web page. SEOs attempt to raise web site rankings in search listings and consequently generate substantial amounts of spam web pages. A destination web site or web page may be able to increase its ranking or priority in a particular search, thus enabling more prominent positioning and placement on a results page leading to increased traffic from users. Subsequently, SEOs are able to generate revenue based on improving the exposure of a client website to increased amounts of traffic and users. Some SEOs may employ keyword stuffing to create web pages, which may include keywords, but no actual content. Another problem is link spam, which creates a large number of pages linking to a particular web page (the commercial client), thus misleading and causing search engines to raise the ranking within search results for a particular web site or web page. In other cases, web spam may be created by generating a large number of web pages that may slightly vary from each other, with the intent that one of these pages will be ranked highly by a search engine.
Thus, what is needed is a solution for detecting unsolicited online content without the limitations of conventional techniques.
Various embodiments of the invention are disclosed in the following detailed description and the accompanying drawings:
The invention can be implemented in numerous ways, including as a process, an apparatus, a system, a composition of matter, a computer readable medium such as a computer readable storage medium or a computer network wherein program instructions are sent over optical or electronic communication links. In this specification, these implementations, or any other form that the invention may take, may be referred to as techniques. In general, the order of the steps of disclosed processes may be altered within the scope of the invention.
A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. The invention is described in connection with such embodiments, but the invention is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention encompasses numerous alternatives, modifications and equivalents. Numerous specific details are set forth in the following description in order to provide a thorough understanding of the invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in detail so that the invention is not unnecessarily obscured.
Detection of web spam is an important goal in reducing and eliminating undesirable content. Depending upon a user's preferences, some content may not be desirable and detection may be performed to determine whether web spam is present. By using statistical distributions formed by using various parameters or attributes associated with a set of crawled web pages, a graph may be developed of all pages in the search results. Here, a graph may refer to a diagram, figure, or plot of data using various parameters. As an example, a graph may be developed where a point may be plotted for each page crawled by a search engine, where one or more attributes of the pages are used to plot the graph. In some examples, web spam detection techniques may be performed during the creation of a search engine index, rather than when a query is performed so as to not delay search results to a user. In other examples, web spam detection may be performed differently. Once outliers have been identified, web pages associated with the outliers may be further evaluated using various techniques. However, once web spam has been detected, deletion, filtering, reduction of search engine rankings, or other actions may be performed. Software or hardware applications (e.g., computer programs, software, software systems, and other computing systems) may be used to implement techniques for evaluating content to detect web spam.
As an example, various outliers may result when a statistical distribution is formed using a variety of attributes or parameters, such as a uniform resource locator (URL). A URL represents an address for a web page that may be used as a parameter to determine whether a page addressed by the URL is web spam. In some examples, a synthetic URL may be used to address a page. Synthetic URLs are generated automatically rather than manually by a developer, administrator or other web content provider. These URLs may appear differently, for example, having random sequences of digits, characters, or other items contained in the address. Synthetic URLs may be automatically generated by an application, program, or machine. Several examples of statistical distributions formed to detect web spam are shown in
A host name may be used with the domain same system (DNS), which is a global, distributed system for mapping symbolic host names to numeric IP addresses. DNS is implemented by a large number of independent computers (“DNS servers”). Each DNS server is responsible for some part of the mapping and may be operated by an organization that has registered ownership of a domain name. A symbolic host name may be resolved by a client, which sends the host name to a DNS server. The host name is forwarded directly or indirectly to a DNS server responsible (e.g., authoritative) for the domain in which the host resides, which returns an associated IP address. As an example, a DNS server may be responsible for a small and fixed (or slowly evolving) set of host names. However, it is possible to configure a DNS server to resolve any given host name within a particular domain to an IP address. Thus, a web server may generate web pages that contain hyperlinks (e.g., URLs) such that the host components of the hyperlinks may appear to refer to different hosts (e.g. “belgium.sometravelagency.com”, “holland.sometravelagency.com”, “france.sometravelagency.com”), but where all host names resolve to the same IP address. Each of the different hosts may be categorized as machine-generated host names or “synthetic host names”.
A synthetic host name may be dynamically created. Synthetic host names often include more dots, dashes, digits, or other characters than a standard host name. In some examples, a synthetic host name may have a different appearance than a standard host name. Synthetic host names may also be referred to as domain name system (DNS) spam. If a synthetic host name is present, then all web pages originating from that host name may be marked or indicated as web spam (408). If a synthetic host name is not present, then no action is taken. The process may be repeated for every host name crawled by a search engine.
Spam web pages may contain numerous hyperlinks with different host names that appear to refer to different unaffiliated web servers, but may refer to affiliated web servers. This creates an impression that a web page links to and endorses other web sites, creating an appearance of impartiality. In order to reduce costs associated with operating independent web servers, a web spam author may configure a DNS server to resolve different host names to a single machine, as described above. Authors of web spam may employ this technique to provide the appearance of a normal web page while appearing to link to other different web sites. This behavior may be detected by computing a host-machine ratio. Host names may be mapped to one or more physical machines, where each machine is identified by an IP address. As an example, a host-machine ratio may be determined by dividing the number of web sites or host names that a given web page links to and appears to endorse by the number of machines that are actually endorsed. Web pages that endorse many more web sites than machines have a high host-machine ratio. Subsequently, these web pages may be detected and identified as web spam. If a high host-machine ratio is associated with a web page, then it may be marked or indicated as web spam. If a high host-machine ratio is not present, the web page is not marked or indicated as web spam. A host-machine ratio may have a threshold above which spam is identified. The host-machine ratio threshold may be adjusted higher or lower. If a page has a high host-machine ratio, that page may appear to link to many different web sites, but actually link to and endorse fewer web servers. In another example, the average host-machine ratio is the average of host-machine ratios for pages served by a machine. Web pages served by a machine with high average host-machine ratio are marked or indicated as web spam.
In the above examples, different attributes and characteristics may be evaluated to implement these techniques for evaluating content to detect web spam. In some examples, different characteristics of a data set may be graphed to develop a statistical distribution, from which statistical outliers may be identified and selected. In other examples, the statistical distribution, analysis, and evaluation techniques described above may be used in other environments or characteristic systems to determine statistical outliers and associated items, properties, or attributes associated for evaluating a data set.
According to one embodiment of the invention, computer system 1100 performs specific operations by processor 1104 executing one or more sequences of one or more instructions contained in system memory 1106. Such instructions may be read into system memory 1106 from another computer readable medium, such as static storage device 1108 or disk drive 1110. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions to implement the invention.
The term “computer readable medium” refers to any medium that participates in providing instructions to processor 1104 for execution. Such a medium may take many forms, including but not limited to, non-volatile media, volatile media, and transmission media. Non-volatile media includes, for example, optical or magnetic disks, such as disk drive 1110. Volatile media includes dynamic memory, such as system memory 1106. Transmission media includes coaxial cables, copper wire, and fiber optics, including wires that comprise bus 1102. Transmission media can also take the form of acoustic or light waves, such as those generated during radio wave and infrared data communications.
Common forms of computer readable media includes, for example, floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, carrier wave, or any other medium from which a computer can read.
In an embodiment of the invention, execution of the sequences of instructions to practice the invention is performed by a single computer system 1100. According to other embodiments of the invention, two or more computer systems 1100 coupled by communication link 1120 (e.g., LAN, PSTN, or wireless network) may perform the sequence of instructions to practice the invention in coordination with one another. Computer system 1100 may transmit and receive messages, data, and instructions, including program, i.e., application code, through communication link 1120 and communication interface 1112. Received program code may be executed by processor 1104 as it is received, and/or stored in disk drive 1110, or other non-volatile storage for later execution.
Although the foregoing embodiments have been described in some detail for purposes of clarity of understanding, the invention is not limited to the details provided. There are many alternative ways of implementing the invention. The disclosed embodiments are illustrative and not restrictive.