This invention relates to a system, method and computer medium for crawling the web to find relevant internet content.
In internet technology, web crawlers are used to find new web pages by collecting and following URLs (Uniform Resource Locators). By following an URL and downloading the corresponding web page the links within that web page can be added to the web crawler's URL collection. The web pages are stored for indexing and ranking by internet search engines. Internet search engines use web page ranking algorithms that relate the links within a web page to the relevance of the web page.
The use of link popularity algorithms to rank web pages has lead to the problem of “link farms”. In order to manipulate a web page's ranking, a large sub-web of interlinked web pages is created and linked to a web page so that the page receives a high search engine ranking. In addition to distortion of web page rankings, a problem with link farms is that a web crawler spends a lot of resources following links and collecting web pages for eventual indexing into a search engine, even though many of these pages are created only for page ranking and are not otherwise used by, nor useful for humans.
What is required is a system, method and computer readable medium that provides enhanced web crawling.
In one aspect of the disclosure, there is provided a method for web crawling comprising determining a plurality of Uniform Resource Locators (URL)s, determining a subset of the plurality of URLs that have associated interaction data, selecting at least one URL of the subset, and downloading a web page corresponding to the at least one selected URL.
In one aspect of the disclosure, there is provided a web crawler comprising at least one Uniform Resource Locator (URL) data store that stores a plurality of URLs, at least one interaction data store that stores interaction data for a plurality of web pages, at least one download module that downloads web page content corresponding to a URL, and at least one URL selection module in communication with the at least one URL data store and the at least one interaction data store. The interaction data indicates an interaction between a human and a web page corresponding to a URL. The at least one URL selection module selects at least one URL from the at least one URL data store that has interaction data in the at least one interaction data store. The at least one URL selection module provides the at least one selected URL to the at least one download module.
In one aspect of the disclosure, there is provided a computer-readable medium comprising computer-executable instructions for execution by a processor, that, when executed, cause the processor to select a Uniform Resource Locator (URL) from a URL data store, look up the selected URL in an interaction data store to determine if interaction data exists for the selected URL in the interaction data store, and if interaction data exists for the selected URL, provide the selected URL to a download module.
Reference will now be made, by way of example only, to specific embodiments and to the accompanying drawings in which:
A system 10 for providing web crawling in accordance with an embodiment of the disclosure is illustrated in
A web crawling method using the system 10 of
The download module 14 downloads web pages 22 from the internet 20 and extracts linked URLs 13 from the download pages. The operation of the download module in accordance with an embodiment of the disclosure is illustrated in the flowchart 200 of
The operation of the URL selection module 18 in accordance with an embodiment of the disclosure is shown in the flowchart 300 of
The interaction data in the interaction data store 19 may be derived from interactions between users and the web page at client browsers, for example as described in any of the Applicant's co-pending patent applications Attorney Docket Nos. HAUSER001, HAUSER002, HAUSER006, HAUSER007, HAUSER007B, HAUSER008, HAUSER009, HAUSER010, the entire contents of each of which are explicitly incorporated herein by reference. In particular, event recorders provided within the web pages may record event data during these interactions and provide event streams to an event server. An example of an event data processing system is illustrated in
The web server 114 may be modified such that the web page content provided to the client 118 includes an event observer module 126 which may be provided as appropriate code or scripts that run in the background of the client's browser 115. In one embodiment, code for providing the event observer module 126 is provided to the web server 114 by a third party service, such as provided from an event server 112, described in greater detail below.
The event observer module 126 observes events generated in a user interaction with the web page 111 at the client 118. The event observer module 126 records events generated within the web browser 115, such as mouse clicks, mouse moves, text entries etc., and generates event streams 121 including an event header message 122 and one or more event stream messages 123. It will be apparent to a person skilled in the art that terms used to describe mouse movements are to be considered broadly and to encompass all such cursor manipulation devices and will include a plug-in mouse, on board mouse, touch pad, pixel pen, eye-tracker, etc.
The event observer module 126 provides the event streams 121 to the event server 112. An example of an event header message 30 is illustrated in
During an interaction with the web page 111, a user navigates the web page 111 and may enter content where appropriate, such as in the HTML form elements. During this interaction events are generated and recorded by the event observer module 126. Periodically, the event observer module 126 formulates an event stream message 123 preceded by an event header message 122 if one has not yet been sent. The event observer module 126 passes the event stream messages 123 to an event module 125 of the event server 112. In the embodiment illustrated in
The event server 112 processes the event stream 121 in the event module 125 or an equivalent component, to analyze the event stream data. Analyzed data may be stored with the raw event stream messages in a content data store 128. Additional modules of the event server may include an attention analysis module 139 as described in the Applicant's co-pending application HAUSER008 reference above, and a content interest processing module 138 as described in the Applicant's co-pending application HAUSER009 referenced above. In one embodiment, the event stream data can be analyzed to determine the probability that the interaction that created the event stream at the client is a human dependent interaction, for example as described in the Applicant's co-pending patent application Attorney Docket No. HAUSER001 referenced above. In the present embodiment, the existence of any human interaction within the content areas of the web page, such as hints, lingers or clicks within the content areas, may be used to indicate the validity of a URL, and such statistics may be loaded into the interaction data store 19. In one embodiment, the web crawler 12 may include the event server 112 such that the web crawler is self contained. In an alternative embodiment, human interaction data may be provided to the interaction data store as a third party service by an event server operator. Alternatively, the event server 112 may maintain its own interaction data store and provide access to the interaction data store as a service.
The interaction data store 19 may store raw event streams with processing of the event streams being performed by the URL selection module 18, for example to rank the URLs according. Alternatively, the interaction data store may have an associated processing module (not shown) that pre-processes the interaction data so that the interaction data store stores the URLs in a ranked form. For example, a processing module may process the event streams to determine an event generator type (e.g. human, non-human, computer assisted human, etc) as described in the Applicant's co-pending patent application HAUSER001 and HAUSER006 referenced above. Once an interaction with a webpage has been classified as a human interaction, the data may be further processed to rank the particular behavior of the interactions. For example, the event streams may be processed to select those events streams containing out-click events, i.e. events that a user produces to exit a web page. The event streams and/or the page content may also be analyzed to determine additional preferred behavior, such as a breadth-first traversal of the web site, backlink count, partial page-rank calculations, page-rank calculations using a link graph with URLs only if those URLs have sufficient human interaction, etc. In one embodiment, the interaction pattern for parked pages, link farms, auto generated “spam” pages (that use random snippets from a variety of authentic web pages just to get high search engine ranking based on the keywords in the snippets) may be identified and used to remove these URLs from the crawl graph (not pursue the links) and/or remove such URLs from page-rank calculations.
A summary of the event statistics including any data used to rank the web pages may be stored in the interaction data store 19.
An alternative embodiment is illustrated in
An operation of the download module 214 is illustrated in the flowchart 400 of
In a further embodiment, the modified web crawler 212 of
An alternative URL selection policy may specify that URLs (or human out-click URLs) will only be followed if there is some form of human area of interest within the page where the URL was found, e.g. a content element with a high enough content interest score. A further alternative URL selection policy may specify that URLs (or human out-click URLs) will only be followed if they are found within a content element with high enough content interest.
The URL selection policies followed by the URL selection module focus the web crawlers resources towards those web pages that are actively used by humans and thus generate particular attention events. Using the selection policies may significantly increase the efficiency of the web crawler and assist in providing higher quality page ranking statistics. Furthermore, as described above, common human browsing patterns, can be recognized via attention analysis for link farm pages, parked pages where the most interesting content is advertisements, and auto-generated “spam” pages. Human outclicks on pages that have no content of interest other than ads can be ignored by the URL selection module.
The embodiments described herein provide an enhanced system and method for web crawling that avoids spending resources collecting web pages that are not useful to humans. The effect of these embodiments is to reduce or eliminate the advantages of a link farm and to remove search engine spam. At current internet growth rates, the requirement to crawl less of the internet can provide large resource savings as well as making page ranking of web pages more efficient and useful for humans. By focusing crawling to the web pages relevant to and used by humans, the ability of artificially manipulate search engine rankings is reduced.
The web crawler 12 may be embodied in hardware, software, firmware or a combination of hardware, software and/or firmware. In a hardware embodiment, components of the web crawler 12 may be embodied in a device, such as server hardware, computer, etc. For example, the URL selection module 18 may include a processor 61 operatively associated with a memory 62 as shown in
Although embodiments of the present invention have been illustrated in the accompanied drawings and described in the foregoing description, it will be understood that the invention is not limited to the embodiments disclosed, but is capable of numerous rearrangements, modifications, and substitutions without departing from the spirit of the invention as set forth and defined by the following claims. For example, the capabilities of the invention can be performed fully and/or partially by one or more of the blocks, modules, processors or memories. Also, these capabilities may be performed in the current manner or in a distributed manner and on, or via, any device able to provide and/or receive information. Further, although depicted in a particular manner, various modules or blocks may be repositioned without departing from the scope of the current invention. Still further, although depicted in a particular manner, a greater or lesser number of modules and connections can be utilized with the present invention in order to accomplish the present invention, to provide additional known features to the present invention, and/or to make the present invention more efficient. Also, the information sent between various modules can be sent between the modules via at least one of a data network, the Internet, an Internet Protocol network, a wireless source, and a wired source and via plurality of protocols.