Embodiments of the present invention relate generally to a system, method, and computer program product for searching for and/or gathering information on a network.
It is estimated that the Internet presently includes over ten billion visible Web pages and possibly even hundreds of billions of pages in the “deep Web” (e.g., information on the Internet not accessible directly by a hyperlink, such as information stored in databases and accessible only by specific query or by submitting information into a form on a web page). As a result, the Internet can be an enormously useful resource for finding information on almost any topic. However, because the Internet is so large and because it is ever changing and growing, there is a need for an efficient system of discovering, classifying, and presenting the information on the Internet so as to allow a user to quickly find specific and up-to-date information related to a particular topic of interest.
The Internet is a global computer network where many individual computers and computer networks are linked together over a series of communication networks. Some entities on the network (i. e., “hosts”) allow other computers on the network to access information stored in the host's computer(s) or in some other location that the host computer(s) can access. In this way, a user having a computer connected to the network may be able to retrieve the shared information from the host computer.
The World Wide Web, web browsers, and other systems and protocols have been created to standardize the information on the Internet and the way in which one computer asks for and retrieves the information from another computer. In general, the Internet has systems for identifying a host on the network. For example, each computer (or group of computers) on the Internet may have an IP address (e.g., a numerical identifier) that identifies the computer's location on the network so that information can be transferred to and from that location on the network. Users who wish to share information on the Internet can find and purchase one or more text-based domain names and then register their computer's IP address with the one or more text-based domain names or sub-domain names in a domain name registrar. In this way other Web users can use the text-based domain name to locate and access at least portions of the host computer.
A Uniform Resource Locator (URL) is a standardized system for indicating the domain name or a sub-domain name and for identifying information of interest on the host computer associated with the indicated domain name or sub-domain name. For example, if a Web user desires to go to Move, Inc.'s homepage, the user may be able to do so by typing the URL www.move.com into the user's web browser. The domain name portion (“move.com”), the “host” portion (“www”), and any sub-domain portion of the URL is used to look up the IP address that is registered as corresponding to the indicated host for the particular domain or sub-domain. An HTTP (Hypertext Transfer Protocol) request is sent to the web server on the host computer(s) that corresponds to the IP address. Typically, the server will return documents containing data written in HTML (Hypertext Markup Language) as well as associated files (e.g., image files) to the user's web browser. The HTML file may contain content information, but will also often contain presentation information (e.g., HTML code that indicates to the web browser how the server's information should be presented to the user) as well as behavior information (e.g., instructions that describe how the user can interact with that information or with the web page itself). For example, the HTML files may also include Forms, JavaScript, Applets, Flash, AJAX, DHTML, and the like, that are not content information but allow the user to interact with the page to accomplish some task.
Each IP Address may host many web pages at the same time. The term “website” is used to generally refer to collections of interrelated web pages, such as web pages that share a common domain name and/or are provided by a common host. The different web pages of a website are distinguished by the web page's URL. For example, http://www.move.com may direct a web user to Move Inc.'s homepage, while
http://www.move.com/apartments/westlakevillage_california/ may be a hyperlink on the homepage that directs the user to a web page having information about apartments in Westlake Village, Calif. The two web pages share the same “move.com” domain name and may be hosted by the same server, although the unique URLs indicate separate web pages.
Web pages often contain multi-media elements (such as text, graphics, images, etc.) and also typically contain a plurality of hyperlinks. A hyperlink is some text, icon, image, or other multi-media element on the web page that is associated with another URL. The hyperlink allows the user to click on the linked element so that the user can be redirected to the corresponding URL, which may provide access to another web page from the same website or may be a web page from some other website. In this way, many of the web pages and websites on the Internet are interconnected.
With billions of pages encompassing almost every topic imaginable and with a largely standardized structure for the Web, there is wealth of information available to anyone who can access the Internet. However, in order to effectively use the Internet, one must be able to efficiently find the most relevant web pages from the billions of irrelevant web pages. To solve this problem, search engines have been created to locate and index many of the web pages on the Internet. In this way, a search engine can allow a user to search the index in an attempt to locate the web pages that are most likely to be relevant to the topic that the user is interested in.
A typical search engine begins the search process with a list of seed pages and a “web crawler.” The seed pages are often already-known web pages that contain many hyperlinks that branch out in a wide area over the web. The web crawler is a program that “crawls” around the Web looking for and indexing web pages.
Step 1140, which involves selecting the next URL from the frontier, can vary depending on the web crawler. Typical methods used for selecting the next URL to index are: (1) “depth-first” method, (2) “breadth-first” method, and (3) “PageRank” method. A depth-first method, also known as a last-in-first-out (LIFO) method, indexes a first web page and then follows a hyperlink discovered in the first web page to a second web page. The crawler then indexes the second web page and, if it discovers hyperlinks on the second web page, it follows one of these hyperlinks to a third web page, indexes the third web page, and follows a link on the third web page to a fourth web page, and so on. In contrast, a breadth-first method, also known as a first-in-first-out (FIFO) method, indexes a first level of web pages and records all of the hyperlinks in those pages. It then follows every hyperlink found in the first level of web pages to a second level of web pages and indexes every one of the second level web pages before proceeding to any of the third level of web pages (i.e., web pages corresponding to hyperlinks found in the second level web pages), and so on. In other words, the breadth-first method completely indexes each level of a link tree before indexing the next lower level. In contrast to the depth-first and breadth-first methods, the “PageRank” method attempts to rank the URLs by some measure of “popularity.” In order to do so, the Web crawler must have a way to measure the popularity of all of the URLs prior to viewing the individual web pages. In this regard, the PageRank method ranks a particular URL based on the number of web pages that the web crawler has viewed that reference the particular URL. In other words, if the web crawler is indexing a web page and comes across a hyperlink for a URL that is already stored in the frontier, the web crawler adds a “vote” to the referenced URL. Each time the web crawler selects another URL from the frontier to index (step 1140), the web crawler selects the URL having the most votes at that point in time. In the PageRank system, the web crawler may also weight some votes more than others based on the number of votes that the referring web page has.
With regard to step 1160, typical web crawlers index a web page by recording every word that is found in the web page. The words are stored in a datastore along with every URL that corresponds to a web page in which the word was found. Some web crawlers may not index words such as “a,” “an,” and “the.” Furthermore, some web crawlers will, in addition to the URL, record other context information in the index (such as where on the web page the word was found). In addition to indexing words found on the web page, a web crawler may also index any “meta tags” that the web page may have. Meta tags are keywords that may not show up on the face of the web page itself, but are listed by the web page developer in the HTML code as keywords supposedly associated with the web page content.
Another common issue that arises with web crawler development is web crawling ethics, often referred to as “politeness.” Since web crawlers often take up a lot of bandwidth, too many web crawlers accessing the same server at the same time or one web crawler accessing the same server too frequently may decrease the performance of the server's website and hinder other web users from accessing and using the website. As a result, two main solutions have developed so that web crawlers can work in the background of the Web without causing too many problems for individual hosts. The first solution uses what is known as the “Robot Exclusion Protocol” (REP). The REP provides a means for a website developer to indicate to a web crawler whether the developer wants all or part of the host computer to be accessed by web crawlers. The second solution is an ethical solution that most web crawler developers impose on themselves. Specifically, the web crawlers should be designed not to access the same server so frequently as to where significantly degrade the performance of the website hosted by the server. Thus, web crawlers will typically impose some minimum amount of time (often on the order of several seconds) that a crawler must wait between sending multiple requests to access the same server.
Since general search engines are designed to provide a master index of as much of the Internet as possible, they require a tremendous amount of computing resources to continually search and research the entire Web and to store all of the indexed information. Furthermore, since the general search engines must have an index that covers as many areas of information as possible, a user of the search engine often receives many unrelated and unwanted web pages in response to a search. As a result, the user must browse through each search result by downloading each web page to determine the actual relevance, if any, of the web page to what the user was looking for.
In an attempt to improve the quality of search results, “focused” web crawlers have developed that are designed to crawl the Web looking for web pages that relate only to a particular topic. Typical focused web crawlers work the same way as a general web crawler in that they execute a loop consisting of the steps of: selecting a web page from a frontier, downloading the web page, and indexing the web page, as described above. Usually the main differences between the traditional general web crawler and the focused web crawler are that the focused crawler: (1) begins with a set of seed pages that contain many hyperlinks related to the topic of interest; (2) includes some sort of algorithm for ranking URLs in the frontier according to their predicted relevancy to the topic of interest; and (3) after it downloads each web page it determines the relevancy of the web page and indexes it accordingly. Naturally, one of the main problems with focused crawling is determining the relevancy of every web page on the Internet with a limited amount of computing resources.
Embodiments of the present invention provide a solution to this problem and other problems associated with focused web crawling and web crawling in general.
Embodiments of the present invention provide a method of searching a network for information related to a topic of interest, wherein the network comprises a plurality of documents containing information. One or more of the documents are grouped together into a collection of documents so that the network comprises a plurality of collections of documents. The method comprises exploring the contents of one or more individual documents of a collection of documents. The method further comprises making a determination of the relevancy of the one or more individual documents of the collection to the topic of interest. The method also comprises making a determination of the relevancy of the collection based at least partially on the relevancy of the one or more individual documents in the collection.
Embodiments of the present invention further provide a system for gathering information related to at least one topic of interest. The system comprises a multi-tiered system configured for searching a network for information related to at least one topic of interest. Each tier of the multi-tiered system comprises more restrictive criteria than the previous tier for locating the at least one topic of interest. In one embodiment, the network comprises a plurality of collections of documents, each collection comprising one or more documents. In such an embodiment, the system may comprise a first tier configured to obtain a list of collections of documents on the network; a second tier configured to classify each of the collections in the list as being a member of one or more of a plurality of categories; and a third tier configured to examine the documents of at least some of the collections based at least partially on the classification of collections by the second tier.
Embodiments of the present invention also provide a method for requesting web pages from a plurality of web hosts, each web host supporting a finite number of web pages. The method comprises grouping the plurality of web hosts into one or more arrays of web hosts, each array comprising a finite number of web hosts. The method further comprises submitting a first web page request to each web host in an array before submitting a second web page request to any web host in the array.
Embodiments of the present invention provide a method of ranking hyperlinks found on the Internet during a web crawling scheme. The method comprises analyzing text in the immediate vicinity of a hyperlink as the hyperlink is found, and computing a weight for that hyperlink based on the relevancy of the text to a set of interest. The method further comprises storing the hyperlink in a datastore where hyperlinks stored therein are ranked based on the relative computed weights of each hyperlink.
Embodiments of the present invention provide a system for ranking hyperlinks found in a document on the Internet. The system comprises a link weighting system for analyzing text in the immediate vicinity of a hyperlink and for computing a weight for the hyperlink based on the relevancy of the text to a set of interest. The system further comprises a datastore for storing the hyperlink with other hyperlinks in a ranked list based on the relative computed weights of each hyperlink.
Embodiments of the present invention provide a method of determining whether a collection of web pages relates to a topic of interest. The method comprises exploring a web page of the collection of web pages; determining the relevancy of the web page to the particular topic of interest; and making a determination of the relevancy of the entire collection of web pages to the topic of interest based at least partially on the determined relevancy of the web page.
Embodiments of the present invention also provide a system of using a web crawler to classify a collection of web pages. The system comprises a web crawler module configured to search a web page from the collection for links to other web pages in the collection. The web crawler module is further configured to request web pages corresponding to the link that the web crawler finds and to examine such web pages for more links to other web pages in the collection. The system also comprises a classifier module configured to make a determination of the relevancy of each web page that the web crawler examines to a set of interest. The system also comprises a collection classifying system for making a determination of the relevancy of the collection to the set of interest based on the determined relevancy of the web pages that the web crawler examines.
Embodiments of the present invention also provide a method of gathering information related to a particular topic of interest. The method comprises searching a network for information related to at least one topic of interest. Searching the network comprises searching the network in a multi-tiered format such that each tier comprises more restrictive criteria for locating the at least one topic of interest than a previous tier. The method also comprises extracting data from the searched information relating to the at least one topic of interest searched on the network.
Embodiments of the present invention also provide for a system for gathering information related to a particular topic of interest. The system comprises a multi-tiered system configured for searching a network for information related to at least one topic of interest, wherein each tier comprises more restrictive criteria for locating the at least one topic of interest than a previous tier. The system also comprises an information extraction engine configured for extracting data from the searched information relating to the at least one topic of interest searched on the network.
Having thus described the invention in general terms, reference will now be made to the accompanying drawings, which are not necessarily drawn to scale, and wherein:
The present invention now will be described more fully hereinafter with reference to the accompanying drawings, in which some, but not all embodiments of the inventions are shown. Indeed, these inventions may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will satisfy applicable legal requirements. Like numbers refer to like elements throughout.
The computer network 1 may comprise many individual documents stored on various networked electronic devices. These documents may be associated with one or more document identifiers that allow the document to be uniquely identified and referred to on the network. In some instances, the documents on the network may be able to be grouped into collections of documents. For example, the documents may be grouped into collections based on some common aspect of the documents, such as a common host device, a common portion of the identifier, a common author, source, or origin, or any other common aspect. Preferably, there is a way to identify a collection of documents, such as a collection identifier, so that each separate collection can be uniquely identified, referred to, and/or recalled. For example, each collection of documents may contain a main or primary document. In one embodiment of the present invention, the document identifier for the main document may be used as the collection identifier.
One example of a network that one embodiment of the present invention may be configured to operate in is the Internet. As described above, the Internet is a global computer network involving hundreds of millions of interconnected devices. Some exemplary embodiments of the present invention may also be configured to operate on the Internet specifically in the World Wide Web (the “Web”). As also described above, the Web is a standardized system of communication, linking, and displaying information on the Internet. For example, a host device may have documents stored in the device that contain information written using a standardized language, such as HTML. The host device may make these documents available on the network. Remote terminals can then request the document from the host device (e.g., by making an HTTP request) and then download the document. The remote terminal may have a web browser that can interpret the standardized language and display the information in the document on the remote terminal's monitor. Such documents on the Web are often referred to as web pages. Each web page has a unique identifier, such as a URL, that can be used to refer to and request the web page on the network.
One or more web pages may be grouped into collections of web pages, often referred to herein as “websites.” As described earlier, the groupings may be based on a common host device, a common IP address, a common domain name, and/or any other identifiable aspect common to a group of web pages. For example, each collection may be uniquely identifiable and/or individually accessible on the network by a URL that accesses a home directory of the host. Often, the home directory comprises a main web page for the website. Such a web page is often referred to as a “home” page from which the other web pages in the website are often accessible via one or more hyperlinks. Some individual web pages of a website, however, may not be directly accessible via hyperlinks on a home page and instead may only share a common domain name portion of the URL. Similarly, some information and pages connected to a website may not be accessible by a hyperlink since the information or web page may be located in a data repository stored on the host device (or other device) but accessible to a remote terminal only by submitting information into a form on a web page. As described above, web pages and other information accessible through such forms represent much of what is known as the “deep” or “hidden” Web, where a large percentage of the information on the Web is located.
In some embodiments of the present invention, each website is considered to be a separate collection of documents on the Internet. It is often assumed that websites usually comprise pages having at least some common subject matter. Sometimes, however, very large and general websites may provide information on a wide variety of subjects. As such, some embodiments of the present invention may further subdivide some websites into smaller collections of web pages, such as branches or paths of a website (a “path” being a collection of nodes usually beginning with the “root” node). For example, a collection identifier according to one embodiment of the present invention may be a specific node of a URL or a specific path within a URL.
In order to find and/or classify information on a network, such as the network 1 described above, embodiments of the present invention provide systems for searching and/or classifying information on a network, such as web pages (or information found in web pages) that are located on the Internet. In particular, embodiments of the present invention are directed to efficiently finding, searching, and/or classifying documents or web pages on a network that are likely to be relevant to one or more particular topics or subtopics of interest. In some embodiments of the present invention, the system also extracts the relevant data from the relevant documents or web pages. In still further embodiments, the system is configured to compile the extracted data into a usable form (e.g., a searchable datastore for use by a network user).
According to one embodiment of the present invention, a multi-tiered cascading crawling system is provided for finding information on a network related to one or more predetermined topics or subtopics of interest. In general, embodiments of the present invention provide a system that operates in multiple “tiers,” where at least some of the output of one tier is used to comprise the input of the next tier. Each tier generally analyzes collections of documents on the network using successively more restrictive criteria about the subject matter of each collection and/or about which collections may be related to the one or more topics or subtopics. In this way, each tier may use knowledge gleaned from its predecessors to refine its understanding of a collection of documents. In general, only the final tier performs an exhaustive crawl of all of the documents of the collections that are identified by the system as being relevant to the topic or subtopic of interest. Furthermore, only the documents from collections that remain at this last tier have data extracted from them and/or are saved, indexed, or otherwise used by other systems or subsystems. In this way, embodiments of the present invention may provide a more efficient system and method of conducting a focused crawl of a network. Specifically, unlike the traditional crawlers that search for web pages on the Internet and then download, analyze, and index each web page at the time the page is found, embodiments of the present invention search for collections of documents and make an initial determination as to the content of the collection as a whole. This allows successive tiers of the system to make increasingly detailed determinations about each collection of documents and, thereby, saves computer resources for conducting a detailed analysis of only the documents contained in the most relevant collections.
The collection harvesting system 110 is configured to obtain a list of collections of documents that exist on the network. For example, the collection harvesting system 110 may be configured to find or otherwise obtain collection identifiers that can be used to identify the collections of documents. The collection identifiers provided by the harvesting system may then be used by the collection classifying system 120. The collection classifying system 120 may be configured to receive a collection identifier from the harvesting system 120 and use the collection identifier to access at least a portion of the collection on the network. The collection classifying system 120 may then analyze one or more documents from the collection to make assumptions or determinations about the content of the collection as a whole, such as whether or not, or to what extent, the collection may relate to one or more predetermined topics or subtopics. If the collection classifying system 120 determines that a collection potentially includes information related to the particular topic or subtopic, the collection identifier for that collection may be provided to the document crawler and classifying system 130. The document crawler and classifying system 130 may be configured to search for documents in the potentially relevant collections and analyze each available document to determine whether, or to what extent, each document relates to one or more topics or subtopics of interest or whether any of the documents contain some particular types of information related to the topics or subtopics of interest.
The information pertaining to the relevancy of one or more documents to a particular topic or subtopic may then be used for many purposes. For example, in one embodiment of the present invention, the documents or document identifiers may be stored in a datastore and indexed based on the determined relevancy of the document to a particular topic or subtopic. In the embodiment of the invention depicted in
As described above,
The collection identifiers that are obtained by the harvesting systems are provided to one or more collection classifying systems 220. The collection identifiers may be provided directly to the classifying systems or they may be provided to a datastore where the collection identifiers are stored for the classifying systems to later fetch and use. When multiple harvesting systems are used in parallel, the individual systems may obtain at least some of the same collection identifiers. In some embodiments, duplicate identifiers are simply eliminated. In other embodiments, obtaining duplicate identifiers may signify that the identified collection is more popular or otherwise more or less significant to the focused crawl and as such, may be weighted more or less heavily than other collection identifiers. In some embodiments of the present invention, collection identifiers that are weighted more heavily may be selected by the collection classifying system 220 before collection identifiers that are weighted less heavily.
As illustrated in
For each category used to divide the collections there may be any number of additional collection classifying systems such as the two shown as 221 and 222. These collection classifying systems 221 and 222 can be used to further classify the collections of documents into subsets. Each collection classifying system can also make assumptions about the collections that it receives based on the collection classifying systems that came before it in the cascade. For example, the collection classifying system 221 may be configured to analyze at least some of the documents in each collection to determine whether the collection is relevant to the particular topic of interest. For instance, the collection classifying system 221 may look for keywords or phrases that are related to the topic of interest. In the example where classifying system 220 sorts collections based on language, every collection that classifying system 221 receives contains at least some information written in a particular language. As such, the keywords and phrases that the classifying system uses can be tailored to that particular language.
In the embodiment of the present invention where the collection classifying systems 221 and 222 are configured to further classify the collections based on relevancy to the topic of interest, the classifying systems 221 and 222 may be configured to analyze at least a sample of the documents in each collection and make a determination of the relevancy of the entire collection based on the determined relevancy of the sampled documents. Specifically how embodiments of the collection classifying systems may analyze sample documents from a collection and, thereby, make a determination of the relevancy of the collection, is described in detail below. Collections that are determined to have a relevancy greater than a threshold relevancy may be output into a document crawler and classifying system 230. If classifying system 221 deems a collection to be irrelevant to the topic of interest, the classifying system 221 may simply not output such irrelevant collections or, alternatively, may output these collection identifiers into a dump datastore 224 where this information could be used for diagnostics purposes or other purposes.
As illustrated in
As can be seen by comparing classifying systems 220 and 221, some classifying systems may allow a single collection to be placed in more than one subset, while other systems require that the collection be placed in either subset or another, and not multiple subsets. For example, if ten documents of a collection are analyzed by a collection classifying system configured to classify collections based on language, the collection identifier may be output to ten different collection classifying systems, one for each language. However, where the collection classifying system is configured to make an assessment of a collection's relevancy to one particular topic, the collection must be deemed to be either relevant or not relevant (or undecided), and the collection must be either sent on to the next tier in the cascade or removed from the cascade.
Once the collections are provided to a document crawler and classifier system, the collection is crawled for documents and each document is individually analyzed and classified as either containing relevant information or not containing relevant information. This crawling and classifying of each document in a collection is conducted by the document crawler and classifier systems 230, 232, and 233. The documents that are deemed to contain relevant information may then be output to one or more data extraction systems 240 and 241. It should be appreciated that the classification conducted by the document crawler and classifier system can make assumptions about the documents based on the document having reached this point in the cascade. As a result, the classification is always a narrowing or a focusing of the information believed to be known about the collection and/or document.
The next tier of the crawling system 300 comprises a website language classifying crawler 320. The website language classifying crawler 320 is configured to crawl each website and analyze at least one or more of each website's web pages to determine the language(s), or dialect(s) used to present information in the website. In addition to or as an alternative to determining languages and dialects, the website language classifying crawler 320 may be configured to determine the locale of the website, the website's creator, and/or the website's audience if such may be relevant to the linguistic features of the website. For example, what people on the U.S. west coast refer to as a “studio” might be more commonly referred to as an “efficiency” on the U.S. east coast. As a result, in some embodiments, it may be useful to classify US/NorthEast as separate from US/WestCoast because, although the language and the dialect may be the same, the terminology and common phrases by which real estate (or whatever the topic of interest might be) may differ. Determining such a difference and separating the websites based on such a difference in an early stage of the cascade may allow both classification and extraction to be more precise and/or efficient.
In some embodiments, the language crawler 320 may be configured to classify a website into one or more of many languages, dialects, and/or locales. In other embodiments, the language crawler 320 may only be concerned with finding websites in only one particular language, dialect, or locale, and, as such, is configured to only output websites in that language, dialect, or locale to the next tier of the main crawler 300.
In the focused crawler 300 of
Any website URLs for websites that are deemed by the website relevancy classifying crawlers to be “relevant” (e.g., websites that received a relevancy score from the website relevancy classifying crawlers greater than some relevancy threshold) are passed to the next tier in the focused crawler 300.
After determining what category or categories the websites best fit into, the website URLs are then provided to web page crawler and classifier systems 330-334 that correspond to the each of the categories of websites. For example, the URLs for the English websites that are related to apartment rentals are provided as input for an English apartment web page crawler and classifier 332. The web page crawler and classifier systems are configured to exhaustively crawl the websites to determine which web pages of the websites have real estate listings. Since there is a web page crawler and classifier system for each category of the topic of interest, the individual web page crawler and classifier systems can each be tailored to search for listings related to the particular category and language. For example, the typical apartment rental listing may use very different terminology and/or formats than typical home sale listings. As a result, it may be beneficial to use different web page crawler and classifier systems, each crawler and classifier system tailored to find either apartment listings or home sale listings. Furthermore, since this tier is usually the only tier in which each available web page of a website is exhaustively crawled, downloaded, and analyzed, the computing resources are efficiently used since this exhaustive crawl is only conducted on a limited subset of the Internet that has been determined to be “relevant” at least to some degree.
Web pages that are determined to include listings are output from the web page crawler and classifier systems to a processing system where the listing information can be used. In the crawler 300 illustrated by
Now that embodiments of a multi-tier cascading focused crawler have been described, the individual systems and subsystems that are used to comprise at least some embodiments of the crawler are described in detail below. Although the embodiments of these systems and subsystems of the present invention are described below with respect to the Web, the present application should not be considered as being limited only to the Web. It should be appreciated that embodiments of the present invention may be configured to operate in other types of networks and with other types of network protocols. Furthermore, although the below systems and subsystems are described as being used as part of the multi-tiered crawling system of the present invention, the below systems may be novel in their own right, independent of their use with the multi-tiered cascading crawler.
It should be appreciated that embodiments of the present invention provide for a modular crawling system that can be arranged, rearranged, and finely tuned to create an efficient focused crawling system for the large, complex, and often-changing networks such as the Internet. The basic building blocks for many of the systems and subsystems of some embodiments of the present invention include the crawler module and the classifier module.
In one embodiment, in addition to or as an alternative to using hyperlinks to locate new URLs, when the web crawler module 400 encounters a web page having a form, the web page crawler module 400 may generate “pseudo URLs” that simulate a user entering information through the form. For example, the web crawler module 400 may be configured to evaluate a form written in JavaScript and create pseudo URLs that correspond to different possible variations of a form post. These pseudo URLs may then be saved to the datastore. Other methods known in the art or made obvious by the present disclosure of finding new URLs may also be employed by the web crawler module 400. For simplicity, the following description discusses the crawling process and system primarily in terms of finding and storing hyperlinks, however, it should be appreciated that embodiments of the present invention may be configured to find new URLs using other methods, such as by evaluating forms and generating pseudo URLs in order to crawl web pages accessible through such forms.
As will be described below, the crawler may be free to crawl the entire network or the crawler may be configured to only crawl a limited segment of the network. For example, in one embodiment the crawler may be limited to crawling only web pages of a particular website (i. e., “internal” web pages). Embodiments of this crawler module 400 are used to create at least some aspects of the multi-tiered network crawling system according to some embodiments of the present invention.
After analyzing the web page, the classifier generally is configured to output data indicative of the relevancy of the web page to the set of interest. Typically, the output of a classifier is a “yes” or a “no,” although some classifiers can be configured to produce a result of “indeterminate.” For example, the classifier may be configured to output a score of 0 if the web page is not determined to be a member of the set of interest, a 1 if the web page is determined to be a member of the set of interest, and a null value if it is indeterminate whether the web page is a member of the set of interest. Other classifiers may output a numerical score within a range of scores, such as a number between 0 and 1. Some classifiers can output 0, 1, and null scores.
As described above, embodiments of the present invention comprise one or more collection harvesting systems configured to obtain a list of identifiers that can be used to identify and/or access collections of documents on the network. The primary goal of the harvester system is usually to generate as many collection identifiers as possible to feed into the classifying systems. Different systems for obtaining collection identifiers may be created, each system having its own strengths and weaknesses. As such, embodiments of the present invention may be configured to use more than one harvesting system to generate the list of collection identifiers.
Harvesting system 545 illustrates one harvesting system of an embodiment of the present invention. The harvesting system 545 involves purchasing, or otherwise obtaining, a predetermined list of collection identifiers. For example, in one embodiment, a list of known websites, domain names, URLs, or IP addresses may be purchased from some other source. In one embodiment, such a list may comprise identifiers for collections of documents, such as websites or website branches, that are known to generally relate to some particular topic of interest that may be similar to the topic of interest of the focused crawler system.
In one embodiment of the IP harvester, the steps of submitting a request to an identifier 569 and determining if a web server (host) responds 571 is comprised of requesting the a robots.txt file from the identifier. The robots.txt file holds a host's robot exclusion policy. If the host returns a policy indicating that crawlers are allowed to crawl the host's home page, the host is considered a “hit” and the corresponding website is added to the collection datastore 595. If the host returns a “file not found” error, that means there is a web server listening with no robot restrictions and the host is also considered a “hit” and the corresponding website is added to the collection datastore 595. Only if the host fails to respond at all, or if the host's robot policy forbids visiting the home page, does the IP harvester consider the host a “fail” and not add the corresponding website to the collection datastore. This embodiment of the IP harvester 565 has the added advantage of satisfying at least some of the politeness requirements by only collecting sites that are allowed to be crawled.
In some embodiments of the IP harvester, at least some failed identifiers may be kept in the collection datastore. For example, in one embodiment, all the failed identifiers are kept in a datastore, but the failed identifiers are separated between those that failed due to a lack of response and those that failed due to a robot exclusion policy. For the identifiers where no response was received, these identifiers may be ignored by the crawler. For the identifiers where the robot exclusion policy forbids crawlers, the crawler might recheck the host or the corresponding website periodically to see if the robot exclusion policy or the host has changed.
As described above, embodiments of the multi-tiered crawling system comprise collection classifying systems for classifying a collection of documents as belonging to some set of interest based at least partially on an analysis of one or more documents in the collection. Embodiments of the collection classifying systems described below are configured to operate on the Web to classify collections of web pages. Although these embodiments are described in terms of the Web, other embodiments may be configured to operate on other networks using other communication protocols. Furthermore, the embodiments described below are described as being configured to classify websites (i.e., web pages originating from a common host). However, as described earlier, embodiments of the present invention may be configured to classify other types of collections of web pages, such as branches of websites, web pages having a common portion of a URL, web pages having a common domain name, web pages originating from a common IP address, and the like.
Collection classifying systems generally include one or more sampler modules.
In other embodiments of the sampler module 600 displayed in
According to one embodiment of the present invention, a plurality of classifier modules may be used together in series or in parallel with each other to produce one or more scores for a given document. Thus a sampler may comprise one or more classifiers linked together in series or in parallel. For example, a chain of classifiers may be linked together in series so that a classifier lower in the chain can make assumptions about a web pages it receives based on the scores given by the classifiers higher in the chain. The chain of classifiers generally comprises a chaining rule that determines how the individual classifier scores are to be combined. Examples of chaining rules may include: (1) a higher score always replaces a lower score; (2) scores of exactly zero or exactly one cannot be altered regardless of the subsequent scores; (3) scores from a higher classifier and the proposed score of the current classifier are averaged to produce a new score; (4) two very different scores are both replaced with null (indeterminate) in the hopes that a subsequent classifier will provide additional clarity as to the appropriate score; and the like. For example, a first classifier may be configured to exclude web pages that are too large for a second classifier in a chain of classifiers to handle. The first classifier may be configured to set the web page's score to zero if the web page is determined to be too large and a null score if the web page size is determined to be acceptable. The chaining rule may be to “replace null scores only.” The second classifier may then be configured to ignore zero-scored documents as non-null and replace the null values with the appropriate score based on the second classifier's classification criteria.
Embodiments of the sampler module may also include classifier modules arranged to operate in parallel in order to assign multiple scores to a single web page. The sampler module may then use each score separately to provide multiple scores for a website, or the sampler module may be configured to combine one or more of the scores to produce a new score for the web page or the website.
Just as embodiments of the sampler module may use multiple classifier modules linked together in series and/or in parallel to classify individual web pages, embodiments of the website classifying systems may use one or more samplers operating in series and/or in parallel to classify individual websites. If multiple sampler modules are used, the one or more scores of each sampler may be output by the website classifying system separately or the scores of one or more samplers may be combined to output one or more new scores for a website.
The efficiency of the multi-tiered crawler of the present invention depends upon efficiently and accurately classifying collections of web pages, such as websites. To accomplish this, embodiments of the present invention utilize one or more of the sampler modules described above to analyze a limited number of web pages of each website in order to make assumptions about the website based on the analysis of the limited number of web pages. As a result, the efficiency of the crawling system generally also depends upon how many web pages of a website need to be analyzed in order to make accurate assumptions about the website as a whole.
According to one embodiment of the present invention, a ranked link extraction system is provided to allow for accurate assumptions to be made about a website (or other collection of web pages) based on the examination of a relatively small fraction of the pages that may exist on the site.
In one embodiment of the ranked link extraction system 700, the weight for a hyperlink is computed based on how well the text in the immediate vicinity of the hyperlink relates or does not relate to some set of interest. For example, a weight score may be a measure of the strength of the correlation between the text and the type of documents that the crawling system is attempting to find. In some embodiments, the ranked link extraction system 700 is configured to measure the strength of the correlation between the text and set of interest (and appropriately adjust the weight score) using information about the source of the URL for the web page being examined by the system. In other words, if the ranked link extraction system 700 is being used, for example, in a lower tier of a cascading focused crawling system, the historical context of the links or paths that led to the web page currently being examined can influence the weight score of extracted links from that web page. For example, a hyperlink labeled “search” may score average when found on a web page having no historical context, however, the hyperlink may score significantly higher if the link that was used to arrive at the current web page was labeled “real estate” and/or had a high weight score. Similarly, the hyperlink may score significantly lower if the link used to arrive at the current web page was labeled “yellow pages” and/or had a low weight score.
In one embodiment of the system, weight scores may be both positive and negative. For example, weights may be positive when the text indicates the desired document type, and weights may be negative when the text predicts an absence of that document type. The magnitude of the weight may indicate the strength of the positive or negative relation.
For example, in the real estate context, the word “estate” found in close vicinity to a hyperlink may be a poor indication of the likelihood that the link is related to real estate unless the word “estate” is specifically part of the phrase “real estate.” In such a scenario, the word “estate” may be given a weight of −4 and “real estate” may be given a weight of +8. In this way, “estate” without the “real” is a 4-unit penalty but “real estate” is a 4-unit bonus (+8 −4). Naturally, weights assigned to particular terms or phrases of text may vary depending on where in the multi-tiered crawler the ranked link extraction system is used.
In one embodiment of the ranked link extraction system, each hyperlink starts with a baseline weight score and the weights of the text in the vicinity of the hyperlink either add or subtract from this baseline score. In one exemplary embodiment, the baseline score is large enough so that the number of digits in the cumulative weight scores of each of the ranked hyperlinks is the same. Keeping the numbers of digits in each weight score constant may assist in ordering the hyperlinks by their relative weights.
Preferably the text in the vicinity of the hyperlink that is analyzed is text that has a high probability of relating to the contents of the link. For example, the text that is analyzed may be: text from the URL associated with the hyperlink; the link text of the hyperlink in question; text associated with any image files used in the hyperlink; text in the link structure; or text nearby the hyperlink in the HTML code or in the displayed web page. Where text near the hyperlink is used to weight the hyperlink, it may be preferable to not analyze text beyond a punctuation mark in the text following the link. In one embodiment, the system analyzes text within some number of words in front of the link (or up until another link or punctuation mark) to get a more complete context in which to interpret the link.
In one embodiment of the sampler module 800, subsequently discovered links do not displace links that have already been classified by the sampler even if the links already classified have lower weight scores than subsequently discovered links. In such an embodiment, the number of web pages that the sampler classifies may be kept constant, but the average link weight for the web pages that are classified may be increased by the ranked link extraction system. In another embodiment of the sampler module 800, however, subsequently discovered links may displace links that have already been classified by the sampler if the links already classified have lower weight scores.
The Document Crawler and Classifying System is generally comprised of a crawler module configured to receive an indication of a collection of documents and exhaustively find and retrieve every available document from the indicated collection of documents. Like the sampler module described above, the document crawler and classifying system uses one or more classifier modules to analyze documents of the collection. Unlike the sampler, however, the document crawler and classifying system is not concerned with classifying the collection and is instead configured to specifically locate the individual documents of a collection that are actually related to the main crawler's topic of interest or a subtopic of interest. In some embodiments of the present invention, the document crawler and classifying system is configured to find documents that contain a specific type of information related to the topic or subtopic of interest.
For example, to use the example described earlier of a multi-tiered crawler configured to locate information related to real estate listings, the various tiers coming before the document crawler and classifying system may have narrowed the Internet down to a plurality of websites or website branches that have been determined to at least include one or more web pages related to real estate or some subtopic of real estate, such as new home construction. Thus, the document crawler and classifying system may now be configured to exhaustively crawl each website determined to contain information relevant to new home construction and find all of the individual web pages that actually include listings for the sale of new homes. According to one embodiment of the present invention, the document crawler and classifying system may be configured to output these web pages, or the web page identifiers, to other systems, such as an indexing system that indexes the web pages, or a data extraction system that actually extracts the listing data and compiles the data extracted from each web page into a searchable datastore of real estate listings.
Many of the systems and subsystems described above involve fetching multiple documents from the same host. For example, the document crawler and classifying system 900 generally must repeatedly submit document requests to the same host as it crawls the collection of documents supported by the host device and downloads and analyzes the individual documents in that collection. Likewise the link harvester system 550, the targeted harvester system 580, and the sampler module 600 of the collection classifying system may also have to repeatedly submit document requests to the same host. As described in the background section above, politeness often requires that multiple document requests not be made too rapidly to the same host so that the host's network performance is not significantly degraded or used up by the crawler. As also described above, being “polite” generally requires that a crawling system wait for some period of time after submitting one document request (e.g., an HTTP request for a web page) to a host before submitting another document request to the same host. Naturally, if the crawling system must sit idle, even if for a second, between submitting multiple document requests to the same host, efficiency of the crawling system is not maximized.
In order to satisfy the politeness requirements while maintaining the efficiency of the crawling system, embodiments of the present invention utilize a novel approach for fetching multiple documents from a plurality of hosts. Specifically, the hosts that are to be accessed are organized into one or more groups of hosts. The hosts are arranged in an array having a first host and a last host. Each host includes a list of documents or web pages to be requested from the host. A cyclic array of hosts is created wherein the first host follows after the last host and the last host comes before the first host. One document or web page is requested from each host in the array, one host at a time. After the first request has been submitted to each of the web hosts in the array ending with the last host in the array, a second request can be submitted from each host in the array beginning with the first host again and ending with the last host. This process may then continue spiraling around the cyclic array of hosts and down the list of documents to be requested from each host. If the array of hosts is large enough, the time it takes to make a request to each host in the array is greater than the politeness requirement for any host in the array. In this way, the requesting system may be able to constantly make web page requests while, at the same time, remaining polite to the individual hosts.
In order to more clearly illustrate the concept of the cyclic array of web hosts,
The process described above can continue until all of the requests for all of the web hosts have been made. However, since each host in a group may vary with respect to the number of web page requests that are to be made to the host, some hosts may have all of the requests submitted before the other hosts. As hosts run out of requests to be submitted to them, the host is skipped each time its turn arrives. As this happens, the group of hosts may become smaller and smaller. At some point, depending on the size of the group and the time required to be polite to a host, the array of hosts may become small enough to where one trip around the array of hosts does not satisfy the politeness requirement. When this happens, the requesting system may be configured to monitor the time since the last request for each host and, if a request is to be made to a host before the politeness requirement is met, the system may be configured to either wait until the politeness requirement is met to submit the request or to simply skip over the host until the system comes across a host where the politeness requirement is satisfied. Due to the fact that efficiency may be reduced as the array of hosts gets smaller in size, it may be beneficial to attempt to form groups of hosts that have similar sized lists of requests to be made.
In one embodiment, the host may be removed from the group instead of merely being skipped over by the system making the requests. In another embodiment, however, the host is never removed from the group and is simply skipped each time its turn is arrived and the host has no more requests to be submitted to it. Such an embodiment, where the empty host is not removed from the group, may be beneficial since crawling some other host in the system may yield new requests for the “empty” host.
In one embodiment of the present invention, hosts that do not have any requests to be made remaining may be replaced by new hosts to which requests need to submitted. In this embodiment, the host array size may be kept constant or nearly constant so that politeness requirements are generally always met.
Note that in some embodiments of the present invention, the politeness requirements may be constant for all hosts, such as some predetermined amount of time. In other embodiments of the present invention, politeness requirements may be based on the individual host. For example, the time required to be polite to a particular host may depend on how quickly the host responded to the last request that was submitted to the host. If the host took a long time to respond to a request, the time required to be polite may be adjusted to be longer than the time required to be polite to a host that responds immediately after a request is made.
In one embodiment of the invention, the system is configured to “re-crawl” the pages of a host in order to ensure that information is kept relatively fresh. One method of re-crawling involves making the requests of each host of the cyclic array 980 cyclic themselves. In other words, the individual columns illustrated in
In one embodiment, a host is skipped if its next request to be made cannot be completed due to a robot exclusion policy, an unsatisfied politeness requirement, or an unsatisfied re-crawl period. In such a case, the next time a request is to be made to that host (i.e., the next time around the cyclic array) the system may proceed to the next request for that host. In another embodiment, the next time around the cyclic array, the system may try again to make the request that was skipped on the last time around the cyclic array.
In one embodiment of the present invention, individual systems or subsystems of the multi-tiered crawler are configured to make requests from a plurality of hosts using the above described cyclic array method. In another embodiment of the present invention, the multi-tiered crawler may comprise a document fetching system configured to make requests for documents on the network for one or more of the systems and subsystems of the multi-tiered crawler. In this embodiment, whenever the one or more systems and subsystems require that a document request be made to a host on the network, the request is first sent to the document fetching system. The document fetching system can then compile the requests from the one or more systems and subsystems and make the requests using the cyclic array process described above. Other advantages of such a centralized requesting system might be that the fetching system could temporarily store requested documents in a cache so that, as the collections are passed from one tier to another, if other tiers in the multi-tiered system need to analyze the same documents, the documents do not have to be requested from the host two or three times in a row.
According to an additional embodiment of the present invention, following web crawling and classification of various web pages, websites, or other information of interest, data extraction is employed to extract information of interest and to format the information in a format that is capable of being further processed by various applications. In general, data extraction uses the information identified as relevant from crawling in order to identify specific pieces of data or entities that may be provided in an indexed format for a search engine. For example, with respect to the real estate industry, listing information relating to price, number of bedrooms, etc. can be identified and located on one or more web pages identified during the crawling and classification processes, as shown in
With reference to
Entity extraction generally encompasses both the HTML transformation/conversion module 1002 and the entity extraction engine 1004. The HTML transformation/conversion module 1002 is employed to enhance the extraction process by altering the source of the HTML files and eliminating content within each HTML document that may negatively affect the extraction process, such as with errors or noise. For example, JavaScript could contain data that is not related to the content of the HTML file. In particular, the HTML transformation/conversion module 1002 may be used to remove unwanted HTML tags (e.g., <meta> and <script>) from the HTML documents (block 1016), replace HTML tags with corresponding values (e.g., <br> with a linefeed), and/or keep the structure and font-related tags (e.g., <table> and <font>) within the HTML documents obtained from the crawling process (generally depicted as block 1012). Thus, the HTML transformation/conversion module 1002 modifies the HTML documents into a format that facilitates the extraction of desirable data from the documents. However, the transformation/conversion module 1002 may not be needed for some entity extraction vendors, for example, where a specific entity extraction engine 1004 is capable of processing the current HTML format and ignoring unwanted HTML tags contained in the HTML documents.
The entity extraction engine 1004 is utilized to extract various relevant pieces of data (i.e., entity information depicted as block 1014) from the HTML documents. For instance, with respect to the real estate industry, the entity extraction engine 1004 could extract information relating to the number of bathrooms, the number of bedrooms, or the price of a real estate property.
In some embodiments, the entity extraction engine 1004 extracts not only from the HTML documents itself but also from the metadata associated with the document during the crawling and classification process described above. For example, such metadata may include, among other things, a history of the text in the vicinity of each link that was followed in order to reach the document being processed by the entity extraction engine 1004. The may allow for more complete information extraction since. For example, with regard to a real estate listing, even if no single document has all of the information provided in it, the crawling system can accumulate the missing information along throughout the path that was used to reach the current document and such information may be stored as metadata associated with the current document. For example, with things like the City and State in an addresses, the city and/or the state may be the name of a link that was used to get to a document having details about a property. However, the address in the document might only have the street address and might not mention a city or a state since it is assumed that such is known based on the link used to arrive at the document. By extracting or otherwise using metadata gathered during the crawling process, such omissions in the current document may be handled.
In order to facilitate extraction, the entity extraction engine 1004 could use techniques, such as standardization, to convert information within the HTML documents into a standard format. For example, a real estate listing may include different formats for describing bedroom, such as bdrm, bd, br, brm, etc. The entity extraction engine 1004 recognizes these format differences and ensures that the correct data is extracted from the HTML document. The entity extraction engine 1004 could be supplied or supported by products such as those developed by ClearForest Corporation and Basis Technology Corporation.
However, rules that facilitate the integration of the data from the crawler with the information extraction 1000 are necessary. Rules are utilized to define the entities of interest contained in the HTML documents. The rules may be pattern rules and/or Gazetteer rules. Thus, pattern rules correspond to a particular entity pattern, while Gazetteer rules correspond to a particular geographical aspect of an entity. With reference to
The HTML structure analyzer module 1006 is employed to convert an inputted HTML document into HTML structural information (block 1018), which typically comprises an HTML tree. Depending on the specific entity extraction engine 1004 used, the HTML structure analyzer module 1006 may be located before or after the entity extraction engine in the process flow path due to the fact that some entity extractors require the raw HTML to be transformed prior to extraction. The HTML tree contains HTML nodes that have information required for further processing by the data analyzer module 1008. An HTML node corresponds to an HTML tag in the document, and only tags defined in the desired HTML node set are converted. In particular, the HTML node contains summarized information of an HTML tag. For example,
The data analyzer module 1008 is used to analyze the data (i.e., entity information 1014) provided by the entity extraction engine 1004 and the HTML structure information (block 1018) from the HTML structure analyzer module 1006 to identify specific information of interest. For example, after the information extraction process pulls desired information out of various HTML documents, the information from each document could span the entire document and not be associated with each other. An information grouper can combine the individual pieces of information together based on grouping rules and their associated HTML document features so that the information can be more representative and informative. For example if the entity extraction process locates two price entities, two address entities, and two bedroom entities from a real estate listing page, the entity extraction process is unable to determine how the price, address, and bedrooms are associated with each other. Thus, using an intelligent grouper, a set of listing information, such as price, address, amenities, and bedroom, can be identified. Each listing includes details of a plurality of grouped entities, which provides more useful information than the entities alone. After the information grouper combines all the entities into desired groups, the groups are output in a desired format, for example, XML files, that can be used by various processes, such as search engines.
Moreover, the data analyzer module 1008 could analyze various images associated with each HTML document. For example, with respect to the real estate industry, a real estate detail listing page could contain photos associated with the real estate property, as well as many other images such as banners, icons, logos, etc. Given the attributes of the image (e.g., type of file or image size) and the position of the image relative to other real estate entities, desired real estate images can be obtained. Thus, images relating to a particular real estate property can be readily identified so that the images associated with the property listing will be displayed to the user.
The information output and organized by the data analyzer module 1008 is provided to the data export module 1010. The data export module 1010 is responsible for exporting the data in formats for indexing and further processing (block 1022). Namely, the data export module 1010 outputs the extracted data to a data aggregation module 1024 where data from several different sources may be consolidated into one or more datastores. For instance, the data may be output in a ready-XML format, such that the data may be used by many different applications, such as Windows®-based applications, databases, search engines, websites, etc.
According to one aspect of the present invention, the system generally operates under control of a computer program product. The computer program product for performing the methods of embodiments of the present invention includes a computer-readable storage medium, such as the memory device associated with a processing element, and computer-readable program code portions, such as a series of computer instructions, embodied in the computer-readable storage medium.
In this regard,
Accordingly, blocks or steps of the control flow diagrams support combinations of means for performing the specified functions, combinations of steps for performing the specified functions, and program instruction means for performing the specified functions. It will also be understood that each block or step of the control flow diagrams, and combinations of blocks or steps in the control flow diagrams, can be implemented by special purpose hardware-based computer systems which perform the specified functions or steps, or combinations of special purpose hardware and computer instructions.
Many modifications and other embodiments of the inventions set forth herein will come to mind to one skilled in the art to which these inventions pertain having the benefit of the teachings presented in the foregoing descriptions and the associated drawings. Therefore, it is to be understood that the inventions are not to be limited to the specific embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of the appended claims. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.
This application claims the benefit of U.S. Provisional Patent Application No. 60/829,453 filed Oct. 13, 2006, the contents of which are incorporated herein by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
60829453 | Oct 2006 | US |