1. Field of the Invention
The present invention relates generally to document search, and more particularly to improving ranking results and retrieval effectiveness by enriching document representations.
2. Description of Related Art
One of the most unique characteristics of the web is its dynamic, human generated hypertext structure. The web has allowed millions of everyday users to publish their own content. Most web pages contain one or more hyperlinks that point to other pages. These hyperlinks, referred to as anchors, may consist of a destination URL and a short piece of text. The short piece of text, which is called anchor text, typically provides a description of the destination URL. For example, the anchor text associated with a hyperlink to the page http://www.acm.org/sigir may include “sigir,” “acm sigir,” and “information retrieval.”
Anchor text is useful because it is similar in nature to queries. In the ACM SIGIR homepage example above, it is easy to see that the anchor text “sigir,” “acm sigir,” and “information retrieval” are reasonable queries that users may enter when they are searching for the page.
However, anchor text sparsity prevents anchor text from being used effectively in Internet search. Currently, many useful pages have very little, or no, anchor text. Therefore, it may be desirable to provide a system and method which may overcome the anchor text sparsity problem by enriching document representations by using aggregated anchor text, especially for those documents that have little or no anchor text to begin with.
Embodiments of the present invention are described herein with reference to the accompanying drawings, similar reference numbers being used to indicate functionally similar elements.
The present invention provides a system and method for enriching document representations by augmenting documents with auxiliary anchor text that is derived by aggregating, or propagating, anchor text over the web graph. The invention may be carried out by computer-executable instructions, such as program modules. Advantages of the present invention will become apparent from the following detailed description.
The user terminal 102-1, 102-2, . . . or 102-n may be a desktop computer, a laptop computer, a personal digital assistant (PDA), a smartphone, a set top box or any electronic devices having access to the network 104. A user terminal may have a CPU, a memory, a user interface, an interface to the computer network 104, and a display. The user terminal may also have a browser application configured to receive, display and publish web pages, which may include text, graphics, multimedia, etc. The web pages may be based on, e.g., HyperText Markup Language (HTML) or extensible markup language (XML). A user may include hyperlinks in a page when publishing it.
The Internet server 103-1, 103-2, . . . or 103-n may be a computer system, running a website or a blog. The website may have a number of web pages, and a web page may have a hyperlink pointing to another page within the site or outside of the site.
The network 104 may be, e.g., the Internet. Network connectivity may be wired or wireless, using one or more communications protocols, as will be known to those of ordinary skill in the art.
The search server 101 may be a computer system and may include a central processing unit (CPU) 1011 and a memory 1012, which communicate with each other and other parts in the computer system via a bus 1015. Alternatively, the search server 101 may include multiple computer systems each configured to accomplish certain tasks and coordinate with other computer systems to perform the method of the present invention.
The CPU 1011 may perform computer software modules stored in the memory 1012 to carry out a number of processes, including but not limited to the one described below with reference to
In one example, the CPU 1011 may execute a search module 1014 stored in the memory 1012 to receive a query over the network 104, identify web pages relevant to the query by searching documents enriched with the aggregated anchor text, calculate estimates of relevance of the web pages using combined weight for each line of anchor text, rank the web pages based on their estimates of relevance, and generate a search result page with the web pages being displayed as a list of search results.
The database 105 may store anchor text information of web pages, which may include, e.g., their URLs, inlinks, anchor text lines and probably weights for the anchor text lines. A table stored in the database 105 will be described below, with reference to
A page 202 (URL: http://alldancing.com/swingdnaces.html) outside the site 200 may have a link 203 pointing to the target page 201. The anchor text of the link 203 may be, e.g., “swing dancing.” A page 204 (URL: http://dancesite.com/swing.html) outside the site 200 may have a link 205 pointing to the page 201. The anchor text of the link 205 may be, e.g., “Lindy hop.” The weights for links 203 and 205 may be, e.g., 3 and 5 respectively.
A page 206 (URL: http://dancing.com/ballrooms.html) may be within the site 200 containing the target page 201, and may have a link 207 pointing to the target page 201. The anchor text of the link 207 may be, e.g., “Lindy Hop.” A page 208 (URL: http://dancing.com/newyork.html) may be within the site 200, and may have a link 209 pointing to the target page 201. The anchor text of the link 209 may be, e.g., “Lindy Hop.” Links 207 and 209 may be called internal inlinks, since they come from within the same site containing the target page 201.
A page 210 (URL: http://ballrooms.com/savoy.html) may be outside the site 200, and may have a link 211 pointing to the page 206. The anchor text of the link 211 may be, e.g., “Savoy Ballroom.” A page 212 (http://ballrooms.com) may be outside the site 200, and may have a link 213 pointing to the page 206. The anchor text of the link 213 may be, e.g., “Savoy Ballroom.” The weights for links 211 and 213 may be, e.g., 1 and 5 respectively. The anchor text for links 211 and 213 may be called external anchor text, since they originate from pages outside of the site 200.
A page 214 (URL: http://nyc.com/culture.html) may be outside the site 200, and may have a link 215 pointing to the page 208. The external anchor text of the link 215 may be, e.g., “Lindy hop.” A page 216 (URL: http://traveling.com/dances.html) may be outside the site 200, and may have a link 217 pointing to the page 208. The external anchor text of the link 215 may be, e.g., “dances in New York.” The weights for links 215 and 217 may be, e.g., 1 and 2 respectively.
In the web graph shown in
Since internal inlinks, e.g., 207 and 209, typically link related pages within a given site, and are typically created by the owner of the site, they may be authoritative, as opposed to links originating from external sites, which may not be as purposefully generated. In addition, external anchors, e.g., 211, 213, 215 and 217, are less likely to be navigational and are more likely to provide good descriptions of their destination. Because internal links connect related pages, the external anchor text of the internal links may be good descriptors, by semantic transitivity, of the target page 201. This is why the external anchor text of the internal inlinks is used as the source of auxiliary anchor text.
In one embodiment, the anchor text associated with the internal inlinks, e.g., 207 and 209, may not be used, if such anchor text is navigational in nature (e.g., “home”, “next page”, etc.).
At 301, for a given URL u, e.g., http://dancing.com/lindyhop.html for the target page 201, all pages P within the site (domain) 200 that link to u may be identified. As discussed above, these links are u's internal inlinks, since they come from within the same site 200. In the embodiment shown in
At 302, pages that are linked to P from outside the site 200 may be identified. These links are u's external anchors. In the embodiment shown in
At 303, all anchor text A of external anchors may be collected. As discussed above, such anchor text is known as external anchor text, because it originates from pages outside of the site 200 containing the target page 201. In the embodiment shown in
At 304, the external anchor text information may be stored in the database 105.
A line of anchor text associated with a URL may have some weight assigned to it. As shown in
Since lines of anchor text may be aggregated from multiple sources, it is possible that the same line of aggregated anchor text may originate from multiple URLs, each with a potentially different weight. For example, the weight for “Savoy Ballroom” is 1 for the link 211 and 5 for the link 213. Since only one weight per distinct line of anchor text may be needed, the weights of lines originating from multiple sources may be combined in some way, at 305. In one embodiment, standard result set fusion techniques may be applied to combine the weights.
In one embodiment, the following weight aggregation functions may be used to weight the aggregated lines of anchor text:
where N(u) is the set of internal inlinks and wt(l,u′) is the original weight of anchor text line l for URL u′. If some line of aggregated anchor text originates from a single URL u′, then the aggregated weight will equal wt(l,u′) regardless of the aggregation function chosen. However, when a line originates from multiple URLs, each of the aggregation functions computes the weight differently.
In one embodiment, the MIN function (1) may be used to select the minimum weight from multiple different weights. Using the MIN function (1), the weights for the aggregated anchor text for the target page 201 may be:
In one embodiment, weights of “Lindy hop,” including 1 for the link 215 and 5 for link 205 may be considered as well.
In one embodiment, the MAX function (2) may be used to select the maximum weight from multiple different weights. Using the MAX function (2), the weights for the aggregated anchor text for the target page 201 may be:
In one embodiment, the MEAN function (3) may be used to calculate the mean value of multiple different weights. Using the MEAN function (3), the weights of the aggregated anchor text for the target page 201 may be:
In one embodiment, the SUM function (4) may be used to calculate the sum of multiple different weights. Using the SUM function (5), the weights for the aggregated anchor text for the target page 201 may be:
Similarly, functions (5) and (6) may be used to calculate the weights as well.
The original anchor text line weights (i.e., wt(l,u′)) may be computed differently for every search engine implementation. In one embodiment, original lines of anchor text may be weighted as follows:
where S(u) is the set of external sites that link to u, δ(l,u,s) is 1 if and only if anchor text l links to u from some page within site s, and |anchors(u,s)| is the total number of unique anchors originating from site s that link to u.
Thus, the input to the method may be a URL u of the target page 201, and the output may be a weighted set of aggregated anchor text lines. This may be achieved in two steps. First, the aggregated anchor text lines may be collected by 301 to 303. Then, the lines may be combined and weighted to produce the final result at 305.
The aggregated anchor text collected and weighted may be used in various ways to build enriched document representations. Aggregated anchor text-enriched document representations may be useful for various information retrieval and natural language processing tasks including, e.g., web search, content match, text classification, and summarization. The best representation will depend on the task. Four possible representations will be discussed below:
The first representation is the flat representation. As shown in
The second representation is the combined representation, which may preserve the document structure, and augment the original anchor text lines in the field 503 with the aggregated anchor text lines. The aggregated anchor text weights may also be used here, as long as the search engine's indexing architecture supports it.
One issue with the combined representation is that there may be some overlap between the original and aggregated anchor text lines, such as “Lindy hop” for the link 215 in the aggregated anchor text and “Lindy hop” for the link 205 in the original anchor text compiled by conventional systems. The aggregated anchor text lines may add noise to a set of high quality original anchor text lines. To overcome this issue, the backoff representation may only add aggregated anchor text to documents that do not originally have any anchor text lines associated with them.
The fourth representation is a new field representation which adds the aggregated anchor text as a completely new field to every document, as shown in
The enriched document representations result in significant improvements in retrieval effectiveness on a very large web test collection. During one evaluation, the method of the invention not only reduced the number of pages with no anchor text by 38%, but also added, on average, 34 lines of anchor text to every URL.
At 601, a search query may be received from a user terminal, e.g., 102-1, over the network 104.
At 602, the search server 101 may search documents representations of web pages, which are enriched with the aggregated anchor text, to identify web pages relevant to the query.
At 603, the search server 101 may calculate estimates of relevance of the web pages, using the combined weight for each line of anchor text.
At 604, the search server 101 may rank the web pages based on their estimates of relevance.
At 605, the search server 101 may generate a search result page, with the web pages being displayed as a list of search results.
Several features and aspects of the present invention have been illustrated and described in detail with reference to particular embodiments by way of example only, and not by way of limitation. For example, the aggregated anchor text may be collected and weighted in many different ways beyond the approaches described here. Also, in addition to web search, the enriched document representations may be used in a number of other ways, including estimating improved document models, developing advanced textual matching features, and even improving the quality of document classification algorithms.
Those of skill in the art will appreciate that alternative implementations and various modifications to the disclosed embodiments are within the scope and contemplation of the present disclosure. Therefore, it is intended that the invention be considered as limited only by the scope of the appended claims.