Ranking search results using feature extraction

Information

  • Patent Grant
  • 7716198
  • Patent Number
    7,716,198
  • Date Filed
    Tuesday, December 21, 2004
    20 years ago
  • Date Issued
    Tuesday, May 11, 2010
    14 years ago
Abstract
Methods and computer-readable media are provided for ranking search results using feature extraction data. Each of the results of a search engine query is parsed to obtain data, such as text, formatting information, metadata, and the like. The text, the formatting information and the metadata are passed through a feature extraction application to extract data that may be used to improve a ranking of the search results based on relevance of the search results to the search engine query. The feature extraction application extracts features, such as titles, found in any of the text based on formatting information applied to or associated with the text. The extracted titles, the text, the formatting information and the metadata for any given search results item are processed according to a field weighting application for determining a ranking of the given search results item. Ranked search results items may then be displayed according to ranking.
Description
FIELD OF THE INVENTION

The present invention generally relates to ranking search results according to relevance to a search request. More particularly, the present invention relates to ranking search results using feature extraction.


BACKGROUND OF THE INVENTION

In the modern computing world, users routinely enter search requests into a variety of search engines for receiving help functionality, research materials, documents related to a given task, and the like. A well-known use of search engines in modern times is the use of a variety of search engines for obtaining one or more Universal Resource Locators (URL) associated with IInternet-based information. For example, a user may use a search engine to search the Internet for all topics related to a given history topic or other research topic.


In response to such searches, hundreds or even thousands of search results, including URLs, documents or other resources, may be located across vast arrays of information sources that are responsive to the user's search request. Efforts have been made for ranking the search results and providing the search results to the user in the order of relevance to the given search request. Prior methods have attempted to obtain various properties from located resources and for using those properties to determine a ranking of the relevance of individual resources to a user's request. Often, metadata associated with various resources, for example, documents and URLs, is incorrect or misleading. Incorrect data used by a ranking system leads to a poor ranking of the search results. Consequently, a user may be forced to review a number of irrelevant resources before the user comes to more relevant resources located during the search.


Accordingly, there is a need for an improved method for ranking search results using a property extraction feature. It is with respect to these and other considerations that the present invention has been made.


SUMMARY OF THE INVENTION

Embodiments of the present invention solve the above and other problems by providing a method of ranking search results using feature extraction data. According to an embodiment of the present invention, upon receipt of a search request by a search engine, a database of information, such as Internet-based information sites, is searched for information responsive to the search request. Each located document or other resource is parsed to obtain data, such as text, formatting information, metadata, and the like. In accordance with an embodiment of the present invention, text and data from the located documents or resources are passed through a feature extraction application to extract properties of the text or data that may be used to improve a ranking of the search results based on relevance of the search results to the user's request.


According to a particular embodiment, while a search engine may obtain certain types of information including document or resource titles from defined sources such as metadata, the feature extraction application extracts titles of each located document or resource from document or resource content based on formatting applied to the document or resource. Text, data, and extracted properties, such as resource titles and associated statistical information, are indexed. A ranking application is then run on the indexed text, data, and extracted properties for determining a ranking value for each located resource based on its relevance to the search request. The ranking application uses the extracted properties, such as titles, to augment other data passed to the ranking application, including indexed text, data from an associated document or resource. Ranked search results may then be displayed to the user.


These and other features and advantages, which characterize the present invention, will be apparent from a reading of the following detailed description and a review of the associated drawings. It is to be understood that both the foregoing general description and the following detailed description are exemplary and are explanatory only and are not restrictive of the invention as claimed.





BRIEF DESCRIPTION OF THE DRAWINGS


FIG. 1 is a block diagram showing the architecture of a personal computer that provides an illustrative operating environment for embodiments of the present invention.



FIG. 2 illustrates a computer screen display showing an Internet browsing application user interface for displaying search results ranked according to relevance to a user's search request.



FIG. 3 is a simplified block diagram illustrating a system architecture for ranking search results using a feature extraction according to embodiments of the present invention.



FIG. 4 is a flow diagram illustrating steps performed by a method and system of the present invention for ranking search results using feature extraction according to embodiments of the present invention.





DETAILED DESCRIPTION

As briefly described above, embodiments of the present invention are directed to methods and computer-readable media for ranking search results using feature extraction data. Each of the results of a search engine query is parsed to obtain data, such as text, formatting information, metadata, and the like. The text, the formatting information and the metadata are passed through a feature extraction application to extract data that may be used to improve a ranking of the search results based on relevance of the search results to the search engine query. According to one embodiment, the feature extraction application extracts titles from text and data contained in located documents or resources based on formatting properties applied to or associated with the document or resource. The extracted titles, the text, the formatting information and the metadata for any given search results item are processed according to a ranking algorithm application for determining a ranking of the given search results item. Ranked search results items may then be displayed according to ranking. These embodiments may be combined, other embodiments may be utilized, and structural changes may be made without departing from the spirit or scope of the present invention. The following detailed description is therefore not to be taken in a limiting sense and the scope of the present invention is defined by the appended claims and their equivalents.


Referring now to the drawings, in which like numerals refer to like elements through the several figures, aspects of the present invention and an exemplary operating environment will be described. FIG. 1 and the following discussion are intended to provide a brief, general description of a suitable computing environment in which the invention may be implemented. While the invention will be described in the general context of program modules that execute in conjunction with an application program that runs on an operating system on a personal computer, those skilled in the art will recognize that the invention may also be implemented in combination with other program modules.


Generally, program modules include routines, programs, components, data structures, and other types of structures that perform particular tasks or implement particular abstract data types. Moreover, those skilled in the art will appreciate that the invention may be practiced with other computer system configurations, including hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers, and the like. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote memory storage devices.


Turning now to FIG. 1, an illustrative architecture for a personal computer 2 for practicing the various embodiments of the invention will be described. The computer architecture shown in FIG. 1 illustrates a conventional personal computer, including a central processing unit 4 (“CPU”), a system memory 6, including a random access memory 8 (“RAM”) and a read-only memory (“ROM”) 10, and a system bus 12 that couples the memory to the CPU 4. A basic input/output system containing the basic routines that help to transfer information between elements within the computer, such as during startup, is stored in the ROM 10. The personal computer 2 further includes a mass storage device 14 for storing an operating system 16, application programs, such as the application program 105, and data.


The mass storage device 14 is connected to the CPU 4 through a mass storage controller (not shown) connected to the bus 12. The mass storage device 14 and its associated computer-readable media, provide non-volatile storage for the personal computer 2. Although the description of computer-readable media contained herein refers to a mass storage device, such as a hard disk or CD-ROM drive, it should be appreciated by those skilled in the art that computer-readable media can be any available media that can be accessed by the personal computer 2.


By way of example, and not limitation, computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EPROM, EEPROM, flash memory or other solid state memory technology, CD-ROM, DVD, or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by the computer.


According to various embodiments of the invention, the personal computer 2 may operate in a networked environment using logical connections to remote computers through a TCP/IP network 18, such as the Internet. The personal computer 2 may connect to the TCP/IP network 18 through a network interface unit 20 connected to the bus 12. It should be appreciated that the network interface unit 20 may also be utilized to connect to other types of networks and remote computer systems. The personal computer 2 may also include an input/output controller 22 for receiving and processing input from a number of devices, including a keyboard or mouse (not shown). Similarly, an input/output controller 22 may provide output to a display screen, a printer, or other type of output device.


As mentioned briefly above, a number of program modules and data files may be stored in the mass storage device 14 and RAM 8 of the personal computer 2, including an operating system 16 suitable for controlling the operation of a networked personal computer, such as the WINDOWS operating systems from Microsoft Corporation of Redmond, Wash. The mass storage device 14 and RAM 8 may also store one or more application programs. In particular, the mass storage device 14 and RAM 8 may store an application program 105 for providing a variety of functionalities to a user. For instance, the application program 105 may comprise many types of programs such as a word processing application, a spreadsheet application, a desktop publishing application, and the like. According to an embodiment of the present invention, the application program 105 comprises a multiple functionality software application suite for providing functionality from a number of different software applications. Some of the individual program modules that may comprise the application suite 105 include a word processing application 125, a slide presentation application 135, a spreadsheet application 140 and a database application 145. An example of such a multiple functionality application suite 105 is OFFICE manufactured by Microsoft Corporation. Other software applications illustrated in FIG. 1 include an Extensible Markup Language (XML) parser 130.


Referring now to FIG. 2, an illustrative Internet browser user interface 200 is illustrated. According to the example user interface 200, a user has entered a search request of “history of computers.” Once the entered search request is submitted to a search engine selected by the user, the search engine searches available Internet-based resources for documents, URLs, or other resources responsive to the user's search request. According to embodiments of the present invention, the located documents, URLs, or other resources, are ranked according to their relevance to the user's entered search request. For example, as illustrated in FIG. 2, responses to the user's search request are ranked in order of 1 to 5 where the first displayed result is ranked as the most relevant search result responsive to the user's request. For example, the first ranked search result includes a title of an article “Computers—A Brief History.” As illustrated in FIG. 2, the least relevant search result includes an example article or advertisement entitled “Computers For Sale.”


As should be understood by those skilled in the art, the example search request and search results illustrated in FIG. 2 are for purposes of example only and are not restrictive of the invention as claimed. Similarly, as should be understood, embodiments of the present invention may be directed to search engines for searching a variety of information resources other than Internet-based information sources. For example, a search engine may be utilized by a company or other organization for searching documents, articles, and the like, located on a company information store or database.



FIG. 3 is a simplified block diagram illustrating a system architecture for ranking search results using a feature extraction application according to embodiments of the present invention. A search pool 305 is illustrative of an information store, database, or other location, which may be searched by a search engine for search results responsive to a user-entered search request or query. As should be understood by those skilled in the art, a typical search engine parses a user's entered search request and utilizes the terms and phrases parsed from a user's search request for searching various information stores and databases for information responsive to the search request. As understood by those skilled in the art, a search engine builds an index of information contained in a given document or resource, as described below. Then, in response to a query, the contents of the index for each associated document or resource are searched by the search engine and are ranked for display, as described herein.


Documents, URLs, and other resources searched by a search engine contain various information including text, data, and metadata. Metadata for a given resource may include such items as a title, an author's name, a date of resource creation, or information about an organization responsible for the resource. Utilizing terms and phrases parsed from a search request, a search engine locates matching terms or phrases in the text, data, or metadata of documents, resources, or URLs available in searched information sources or databases. A parser application 310 is illustrated for parsing each of a plurality of search results items into one or more portions of data and associated formatting information applied to the one or more portions of data, as well as, parsing the results of the search engine request into pieces of content and metadata. For example, parser application 310 may be in the form of a web crawler application and may parse an Internet-based resource into individual content pieces and metadata. For example, if the metadata of the resource includes a title, an author's name, identification information about the resource, and the like, the parser application 310 may parse that information into separate pieces of data or content. Content from other portions of a given resource may also be parsed by the web crawler application. For example, properties of a document may be obtained from parsing a storage file associated with a located search item, for example, a Hypertext Markup Language (HTML) header or body section. For a detailed discussion of the operation of a web crawling application, described herein, see U.S. patent application Ser. No. 10/609,315, Jun. 27, 2003, entitled “Normalizing Document Metadata Using Directory Services,” which is incorporated herein by reference as if fully set out herein and U.S. patent application Ser. No. 09/493,748, filed Jan. 28, 2000, entitled “Adaptive Web Crawling Using A Statistical Model,” which is incorporated herein by reference as if fully set out herein.


In addition, other text and data in a given resource may be parsed into individual terms or phrases by the parser application. As should be understood by those skilled in the art, for other types of search engine results, other types of parsing applications may be utilized. For example, a document parser, for example an XML parser 130, may be utilized for parsing documents from a company or entity document storage site for individual text items, phrases, and metadata in the documents that may be useful for ranking search results obtained by the search engine.


Often, metadata of individual documents or other resources can be incorrect, missing or misleading. For example, information may be located in the metadata of a given Internet-based resource that is indicative of a title of a document, but the information may be incorrect. In addition, text or data located in the general content of the document may be ambiguous, misleading, or otherwise difficult to parse and to understand in terms of its relevance to a given search request. Referring still to FIG. 3, once content and metadata are parsed from located documents, resources, or URLs, the parsed content and metadata are passed through a filter application 330 for passing the parsed content and metadata to one or more analysis plug-ins or applications for characterizing the data and for passing the data to an index center 370. For example, metadata taken from a metadata location of an Internet-based resource may be passed to a metadata analysis application 360 for comparison against known metadata items for a determination of the applicability of the metadata for use in ranking associated search results.


Other content, for example, text and data, may be passed to a content analysis application 340 for similar analysis. For example, a content analysis application 340 may be utilized for characterizing certain text or data for eventual use in ranking associated search results. For example, a content analysis application 340 may compare words parsed from the text of a document or resource against a database of known words that have been previously characterized. For example, referring to the example search results contained in FIG. 2, the phrase “For Sale,” illustrated in the fifth search result may be compared against a database of terms, and a determination may be made that the term is relevant to advertising materials, for example.


According to an embodiment of the present invention, content and metadata parsed from located documents, resources, and URLs are passed to a feature extraction application for extracting specified information from the located resources. According to a particular embodiment of the present invention, the feature extraction application 350 is utilized for extracting titles from located resources, documents, or URLs. As should be understood by those skilled in the art, a title of a given document, resource or URL, is a very valuable piece of information for ranking a given search result for its relevance to a search request. For example, referring to FIG. 2, the first ranked resource bares a title of “Computers—A Brief History.” As can be understood, such a title may have a strong relevance to the search request of “history of computers” entered by the user. Accordingly, use of extracted titles from located resources, documents and URLs is of high value to a ranking system for determining a ranking of search results relative to an associated search request or query.


According to embodiments of the present invention, an exemplary feature extraction application for extracting titles from located resources parses data contained in the resource for determining which text or data is associated with a title. For example, formatting information applied to certain text or data in a document, the location of certain text or data in a document, spacing between certain text or data a document, and the like, may be used by a feature extraction application for extracting a title from a given resource. For example, historical data may indicate that a phrase located at the top of a document or resource set apart from other content of the document or resource and formatted according to certain formatting properties, for example, bold formatting, underlining, and the like, may indicate that the phrase is a title. For another example, words or phrases located after certain introductory words, such as “Re:,” “Subject,” “Title,” and the like, may indicate that a word or set of words or data immediately following such terms or phrases is a title. According to embodiments of the present invention, such formatting information (font size, style, alignment, line numbers, etc) are passed to a classification algorithm of the extraction feature application that attempts to classify each fragment of text or data passed through the extraction feature as a potential beginning and ending of a particular feature such as a title. Once a beginning point and an ending point are determined as including a given feature, such as a title, a weighting may be applied to the determined text or data to further identify the determined text or data as a particular feature, such as a title. Once a given text or data fragment is determined as a particular feature, that text or data fragment may be indexed for ranking, as described below.


In addition to extracting titles from data passed through the feature extraction application, statistical information is extracted for each extracted title that provides a number of and frequency of appearance of a the extracted title in the data. As should be understood, other types of data in addition to titles may be extracted from a text or data selection for use in accordance with the present invention for augmenting the performance of a ranking application such as the field weighting algorithm described below.


Referring still to FIG. 3, once analysis of the parsed content and metadata obtained from search results items is passed through the analysis applications, including the feature extraction application 350, each word, phrase, or data item, identified by the one or more analysis applications 340, 350, 360, is passed to an index center 370 for indexing in association with a given search results item. For example, according to embodiments of the present invention, any word or phrase extracted as a title from a document or resource by the feature extraction application 350 is passed to the index center 370 as an extracted title. That is, the extracted word or phrase is indexed in the same manner as would be a word or phrase that has been identified with certainty as a title associated with a given document, resource, or URL. However, as should be understood, a given document or resource may have a text item actually identified as a title of the document or resource. According to embodiments of the invention, both such “actual” titles and any extracted titles are used by the ranking algorithm for ranking search results, as described herein. The extracted titles serve to augment other information provided to the ranking algorithm.


Extracted titles and associated statistical information, words, phrases, data, and the like, indexed for a given search results item are passed to a field weighting application 380 for generation of a ranking value for the associated search results item based on its relevance to the search request. The following represents an example scoring or ranking algorithm for ranking a given search result based on indexed data obtained or extracted from a given search results item.










wtf
(


k
1

+
1





k
1



(


(

1
-
b

)

+

b


wdl
avwdl



)


+
wtf


×

log


(

N
n

)








According to this algorithm, “Wtf” means weighted term frequency which is a sum of term frequencies of a given term multiplied by weights across all properties. According to embodiments, statistical analysis weighting may be applied to various terms for use in the algorithm. For example, articles like “and” and “the” may be given low weights so those terms will not adversely affect ranking of a given search item. “Wdl” means weighted document length. “Avwdl” means average weighted document length. “N” means number of documents in the corpus or body of documents or resources located in the search. The lower case “n” means the number of documents containing the given query term and the sum is across all query terms. The variables “kl” and “b” are constants. As should be understood, the ranking algorithm described above is one example of such and algorithm. For a detailed description of the operation of a field weighting ranking/scoring function, as described herein, see U.S. patent application Ser. No. 10/804,326, filed Mar. 18, 2004, entitled “Field Weighting In Text Document Searching,” which is incorporated herein by reference as if fully set out herein.


Other suitable ranking algorithms may be used for generating ranking values for search results based on a content, metadata and extracted features of a given document or resource. For example, according to a particular embodiment, the algorithm cited above may be modified by removing the expression “(1−b)+b dl/avdl.” This expression is a length normalization factor that is incorporated into wtf variable. According to this embodiment, independent length normalization is used for each property. WTF becomes a weighted sum of term frequencies normalized by length. Extracted features (e.g., titles) and associated statistical information are included in the weighted sum of term frequencies, illustrated in the algorithm above, as one of the non-zero weight properties. The extracted titles and associated statistical information serve to augment the performance of the ranking algorithm for generating more accurate and reliable rankings for individual search results items. As should be understood, other types of data, in addition to titles, for example, names, dates, locations, and the like, may be extracted by a feature extraction application for augmenting the performance of the field weighting application. Once the field weighting ranking/scoring function is run using the indexed data obtained or extracted from an associated search results item, a numerical value is assigned to the associated search results item in terms of its relevance to the associated search request.


Referring still to FIG. 3, a user interface 390 displays the search results based on the ranking value applied to each search result. That is, the search result having the highest ranking is displayed first, the search result having the next highest ranking is displayed next, the search result having the next highest ranking is displayed next, and so on.


Having described components of a system for ranking search results using properties extracted by a feature extraction application with respect to FIG. 3, FIG. 4 is a flow diagram illustrating steps performed by a routine of the present invention for ranking search results using feature extraction data according to embodiments of the present invention. The routine 400 begins as start block 405 and proceeds to block 410 where a given URL space associated with an Internet-based information site or sites is crawled by a web crawler 310 for locating documents, resources, URLs, and the like, responsive to a user's search request. Similarly, a search engine for searching documents in a company or other organization resource storage location may be utilized.


At block 415, documents or other resources obtained by the search engine are parsed to obtain text, data, formatting information, metadata, and the like. At block 420, text, content, metadata, and other data parsed from obtained documents or resources are passed to one or more analysis applications 340, 350, 360. In particular, according to embodiments of the present invention, a feature extraction application 350 is run against text, data, and metadata parsed from a given search results item to extract titles from the associated search request items. At block 425, extracted titles and other text, data and metadata are indexed in an index center 370, described above.


At block 430, in response to a search request, the search engine searches the index for information responsive to the request. Information responsive to the request, including extracted titles and other text, data and metadata, is passed to the ranking algorithm, as described above. At block 435, ranking values are generated by the ranking algorithm application for each search results item. At block 440, the results of the user's search request are displayed according to the ranking values applied to each search results item by the field weighting application. According to embodiments, the steps described above may be performed according to different orders. For example, according to one embodiment, the index is built for a variety of documents or resources in a given information source. Then, at query time, the search engine searches the index and ranks the results using the ranking algorithm.


As described herein, methods and systems are provided for ranking search results using feature extraction data. According to a particular embodiment, titles are extracted from documents, resources, URLs, and the like located by a search engine in response to a user search request. The extracted titles are utilized along with other metadata and information parsed from a given search request item by a field weighting application for applying a ranking to a given search request item. It will be apparent to those skilled in the art that various modifications or variations may be made in the present invention without departing from the scope or spirit of the invention. Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein.

Claims
  • 1. A method, comprising: receiving on a computing device resource items generated by a search engine in response to a search request;parsing each of the resource items to obtain data, wherein the data includes: text, formatting information and metadata;passing the data for each of the resource items that includes the text, the formatting information and the metadata through a feature extraction application for determining applicability of the obtained metadata by comparing the obtained metadata against known metadata for use in ranking search results; comparing words parsed from the text that is separate from the obtained metadata against a database of known words that have been previously characterized; using formatting characteristics of the data to determine when to extract a title from the resource item; wherein the formatting characteristics that are used include: a bold formatting characteristic; an underlining formatting characteristic; wherein the feature extraction application stores statistical information for each extracted title that provides a number of and a frequency of appearance of the extracted title in the data; and extracting features from the one or more portions for each of the resource items;passing extracted features through a ranking application for generating a ranking value for each of the resource items based on a relevance of each of the resource items to the search request; andgenerating a list of the resource items in an order according to the ranking value for each of the resource items, whereby when the resource item has the ranking value associated with being ranked as most relevant to the search request received by the search engine is displayed first.
  • 2. The method of claim 1, whereby parsing the resource items into data further includes parsing the data to include one or more text selections and associated formatting information applied to the one or more text selections.
  • 3. The method of claim 2, whereby extracting the features from the data includes extracting statistical information for any extracted features including a frequency with which any features are found.
  • 4. The method of claim 3, whereby passing the extracted features and the data through the ranking application for generating the ranking value for each of the resource items based on the relevance of each of the resource items to a search request received by the search engine includes passing the statistical information for the extracted features through the ranking application.
  • 5. The method of claim 4, whereby extracting the features from the data for each of the resource items includes computing a score for a beginning point and an ending point for each extracted feature in a given resource item based on formatting information applied to a document.
  • 6. The method of claim 5, whereby extracting the features from the data for each of the resource items includes extracting titles from the data for each of the resource items.
  • 7. The method of claim 1, prior to passing the any extracted features and the data through the ranking application for generating the ranking value for each of the resource items, further comprising: receiving a search request at the search engine; andcausing the list of the resource items to display in a user interface.
  • 8. The method of claim 1, prior to parsing each of the resource items into data, receiving the resource items from one or more information sources accessed and parsed by the search engine.
  • 9. A method, comprising: receiving on a computing device resource items generated by a search engine in response to a search request;obtaining resource items from an information source;parsing a metadata source in each of the resource items for one or more metadata items, and parsing a content portion of each of the resource items into one or more text selections and associated formatting information applied to the one or more text selections;passing the one or more metadata items and the one or more text selections and associated formatting information for each of the resource items through a feature extraction application;extracting titles from the one or more text selections and associated formatting information for each of the resources including using formatting characteristics to determine when to extract a title from the resource item; wherein the formatting characteristics that are used include: a bold formatting characteristic; an underlining formatting characteristic; wherein the feature extraction application stores statistical information for each extracted title that provides a number of and a frequency of appearance of the extracted title;determining applicability of the one or more metadata items by comparing the one or more metadata items against known metadata for use in ranking search results;comparing words parsed from the text that is separate from the one or more metadata items against a database of known words that have been previously characterized;processing the extracted titles, the one or more metadata items, the one or more text selections and the associated formatting information according to a ranking algorithm for generating a ranking value for each of the resource items based on a relevance of each of the resource items to the search request received by the search engine; andgenerating a list of the resource items in an order according to the ranking value for each of the resource items.
  • 10. The method of claim 9, further comprising displaying each of the resource items in a user interface in an order according to the ranking value for each of the resource items whereby the resource item having the ranking value associated with being ranked as most relevant to the search request received by the search engine is displayed first.
  • 11. The method of claim 9, whereby extracting titles from the one or more text selections and associated formatting information for each of the resource items includes extracting statistical information for any extracted titles including a frequency with which any titles are found in an associated resource item; andfurther comprising processing the statistical information for the any extracted titles according to a field weighting algorithm in association with processing the any extracted titles, the one or more metadata items, the one or more text selections and the associated formatting information according to the field weighting algorithm.
  • 12. The method of claim 9, prior to processing the extracted titles, the one or more metadata items, the one or more text selections and the associated formatting information according to the ranking algorithm for generating the ranking value for each of the resource items based on the relevance of each of the resource items to the search request received by the search engine, receiving the search request at the search engine.
  • 13. A computer-readable medium having stored thereon computer-executable instructions which when executed by a computer perform a method, comprising: receiving resource items generated by a search engine in response to a search request, the resource items including one or more content portions and one or more metadata portions;parsing each of the resource items into the one or more content portions and associated formatting information applied to the one or more content portions;passing the one or more content portions and associated formatting information for each of the resource items through a feature extraction application;determining applicability of the one or more metadata portions by comparing the one or more metadata portions against known metadata for use in ranking search results;comparing words parsed from the content portions that is separate from the one or more metadata portions against a database of known words that have been previously characterized;using formatting characteristics to determine when to extract a title from the resource item; wherein the formatting characteristics that are used include: a bold formatting characteristic; an underlining formatting characteristic; wherein the feature extraction application stores statistical information for each extracted title that provides a number of and a frequency of appearance of the extracted title in the data; and extracts features from the one or more content portions and associated formatting information for each of the resource items;passing the extracted features and the one or more content portions and associated formatting information through a ranking application for generating a ranking value for each of the resource items based on a relevance of each of the resource items to the search request received by the search engine; andgenerating a list of the resource items in an order according to the ranking value for each of the resource items.
  • 14. The computer-readable medium of claim 13, wherein parsing each of the resource items further comprises: parsing the one or more metadata sources in each of the resource items for one or more metadata items; andparsing the one or more content portions into one or more text selections and associated formatting information applied to the one or more text selections.
  • 15. The computer-readable medium of claim 14, whereby extracting features from the one or more content portions and associated formatting information and the one or more metadata portions includes extracting statistical information for any extracted features including a frequency with which any features are found in an associated resource item.
  • 16. The computer-readable medium of claim 15, whereby passing the extracted features and the one or more content portions and associated formatting information and one or more metadata portions through the ranking application for generating the ranking value for each of the resource items based on a relevance of each of the resource items to a search request received by the search engine includes passing the statistical information for the extracted features through the ranking application with the associated extracted features.
  • 17. The computer-readable medium of claim 16, whereby extracting features from the one or more content portions and associated formatting information and the one or more metadata portions for each of the resource items includes computing a score for a beginning point and an ending point for each extracted feature in a given resource item based on formatting information applied to a document.
  • 18. The computer-readable medium of claim 17, whereby extracting features from the one or more content portions and associated formatting information and the one or more metadata portions for each of the resource items includes extracting titles from the one or more content portions and associated formatting information for each of the resource items.
  • 19. The computer-readable medium of claim 13, prior to passing the any extracted features and the one or more content portions and associated formatting information through the ranking application for generating the ranking value for each of the resource items, further comprising: receiving a search request at the search engine; andcausing the list of the resource items to display in a user interface in an order according to the ranking value for each of the resource items whereby the resource item having the ranking value associated with being ranked as most relevant to the search request received by the search engine is displayed first.
  • 20. The computer-readable medium of claim 13, prior to parsing each of the resource items into the one or more content portions and associated formatting information, receiving the resource items from one or more information sources accessed and parsed by the search engine.
US Referenced Citations (190)
Number Name Date Kind
5222236 Potash et al. Jun 1993 A
5257577 Clark Nov 1993 A
5594660 Sung et al. Jan 1997 A
5606609 Houser et al. Feb 1997 A
5848404 Hafner et al. Dec 1998 A
5893092 Driscoll Apr 1999 A
5920859 Li Jul 1999 A
5933851 Kojima et al. Aug 1999 A
5960383 Fleischer Sep 1999 A
5983216 Kirsch et al. Nov 1999 A
5987457 Ballard Nov 1999 A
6006225 Bowman et al. Dec 1999 A
6012053 Pant et al. Jan 2000 A
6032196 Monier Feb 2000 A
6041323 Kubota Mar 2000 A
6070158 Kirsch et al. May 2000 A
6070191 Narendran et al. May 2000 A
6098064 Pirolli et al. Aug 2000 A
6125361 Chakrabarti et al. Sep 2000 A
6128701 Malcolm et al. Oct 2000 A
6145003 Sanu et al. Nov 2000 A
6151624 Teare et al. Nov 2000 A
6167369 Schulze Dec 2000 A
6182085 Eischstaedt et al. Jan 2001 B1
6182113 Narayanaswami Jan 2001 B1
6185558 Bowman et al. Feb 2001 B1
6202058 Rose et al. Mar 2001 B1
6208988 Schultz Mar 2001 B1
6216123 Robertson et al. Apr 2001 B1
6222559 Asano et al. Apr 2001 B1
6240407 Chang et al. May 2001 B1
6240408 Kaufman May 2001 B1
6247013 Morimoto Jun 2001 B1
6263364 Najork et al. Jul 2001 B1
6285367 Abrams et al. Sep 2001 B1
6285999 Page Sep 2001 B1
6304864 Liddy et al. Oct 2001 B1
6317741 Burrows Nov 2001 B1
6327590 Chidlovskii et al. Dec 2001 B1
6349308 Whang et al. Feb 2002 B1
6351467 Dillon Feb 2002 B1
6351755 Najork et al. Feb 2002 B1
6360215 Judd et al. Mar 2002 B1
6385602 Tso et al. May 2002 B1
6389436 Chakrabarti et al. May 2002 B1
6418433 Chakrabarti et al. Jul 2002 B1
6418452 Kraft et al. Jul 2002 B1
6418453 Kraft et al. Jul 2002 B1
6442606 Subbaroyan et al. Aug 2002 B1
6473752 Fleming, III Oct 2002 B1
6484204 Rabinovich Nov 2002 B1
6516312 Kraft et al. Feb 2003 B1
6539376 Sundaresan et al. Mar 2003 B1
6546388 Edlund et al. Apr 2003 B1
6547829 Meyerzon et al. Apr 2003 B1
6549896 Candan et al. Apr 2003 B1
6549897 Katariya et al. Apr 2003 B1
6594682 Peterson et al. Jul 2003 B2
6598047 Russell et al. Jul 2003 B1
6598051 Wiener et al. Jul 2003 B1
6601075 Huang et al. Jul 2003 B1
6622140 Kantrowitz Sep 2003 B1
6628304 Mitchell et al. Sep 2003 B2
6633867 Kraft et al. Oct 2003 B1
6633868 Min et al. Oct 2003 B1
6636853 Stephens Oct 2003 B1
6638314 Meyerzon et al. Oct 2003 B1
6671683 Kanno Dec 2003 B2
6701318 Fox et al. Mar 2004 B2
6718324 Edlund et al. Apr 2004 B2
6718365 Dutta Apr 2004 B1
6738764 Mao et al. May 2004 B2
6763362 McKeeth Jul 2004 B2
6766316 Caudill et al. Jul 2004 B2
6766422 Beyda Jul 2004 B2
6775659 Clifton-Bligh Aug 2004 B2
6775664 Lang et al. Aug 2004 B2
6778997 Sundaresan et al. Aug 2004 B2
6829606 Ripley Dec 2004 B2
6859800 Roche et al. Feb 2005 B1
6862710 Marchisio Mar 2005 B1
6871202 Broder Mar 2005 B2
6883135 Obata et al. Apr 2005 B1
6886010 Kostoff Apr 2005 B2
6886129 Raghavan et al. Apr 2005 B1
6910029 Sundaresan Jun 2005 B1
6931397 Sundaresan Aug 2005 B1
6934714 Meinig Aug 2005 B2
6944609 Witbrock Sep 2005 B2
6947930 Anick et al. Sep 2005 B2
6959326 Day et al. Oct 2005 B1
6973490 Robertson et al. Dec 2005 B1
6990628 Palmer et al. Jan 2006 B1
7016540 Gong et al. Mar 2006 B1
7028029 Kamvar et al. Apr 2006 B2
7039234 Geidl et al. May 2006 B2
7051023 Kapur et al. May 2006 B2
7072888 Perkins Jul 2006 B1
7076483 Preda et al. Jul 2006 B2
7080073 Jiang et al. Jul 2006 B1
7107218 Preston Sep 2006 B1
7152059 Monteverde Dec 2006 B2
7181438 Szabo Feb 2007 B1
7179497 Cossock Mar 2007 B2
7243102 Naam et al. Jul 2007 B1
7246128 Jordahl Jul 2007 B2
7257574 Parikh Aug 2007 B2
7257577 Fagin et al. Aug 2007 B2
7260573 Jeh et al. Aug 2007 B1
7281002 Farrell Oct 2007 B2
7308643 Zhu et al. Dec 2007 B1
7328401 Obata et al. Feb 2008 B2
7428530 Ramarathnam et al. Sep 2008 B2
7519529 Horvitz Apr 2009 B1
20010042076 Fukuda Nov 2001 A1
20020055940 Elkan May 2002 A1
20020062323 Takatori et al. May 2002 A1
20020078045 Dutta Jun 2002 A1
20020099694 Diamond et al. Jul 2002 A1
20020103798 Zheng et al. Aug 2002 A1
20020107861 Clendinning et al. Aug 2002 A1
20020107886 Gentner et al. Aug 2002 A1
20020129014 Kim et al. Sep 2002 A1
20020169595 Agichtein et al. Nov 2002 A1
20020169770 Kim et al. Nov 2002 A1
20030037074 Dwork et al. Feb 2003 A1
20030053084 Geidl et al. Mar 2003 A1
20030055810 Cragun et al. Mar 2003 A1
20030061201 Grefenstette et al. Mar 2003 A1
20030065706 Smyth et al. Apr 2003 A1
20030074368 Schuetze et al. Apr 2003 A1
20030208482 Kim et al. Nov 2003 A1
20030217007 Fukushima et al. Nov 2003 A1
20030217047 Marchisio Nov 2003 A1
20030217052 Rubenczyk et al. Nov 2003 A1
20040003028 Emmett et al. Jan 2004 A1
20040006559 Gange et al. Jan 2004 A1
20040049766 Bloch et al. Mar 2004 A1
20040093328 Damle May 2004 A1
20040117351 Challapalli et al. Jun 2004 A1
20040148278 Milo et al. Jul 2004 A1
20040181515 Ullmann et al. Sep 2004 A1
20040186827 Anick et al. Sep 2004 A1
20040194099 Lamping et al. Sep 2004 A1
20040199497 Timmons Oct 2004 A1
20040205497 Alexander et al. Oct 2004 A1
20040215606 Cossock Oct 2004 A1
20040215664 Hennings et al. Oct 2004 A1
20040254932 Gupta et al. Dec 2004 A1
20050033742 Kamvar et al. Feb 2005 A1
20050044071 Cho et al. Feb 2005 A1
20050055340 Dresden Mar 2005 A1
20050055347 Cho et al. Mar 2005 A9
20050060186 Blowers et al. Mar 2005 A1
20050060304 Parikh Mar 2005 A1
20050060311 Tong et al. Mar 2005 A1
20050071328 Lawrence Mar 2005 A1
20050071741 Acharya et al. Mar 2005 A1
20050086192 Kodama Apr 2005 A1
20050086206 Balasubramanian et al. Apr 2005 A1
20050086583 Obata et al. Apr 2005 A1
20050144162 Liang Jun 2005 A1
20050154746 Liu et al. Jul 2005 A1
20050165781 Kraft et al. Jul 2005 A1
20050187965 Abajian Aug 2005 A1
20050192936 Meek et al. Sep 2005 A1
20050192955 Farrell Sep 2005 A1
20050210006 Robertson Sep 2005 A1
20050216533 Berkhin Sep 2005 A1
20050240580 Zamir et al. Oct 2005 A1
20050251499 Huang Nov 2005 A1
20050262050 Fagin et al. Nov 2005 A1
20050283473 Rousso et al. Dec 2005 A1
20060036598 Wu Feb 2006 A1
20060047649 Liang Mar 2006 A1
20060173560 Widrow Aug 2006 A1
20060195440 Burges et al. Aug 2006 A1
20060206460 Gadkari et al. Sep 2006 A1
20060206476 Kapur et al. Sep 2006 A1
20060282455 Lee et al. Dec 2006 A1
20060287993 Yao et al. Dec 2006 A1
20070038616 Guha Feb 2007 A1
20070038622 Meyerzon et al. Feb 2007 A1
20070073748 Barney Mar 2007 A1
20070106659 Lu et al. May 2007 A1
20070150473 Li et al. Jun 2007 A1
20070276829 Wang et al. Nov 2007 A1
20090106221 Meyerzon et al. Apr 2009 A1
20090106223 Meyerzon et al. Apr 2009 A1
20090106235 Tankovich et al. Apr 2009 A1
Foreign Referenced Citations (13)
Number Date Country
0950961 Oct 1999 EP
0950961 Oct 1999 EP
1050830 Nov 2000 EP
1120717 Aug 2001 EP
1282060 Feb 2005 EP
1557770 Jul 2005 EP
10091638 Apr 1998 JP
11328191 Nov 1999 JP
2002-091843 Mar 2002 JP
2003-248696 Sep 2003 JP
10-2002-0015838 Mar 2002 KR
10-2003-0082109 Oct 2003 KR
10-2006-0116042 Nov 2006 KR
Related Publications (1)
Number Date Country
20060136411 A1 Jun 2006 US