Direct navigation for information retrieval

Description

BACKGROUND

This invention relates to direct navigation for information retrieval.

A web page such as on the World Wide Web (“Web”) represents web content and is typically written in Hypertext Markup Language (“HTML”). HTML is a set of markup symbols or codes inserted in a file intended for display on a Web browser page. The markup symbols tell the Web browser how to display a Web page's words and images for a user.

A search engine is Web-enabled software that receives search terms, i.e., a query, from a user and identifies documents, i.e., web pages, on the Web that are otherwise associated with, the search terms. The documents are typically identified by uniform resource locators (“URLs”) that are links to their content.

SUMMARY

In an aspect, the invention features a method of retrieving information including assigning concept labels to documents contained in a collection, receiving a query, converting the query to a query concept, and mapping the query concept to a concept label.

Embodiments may include one or more of the following. Assigning may include parsing the documents automatically with a grammar. The concept label may represent a general notion. The query may be a text query received from a user. Assigning may include spidering the collection, matching features contained in each of the documents to a store of concepts, and storing document location indicators for each matched concept. The documents may be HyperText Markup Language (HTML) files. The document location indicators may be Universal Resource Identifiers (URLs). Converting may include applying a store of grammar rules to the query and the grammar rules may map text to concepts.

The method may also include generating a list of the mapping in which the list represents locations of documents. The locations may be Universal Resource Identifiers (URLs).

In another aspect, the invention features a method of document retrieval including assigning concept labels to documents contained in a collection according to grammar rules, receiving a query, converting the query to a query concept using the grammar rules, and mapping the query concept to a concept label.

Embodiments may include one or more of the following. Assigning may include parsing the documents automatically with the grammar rules. The query may be received from a user.

The method may also include generating a list of the mapped query concepts, and displaying the list to the user on an input/output device.

Embodiments of the invention may have one or more of the following advantages.

Documents in a collection are pre-labeled with concepts, wherein each concept is a general notion or idea. A grammar is written around concepts. The grammar is applied to a user query and provides a direct mapping of the user query to appropriate documents in a collection of documents found on the Web.

The pre-labeling with concepts may be automatic using parsing with a grammar.

A user query is responded to using concept matching rather than direct word matching.

Using a grammar applied to a user query provides a technique for allowing many different ways of expressing something to map to a single item.

Annotating documents contained on the Web with concepts provides a robust manner of searching for Web documents.

Concept matching overcomes the limitation where a user query needs to match words and cannot find all the words in a single document.

DESCRIPTION OF DRAWINGS

The foregoing features and other aspects of the invention will be described further in detail by the accompanying drawings, in which:

FIG. 1 is a block diagram of a network.

FIG. 2 is a block diagram of a direct navigation process.

FIG. 3 is a flow diagram of a query process.

DETAILED DESCRIPTION

Referring to FIG. 1, an exemplary network 10 includes a user system 12 linked to a globally connected network of computers such as the Internet 14. The network 10 includes content servers 16, 18 and 20 linked to the Internet 14. Although only one user system 12 and three content server systems 16, 18, 20 are shown, other configurations include numerous client systems and numerous server systems. Each content server system 16, 18, 20 includes a corresponding storage device 22, 24, 26. Each storage device 22, 24, 26 includes a corresponding database of content 28, 30, 32. The network 10 also includes a direct navigation server 34.

The direct navigation server 34 includes a processor 36 and a memory 38. Memory 38 stores an operating system (“O/S”) 40, a TCP/IP stack 42 for communicating over the network 10, and machine-executable instructions 44 executed by processor 36 to perform a direct navigation process 100 described below. The direct navigation server 34 also includes a storage device 46 having a database 47.

The user system 12 includes an input/output device 48 having a Graphical User Interface (GUI) 50 for display to a user 52. The GUI 50 typically executes search engine software such as the Yahoo, AltaVista, Lycos or Goggle search engine through browser software, such as Netscape Communicator from AOL Corporation or Internet Explorer from Microsoft Corporation.

Referring to FIG. 2, the direct navigation process 100 includes a pre-processing stage 102 and a post-processing stage 104. The pre-processing stage 102 includes a document annotation process 106. The post-processing stage 104 includes a user query conversion process 108 and a concept mapping process 110.

The annotation process 106 assigns one or more concept labels on features of a document contained in a collection of documents. A concept label represents a concept and a concept represents a general notion or idea, More specifically, each of the databases 28, 30, 32 contain pages (e.g., web pages) generally referred to as documents. The annotation process 106 includes a spider program to spider all the pages of content contained in the databases 28, 30, 32.

A spider is a program that visits Web sites (e.g., server 16, 18, 20) and reads their pages and other information in order to generate entries for a search engine index. The major search engines on the Web all have such a program, which is also known as a “crawler” or a “bot.” Spiders are typically programmed to visit servers (i.e., websites) that have been submitted by their owners as new or updated. Entire Web sites or specific pages can be selectively visited and indexed. Spiders are called spiders because they usually visit many Web sites in parallel at the same time, their “legs” spanning a large area of the “web.” Spiders can crawl through a site's pages in several ways. One way is to follow all the hypertext links in each page until all the pages have been read. Hypertext is an organization of information units into connected associations that a user can choose to make. An instance of such an association is called a link or hypertext link. A link is a selectable connection from one word, picture, or information object to another. In a multimedia environment such as the World Wide Web, such objects can include sound and motion video sequences. The most common form of link is the highlighted word or picture that can be selected by the user (with a mouse or in some other fashion), resulting in the immediate delivery and view of another file. The highlighted object is referred to as an anchor. The anchor reference and the object referred to constitute a hypertext link.

For example, a particular page of content may include text pertaining to automobiles, with their associated purchase options and pricing. This particular page can be annotated with a “review” concept and/or an “automotive review” concept. The document annotation process 106 stores a Universal Resource Identifier (URL) of the particular page along with its related concept(s) in the storage device 46 of the direct navigation server 34. The URL is an address of the particular page (also referred to as a resource). The type of resource depends on the Internet application protocol. Using the World Wide Web's protocol, the Hypertext Transfer Protocol (HTTP), the resource can be an HTML page, an image file, a program such as a common gateway interface application or Java applet, or any other file supported by HTTP. The URL has the name of the protocol required to access the resource, a domain name that identifies a specific computer on the Internet, and a hierarchical description of a page location on the computer.

Concepts can be generated manually and matched to the particular web page. Concepts can also be generated automatically by parsing the documents with a grammar. In another example, concepts can be generated from a review of features associated with the page being spidered.

In the post-processing stage 104, the user query conversion process 108 receives text in a user query entered by the user 52 through search engine software executing through browser software. In another example, the process 108 receives audio input as the user query and converts the audio input into a text user query.

The user query conversion process 108 utilizes a set of grammar rules stored in the storage device 46 of the direct navigation server 32 and applies the grammar rules to the user query such that the query matches one or more concepts.

The user query may be a word or multiple words, sentence fragments, a complete sentence, and may contain punctuation. The query is normalized as pretext. Normalization includes checking the text for spelling and proper separation. A language lexicon is also consulted during normalization. The language lexicon specifies a large list of words along with their normalized forms. The normalized forms typically include word stems only, that is, the suffixes are removed from the words. For example, the word “computers” would have the normalized form “computer” with the plural suffix removed.

The normalized text is parsed, converting the normalized text into fragments adapted for further processing. Annotating words as punitive keys and values, according to a feature lexicon, produces fragments. The feature lexicon is a vocabulary, or book containing an alphabetical arrangement of the words in a language or of a considerable number of them, with the definition of each e.g., a dictionary. For example, the feature lexicon may specify that the term “Compaq” is a potential value and that “CPU speed” is a potential key. Multiple annotations are possible.

The fragments are inflated by the context in which the text inputted by the user arrived, e.g., a previous query, if any, that was inputted and/or a content of a web page in which the user text was entered. The inflation is performed by selectively merging state information provided by a session if service with a meaning representation for the current query. The selective merging is configurable based on rules that specify which pieces of state information from the session service should be merged into the current meaning representation and which pieces should be overridden or masked by the current meaning representation.

A session service may store all of the “conversations” that occur at any given moment during all of the user's session. State information is stored in the session service providing a method of balancing load with additional computer configurations. Load balancing may send each user query to a different configuration of the computer system. However, since query processing requires state information, storage of station information on the computer system will not be compatible with load balancing. Hence, use of the session service provides easy expansion by the addition of server systems, with load sharing among the systems to support more users.

The state information includes user-specified constraints that were used in a previous query, if any. The state information may optionally include a result set, either in its entirety or in condensed form, from the previous query to speed up subsequent processing in context. The session service may reside in one computer system, or include multiple computer systems. When multiple computer systems are employed, the state information may be assigned to a single computer system or replicated across more than one computer system.

The inflated sentence fragments are converted into meaning representation by making multiple passes through a meaning resolution stage. The meaning resolution stage determines if there is a valid interpretation within the text query of a key-value grouping of the fragment. If there is a valid interpretation, the key value grouping is used. For example, if the input text, i.e., inflated sentence fragment, contains the string “500 MHz CPU speed,” which may be parsed into two fragments, “500 MHz” value and “CPU speed” key, then there is a valid grouping of key=“CPU speed” and value=“500 MHz”.

If no valid interpretation exists, a determination is made on whether the grammar rules contain a valid interpretation. If there is a valid interpretation, the key value group is used. If no valid interpretation is found, a determination of whether previous index fields have a high confidence of uniquely containing the fragment. If so, the key value grouping is used. If not, other information sources are searched and a valid key value group generated. If a high confidence and valid punitive key is determined through one of the information sources consulted, then the grouping of the key and value form an atomic element are used. To make it possible to override false interpretations, a configuration of grammar can also specify manual groupings of keys and values that take precedence over the meaning resolution stage. Meaning resolved fragments, representing the user query, are associated with concepts.

The concept mapping process 110 matches the concept associated resolved fragments to concept/URL pairs stored in the storage device 46 and loads the associated URL representing the matched concepts.

Referring to FIG. 3, a query process 200 includes loading (202) a database containing pages of content with mapped concepts. The process 200 receives (204) a user query and parses (206) the user query in conjunction with grammar rules. The process 200 associates (208) the parsed user query with a concept. The process 200 matches (210) the user query concept with a concept/URL pair and loads (212) the associated concept as directed by the URL.

Accordingly, other embodiments are within the scope of the following claims.

Claims

1. A computer program comprising instructions stored in a memory that, when executed by a processor, cause the processor to: perform a pre-processing stage by parsing documents contained in a collection with a grammar in order to identify one or more concepts contained therein, and assign concept labels to the documents contained in the collection according to the grammar; andperform a post-processing stage to apply the grammar to a query to convert the query to one or more concepts and map the concepts to the concept labels that match the concepts, wherein the query is normalized, the normalized query is parsed and converted into fragments according to a feature lexicon, the fragments are inflated by selectively merging state information provided by a session service with a meaning representation for the query, and the inflated fragments are converted into a meaning resolution through a meaning resolution stage that determines whether there is a valid interpretation of a key-value grouping of each of the fragments, such that the meaning resolved fragments are associated with the concepts.
2. The computer program of claim 1 further comprising instructions for causing the processor to: generate a list of the map.
3. A computer program comprising instructions stored in a memory that, when executed by a processor, cause the processor to: perform a pre-processing stage by parsing documents contained in a collection using grammar rules in order to identify one or more concepts contained therein, and assign concept labels to the documents contained in a collection according to the grammar rules;receive a query;perform a post-processing stage to apply the grammar rules to a query to convert the query to one or more concepts and map the concepts to the concept labels that match the concepts, wherein the query is normalized, the normalized query is parsed and converted into fragments according to a feature lexicon, the fragments are inflated by selectively merging state information provided by a session service with a meaning representation for the query, and the inflated fragments are converted into a meaning resolution through a meaning resolution stage that determines whether there is a valid interpretation of a key-value grouping of each of the fragments, such that the meaning resolved fragments are associated with the concepts.
4. The computer program of claim 3 further comprising instructions for causing the processor to: generate a list of the mapped query concepts; anddisplay the list to a user on an input/output device.

CROSS-REFERENCE TO RELATED APPLICATION

This application is a continuation of application Ser. No. 10/080,945, filed on Feb. 22, 2002, by Jane W. Chang et al., entitled DIRECT NAVIGATION FOR INFORMATION RETRIEVAL.

US Referenced Citations (142)

Number	Name	Date	Kind
4586160	Amano et al.	Apr 1986	A
4724523	Kucera	Feb 1988	A
4984178	Hemphill et al.	Jan 1991	A
5023832	Fulcher et al.	Jun 1991	A
5060155	Van Zuijlen	Oct 1991	A
5111398	Nunberg et al.	May 1992	A
5146406	Jensen	Sep 1992	A
5251129	Jacobs	Oct 1993	A
5265065	Turtle	Nov 1993	A
5325298	Gallant	Jun 1994	A
5349526	Potts et al.	Sep 1994	A
5365430	Jagadish	Nov 1994	A
5369577	Kadashevich et al.	Nov 1994	A
5418717	Su et al.	May 1995	A
5418948	Turtle	May 1995	A
5475588	Schabes et al.	Dec 1995	A
5488725	Turtle et al.	Jan 1996	A
5506787	Muhlfeld et al.	Apr 1996	A
5577241	Spencer	Nov 1996	A
5590055	Chapman et al.	Dec 1996	A
5594641	Kaplan et al.	Jan 1997	A
5610812	Schabes et al.	Mar 1997	A
5615360	Bezek et al.	Mar 1997	A
5625748	McDonough et al.	Apr 1997	A
5627914	Pagallo	May 1997	A
5634053	Noble et al.	May 1997	A
5634121	Tracz et al.	May 1997	A
5649215	Itoh	Jul 1997	A
5680628	Carue	Oct 1997	A
5708829	Kadashevich	Jan 1998	A
5721897	Rubinstein	Feb 1998	A
5737621	Kaplan et al.	Apr 1998	A
5737734	Schultz	Apr 1998	A
5748973	Palmer et al.	May 1998	A
5768578	Kirk et al.	Jun 1998	A
5799268	Boguraev	Aug 1998	A
5822731	Schultz	Oct 1998	A
5826076	Bradley et al.	Oct 1998	A
5864863	Burrows	Jan 1999	A
5884302	Ho	Mar 1999	A
5890147	Peltonen et al.	Mar 1999	A
5913215	Rubenstein et al.	Jun 1999	A
5933822	Braden-Harder et al.	Aug 1999	A
5940821	Wical	Aug 1999	A
5950184	Karttunen	Sep 1999	A
5950192	Moore et al.	Sep 1999	A
5956711	Sullivan et al.	Sep 1999	A
5963894	Richardson et al.	Oct 1999	A
5970449	Alleva et al.	Oct 1999	A
5983216	Kirsch	Nov 1999	A
5991713	Unger et al.	Nov 1999	A
5991751	Rivette et al.	Nov 1999	A
5991756	Wu	Nov 1999	A
6006221	Liddy et al.	Dec 1999	A
6009422	Ciccarelli	Dec 1999	A
6012053	Pant et al.	Jan 2000	A
6018735	Hunter	Jan 2000	A
6025843	Sklar	Feb 2000	A
6026388	Liddy et al.	Feb 2000	A
6032111	Mohri et al.	Feb 2000	A
6038560	Wical	Mar 2000	A
6038561	Snyder et al.	Mar 2000	A
6055528	Evans	Apr 2000	A
6058365	Nagal et al.	May 2000	A
6064953	Maxwell, III et al.	May 2000	A
6064977	Haverstock et al.	May 2000	A
6070158	Kirsch et al.	May 2000	A
6073098	Buchsbaum et al.	Jun 2000	A
6076088	Paik et al.	Jun 2000	A
6081774	de Hita et al.	Jun 2000	A
6094652	Falsal	Jul 2000	A
6101537	Edelstein et al.	Aug 2000	A
6138128	Perkowitz et al.	Oct 2000	A
6154720	Onishi	Nov 2000	A
6167370	Tsourikov et al.	Dec 2000	A
6169986	Bowman et al.	Jan 2001	B1
6182029	Friedman	Jan 2001	B1
6182063	Woods	Jan 2001	B1
6182065	Yeomans	Jan 2001	B1
6233575	Agrawal et al.	May 2001	B1
6233578	Machihara et al.	May 2001	B1
6236987	Horowitz et al.	May 2001	B1
6243679	Mohri et al.	Jun 2001	B1
6256631	Malcolm	Jul 2001	B1
6263335	Paik et al.	Jul 2001	B1
6269368	Diamond	Jul 2001	B1
6271840	Finseth et al.	Aug 2001	B1
6275819	Carter	Aug 2001	B1
6278973	Chung et al.	Aug 2001	B1
6292794	Cecchini et al.	Sep 2001	B1
6292938	Sarkar et al.	Sep 2001	B1
6298324	Zuberec et al.	Oct 2001	B1
6304864	Liddy et al.	Oct 2001	B1
6304872	Chao	Oct 2001	B1
6311194	Sheth et al.	Oct 2001	B1
6314439	Bates et al.	Nov 2001	B1
6324534	Neal et al.	Nov 2001	B1
6349295	Tedesco et al.	Feb 2002	B1
6353827	Davies et al.	Mar 2002	B1
6363373	Steinkraus	Mar 2002	B1
6363377	Kravets et al.	Mar 2002	B1
6366910	Rajaraman et al.	Apr 2002	B1
6377945	Rievik	Apr 2002	B1
6393415	Getchius et al.	May 2002	B1
6397209	Reed et al.	May 2002	B1
6397212	Biffar	May 2002	B1
6401084	Ortega et al.	Jun 2002	B1
6415250	van den Akker	Jul 2002	B1
6434554	Asami et al.	Aug 2002	B1
6434556	Levin et al.	Aug 2002	B1
6438540	Nasr et al.	Aug 2002	B2
6438575	Khan et al.	Aug 2002	B1
6446061	Doerre et al.	Sep 2002	B1
6446256	Hyman et al.	Sep 2002	B1
6449589	Moore	Sep 2002	B1
6463533	Calamera et al.	Oct 2002	B1
6466940	Mills	Oct 2002	B1
6480843	Li	Nov 2002	B2
6505158	Conkie	Jan 2003	B1
6542889	Aggarwal et al.	Apr 2003	B1
6560590	Shwe et al.	May 2003	B1
6584464	Warthen	Jun 2003	B1
6601026	Appelt et al.	Jul 2003	B2
6611825	Billheimer et al.	Aug 2003	B1
6651220	Penteroudakis et al.	Nov 2003	B1
6665662	Kirkwood et al.	Dec 2003	B1
6675159	Lin et al.	Jan 2004	B1
6704728	Chang et al.	Mar 2004	B1
6711561	Chang et al.	Mar 2004	B1
6714905	Chang et al.	Mar 2004	B1
6745181	Chang et al.	Jun 2004	B1
6766320	Wang et al.	Jul 2004	B1
6785671	Bailey et al.	Aug 2004	B1
6862710	Marchisio	Mar 2005	B1
7047242	Ponte	May 2006	B1
7136846	Chang et al.	Nov 2006	B2
7343372	Chang et al.	Mar 2008	B2
20020059161	Li	May 2002	A1
20030037043	Chang et al.	Feb 2003	A1
20040133603	Chang et al.	Jul 2004	A1
20040167889	Chang et al.	Aug 2004	A1
20060123045	Chang et al.	Jun 2006	A1

Foreign Referenced Citations (2)

Number	Date	Country
0 304 191	Feb 1989	EP
0 597 630	May 1994	EP

Related Publications (1)

	Number	Date	Country
	20080140613 A1	Jun 2008	US

Continuations (1)

	Number	Date	Country
Parent	10080945	Feb 2002	US
Child	12019586		US

Direct navigation for information retrieval

Information

Patent Number

Date Filed

Date Issued

Inventors

Original Assignees

Examiners

Agents

CPC

US Classifications

Field of Search

US

International Classifications

Disclaimer

Term Extension