The present invention relates to a method and system for searching multifaceted information encoded by an inverted text index in an information retrieval system.
Conventional information retrieval (IR) systems combine free text search with contextual navigation to enhance the user experience. For example, a website that sells products provides a keyword search interface to search a database of documents associated with the products being sold, and the interface is combined with a browsing menu that allows users to drill down into several levels of categories of the products. In response to a user issuing a keyword query to search the database, the IR system presents the user with a set of relevant documents as a result of that query, and also changes the navigation menu to display the most relevant facets for the given query. Improvements are needed relative to the speed at which these known IR systems present the keyword search results and update the contextual navigation menu. Further, the development efforts required to combine free text search with contextual navigation are significant. Thus, there exists a need to overcome at least one of the preceding deficiencies and limitations of the related art.
The present invention provides a computer-implemented method of querying multifaceted information in an information retrieval system, comprising:
constructing, by the information retrieval (IR) system, an inverted index having a plurality of unique indexed tokens associated with a plurality of posting lists in a one-to-one correspondence, each posting list including one or more documents of a plurality of documents, wherein an indexed token of the plurality of unique indexed tokens is one of a facet token included as an annotation in a document of the plurality of documents and a path prefix of the facet token, wherein the annotation indicates a path within a tree structure representing a facet that includes the document, the tree structure including a plurality of nodes representing a category and one or more sub-categories that categorize the document;
receiving, by the IR system, a query that includes a plurality of constraints on the plurality of documents, the plurality of constraints being associated with multiple indexed tokens of the plurality of unique indexed tokens and multiple posting lists corresponding to the multiple indexed tokens; and
executing the query by the IR system, the executing including:
A system and a computer program product corresponding to the above-summarized method are also described and claimed herein.
Advantageously, the present invention provides a scalable technique that efficiently encodes facet information in an inverted index. Further, the present invention provides a runtime algorithm that efficiently evaluates queries that combine free text constraints and navigational constraints, thereby returning query results more quickly. Still further, the disclosed runtime algorithm is robust even though the indexed documents may be categorized inconsistently.
Overview
The present invention provides a scalable solution for adding multifaceted navigation capabilities to IR systems. The solution disclosed herein includes an inverted index used to encode multifaceted information and a runtime algorithm that efficiently evaluates queries that combine navigational constraints and free-text predicates (i.e., keywords). Further, the present invention provides a technique for efficiently counting the number of documents included in sub-categories of a category specified in a query constraint. Still further, a technique for computing an aggregate function relative to such sub-categories is also disclosed herein.
System for Querying Multifaceted Information
Using the multifaceted movie information organized in the facets of
Results of the search are displayed as a list including titles of movies that are English or French-language dramas (e.g., The Godfather, starring Marlon Brando and Al Pacino; The Great Escape, starring Steve McQueen; Scarface, starring Al Pacino; The French Connection, starring Gene Hackman; Breathless, starring Jean-Paul Belmondo, etc.). The numbers in parentheses indicate the number (i.e., counts) of qualifying movies within each drama sub-category and within each language sub-category. For example, (200) after crime drama indicates that there are 200 crime dramas in the database. These numbers in parentheses guide further drill-down by the user.
Continuing the example, a second drill-down is shown that now limits the search to English-language movies in the drama genre:
In this second drill-down, the counts shown for the dramas have decreased from the first drill-down because only English-language dramas are considered. Further, the list of search results is similarly shortened by excluding French-language dramas (e.g., The Godfather, starring Marlon Brando and Al Pacino; The Great Escape, starring Steve McQueen; Scarface, starring Al Pacino; The French Connection, starring Gene Hackman, etc.).
Still continuing this example, “Al Pacino” is entered as a keyword search term and the resulting drill-down is shown below:
In this case, the search engine determined that war dramas and romantic dramas each had a count of zero, and therefore stopped displaying those two sub-categories as drill-down choices. In the search results list, only English-language dramas starring Al Pacino are displayed (e.g., The Godfather, starring Marlon Brando and Al Pacino and Scarface, starring Al Pacino).
Indexing
Each incoming document includes one or more facet tokens. As used herein, a facet token is defined as a document annotation that indicates a path in a facet's tree-structured taxonomy. In one embodiment, facet tokens are inserted into documents as meta-data in a general-purpose markup language (e.g., Extensible Markup Language (XML)). Hereinafter, specific facet tokens are represented by the term “facet:” followed by a path indicator (e.g., “facet:A.B.D”). It will be apparent to those skilled in the art that other representations can be used to indicate a facet token. The path indicated by a facet token usually ends in a leaf node of the facet's tree structure, but may also end in an internal node of the tree structure.
Facet 404 includes category 426 (i.e., node X) and node X's sub-categories 428 and 430 (i.e., nodes Y and Z, respectively). Facet 404 also includes document 418 in sub-category 428 and document 420 in sub-category 430.
It should be noted that a document can be included in multiple facets and be included in multiple paths within a facet. For example, document d1 is included in paths A.B.E and A.C.F of facet 402 and path X.Y of facet 404. To indicate its inclusion in paths A.B.E, A.C.F and X.Y, document d1 includes the following facet tokens: facet:A.B.E, facet:A.C.E and facet:X.Y.
An inverted index is constructed by multifaceted search system 100 (see
In one embodiment, each item in a posting list in an inverted index includes an optional payload in which additional information about a document can be stored. Hereinafter, square bracket (i.e., [ ]) indicate a payload. For example, 0.1.0 is the payload in d3[0.1.0].
Returning to the movie database search example presented above relative to
Similarly, the aforementioned search for titles of movies that are English-language crime dramas that star Al Pacino can be provided by the following query:
In one embodiment, the query syntax also includes a function (e.g., GetCounts) that returns sub-category path names and their counts. The returned sub-category path names are the names of each sub-category under a category or sub-category specified by a facet restriction in the query. For example, the following query can be executed to return the sub-category names and counts under the genre.drama sub-category (see
It should be noted that the count function included in the query can utilize facet restrictions that are different from the query's facet restrictions. For example, using the taxonomy of
When executing a query, search engine 102 (see
Incoming documents may include dirty data (e.g., inconsistencies in the categorization of documents). For instance, document d1 in
The special exact tokens 472 indicate categories and/or sub-categories of the taxonomy of
Determining Counts of Qualifying Documents
fullpath d1[0.0.0, 0.1.0, 1.0], d2[0.0, 1.1], d3[0.1.0]
The fullpath token and posting list presented above illustrates that document d1 is included in full paths A.B.E, A.C.F, and X.Y, which correspond to the payload values of 0.0.0, 0.1.0, and 1.0, respectively; document d2 is included in full paths A.B and X.Z, which correspond to payload values 0.0 and 1.1, respectively; and document d3 is included in full path A.C.F, which corresponds to the payload value 0.1.0. It will be apparent to those skilled in the art that other encodings based on non-Dewey labeling schemes can also be used.
The counters in hierarchy 600 are used by multifaceted search system 100 (see
To support navigational operations, other counts are provided by embodiments of the present invention. In one embodiment, a query API provides a specification of whether the count function (e.g., GetCounts) of the query counts locally (i.e., only the children) or globally (i.e., the entire subtree). This specification of a local or a global mode facilitates the finding of nodes in the entire tree that have the higher counts for a given query. After the execution of a query, the navigational position of the user can be placed at the nodes that are most relevant (i.e., have higher counts) for that query. For example, using the taxonomy of
Query Execution Algorithm
In step 704, the inverted index is utilized to identify the posting lists associated with T and F1, F2, . . . , Fn. These identified posting lists are intersected to determine a list of one or more qualifying documents. In step 706, the fullpath token is used to look up Dewey encodings E1, E2, . . . , Ek for each qualifying document determined in step 704. For each encoding Ei, the Dewey digits in Ei are used in step 708 to increment counters associated with sub-categories of the categories and/or sub-categories indicated by C1, C2, . . . , Cm. In step 710, the qualifying documents are returned (e.g., displayed) along with the counts of qualifying documents in each sub-category of C1, C2, . . . , Cm and the names of those sub-categories of C1, C2, . . . , Cm. the query execution algorithm ends at step 712.
As an example of applying the query execution algorithm of
In this example, the qualifying documents found by intersecting the facet tokens in step 704 are documents d1 and d2 (i.e., documents 418 and 420 of
Aggregation Function
In one embodiment, the aforementioned count function (e.g., GetCounts) included in the query syntax is supplemented with a more general function that provides aggregations over faceted data, where the aggregations are more sophisticated than simple counts of records or documents belonging to sub-categories of a certain category. Such aggregations are required in certain faceted search applications such as business intelligence (BI) applications and facilitate navigation to sub-categories of a facet.
In certain data collections (e.g., enterprise data), each document has one or more numeric fields associated therewith and which are indexed in search engine 102 (see
For example, assume that each document in “project collection” has two numeric values associated therewith: contract_value and estimated_cost. Further, assume that there is a geography dimension, and that the category “US” (i.e., indicating the United States) is selected with the sub-categories being the 50 states of the United States. As described above, search engine 102 (see
Computing System
Local memory elements of memory 804 are employed during actual execution of the program code of multifaceted search system 814. Cache memory elements of memory 804 provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution. Further, memory 804 may include other systems not shown in
Memory 804 may comprise any known type of data storage and/or transmission media, including bulk storage, magnetic media, optical media, random access memory (RAM), read-only memory (ROM), a data cache, a data object, etc. Storage unit 812 is, for example, a magnetic disk drive or an optical disk drive that stores data. Moreover, similar to CPU 802, memory 804 may reside at a single physical location, comprising one or more types of data storage, or be distributed across a plurality of physical systems in various forms. Further, memory 804 can include data distributed across, for example, a LAN, WAN or storage area network (SAN) (not shown).
I/O interface 806 comprises any system for exchanging information to or from an external source. I/O devices 810 comprise any known type of external device, including a display monitor, keyboard, mouse, printer, speakers, handheld device, printer, facsimile, etc. Bus 808 provides a communication link between each of the components in computing unit 800, and may comprise any type of transmission link, including electrical, optical, wireless, etc.
I/O interface 806 also allows computing unit 800 to store and retrieve information (e.g., program instructions or data) from an auxiliary storage device (e.g., storage unit 812). The auxiliary storage device may be a non-volatile storage device (e.g., a CD-ROM drive which receives a CD-ROM disk). Computing unit 800 can store and retrieve information from other auxiliary storage devices (not shown), which can include a direct access storage device (DASD) (e.g., hard disk or floppy diskette), a magneto-optical disk drive, a tape drive, or a wireless communication device.
The invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements. In a preferred embodiment, the invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
Furthermore, the invention can tale the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code of multifaceted search system 814 for use by or in connection with a computing unit 800 or any instruction execution system to provide and facilitate the capabilities of the present invention. For the purposes of this description, a computer-usable or computer-readable medium can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, RAM 804, ROM, a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk-read-only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.
The flow diagrams depicted herein are provided by way of example. There may be variations to these diagrams or the steps (or operations) described herein without departing from the spirit of the invention. For instance, in certain cases, the steps may be performed in differing order, or steps may be added, deleted or modified. All of these variations are considered a part of the present invention as recited in the appended claims.
While embodiments of the present invention have been described herein for purposes of illustration, many modifications and changes will become apparent to those skilled in the art. Accordingly, the appended claims are intended to encompass all such modifications and changes as fall within the true spirit and scope of this invention.
This application is a continuation application claiming priority to Ser. No. 11/564,915, filed Nov. 30, 2006, now U.S. Pat. No. 7,496,568 issued Feb. 24, 2009.
Number | Name | Date | Kind |
---|---|---|---|
5704060 | Del Monte | Dec 1997 | A |
5787421 | Nomiyama | Jul 1998 | A |
6212494 | Boguraev | Apr 2001 | B1 |
6236985 | Aggarwal et al. | May 2001 | B1 |
6243713 | Nelson et al. | Jun 2001 | B1 |
6381354 | Mennie et al. | Apr 2002 | B1 |
6490579 | Gao et al. | Dec 2002 | B1 |
6519586 | Anick et al. | Feb 2003 | B2 |
6665666 | Brown et al. | Dec 2003 | B1 |
6745206 | Mandler et al. | Jun 2004 | B2 |
6748387 | Garber et al. | Jun 2004 | B2 |
6925608 | Neale et al. | Aug 2005 | B1 |
6963871 | Hermansen et al. | Nov 2005 | B1 |
7472347 | Cooper et al. | Dec 2008 | B2 |
7499915 | Chandrasekar et al. | Mar 2009 | B2 |
7836050 | Jing et al. | Nov 2010 | B2 |
20020032672 | Keith, Jr. | Mar 2002 | A1 |
20030018622 | Chau | Jan 2003 | A1 |
20040167889 | Chang et al. | Aug 2004 | A1 |
20040267700 | Dumais et al. | Dec 2004 | A1 |
20050108200 | Meik et al. | May 2005 | A1 |
20060112079 | Holt et al. | May 2006 | A1 |
20060167930 | Witwer et al. | Jul 2006 | A1 |
20060282411 | Fagin et al. | Dec 2006 | A1 |
20060288039 | Acevedo-Aviles et al. | Dec 2006 | A1 |
20070050753 | Holt et al. | Mar 2007 | A1 |
20070055680 | Statchuk | Mar 2007 | A1 |
20070106658 | Ferrari et al. | May 2007 | A1 |
20070208738 | Morgan | Sep 2007 | A1 |
20080010250 | Fontoura et al. | Jan 2008 | A1 |
Number | Date | Country |
---|---|---|
1716252A | Jan 2006 | CN |
2003091419 | Mar 2003 | JP |
Number | Date | Country | |
---|---|---|---|
20080222117 A1 | Sep 2008 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 11564915 | Nov 2006 | US |
Child | 12124272 | US |