The present invention relates generally to a search and retrieval system, and more particularly, to an intelligent search and retrieval system and method.
Existing comprehensive search and retrieval systems have been principally designed to provide services for “Information Professionals”, such as professional searchers, librarians, reference desk staff, etc. These information professionals generally have a significant amount of training and experience in drafting complex focused queries for input into these information service systems and are able to understand and use the many features available in the various existing comprehensive search and retrieval systems.
However, with the explosive increase in the quantity, quality, availability, and ease-of-use of Internet-based search engines such as Google, AltaVista, Yahoo Search, Wisenut, etc., there is a new population of users familiar with these Internet based search products who now expect similar ease-of-use, simple query requirements and comprehensive results from all search and retrieval systems. This new population may not necessarily be, and most likely are not, information professionals with a significant amount of training and experience in using comprehensive information search and retrieval systems. The members of this new population are often referred to as “end-users.” The existing comprehensive search and retrieval systems generally place the responsibility on an end-user to define all of the search, retrieval and presentation features and principles before performing a search. This level of complexity is accessible to information professionals, but often not to end-users. Presently, end-users typically enter a few search terms and expect the search engine to deduce the best way to normalize, interpret and augment the entered query, what content to run the query against, and how to sort, organize, and navigate the search results. The end-users expect search results and corresponding document display to be based upon their limited search construction instead of the comprehensive taxonomies upon which information professionals rely when using comprehensive search engines. End-users have grown to expect simplistic queries to produce precise, comprehensive search results, while (not realistically) expecting their searches to be as complete as those run by information professionals using complex queries.
Therefore, there is a need in the art to have an intelligent comprehensive search and retrieval system and method capable of providing an end-user effortless access yet the most relevant, meaningful, up-to-date, and precise search results as quickly and efficiently as possible.
The present invention provides an intelligent search and retrieval system and method capable of providing an end-user access utilizing simplistic queries and yet the most relevant, meaningful, up-to-date, and precise search results as quickly and efficiently as possible.
In one embodiment of the present invention, an intelligent search and retrieval method comprises the steps of:
In another embodiment of the present invention, an intelligent search and retrieval method comprises the steps of:
Still in one embodiment of the present invention, the taxonomy database of the query profiler comprises a timing identifier for identifying a timing range, wherein the method further comprises receiving the query with a time range and identifying the source of the query term with the time range.
Further in one embodiment of the present invention, the taxonomy database of the query profiler comprises a query term ranking module, wherein the module provides a relevance score corresponding to the number of times the query term appears in documents containing the corresponding code and the number of documents for which the query term and the corresponding code appear together.
Further, in one embodiment of the present invention, an intelligent search and retrieval system comprises:
In another embodiment of the present invention, an intelligent search and retrieval system comprises:
Still in one embodiment of the present invention, the taxonomy database of the query profiler comprises a timing identifier for identifying a timing range, wherein the method further comprises receiving the query with a time range and identifying the source of the query term with the time range.
Further in one embodiment of the present invention, the taxonomy database of the query profiler comprises a query term ranking module, wherein the module provides a relevance score corresponding to the number of times the query term appears in documents containing the corresponding code and the number of documents for which the query term and the corresponding code appear together.
While multiple embodiments are disclosed, still other embodiments of the present invention will become apparent to those skilled in the art from the following detailed description, which shows and describes illustrative embodiments of the invention. As will be realized, the invention is capable of modifications in various obvious aspects, all without departing from the spirit and scope of the present invention. Accordingly, the drawings and detailed description are to be regarded as illustrative in nature and not restrictive.
The present invention provides an intelligent search and retrieval system and method capable of providing an end-user access, via simplistic queries, to relevant, meaningful, up-to-date, and precise search results as quickly and efficiently as possible.
Definitions of certain terms used in the detailed descriptions are as follows:
Referring to
Also in
Further, in
An exemplary system architecture of one embodiment of the intelligent search and retrieval system is explained as follows. The architecture may be comprised of two subsystems: an IQ Digester and an IQ Profiler. The IQ Digester maps the intersection of words and phrases to the codes and produces a digest of this mapping, along with an associated set of scores. The IQ Digester is a resource-intensive subsystem which may require N-dimensional scale (e.g. CPU, RAM and storage). The IQ Profiler accesses the IQ Digester and serves as an agent to convert a simple query into a fully-specified query. In one embodiment, the IQ Profiler is a lightweight component which runs at very high speed to convert queries in near-zero time. It primarily relies on RAM and advanced data structures to effect this speed and is to be delivered as component software.
In one embodiment, the IQ Digester performs the following steps:
1. For each categorized document
2. On a scheduled basis, a digest mapping is collected. This mapping may contain the following items:
The IQ Digester uses linguistic analysis to perform “optimistic” phrase extraction. Optimistic phrase extraction is equivalent to very high recall with less emphasis on precision. This process produces a list of word sequences which are likely to be searchable phrases within some configurable confidence score. The rationale behind optimistic phrase identification is to include as many potential phrases as possible in the IQ Digest database. Although this clutters the database with word sequences that are not phrases, the IQMAP's scoring process weeds out any truly unrelated phrases. Their phrase→code score are statistically insignificant.
The TF-IDF (Term Frequency-Inverse Document Frequency) module provides relevance ranking in full-text databases. Phrase-Code Frequency-Inverse Phrase-Code Document Frequency (PCF-IPCDF) module in accordance with the present invention selects the codes for improving user searches. The system outputs the codes or restricts sources of the query and thereby improve very simply specified searches.
Definitions of certain terms are as follows:
Phrase extraction via linguistic analysis may be required at the time of document insertion and query processing. Phrase extraction in both locations produce deterministic, identical outputs for a given input. Text normalization, referred to as “tokenization,” is provided. This enables relational databases, which are generally unsophisticated and inefficient in text processing, to be both fast and deterministic.
The following tables map phrases to codes, while recording phrase-code occurrence frequencies, phrase-code document frequencies and total document count.
The IQ Database's purpose is to tie words and phrases to the most closely related metadata, so as to focus queries on areas which contain the most relevant information. To be efficient in processing documents, the IQDB inserter may require a per-language list of stop words and stop codes. The stop word list is likely a significantly expanded superset of the typical search engine stop word list, as it eliminates many words which do not capture significant “aboutness” or information context. As opposed to traditional stop word lists which often contain keywords of significance to the search engine (e.g. “and”, “or”), the stop word list is populated more by the frequency and diffusion of the words—words appearing most frequently and in most documents (e.g. “the”) are statistically meaningless. Use of the language-specific stop lists on the database insertion side may obviate the need to remove stop words on the query side, since they have zero scores on lookup in the IQ database. There are regions of an Intelligent Indexing map which are so broad as to be meaningless, such as codes with parent or grandparent of ROOT. For processing and query efficiency, these codes must be identified and discarded.
Once stop words and stop codes have been eliminated, a calculation is needed to isolate the “deepest” code from each branch contained within a document. Though the indexing is defined as a “polyarchy” (meaning that one taxonomic element (a.k.a. code) can have more than one parent, it can be transformed into a directed acyclic graph (a.k.a. a tree) via element cloning. That is, cycles can be broken by merely cloning an element with multiple parents into another acyclic element beneath each of its parents. By then noting each element's ultimate parent(s) and its depth beneath that parent, the deepest code for each root element of the tree can be isolated. In cases where a code has multiple ultimate parents, both ultimate parents may need to be identified and returned in the IQMAP. This maximizes concentration of data points around single, specific taxonomic elements, and prevents diffusion, which is likely to weaken query results.
Also, choosing which codes to use and which to discard is accomplished by an originator of the code. There are numerous methods of applying codes. Some reflect documents' contextual content (natural language processing and rules-based systems), while others merely map (taxonomy-based expansion and codes provided by a document's creator). Codes added by mapping create multicollinearity in the dataset, and weaken overall results by dilution.
By keeping the IQ database content to a strictly limited time window and deleting data points as they fall outside the time window, the database actually tracks temporal changes in contextual meaning.
Since related elements have an explicitly defined contextual relationship (e.g. Tax accounting is a child of Accounting, therefore they are contextually related), integer code identifiers may be assigned to codes in such a way that a clear and unambiguous spatial representation of word-code relationships can be visualized. By assigning code identifiers (that is, putting sufficient empty space between unrelated code identifiers), clear visual maps can be created. For Example:
By condensing identifiers for semantically-related codes and diffusing identifiers for unrelated codes, it visualizes the clustering of certain words around certain concepts using a three-dimensional graph of (p, c, s) where p is the phrase identifier, c is the code identifier and s is the modified TFIDF score.
This calculation encodes the following principles:
One of the advantages of the present invention is that it provides end-users effortless access yet the most relevant, meaningful, up-to-date, and precise search results, as quickly and efficiently as possible.
Another advantage of the present invention is that an end-user is able to benefit from an experienced recommendation that is tailored to a specific industry, region, and job function, etc., relevant to the search.
Yet another advantage of the present invention is that it provides a streamlined end-user search screen interface that allows an end user to access resources easily and retrieve results from a deep archive that includes sources with a historical, global, and local perspective.
Further advantages of the present invention include simplicity, which reduces training time, easy accessibility which increases activity, and increased relevance which allows acceleration of decision making.
These and other features and advantages of the present invention will become apparent to those skilled in the art from the attached detailed descriptions, wherein it is shown, and described illustrative embodiments of the present invention, including best modes contemplated for carrying out the invention. As it will be realized, the invention is capable of modifications in various obvious aspects, all without departing from the spirit and scope of the present invention. Accordingly, the above detailed descriptions are to be regarded as illustrative in nature and not restrictive.
This application claims the benefit of U.S. Provisional application No. 60/546,658, entitled “Intelligent Search and Retrieval System And Method”, filed on Feb. 20, 2004, the subject matter of which is hereby incorporated by reference herein in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
5542090 | Henderson et al. | Jul 1996 | A |
5754939 | Herz et al. | May 1998 | A |
5924090 | Krellenstein | Jul 1999 | A |
5960422 | Prasad | Sep 1999 | A |
6038561 | Snyder et al. | Mar 2000 | A |
6067552 | Yu | May 2000 | A |
6233575 | Agrawal et al. | May 2001 | B1 |
6260041 | Gonzalez et al. | Jul 2001 | B1 |
6292830 | Taylor et al. | Sep 2001 | B1 |
6332141 | Gonzalez et al. | Dec 2001 | B2 |
6418433 | Chakrabarti et al. | Jul 2002 | B1 |
6711585 | Copperman et al. | Mar 2004 | B1 |
6735583 | Bjarnestam et al. | May 2004 | B1 |
6868525 | Szabo | Mar 2005 | B1 |
6873990 | Oblinger | Mar 2005 | B2 |
6961737 | Ritchie et al. | Nov 2005 | B2 |
7035864 | Ferrari et al. | Apr 2006 | B1 |
7146361 | Broder et al. | Dec 2006 | B2 |
7266548 | Weare | Sep 2007 | B2 |
20010000356 | Woods | Apr 2001 | A1 |
20020087565 | Hoekman et al. | Jul 2002 | A1 |
20030014405 | Shapiro et al. | Jan 2003 | A1 |
20030154196 | Goodwin et al. | Aug 2003 | A1 |
20030172059 | Andrei | Sep 2003 | A1 |
20030212666 | Basu et al. | Nov 2003 | A1 |
20030217052 | Rubenczyk et al. | Nov 2003 | A1 |
20040024790 | Everett | Feb 2004 | A1 |
20040060426 | Weare et al. | Apr 2004 | A1 |
20040267718 | Milligan et al. | Dec 2004 | A1 |
20050060312 | Curtiss et al. | Mar 2005 | A1 |
20050097075 | Hoekman et al. | May 2005 | A1 |
20050187923 | Cipollone | Aug 2005 | A1 |
Number | Date | Country | |
---|---|---|---|
20050187923 A1 | Aug 2005 | US |
Number | Date | Country | |
---|---|---|---|
60546658 | Feb 2004 | US |