The invention relates in general to data searching and, specifically, to a system and method for formulating data search queries.
An increasingly substantial body of printed material in electronic form has evolved in large part due to the widespread adoption of the Internet and personal computing. These materials include both traditional “formal” forms of writings and publications distributed through publishers, businesses, governmental agencies, and educational institutions, such as books, manuscripts, and other published materials, and non-traditional “informal” works, such as email, personal correspondence, notes, instant messaging, and other textual and non-textual content stored in electronic form. Additionally, other materials stored in electronic form include non-traditionally authored binary and non-character-based data, such as object and various forms of program code generated by computer program compilers.
Efficient search strategies have long existed for databases, spreadsheets, object libraries, and similar structured and ordered data. In contrast, authored, non-machine originated documents, such as textual content, are unstructured collections of words that lack a regular ordering amenable to search. As a result, conventional searching tools for such content borrow from ordered data search techniques and rely on algebraic formulations using Boolean logic or query languages, such as SQL. Individual terms are combined into search queries using Boolean logic operators, such as AND for conjunction, OR for disjunction, and NOT for negation, and the search scope is specified through set complementation and union operations on the target corpus and interim search results. Matching documents, or “hits,” are presented for review or further searching.
For most users, searching using Boolean logic or query languages is non-intuitive and may provide incorrect or undesired search results. Natural language search tools attempt to insulate users from working directly with Boolean logic or query languages by providing a user-friendly front-end through which search queries can be specified as simple English language sentences or phrases. Often, a query is entered as a question or phrase, which is parsed and processed by a front-end processor. An underlying search engine then attempts to identify target documents implied by the literal and linguistic structure of the search query.
Boolean logic, query languages, and natural language search tools, though, require users to formulate and enter an express search criteria, either as a Boolean or query language expression, or as a natural language sentence or phrase. Users must concentrate on how the phrasing of the search criteria might affect the search and are forced to reevaluate the criteria when the search results are non-responsive. Searching through documents, however, does not always translate easily into readily-expressible criteria, and re-searching can be time-consuming and counter-productive. Thus, a less structured form of searching that can accommodate unstructured, preferably expressionless, search criteria is sometimes needed. For example, a user might have a general idea that a set of documents likely contains phraseology that “sort of” matches, but does not exactly match, a particular data excerpt. Conventional search tools require the user to first evaluate the data excerpt to identify potentially matching search terms and conditions, yet determining the proper terms and conditions to include or exclude in the criteria might require multiple attempts until desired results are obtained. For instance, specifying the proximity, or nearness, of matching terms within each document can relax or constrain the search scope, but knowing how far to span search term proximity generally assumes a priori knowledge of the structure of the target documents, such as word ordering and frequency.
Therefore, there is a need for an approach to facilitating searching of textual and non-textual data through a user interface that accepts unstructured data and user-adjustable search criteria parameters to specify, for example, variable term inclusion and proximity of matching search terms.
A system and method includes a user interface that allows a user to specify an unstructured search criteria for documents by providing a data excerpt, including textual or binary data, and choosing parameters indicating search term inclusion and proximity of matching terms. The documents contain data, which can be character-based or pure binary stored data, and are indexed for use in searching and other data processing activities. The user interface formulates a search query for the user and does not require the search criteria to be explicitly defined by the user. Instead, the user provides a data excerpt and adjusts inclusion and proximity controls. The data excerpt is parsed and processed to extract search terms, which become tokens in the search query. The adjustments to the inclusion control define the minimum number of search terms that must appear in each document being searched, which always requires one or more matching terms. The adjustments to the proximity control define the span within which a minimum of two or more matching search terms must appear. For instance, two matching search terms occurring next to each other have a span equal to zero.
One embodiment provides a system and method for formulating data search queries. A user interface operable to specify an unstructured search criteria for a search query on one or more documents is provided. An input portal is exported to receive a data excerpt selected to be searched against the documents. A selectable inclusiveness control is exported to specify a granularity of inclusion of matching tokens within each document. A selectable proximity control is exported to specify a degree of nearness of the tokens within each document. Tokens derived from the data excerpt and parameters corresponding to the granularity of inclusion and the degree of nearness are compiled into the search query.
A further embodiment provides a system and method for performing a data search. A data excerpt selected to be searched against one or more documents stored in electronic form is processed into search terms. A search criteria containing the search terms and parameters indicating at least one of search term inclusion and proximity of matching search terms in the documents is built. Search results generated by execution of the search criteria on the documents are presented.
Still other embodiments will become readily apparent to those skilled in the art from the following detailed description, wherein are described embodiments of the invention by way of illustrating the best mode contemplated for carrying out the invention. As will be realized, the invention is capable of other and different embodiments and its several details are capable of modifications in various obvious respects, all without departing from the spirit and the scope of the present invention. Accordingly, the drawings and detailed description are to be regarded as illustrative in nature and not as restrictive.
System
Documents stored in electronic form can be intuitively searched through a user-friendly interface that accepts unstructured data search criteria.
By way of illustration, the system 10 operates in a distributed computing environment, which includes a plurality of heterogeneous systems and document sources. A backend server 11 executes a workbench suite 31 for providing a user interface framework for automated document management and processing, which includes a document searcher 35 for searching documents 14 through an intuitive user interface, as further described below beginning with
The document mapper 32 operates on documents retrieved from a plurality of local or remote sources. The local sources include documents 17, 20 maintained in storage devices 16, 19 respectively coupled to a local server 15 or local client 18. The local server 15 and local client 18 are interconnected to the production system 11 over an intranetwork 21. In addition, the document mapper 32 can identify and retrieve documents from remote sources via a gateway 23 or similar portal to an internetwork 22, including the Internet. The remote sources include documents 26, 29 maintained in storage devices 25, 28 respectively coupled to a remote server 24 and a remote client 27. In one embodiment, the documents 17, 20, 26, 29 include email stored in electronic message folders, such as maintained by the Outlook and Outlook Express products, licensed by Microsoft Corporation, Redmond, Wash. In a further embodiment, the document searcher 35 provides an interface to an external query engine 36 that executes search queries on either the local database 30 or a remote database 37 and provides back search results. The databases 30, 37 can be SQL-based relational databases, such as the Oracle database management system, Release 8, licensed by Oracle Corporation, Redwood Shores, Calif., or other types of structured databases. Other system environments, network configurations and topologies, and sources of documents and electronically-stored data are possible.
The individual computer systems, including backend server 11, production server 32, server 15, client 18, remote server 24, remote client 27, and remote query engine 36 are general purpose, programmed digital computing devices consisting of a central processing unit (CPU), random access memory (RAM), non-volatile secondary storage, such as a hard drive or CD ROM drive, network interfaces, and peripheral devices, including user interfacing means, such as a keyboard and display. Program code, including software programs, and data are loaded into the RAM for execution and processing by the CPU and results are generated for display, output, transmittal, storage, or processing.
Searching
Email is one popular form of communications that results in unstructured informal writings and individual email messages can be treated as documents. Other forms and manner of documents are possible.
The substantive portions of each email 41-46, in particular, the message body with header and extraneous data removed, represent a collection of searchable data. For ease of discussion, pertinent words are underlined. For instance, emails 41, 42, 44, 45, and 46 all contain either “mice” or “mouse,” the root word stem of which is simply “mouse.” Similarly, emails 42 and 43 both contain “cat;” emails 41, 43 and 46 contain “man” or “men,” the root word stem of which is “man;” and email 43 contains “dog.” These words are indexed. By extension, searchable data occurring in all forms and manner of materials stored in electronic form can be identified and indexed to facilitate searching.
In a further embodiment, weights can be assigned to searchable data based on structural location within each document. For example, those words occurring in titles, heading, tables of content, or indexes can have higher weights assigned, which cause a search to favor those terms over other terms having lower weights, either assigned or by default.
User Interface
Rather than requiring users to construct complex search criteria, users need only provide an excerpt of data and user-adjustable selection controls to perform searching.
Searching is facilitated through operations performed on the user interface 50.
Conceptually, search criteria specification and search query execution are two logically separate but operationally contiguous actions, that is, once a search criteria is specified, search query execution will follow. The search criteria is specified when the data excerpt 51 is entered (operation 61), when the “Contains” control is adjusted (operation 62), or when the “Proximity” control is adjusted (operation 63). Logically, these operations occur on the “half-click,” that is, upon the initial toggle of an input key, such as a mouse or keyboard button. The search query is executed (operation 64) upon the next “half-click,” that is, upon the release of the input key. In one embodiment, this pair of half-click operations is atomic, and actual search criteria processing and query execution can both occur following input key release, although the two operations could also be performed serially following detection of each separate half-click, where supported by the input key device drivers.
The data excerpt 51 is entered through a data entry area 54 (operation 61), such as by cut-and-paste or drag-and-drop commands, or through manual entry. In addition, the data excerpt 51 can include a Uniform Resource Location (URL), files, directories, folders, entire document, socket, data pipe, or other data stream or source. The data excerpt 51 is preprocessed into tokens for the search query, as further described below respectively with reference to
The user can also set search criteria parameters through selectable user-adjustable controls. The granularity by which search terms must be included within each document can be specified by adjusting the “Contains” control 52 (operation 62), as further described below respectively with reference to
In one embodiment, the “Contains” control 52 and “Proximity” control 53 are separate user-adjustable slider bar controls, but could be a single selectable control. When set at either extreme of the range of control permitted with the “Contains” control 52 and “Proximity” control 53, respective granularity of inclusion and degree of nearness are maximally relaxed or constrained. Other types of controls for the “Contains” control 52 and “Proximity” control 53 are possible, including separate or combined rotary or gimbal knobs, slider bars, radio buttons, and other user input mechanisms that allow continuous or discrete selection over a fixed range of rotation, movement, or selection.
In a further embodiment, the user interface 50 can be supplemented with controls to specify additional search criteria. For example, a selection control can be provided to enable a user to specify one or more required or optional search terms in the data excerpt 51, which respectively qualifies the search to always and permissibly include the terms selected. Also, the user interface 50 can include an ordering control that allows a user to specify a precedence applicable to the search terms, which causes the search to favor those search terms having higher precedence over other terms. As well, the user interface 50 can include a search scope control that enables a user to specify those documents within the corpus to be searched, which limits the field of search to the documents specified. Other forms of user interface controls and options are possible.
The search query that is used to conduct the search of the corpus of target documents is compiled following search criteria specification (operations 61, 62, 63). In one embodiment, the search query is a combination of tokens and Boolean AND, OR, set, and similar operations, which specify the search logic for inclusiveness, and natural language sentences or phrases, which specify the search logic for proximity. In a further embodiment, the search query is a combination of an unstructured search criteria entered through the user interface 50, plus an encapsulated search query, which can also be entered through the user interface 50 via the data entry area 54. The encapsulated search query is concatenated or incorporated into the compiled search query.
The search query is automatically executed following search criteria specification or when the user toggles a search button 55 (operation 64). The search query is executed against target documents stored in a data corpus. Each document in the data corpus is indexed to facilitate searching. One form of suitable indexing based on feature extraction and scoring is described in commonly-assigned U.S. patent application, Ser. No. 10/317,438, filed on Dec. 11, 2002, pending, the disclosure of which is incorporated by reference. Other types of indexing are possible.
Those documents matching the search criteria are presented as search results 56 (operation 65). The search results 56 identify the emails 41, 46 scoring equally in terms of the inclusion of the terms “man” and “mouse.” These terms are also equally proximate with both terms occurring within one word of the other. The remaining emails 42, 44, 45 in the search results are lower scoring than the emails 41 and 46, but are equally likely between themselves. Proximity is inapplicable to these single term matches. The user can review the search results and perform further searching operations, including entering a data excerpt 51 (operation 61), adjusting the “Contains” control 52 (operation 62), adjusting the “Proximity” control 53 (operation 63), or executing a search (operation 64). In a further embodiment, the search results can be processed to facilitate review, including sorting, filtering, and organizing.
Method
From a user perspective, searching requires providing a data excerpt 51 and adjusting the “Contains” and “Proximity” controls 52, 53 through the user interface 50. However, the raw user-specified search criteria must still be evaluated and executed as a search query to generate search results. Search criteria evaluation and execution can be performed as operations either as part of or independent from the user interface 50.
During each iteration, that is, search (block 81), the user interface 50 is first provided (block 82) and the data excerpt 51 and inputs to the “Contains” and “Proximity” controls 52, 53 are accepted (block 83). The search criteria is specified when the data excerpt 51 is entered, when the “Contains” control is adjusted, or when the “Proximity” control is adjusted. Logically, these operations occur on the “half-click,” that is, upon the initial toggle of an input key, such as a mouse or keyboard button. The search is initiated (block 84) upon the next “half-click,” that is, upon the release of the input key, after which the search criteria is preprocessed to form tokens (block 85), as further described below with reference to
Preprocessing a Search
Preprocessing a search primarily converts the data excerpt 51 into an equivalent tokenized representation for use in a search query.
Searching by Nearness
The proximity control 53 selectively specifies a degree of nearness between matching search terms found in each document.
A span size and a number of search terms to combine within the span are respectively determined from the “Proximity” control 53 input (blocks 111 and 112). Both the span s to be applied and the number of search terms to combine c during searching of each document are determined in accordance with equations (1) and (2):
where N is a number of the tokens and 0.0<p<1.0 is a value representing the degree of nearness specified through the selectable “Proximity” control 53. The function MaxInt( ) ensures that a value not less than two for the matching search terms is specified. The search query is then executed on the target corpus conditioned on the span size and search terms number (block 113).
In one embodiment, the search terms are combined in the same ordering as provided in the data excerpt 51, which implicitly limits the universe of possible combinations of search terms. However, in a further embodiment, the ordering of the search terms in the data excerpt 51 is immaterial and a wider range of search term combinations can be considered.
Searching by Inclusion
The inclusiveness control selectively specifies a granularity of inclusion of search terms within each document.
The number of search terms is determined from the “Contains” control 52 input (block 121). The number of search terms h that must be matched by one or more terms or concepts in each target document is determined in accordance with equation (3):
h=int(N*p+1) (3)
where N is a total number of the tokens and 0.0≦p<1.0 is a value representing the granularity of inclusiveness specified through the “Contains” control. The search query is then executed on the target corpus conditioned on the minimum number of hits (block 122).
System Modules
In one embodiment, searching is performed by the document searcher.
The document searcher 131 includes a storage device 136 and a preprocessor 132, nearness searcher 133, and inclusiveness searcher 134. In addition, the document search 131 includes a query engine 135, or provides an interface to an external query engine 36 (shown in
The preprocessor 132 evaluates each data excerpt 139 as provided as an input 143 from a user interface 142 to build an initial search query 142. Based on the “Contains” control 52 inputs 144, the inclusiveness searcher 133 determines the minimum number of hits on search terms necessary for a target document in the data corpus 137 to match, which are saved as nearness parameters 140. Similarly, based on the “Proximity” control 53 inputs 144, the nearness searcher 134 determines both the search span size and the number of search terms to combine in each span, which are saved as inclusiveness parameters 140. The query engine 135 executes the search query 142 against the data corpus 137 and provides search results as outputs 146 that are presented through the user interface 143. Other forms of document searcher functionality are possible.
While the invention has been particularly shown and described as referenced to the embodiments thereof, those skilled in the art will understand that the foregoing and other changes in form and detail may be made therein without departing from the spirit and scope of the invention.
In one embodiment, inclusiveness and nearness, or proximity, searching are implemented using functionality provided by Lucene, a Java-based, open source toolkit for text indexing and searching, which is available over the Internet at http://lucene.apache.org. Other information libraries provide sufficient similar functionality.
Inclusiveness and nearness searching can be respectively defined as functions CONTAINS( ) and SPAN( ), providing functionality as follows:
Assuming that the data excerpt is textual data consisting of “cats and dogs at play.” The search tokens extracted from the data excerpt would be: cat, dog and play. The plural forms are made singular and the words and and at are removed as stop words.
CONTAINS( ) Searching
If the count input parameter is provided with a value of ‘2’ using the “Contains” control, an inclusiveness search query is compiled with the following form:
CONTAINS( [“cat”, “dog”, “play”], 2)
Thus, any documents that contain any combination of two or more of the search terms “cat,” “dog,” and “play” would be returned. The equivalent Boolean expression is:
(cat AND dog) OR (cat AND play) OR (dog AND play)
SPAN( ) Searching
The input parameters provided using the “Proximity” control modifies two possible controls, which are the size of the span, s, and the number of terms to combine, c, respectively determined per equations (1) and (2), described above. Using a parameter value of p=0.25, c=2, as at least two terms are required, and s=15. A nearness search query is compiled with the following form, using the SPAN( ) function in conjunction with Boolean operators:
SPAN([“cat”, “dog”], 15) OR SPAN([“cat”, “play”], 15) OR SPAN([“dog”, “play”], 15)
Thus, any documents that contain any combination of two or more of the search terms “cat,” “dog,” and “play” occurring within 15 terms of each other would be returned.