Search engines may permit different types of searches. A full-text search permits search terms to be located in one or more available documents, regardless of where the search terms may be located in the documents. A search of queryable fields permits a user to specify one or more fields of a document that may contain the search terms.
Search engines typically make use of data structures known as search indexes to improve the efficiency and speed of searches. However, a full-text search typically requires a different search index than a queryable search. Requiring multiple search indexes increases the memory storage requirements for a search engine and increases the overhead of searches.
Embodiments of the disclosure are directed to a method implemented on a computing device for creating a search index. A plurality of words found in one or more documents is identified. For each word of the plurality of words, one or more fields of the one or more documents in which the word can be found is identified. Using the computing device, a search index is created for each word of the plurality of words. The search index for each word of the plurality of words provides a mapping between the word and each occurrence of the word in each field of the one or more documents in which the word is found.
This Summary is provided to introduce a selection of concepts, in a simplified form, that are further described below in the Detailed Description. This Summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used in any way to limit the scope of the claimed subject matter.
The present application is directed to systems and methods for using a single search index to implement both full text search and search of queryable fields. Full text searching refers to searching for words within a preconfigured set of fields of documents. Search of queryable fields refers to searching for words within specific fields of documents. The systems and methods provide for organizing the single search index by words and by fields within words. By organizing the single search index in this manner, searches can be performed quickly and efficiently without unnecessary duplication of system resources.
The example server computer 106 includes a search processing module 108. Server computer 106 may be part of a server farm of multiple server computers. An example of a server computer that may be part of a server farm is the Microsoft SharePoint® Server 2010 collaboration server from Microsoft Corporation of Redmond, Wash.
The example database 110 stores one or more document that may be accessed via client computers 102, 104. The database 110 may be part of one or more server computers, for example server computer 106. In other embodiments the one or more server computers may store the one or more documents in lieu of database 110.
Client computers 102, 104 may access server computer 106 over a corporate Intranet or over the Internet. In examples, client computers 102, 104 may be part of a shared document management system such as the Microsoft SharePoint® document management system. In the shared document management system, one or more documents stored on server computer 106 or database 110 may be accessible by a user on client computer 102 or client computer 104.
When a user on client computer 102 needs to perform a search on the document management system, the user typically initiates an application including a user interface on client computer 102 and enters a search term in a query field of the user interface. The search term may be a word or a phrase that may be included in the one or more documents stored in the document management system. In examples, the user may request a full text search or the user may specify one or more fields in a document for which the search term may be located. For a full text search, the search term may be located anywhere in the document.
Documents may be structured to include identifiable parts or sections known as fields. Examples fields are titles, paragraph headings, sections of a document such as Abstract, Claims, Detailed Description, the full body of a document, etc. Other example fields are possible and other examples of sections are possible.
The example search processing module 108 receives search queries from client computers 102, 104 and performs a search of the document management system for documents containing the search queries. As shown in
During a search for a word or group of words, the word or group of words may be found in a plurality of documents. In examples, the ranking module 204 may rank search results in relation to a number of occurrences of the word or group of words in a document, the more hits per document, the higher the rank. Similarly, when searching for a word or group of words in a particular field, the ranking module may rank search results in relation to a number of occurrences of the word or group of words in the field in a document, the more occurrences of the word in the field of a document, the higher the rank. Other ways in which the ranking module 204 may rank documents include determining how close search terms are to each other in a document, the closer the search terms in the document, the higher the rank.
The word dictionary 302 includes index information for each word of the one or more words stored in the word dictionary 302. The index information provides mappings between each word of the one or more words and each field in which the word may occur for each document stored in the document management system. The mappings indicate the position in each document of each word in each field. In examples, instead of indicating the position in each document of each word in each field, the mappings may indicate only the frequency of occurrence for each word in each field of each document for which there is an occurrence of the word in the field of the document. In other examples, both types of mappings may be provided. One group of mappings may be provided to indicate the position of each word in each field and another group of mappings may be provided to indicate the frequency of occurrence for each word in each field of each document for which there is an occurrence of the word in the field of the document.
The word dictionary 302 orders each field sequentially per word. Thus, as shown in
By organizing the search index by fields within a word, a full text search and a search for a word in a specific field may be performed via single disk read of the search index. This improves search performance and also avoids the need to provide separate search indexes for full text searching and for searching by specific fields. In addition, organizing the search index by fields with a word provides a degree of search schema flexibility. Rather than being confined to use a limited number of fields, as is common with search indexes, the word dictionary 302 may include a large number of fields and provide the ability to select a subset of this large number of fields dynamically during a search.
Schema flexibility also offers improvements in multi-tenant environments. A multi-tenant environment is where more than one customer uses the same search index. Because the word dictionary 302 includes a larger number of fields than is typically the case, individual customers can choose fields that are useful to them when doing a search.
As shown in
The first word in the example word dictionary 302 is word 0 (304). The example word 0 (304) is found in a plurality of different fields on server computer 106, the first field being designated field 0 (306) and the last field being designated field 5. Only fields 0 and 5 are shown in
Field 0 (306) includes an example location start field 308, an example location length field 310, an example position start field 312 and an example position length field 314. The location start field 308 stores a pointer to a start of the location data storage area 328.
For each field for which a word occurs in a document, the location data storage area 328 includes a doc ID field, a frequency field and a field length field. For example, the location start field 308 points to the example doc ID field 330. The example doc ID field 330 provides an identifier for a document that includes one or more occurrences of word 0 in field 0.
The example frequency field 332 provides a number representing a number of occurrences of word 0 in field 0 for the document identified by the doc ID field 330. For example, if field 0 represents the title of the document, and word 0 occurs two times in the title, the example frequency field 332 has a value of 2. The example field length field 334 represents the length of field 0 for the document, for example the length of the title of the document identified by the doc ID field 330.
Similarly, the doc ID field 336 provides an identifier for another document that includes one or more occurrences of word 0 in field 0. The frequency field 338 provides a number representing a number of occurrences of word 0 in the document identified by the doc ID field 336 and the field length field 340 represents the length of field 0 in the document identified by doc ID field 336. Each of the example fields 330-340 typically occurs sequentially in memory so that memory offsets may be used to locate each of the example fields 330-340.
The example location length field 310 contains a value representing a length of the location data fields in the location data storage area 328 for the occurrences of word 0 in field 0. In the example shown in
The example position data storage area 354 is an area of memory on server computer 106 or database 110 that stores position information for each occurrence of a word in a field in the one or more documents stored in the document management system. For each occurrence of the word in the field for a document, the position data storage area 354 includes information identifying the document, information identifying the position of the word in the document and information identifying the length of the field. For example, the doc ID field 356 provides an identifier for a document for which there is an occurrence of word 0 in field 0. In examples, the doc ID field 356 may identify the same document as the doc ID field 330. In other examples, the doc ID field 356 may identify a different document.
The position field 358 provides the position of word 0 in the document identified by the doc ID field 356 for a first occurrence of word 0 in field 0 in the document identified by the doc ID field 356. In examples, the position may be represented by a line number and a cursor position on the line corresponding to the line number. The field length field 360 represents the length of field 0 in the document identified by the doc ID field 356. Similarly, the doc ID field 362 provides an identifier for a document in which there is another occurrence of word 0 in field 0. If there is more than one occurrence of word 0 in field 0 for the document identified by the doc ID field 356, the doc ID field 362 may identify the same document as the doc ID field 356. The position field 364 provides the position of word 0 in the document identified by the doc ID field 362 for the occurrence of word 0 in field 0 in the document identified by the doc ID field 362. When there are multiple occurrences of word 0 in field 0 in a document, the position field 364 may represent a position of a second occurrence of word 0 in field 0 for the document identified by the doc ID field 356. The field length field 366 represents a length of field 0 in the document identified by the doc ID field 362.
The example position length field 314 contains a value representing a length of the position data fields in the position data storage area 354 for occurrences of word 0 in field 0. In the example shown in
In a similar manner, the word dictionary 302 includes location start, location length, position start and position length information for each document field for which word 0 occurs in the document field. Thus, for document field 5, the word dictionary 302 includes the location start field 318, the location length field 320, the position start field 322 and the position length field 324. The location start field 318 is a pointer to a start of location data in the location data storage area 328 for occurrences of word 0 in field 5 for the one or more documents stored on server computer 106 or on database 110. The location length field 320 contains a value representing a length of location data fields for occurrences of word 0 in field 5 for the one or more documents stored on server computer 106 or on database 110. As shown in
The position start field 322 is a pointer to a start of position data in the position data storage area 354 for occurrences of word 0 in field 5 for the one or more documents stored on server computer 106 or on database 110. In the example shown in
The position length field 324 contains a value representing a length of the position data fields for occurrences of word 0 in field 5 for the one or more documents stored on server computer 106 or on database 110. In the example shown in
As discussed, the word dictionary 302 includes search index data for all words for which there is an occurrence of the word in one or more fields in the one or more documents stored on server computer 106 or database 110. However, because of space considerations, index data for only word 0 is shown in
At operation 404, for each word of the plurality of words, one or more fields are identified in the one or more documents in which the word is found. A field is an identifiable part of a document, for example a title, a heading, a paragraph, or the entire document. Other examples of fields are possible.
At operation 406, for each field in which a word is found, a mapping is generated between the word and a position of the word in each document in which the word is found in the field. The mapping provides an index that permits the word to be located for each occurrence of the word in the field in the one or more documents.
At operation 408, for each field in which a word is found, a mapping is generated between the word and a frequency of occurrence of the word in the field for each of the one or more documents in which the word is found in the field. The frequency of occurrence represents the number of times the word appears in the field for each of the one or more documents.
With reference to
In a basic configuration, server computer 106 typically includes at least one processing unit 502 and system memory 504. Depending on the exact configuration and type of computing device, the system memory 504 may be volatile (such as RAM), non-volatile (such as ROM, flash memory, etc.) or some combination of the two. System memory 504 typically includes an operating system 506 suitable for controlling the operation of a server, such as the Microsoft SharePoint® Server 2010 collaboration server, from Microsoft Corporation of Redmond, Wash. The system memory 604 may also include one or more software applications 608 and may include program data.
The server computer 106 may have additional features or functionality. For example, server computer 106 may also include computer readable media. Computer readable media can include both computer readable storage media and communication media.
Computer readable storage media is physical media, such as data storage devices (removable and/or non-removable) including magnetic disks, optical disks, or tape. Such additional storage is illustrated in
The server computer 106 may also contain communication connections 518 that allow the device to communicate with other computing devices 520, such as over a network in a distributed computing environment, for example, an intranet or the Internet. Communication connections 518 are one example of communication media. Communication media may typically be embodied by computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media.
The various embodiments described above are provided by way of illustration only and should not be construed to limiting. Various modifications and changes that may be made to the embodiments described above without departing from the true spirit and scope of the disclosure.