A significant amount of content is stored in document repositories. Frequently, documents repositories are structured, e.g., using hierarchically organized folders. A document search query may require the identification of documents based on their locations in a structured document repository.
Specific embodiments of the technology will now be described in detail with reference to the accompanying figures. In the following detailed description of embodiments of the technology, numerous specific details are set forth in order to provide a more thorough understanding of the technology. However, it will be apparent to one of ordinary skill in the art that the technology may be practiced without these specific details. In other instances, well-known features have not been described in detail to avoid unnecessarily complicating the description.
In the following description of
In general, embodiments of the technology relate to a method and system for document searches in structured document repositories. Documents, in one or more embodiments of the technology, are stored in a document repository that is structured by, e.g., including identifiable locations in the document repository. The structure of the repository may be provided by a file system with folders, a database that stores documents in different locations, or any other form of organization that enables the storage of documents in different locations. In one or more embodiments of the technology, a search for a document may include location constraints. In other words, a user may request to receive only matching documents that are stored at a particular location in the structured document repository. For example, the user may request to receive documents that match the search query only if these documents are stored in folders A, B, and C, but not in any other folder of the structured document repository.
In one embodiment of the technology, a client system (110) corresponds to any computing system (see e.g.,
Continuing with the discussion of
The document management service, in accordance with one or more embodiments of the technology, includes a document repository query engine (122), and a document indexing engine (128).
The document repository query engine (122) includes a document search engine (124) and a descendant filter (126). The document search engine (124), in accordance with one or more embodiments of the technology, identifies documents, in the structured document repository (132), that match a user-specified search query. The document search engine (124) may further perform additional functions such as determining whether the requesting user is authorized to access the identified documents. The search being performed by the document search engine (124), in one or more embodiments of the technology, is an indexed document search. Accordingly, the document search engine (124) may access the document search index (136) when performing a search for documents in the structured document repository (132).
The descendant filter (126), in accordance with one or more embodiments of the technology, determines, for each document identified by the document search engine, whether the document meets previously specified document location constraints. More specifically, as the document search engine (124) identifies a document, the descendant filter determines, based on a comparison of the document location, specified in the document search index, and the location constraints, whether the document should be reported to the requesting user. A description of the document search, performed by the document search engine (124), and of the document filtering, performed by the descendant filter (126) is provided below with reference to
The document indexing engine (128), in accordance with one or more embodiments of the technology, indexes documents (134.1-134.N) that are stored in the structured document repository (132). The document indexing engine (128) generates document search index entries that include at least a document identifier (138), indexing terms (140) and a document location (142). One document search index entry may be generated per indexed document, and some or all documents (134) in the structured document repository (132) may be indexed by the document indexing engine. Each of the index entries, in accordance with an embodiment of the technology, includes information that characterizes a document in the structured document repository. The generation and use of document identifiers (138), indexing terms (140) and document locations (142) is further described below.
Continuing with the discussion of
The structured document repository (132) and/or the document search index (136) may be implemented using any format suitable for the storage of the corresponding entries in these repositories. One or more of these repositories may be, for example, a collection of text or binary files, spreadsheets, SQL databases etc., or any other type of hierarchical, relational and/or object oriented collection of data.
The structured document repository (132), in accordance with an embodiment of the technology, hosts a collection of documents (134.1-134.N) that may be searched upon request, e.g., by a user. The documents in the structured document repository may include any type of content and may be text documents encoded in various formats, or hybrid documents including text content in combination with other, non-text content.
In one embodiment of the technology, the structured document repository (132) is a file system, and a document (134) may be stored as a file in the file system in a folder or directory. The file system may be hierarchical and may include any number of hierarchies. Alternatively, the structured document repository (132) may be any other form of structured storage, e.g., a database of any type, or a file in which different locations are distinguishable. In one or more embodiments of the technology, the documents (134) in the structured document repository (132) are indexed to facilitate and/or accelerate the search for documents. The resulting indexing data may be stored in the document search index (136), as subsequently described.
The document search index (136), in accordance with one or more embodiments of the technology, includes the indexing information for at least some of the documents (134.1-134.N) in the structured document repository (132). The indexing information for a document (134), in accordance with an embodiment of the technology, is stored in an index entry (138). The document search index (136) may be a file system, a file, or any kind of database that accommodates the index entries (128.1-138.N).
Each index entry (138) corresponds to a single document (134) in the structured document repository (132) and characterizes the document. Index entries may exist for at least some of the documents in the structured document repository. The generation of index entries is described below, with reference to
An index entry (138) in accordance with an embodiment of the technology, includes a document identifier (140), indexing terms (142) and a document location (144). The document identifier (140), in accordance with an embodiment of the technology, is used to associate the index entry (138), including the subsequently described indexing terms (142) and the document location (144) with the corresponding document (134) in the structured document repository (132). The document identifier (140) may be any type of identifier based on which a corresponding document (134) in the structured document repository (132) can be identified. The document identifier may be, for example, the name of the corresponding document.
The indexing terms (142) in accordance with an embodiment of the technology, are expressions, e.g., words or phrases that characterize the corresponding document, obtained from a corresponding document (134) when indexing the document, as further described in
A document location (144), in accordance with one or more embodiments of the technology, describes where the corresponding document (134) is located in the structured document repository. The document location may thus specify, e.g., a path, a folder, a directory, a database location, or any other location where the corresponding document is located.
With the document location (144) of an index entry (138) being directly stored in the document search index (136), the location of the document (134) in the structured document repository (132) can be obtained directly by reading the index entry (138) in the document search index (136). At least part of the document search index is thus implemented as a covering index, which in contrast to a conventional index that merely points to a location from where an information can be obtained, contains the actual information itself.
The technology is not limited to the architecture of the system (100) shown in
Turning to
In Step 202, indexing terms are obtained for the document that was added to the structured document repository. Indexing terms may be obtained by parsing the document and by identifying the most frequently occurring terms. Identifying indexing terms may require additional steps such as, for example, the removal of stop words from the document content. Stop words may be frequently occurring words such as, for example “the”, “a” “to”, etc. that may not serve as meaningful keywords for representing the document content in a document content identifier. Further, the document content may be stemmed, i.e., words or terms in the document may be replaced by the corresponding word stems. For example, the words “fishing”, fished” and “fisher” may be reduced to the root word fish. Alternatively, lemmatization may be used to obtain the word stems.
Frequently occurring terms in the document may be considered indexing terms which may be obtained, for example, by generating a sorted list of the word stems. The list may only include terms that occur with at least a certain frequency.
In Step 204, the indexing terms are stored in the document search index.
The indexing terms may be stored in an index entry that is specific to the document from which the indexing terms were obtained, and that is uniquely identified using a document identifier. The indexing terms may be accompanied by other information such as, for example, a cardinality determined for each of the indexing terms. The cardinality may be based on how frequently the indexing term exists in the document.
In Step 206, the location in which the document was stored in Step 200 is stored in the document search index, as part of the index entry that corresponds to the document. Depending on the organization of the structured document repository, a particular format may be used for the document location. For example, the document location may be a path that includes folders or directories, or any other location specifier that allows identification of the location in the structured document repository, where the document can be found.
Turning to
In Step 302, location constraints are obtained. The location constraints may be obtained, for example from the requesting user or from the requesting software module. The location constraint may be a path, a folder, a directory, or any other location, without departing from the technology. Further, the location constraint may be groups of locations to be considered or to be excluded by the search to be performed. For example, a location constraint may specify that a folder and any folders that are hierarchically organized below that folder are to be considered.
In Step 304, documents that match the document search query and that meet the location constraints are identified in the structured document repository, as further described in
In Step 306, the documents identified in Step 304, are reported to the requesting user or to the requesting software module.
Turning to
In Step 402, a determination is made about whether the selected index entry corresponds to a matching document, based on the indexing terms in the index entry. The determination may be made by a comparison of the search terms specified in the search query and the indexing terms in the index entry. If a sufficient match is found (e.g., based on at least a specified number of search terms being found among the indexing terms), a determination is made that a match has been identified, and the method may proceed to Step 404. If a match has not been identified, the method may return to Step 400 to identify a different index entry.
In Step 404, a determination is made about whether the selected index entry corresponds to a matching document, based on the document location. The determination may be made by verifying that the document location, stored in the index entry, does not violate the location constraints. If a match is found, the method may proceed to Step 406, where the document that corresponds to the index entry is flagged as a matching document, i.e., as a document for which both the search term requirement and the location constraints are met. If no match was found, the method may return to Step 400 to identify a different index entry.
In Step 408, a determination is made about whether additional index entries need to be considered. This may be necessary, for example, if the search is to be performed across all index entries, while not all index entries have yet been examined. Alternatively, additional index entries may need to be considered if the search query specifies a certain number of matching documents to be returned, while the number of identified matching documents is still below this number. If a determination is made that additional index entries are to be considered, the method may return to Step 400.
Embodiments of the technology may enable a system to perform indexed searches for documents while explicitly considering the storage location of the documents in a structured document repository. Embodiments of the technology are based on a search index that is organized in the form of a covering index. The search index includes a documentation of the locations of the documents in the structured document repository. The search for documents is performed in an effective manner using a single-step search approach in which, for each document that matches a search query, a confirmation can be obtained regarding whether the document is to be reported based on the document's location in the structured document repository. A document is assessed, Both the indexing terms and the document location are obtained directly from the document search index. Embodiments of the technology thus avoid input/output operation-intense verification of the document location by scanning the structured document repository for the document's location. Embodiments of the technology, by reducing the number of file system operations, thus increase search performance and/or reduce computing system load.
Embodiments of the technology may be implemented on a computing system. Any combination of mobile, desktop, server, embedded, or other types of hardware may be used. For example, as shown in
Software instructions in the form of computer readable program code to perform embodiments of the technology may be stored, in whole or in part, temporarily or permanently, on a non-transitory computer readable medium such as a CD, DVD, storage device, a diskette, a tape, flash memory, physical memory, or any other computer readable storage medium. Specifically, the software instructions may correspond to computer readable program code, that when executed by a processor(s), is configured to perform embodiments of the technology.
Further, one or more elements of the aforementioned computing system (500) may be located at a remote location and connected to the other elements over a network (512). Further, embodiments of the technology may be implemented on a distributed system having a plurality of nodes, where each portion of the technology may be located on a different node within the distributed system. In one embodiment of the technology, the node corresponds to a distinct computing device. Alternatively, the node may correspond to a computer processor with associated physical memory. The node may alternatively correspond to a computer processor or micro-core of a computer processor with shared memory and/or resources.
While the technology has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this disclosure, will appreciate that other embodiments can be devised which do not depart from the scope of the technology as disclosed herein. Accordingly, the scope of the technology should be limited only by the attached claims.
Number | Name | Date | Kind |
---|---|---|---|
6269380 | Terry | Jul 2001 | B1 |
6356891 | Agrawal | Mar 2002 | B1 |
7664806 | Koudas | Feb 2010 | B1 |
7822751 | O'Clair | Oct 2010 | B2 |
8635228 | Shahabi | Jan 2014 | B2 |
8762225 | Dean | Jun 2014 | B1 |
20100042963 | Bruno | Feb 2010 | A1 |
20100094877 | Garbe | Apr 2010 | A1 |
20100185666 | Crow | Jul 2010 | A1 |
20110072023 | Lu | Mar 2011 | A1 |
20110202541 | Permandla | Aug 2011 | A1 |
20110252064 | Murugappan | Oct 2011 | A1 |
20120110015 | Nath | May 2012 | A1 |
20120166443 | Bloesch | Jun 2012 | A1 |
20120296913 | Ash | Nov 2012 | A1 |
20130046785 | Assadollahi | Feb 2013 | A1 |
20130110892 | Wood | May 2013 | A1 |
20130226959 | Dittrich | Aug 2013 | A1 |
20140229468 | Or | Aug 2014 | A1 |