The present invention relates generally to index build technology, and more particularly, to the generation of indexes for document searching in situations requiring semantic analysis as part of the search.
Indexing of documents is often used to reduce search times for document searches. Technology for index building generally aims for even distribution of the data that is indexed within a system. While the distributed computation of the search is powerful, it tends to break up the semantics of the data by assuming that the data is homogeneous. Homogeneity is a good assumption for the general text search problem. However, homogeneity presents a problem when semantic aggregation for analysis of data is needed, for instance, when specific data collections are relevant for a search. In general there are two solutions, collect specific indices relevant to those collections or develop complex aggregation and filtering data joiners. Both need increasingly complex queries, relying on intrinsically generated structured metadata.
Particular data stores may use specific indexing techniques, that are typically very close to the structure of the data being stored. General data storage systems may organize information based on common interest domains, meaning based on what the information of interest looks like, and what the information intends to model or represent. Often there is a mismatch between how data is queried to generate the searches and how it is stored in data sources.
In one general embodiment, a method is disclosed for a system to build a distributed reverse semantic index. The method includes receiving a plurality of documents, with each document having at least one defined rule/semantic, distributing the plurality of documents among a plurality of nodes of a system, and processing the documents in a generally parallel fashion. Processing the documents comprises processing text data of each document, and breaking each document into fields to index the text data to create index data by deferring on how to categorize the text data based upon the at least one defined rules/semantic. The indexed data is then combined back together to create an indexer-agnostic semantic index including a plurality of semantic index shards. The method further includes semantically classifying the documents based on the index shards into groups based on document type to create the distributed reverse semantic index.
In another embodiment, a system is disclosed that is configured to build a distributed reverse semantic index. The system builds a distributed reverse semantic index that includes semantic index shards, with each semantic index shard including documents of a similar document type. To build the distributed reverse semantic index, a plurality of documents, each document having at least one defined rule/semantic are received. The plurality of documents are then distributed among a plurality of nodes of the system. The plurality of documents are then processed in a generally parallel fashion, where processing of the plurality of documents includes processing text data of each document of the plurality of documents and breaking each document into fields to index the text data to create index data by deferring on how to categorize the text data based upon the at least one defined rule/semantic. The system then recombines the indexed data to create an indexer-agnostic semantic index that includes a plurality of the semantic index shards. The system then semantically classifies the documents based on the index shards into groups based on document type, to create the distributed reverse semantic index that includes the indexer-agnostic index and the groups organized as the index shards.
In another embodiment, a computer program product is disclosed that comprises a computer readable medium having an embodiment of a computer usable program code. The computer usable program code is configured to receive a plurality of documents, with each document having at least one defined rule/semantic and distribute the plurality of documents among a plurality of nodes of the system. The computer usable program code is also configured to process the plurality of documents by the plurality of nodes in a generally parallel fashion, including process text data of each document of the plurality of documents and break each document into fields to index the text data to create index data by deferring on how to categorize the text data based upon the defined rules/semantics. The computer usable program code is further configured to recombine the indexed data to create an indexer-agnostic semantic index including a plurality of the semantic index shards. Finally, the computer usable program code is configured to semantically classify the documents based on the index shards into groups based on document type to create the distributed reverse semantic index.
For a fuller understanding of the nature and advantages of the invention, as well as a preferred mode of use, reference should be made to the following detailed description read in conjunction with the accompanying drawings, in which:
The following description is made for the purpose of illustrating the general principles of the invention and is not meant to limit the inventive concepts claimed herein. Further, particular features described herein can be used in combination with other described features in each of the various possible combinations and permutations. Unless otherwise specifically defined herein, all terms are to be given their broadest possible interpretation including meanings implied from the specification as well as meanings understood by those skilled in the art and/or as defined in dictionaries, treatises, etc.
In one general embodiment, a method is disclosed for a system to build a distributed reverse semantic index. The method includes receiving a plurality of documents, with each document having at least one defined rule/semantic, distributing the plurality of documents among a plurality of nodes of a system, and processing the documents in a generally parallel fashion. Processing the documents comprises processing text data of each document, and breaking each document into fields to index the text data to create index data by deferring on how to categorize the text data based upon the at least one defined rules/semantic. The indexed data is then combined back together to create an indexer-agnostic semantic index including a plurality of semantic index shards. The method further includes semantically classifying the documents based on the index shards into groups based on document type to create the distributed reverse semantic index.
In another embodiment, a system is disclosed that is configured to build a distributed reverse semantic index. The system builds a distributed reverse semantic index that includes semantic index shards, with each semantic index shard including documents of a similar document type. To build the distributed reverse semantic index, a plurality of documents, each document having at least one defined rule/semantic are received. The plurality of documents are then distributed among a plurality of nodes of the system. The plurality of documents are then processed in a generally parallel fashion, where processing of the plurality of documents includes processing text data of each document of the plurality of documents and breaking each document into fields to index the text data to create index data by deferring on how to categorize the text data based upon the at least one defined rule/semantic. The system then recombines the indexed data to create an indexer-agnostic semantic index that includes a plurality of the semantic index shards. The system then semantically classifies the documents based on the index shards into groups based on document type, to create the distributed reverse semantic index that includes the indexer-agnostic index and the groups organized as the index shards.
In another embodiment, a computer program product is disclosed that comprises a computer readable medium having an embodiment of a computer usable program code. The computer usable program code is configured to receive a plurality of documents, with each document having at least one defined rule/semantic and distribute the plurality of documents among a plurality of nodes of the system. The computer usable program code is also configured to process the plurality of documents by the plurality of nodes in a generally parallel fashion, including process text data of each document of the plurality of documents and break each document into fields to index the text data to create index data by deferring on how to categorize the text data based upon the defined rules/semantics. The computer usable program code is further configured to recombine the indexed data to create an indexer-agnostic semantic index including a plurality of the semantic index shards. Finally, the computer usable program code is configured to semantically classify the documents based on the index shards into groups based on document type to create the distributed reverse semantic index.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java®, Smalltalk™, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
Aspects of the present invention are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The workstation 10 shown in
The workstation 10 may have resident thereon an operating system capable of running various programs. It will be appreciated that a preferred embodiment may also be implemented on any suitable platform or operating system. A preferred embodiment may be written using JAVA, XML, C, and/or C++ language, or other programming languages, along with an object oriented programming methodology. Object oriented programming (OOP), which has become increasingly used to develop complex applications, may be used.
In one embodiment, the system 100 begins with a receiver 102 receiving documents 200 from an arbitrary data source 104. The specific nature of the data source 104 is not relevant. A distribution 106 then distributes the documents 200 among different nodes 300-1 to 300-N in the system 100, providing a generally balanced load 202-i for each node 202-i, where i ranges from 1 to N. The distributed documents 200-1 to 200-N are processed in a distributed parallel fashion, of each individual document 200 to create indexes 250.
For simplicity of discourse, consider an example of operating the node 300-1. A document 200-1 is processed along with the full text of the document 200-1 to create indexes 250-1. The document 200-1 is broken into fields 204-1, to handle different data types 206-1 appropriately. Additionally, semantic rules 202 are given special emphasis in order to enable the consistent classification of the data in the documents 200.
Referring to
In one general embodiment, the system 100 may include the receiver 102 for receiving a plurality of the documents 200, with each document 200 having at least one defined rule/semantic 202. The system 100 may also include the distribution 106 distributing the plurality of the documents 200-1 to 200-N among a plurality of nodes 300-1 . . . to 300-N.
The system 100 may also include a processor, such as the central processing unit 12 described in
The system 100 may also include a combiner 114 for combining the indexed data 250-1 back together to create an indexer-agnostic semantic index 510-1 that includes a plurality of the semantic index shards 502-1 . . . to 502-M. In an alternative embodiment, the system 100 may semantically classifying the documents 200-1 . . . to 200-N based on the index shards 502-1 into groups based on document type 204 to create the distributed reverse semantic index 500.
In one embodiment, the DFS 108 may include the processor 12 that may embody at least part of the method, and may comprise the receiver 102 for receiving the documents 200 from the data source 104. The processor 12 that may also comprise the distribution 106 for distributing the documents 200.
In one embodiment, at least one of the nodes 300-1 to 300-N, such as the node 300-N, for example, may include a processor 12-A. The processor 12-A, may comprise a processor 12, such as the central processing unit 12 described in
In one embodiment, the distribution engine 110 may include another processor 12-B. The processor 12-B, may comprise a processor 12, such as the central processing unit 12 described in
In one embodiment, the index builder 112 may also include a processor 12-C. The processor 12-C, may comprise a processor 12, such as the central processing unit 12 described in
Referring still to
The indexed data 250-1 is combined back together based on these groups, with each group being used to build a semantic index shard 490 or set 492 of semantic index shards, as illustrated in Prior Art
Based on the knowledge acquired from the data source 104 the semantic index 502, or collections 500 of these semantic indexes 502, may be created based on the semantics 202 contained within the data source 104 itself. This way user queries can be optimized by quickly zeroing on to the data that is semantically relevant to the query at hand. The semantic index 502, or collection 500 of the semantic indexes 502, can be produced around classification and relationships of concepts and terms of specific domains of interest allowing search operations to be exhaustive on the realm of applicability, when appropriate, rather than across an entire corpus of disassociated documents. It must be noted that this does not prevent a corpus wide search to be used and traditional ranking algorithms to be applied, it rather enhances the indexing systems ability to leverage the semantics contained within the query itself.
Returning to
By indexing data using a semantic aggregation, much of the index 250-1 can be disregarded before the search takes place. This provides a comparable level of accuracy as searching the entire index, with only a subset of the indexed data having actually been searched. This can provide a strong performance improvement to the system 100 performing searches using the reverse semantic index 500.
The system 100 is independent of the index builder 112 and of any specific indexing process, procedure and/or rule base. The main logic of the system 100 is external to the actual indexing process, allowing any number of different index builders 112 to be used. Different indexing applications provide added functionality or other advantages over one another, so the ability to use the same process to build a semantic index in a distributed fashion with different indexers is beneficial. Additionally, the indexer-agnostic design of the system 100 allows it to be leveraged to test the performance of competing indexers.
While various embodiments have been described above, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of a preferred embodiment should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.