The present invention concerns a method for improving search engine efficiency with respect to accessing, searching and retrieving information in the form of documents stored in document or content repositories, wherein an indexing subsystem of the search engines crawls the stored documents and generates an index thereof, wherein applying a user search query to the index shall return a result set of at least some query-matching documents to the user, and wherein the search engine comprises an array of search nodes hosted on one or more servers.
Particularly the invention discloses how to build a new framework for index distribution on a search engine, and even more particularly on an enterprise search engine.
Building a search engine is challenging for several reasons:
To satisfy the above three requirements, search engines use sophisticated methods for distributing their indices across a possibly large cluster of hosts.
An overview and discussion of the prior art relevant to the present invention shall now be given. All literature references are identified by abbreviations in parenthesis at the appropriate location in the following. A full bibliography is given in an appendix at the end of the description.
In order to improve the efficiency of search systems there has recently been much research on distribution of search engine indices. Early work concerned how to distribute posting lists and explored the trade-off between distributing posting lists based on index terms (herein also called keywords) versus documents [Bad01, MMR00, RNB98, TGM93, CKE+90, MWZ06]. The present invention takes as its point of departure the insight that making a global choice between these two alternatives is suboptimal because the statistical properties of keywords and documents vary in a typical search environment, as exemplified below.
For an unpopular keyword k that appears in only a few queries, replicating its posting list wastes resources since there is little opportunity for parallelism in executing the queries and thus not many queries ever read k's posting list in parallel from different hosts. The posting list for a popular keyword kl, however, is accessed by many queries and should thus be replicated to enable parallelism.
In order to better understand the prior art, a brief discussion of a search engine architecture as known in the art and currently used shall be given with reference to
The search engine 100 of the present invention, as known in the art, comprises various subsystems 101-107. The search engine can access document or content repositories located in a content domain or space wherefrom content can either actively be pushed into the search engine, or using a data connector be pulled into the search engine. Typical repositories include databases, sources made available via ETL (Extract-Transform-Load) tools such as Informatica, any XML-formatted repository, files from file servers, files from web servers, document management systems, content management systems, email systems, communication systems, collaboration systems, and rich media such as audio, images and video. Retrieved documents are submitted to the search engine 100 via a content API (Application Programming Interface) 102. Subsequently, documents are analyzed in a content analysis stage 103, also termed a content pre-processing subsystem, in order to prepare the content for improved search and discovery operations. Typically, the output of this content analysis stage 103 is an XML representation of the input document. The output of the content analysis is used to feed the core search engine 101. The core search engine 101 can typically be deployed across a farm of servers in a distributed manner in order to allow for large sets of documents and high query loads to be processed. The core search engine 101 accepts user requests and produces lists of matching documents. The document ordering is usually determined according to a relevance model that measures the likely importance of a given document relative to the query. In addition, the core search engine 101 can produce additional metadata about the result set, such as summary information for document attributes. The core search engine 101 in itself comprises further subsystems, namely an indexing subsystem 101a for crawling and indexing content documents and a search subsystem 101b for carrying out search and retrieval proper. Alternatively, the output of the content analysis stage 103 can be fed into an optional alert engine 104. The alert engine 104 will have stored a set of queries and can determine which queries that would have been satisfied by the given document input. A search engine can be accessed from many different clients or applications which typically can be mobile and computer-based client applications. Other clients include PDAs and game devices. These clients, located in a client space or domain, submit requests to a search engine query or client API 107. The search engine 100 will typically possess a further subsystem in the form of a query analysis stage 105 to analyze and refine the query in order to construct a derived query that can extract more meaningful information. Finally, the output from the core search engine 101 is typically further analyzed in another subsystem, namely a result analysis stage 106 in order to produce information or visualizations that are used by the clients. Both stages 105 and 106 are connected between the core search engine 101 and the client API 107, and in case the alert engine 104 is present, it is connected in parallel to the core search engine 101 and between the content analysis stage 103 and the query and result analysis stages 105; 106.
In order to improve the search speed of a search engine International published application WO00/68834 proposes a search engine with two-dimensional linearly scalable parallel architecture for searching a collection of text documents D, wherein the documents can be divided into a number of partitions d1, d2, . . . , dn, wherein the collection of documents D is pre-processed in a text filtration system such that a pre-processed document collection Dp and corresponding pre-processed partitions dp1, dp2, . . . , dpn are obtained, wherein an index I can be generated from the document collection D such that for each previous pre-processed partition dp1, dp2, . . . , dpn a corresponding index i1, i2, . . . , in is obtained, wherein searching a partition d of the document collection D takes place with a partition-dependent data set dp,k, where 1≦k≦n, and wherein the search engine comprises data processing units that form sets of nodes connected in a network. A first set of nodes comprises dispatch nodes Nα, a second set of nodes search nodes Nβ and a third set of nodes indexing nodes Nγ. The search nodes Nβ are grouped in columns which via the network are connected in parallel between the dispatch nodes Nα and an indexing node Nγ. The dispatch nodes Nα are adapted for processing search queries and search answers, the search nodes Nβ are adapted to contain search software, and the indexing nodes Nγ are adapted for generally generating indexes I for the search software. Optionally, acquisition nodes Nδ are provided in a fourth set of nodes and adapted for processing the search answers, thus relieving the dispatch nodes of this task. The two-dimensional scaling takes place respectively with a scaling of the data volume and a scaling of the search engine performance through a respective adaptation of the architecture.
The schematic layout of this scalable search engine architecture is shown in
Although the architecture shown in
Further, U.S. Pat. No. 7,293,016 B1 (Shakib & al., assigned to Microsoft Corporation) discloses how to arrange indexed documents in an index according to a static ranking and partitioned according to that ranking. The index partition is scanned progressively, starting with a partition containing those documents with the highest static rank, in order to locate documents containing a search word, and a score is computed based on a present set of documents located thus far in the search and on basis of the range of static ranks to a next partition to be scanned. The next partition is scanned to locate the documents containing a search word when the calculated score is above a target score. A search can be stopped when no more relevant results will be found in the next partition.
US published patent application No. 2008/033943 A1 (Richards & al., assigned to BEA Systems, Inc.) concerns a distributed search system with a central queue of document-based records wherein a group of nodes is assigned to different partitions, indexes for a group of documents are stored in each partition, and the nodes in the same partition independently process document-based records from the central queue in order to construct the indexes.
Existing prior art does not provide a design framework built on the general notions of keyword and query distribution properties and thus does not achieve the flexibility of this design, with resulting performance improvements and reduction in resource requirements.
A specific concern has been growth of the index, and several specific techniques that gracefully handle on-line index construction have been developed [BCL06]. These techniques are orthogonal to the framework resulting from applying the method according to the present invention, as shall be apparent from a detailed description thereof.
The present invention does not take specific ranking algorithms into account since it is assumed that the user always wants all query results. However, these ideas can be extended in a straightforward manner to some of the recently developed ranking algorithms [RPB06, AM06, LLQ+07] and algorithms for novel query models [CPD06, LT1T07, ZS07, DEFS06, TKT06, JRMG06, YJ06, KCMK06]. Algorithms for finding the best matching query results when combining matching functions have also been the focus of much research [PZSD96, Fag99, MYL02]. These techniques are, however, orthogonal to an index distribution framework as realized by the method of the present invention, and they can also be incorporated easily.
The techniques employed by the present invention for query processing with partitioned posting lists are based on fundamental ideas drawn from parallel database systems [DGG+86]; however, parallel database systems were developed for database management systems that store structured data, whereas the focus of the present invention is on enterprise and Internet search where search queries are executed over collections of often unstructured or semi-structured documents.
There is also prior art concerning text query processing in peer-to-peer systems where the goal is to coordinate loosely coupled hosts with an emphasis to find query results without broadcasting a query to all hosts in the network [RV03, LLH+03, ODODg02, SMwW+03,CAN02, KRo02, SL02, TXM03, TXD03, BJR03, TD04]. The main assumption of these prior art publications concerns the degree of coupling between the hosts, and this is different from the initial conception of the present invention which assumes that all hosts are tightly coupled and are under control of a single entity, for example, in a cluster in an enterprise data center, which is the dominant architecture today. The conceptual framework on which the present invention builds, maps directly onto this architecture by assuming a tightly coupled set of hosts.
In view of the shortcomings and disadvantages of the above-mentioned prior art, it is a major object of the present invention to provide a method that significantly enhances the performance of a search engine.
Another object of the present invention is to configure the index of a search engine and specifically an enterprise search engine on basis of recognizing that keywords and documents will differ both with regard to intrinsic as well as to extrinsic properties, for instance such as given by modalities in search and access patterns.
Finally, it is an object of the present invention to optimize an index configuration with regard to inherent features of the search system itself as well as to its operating environment.
The above-mentioned objects as well as further features and advantages are realized according to the present invention with a method that is characterized by configuring the index of the search engine on basis of one or more document properties, and at least one of a fault-tolerance level, a required search performance, document meta-properties, and an optimal resource utilization;
partitioning the index; replicating the index; distributing the thus partitioned and replicated index over the array of search nodes, such that index partitions and replicas thereof are assigned to said one or more servers hosting the array of search nodes, and processing search queries on the basis of the distributed index.
Further features and advantages of the present invention shall be apparent from the appended dependent claims.
The present invention shall be better understood when the following detailed discussion of its general background and actual embodiments is read in conjunction with the appended drawing figures of which
In order to describe the present invention in full, some assumptions and preliminaries shall be discussed. Then the new framework for index distribution enabled by the method according to the present invention is discussed.
For the present invention, a simplified model of a search engine is introduced. The notation used is summarized in Table 1.
One has a set of keywords K={κ1, . . . , κn} and a set of documents D={d1, . . . , dm}. Each document d is a list of keywords, and is identified by a unique identifier called a URL. An occurrence is a tuple (κ, u) which indicates that the document associated with the URL u contains the keyword κ. A document record is a tuple (u, date) that indicates that the document associated with the URL u was created at a given date.
In practice, an occurrence contains other data, for example the position of the keyword in the document or data that are useful for determining the ranking of the document in the output of a query. Also, a document has other associated metadata besides the document record, for example an access control list. Neither of these issues are important for the aspects of the index which are the focus of the following discussion.
The index of a search engine consists of sets of occurrences and a set of document records. There is one set of occurrences for each keyword κ, hereinafter called the posting set of keyword κ. The posting set of keyword κ contains all occurrences of keyword κ, and it contains only occurrences of keyword κ. To be consistent with the prior art, posting sets are presumed to be ordered in a fixed order (for example, lexicographically by URL), and the ordered posting set of a keyword k will be referred to as the posting list PL(κ) of keyword κ in the following disclosure. The set of document records contains one document record for each document, and it only contains document records.
Now search queries and query processing shall be discussed in some detail. Users issue queries; and a query q consists of a set of keywords q={κ1, . . . , κl} ⊂K. The present invention adopts a model for a query in which a user would like to find every document that contains all the keywords in the query. One can assume that the arrival time of each query q follows an exponential distribution and thus can be characterized by a single parameter λq, the interarrival rate of query q. Note that this probabilistic model of queries implies that queries are independent. A query workload ω is a function that associates with each query q ∈ 2K an arrival rate λω(q). From a query workload one can compute the arrival rate λω(κ) of each keyword κ by summing over all the queries that contain κ, formally
The following simplified way of logically processing a query q={κ1, . . . , κl} shall be assumed. For each keyword κi its posting list PL(κi) is retrieved for i ∈ {1, . . . , l}, and their intersection in the URL fields are computed. Formally, the following relational algebra expression computes the query result QueryResult(q) for query q={κ1, . . . , κl}:
QueryResult(q)=πURLPL(κ1) ∩ . . . ∩ πURLPL(κ1)
There are more sophisticated ways of defining QueryResult(q); for example, the user may only want to see a subset of QueryResult(q), and also may want to see this subset in ranked order.
Physical Setup
The present invention assumes a cluster of workstations modelled as a set of hosts H={h1, . . . , ho} [ACPtNt95]. Further, each host h is assumed to have a single disk with a fixed amount of storage space of DiskSize units. Note that for ease of explanation, the translation of the abstract unit of storage into a concrete unit such as bytes has been omitted in the model. Each occurrence is assumed to have a fixed size of 1 unit. For a keyword κ and its posting list PL(κ) the size of the posting list |PL(κ)|, is defined as the number of occurrences in PL(κ).
Each host h is assumed to be capable of an associated overall performance that allows it to retrieve buc(h) units of storage within latencyBound milliseconds; this number is an aggregated unit that incorporates CPU speed, the amount of main memory available, and the latency and transfer rate of the disk of the host. Further, in the following, all hosts are assumed to have identical performance, and thus the dependency of buc(h) on h can be dropped and reference just be made to buc as the number of units that any host can retrieve within latencyBound milliseconds.
A framework for index distribution shall now be discussed. Specifically, the framework or architecture as realized according to the method of the present invention encompasses three aspects, viz. partitioning, replication and host assignment, as set out below.
Partitioning
For each keyword, its posting list is partitioned into one or more components. This partitioning of the posting lists into components is done in order to be able to distribute the posting lists across multiple hosts such that all components can be retrieved in parallel.
Replication
For each keyword, each of its components is replicated a certain number of times resulting in several component-replicas for each component. Component-replicas are created for several reasons. The first reason for replication is fault-tolerance; in case a host that stores a component fails, the component can be read from another host. The second reason for replication is improved performance because queries can retrieve a component from anyone of the hosts on which the component is replicated and thus load can be balanced.
Host Assignment
After partitioning and replication, each component-replica of a posting list is assigned to a host, but with the assignment subject to the restriction that no two component-replicas of the same component and the same partition are assigned to the same host. The host assignment enables the location of components to be optimized globally across keywords. One could for example co-locate components of keywords that appear commonly together in queries to reduce the cost of query processing.
Now the corresponding three parts of the index distribution framework according to the method of the present invention shall be introduced.
For the first part, select a function numPartitions(·) that takes as input a keyword κ and returns the number of components into which posting list PL(κ) is partitioned; the resulting components are C01(κ), C02(κ), . . . , C0numPartitions(κ)(κ). Also select a function occLoc(·) that takes as input an occurrence and outputs the number of the component in which this occurrence is located. Thus, if occLoc((κ, u))=i, then (κ, u) ∈ C0i(κ). Note that if (κ, u) ∈ PL(κ), 1≦occLoc((κ, u))≦numPartitions(κ) holds.
For the second part, select a function numReplicas(·) that takes as input a keyword κ and returns the number of component-replicas of the partitions of the posting list of κ. The original component is included in the number of component-replicas. Thus for a keyword κ, there exist numReplicas(κ)·numPartitions(κ) component-replicas. If the right numPartitions(κ) components are combined, then they together comprise PL(κ); for any component C0i(κ) one can find numReplicas(κ) identical component-replica. In particular, if in workload ω keyword κ has arrival rate λω(κ), and one uniformly balances the load between numReplicas(κ) component-replica, then the arrival rate for this keyword for each of the component-replicas will be
For the third part select a function hostAssign(κ, i, j) that takes as input a keyword κ, a replica number i and a component number j and returns the host that stores component-replica i of component j of the posting list in PL(κ). Note that two identical component-replicas (that are replicas of each other) must be mapped to different hosts. Formally, hostAssign(κ, i1, j)≠hostAssign(κ, i2, j) must hold for j ∈ {1, . . . , numPartitions(k)} and i1, i2 ∈ {1, . . . , numPartitions(κ)} with i1≠i2.
An instantiation of the three functions numPartitions(·) numReplicas(·) and hostAssign (κ, i,j) shall be called a search engine index configuration.
Given the framework as disclosed above, the physical model for processing a query q can now be introduced. Processing a query q involves three steps:
Now these three steps shall in turn be addressed.
For the first step, note that function hostAssign(κ, i,j) encodes for each keyword κ the set of hosts where all the component-replicas of the posting list of κ are stored.
For the second step, each host involved in processing query q (as selected in the first step) retrieves all its local component-replicas for all keywords involved in the query.
For the third step, each host will first intersect the local component-replica of all the keyword. Then the results of the local intersections are processed further to complete computation of QueryResult(q).
Now the problem of index design can be defined as follows. A set of hosts that has associated storage space DiskSize and performance buc is given. Also given is a set of keywords with posting lists PL(κ1), . . . , PL(κm) that have sizes |PL(κ1)|, . . . , |PL(κm)|, as well as a query workload ω.
For the index design problem one needs to find functions numPartitions(·), numReplicas(·), and hostAssign such that the expected latency of answering a query q is below latencyBound, where the expectation is over the set of all possible query sequences.
In the following a discussion of some embodiments shall be given by specific and exemplary instantiations thereof.
1. AllTheWeb Rows and Columns
The AllTheWeb Rows and Columns architecture (in homage to the AllTheWeb search system as described in the introduction hereinabove) is a trivial instantiation of the framework, cf.
Using a hash function on URLs which is independent of the keyword, the postings of any keyword are approximately evenly partitioned into c components. Each component is then replicated within the column, one component-replica for each row, resulting in r component-replicas. To reconstruct the posting list of a keyword, one host from each column needs to be accessed, but it is not necessary to select these hosts all from the same row, and this flexibility simplifies query load balancing between hosts and improves fault tolerance.
To make the connection to the notation of the framework as realized by the method of the present invention, the three above-mentioned functions must be instantiated. First, due to the row and column schema, one has that for all keywords κ ∈ K, numPartitions(κ)=c, and numReplicas(κ)=r hold, and for all URLs u and κ1, κ2 ∈ K the following must hold:
occLoc((κ1, u))=occLoc((κ2, u)),
i.e. for a URL u, the function occLoc((κ, u)) is independent of the keyword κ. The function hostAssign is also quite simple. Let hostAssign(κ, i,j)=(i,j), where i is the row of the host and j indicates the column of the host in the r×c matrix. Note that if the number of columns c is suitably chosen, then all the component-replicas of any single keyword κ can be read in parallel within the latencyBound. The smallest number c is the following:
When performing query processing in the AllTheWeb Rows and Columns architecture as shown in
However, AllTheWeb Rows and Columns has several disadvantages. Firstly, the number of hosts accessed for a keyword κ is independent of the length of κ's posting list; c hosts must always be accessed even for keywords with very short posting lists. Secondly, AllTheWeb Rows and Columns does not take keyword popularity in the query workload into account; every component is replicated r times even if the associated keyword is accessed only quite infrequently. Thirdly, changes in the physical setup for AllTheWeb Rows and columns are constrained to additions of hosts in multiples of c or r at once, resulting in an additional row or an additional column in the architecture.
Addition of a new (c host) row is relatively straightforward; addition of a new (r host) column, however, is non-trivial. To illustrate this point, consider an instance of AllTheWeb Rows and Columns with r rows and c columns and which uses associated function occLocc(·) with range {1, . . . , c}. When adding another row, a new function occLoc′(·) with range {1, . . . , c+1} must be selected, and in general
occLoc((κ, u))≠occLoc′((κ, u)),
will hold, so all posting lists need to be repartitioned according to occLoc′(·), which basically results in re-building of the whole index.
2. Fully Adaptive Rows and Columns
Now, a solution according to the present invention that takes into account both the difference in the sizes of posting lists and the difference in popularity of keywords in the query shall be described. The essence of this novel solution is that one instantiates AllTheWeb Rows and Columns differently for each keyword: Each keyword may have a different number of rows and columns. In other words, applying the method of the present invention shall provide a solution with fully adaptive rows and columns.
Consider a keyword κ. Start with an instantiation of numPartitions(κ). Since each host can only retrieve buc units while satisfying the global query latency requirement of latencyBound, PL(κ) is partitioned into
components. Thus each component is sized such that it can be read within the query latency requirement from a single host. Note that for a keyword having very short posting lists, one (or very few) components are created, whereas for keywords having very long posting lists, many components are created.
The question now is how many component-replicas should be created for a keyword κ. Recall that component-replicas are created for fault tolerance and in order to distribute the query workload across hosts. To tolerate f unavailable hosts, numReplicas(κ)≧f is enforced. To balance the query workload, posting lists of popular keywords (in the query workload) are replicated more often than posting lists of rare keywords. So the number of replicas is made inversely proportional to the arrival rate of the keyword in the workload.
By making numPartitions(κ) and numReplicas(κ) different for each keyword κ, one obtains a number of rows and columns that is specific for each keyword. The number of columns still indicates the number of partitions, and the number of rows indicates the number of replicas for each partition. However, keywords with long posting lists have many columns, and keywords with short posting lists have few columns. Popular keywords have many rows, unpopular keywords have few rows. As compared to AllTheWeb Rows and Columns, Fully Adaptive Rows and Columns results in less imbalance in the sizes of the components for different keywords. Thus one has achieved that each component-replica is now normalized in the sense that each component-replica has approximately the same size (up to a difference of buc) and has approximately the same arrival rate.
There are many different ways of assigning component-replicas to hosts. Given o hosts, one possibility would be to hash each keyword κ onto one of the numbers from 1 to o, and then to assign the components sequentially (mod o) to hosts. Conceptually, for a keyword κ this embeds κ's specific matrix with numPartitions(κ) columns and numReplicas(κ) rows sequentially into the o hosts. Formally, the assignment function has the following form. Let keywHash(·) be a function from K to {1, . . . , o} with the property that
for i ∈ K to {1, . . . , o} and κ ∈ K. Then one can lay out the submatrix for keyword κ in H row by row as follows:
hostAssign(κ, i, j)=(keywHash(κ)+(i−1)·numPartitions(κ)+(j−1)) mod o,
where i ∈ {1, . . . , numReplicas(κ)} and j ∈ {1, . . . , numPartitions(κ)},
With this instantiation of hostAssign, the question now is how many component-replicas will be assigned to a host. With Fully Adaptive Rows and Columns the following simple theorem shows that there will not be much imbalance between two hosts with respect to the number of component-replicas.
Theorem 1
Let s be the total number of component-replicas created over all keywords κ, formally
Let o be the number of hosts, and assume hostAssign is defined as in the previous paragraph, and assume that s=Ω(o). Then the maximum number of component-replicas at each host h ∈ H is Θ(s/o), i.e. the maximum number of component-replicas assigned to any host is on the order of the mean number of component-replicas assigned to any host.
Proof
Follows from bounds on balls into bins [MR95].
To retrieve the posting list of a keyword κ in processing a search query, select any one host from each of the numPartitions(κ) “virtual columns” of κ's matrix; thus the number of different possibilities for choosing this set is numPartitions(κ)numReplicas(κ).
Processing queries with multiple keywords in Fully Adaptive Rows and Columns is much more expensive than in AllTheWeb Rows and Columns. For example, consider a keyword query q={κ1, κ2} where numPartitions(κ1)≠numPartitions(κ2). Since keywords κ1 and κ2 are partitioned differently, the posting list of say, κ1, must be repartitioned to match the partitioning of κ2, an expensive operation. In addition, there is no guarantee that any components of κ1 and κ2 are co-located at the same host.
3. Two-Class Rows and Columns
A third instantiation of the framework as realized by the method of the present invention is a special case of Fully Adaptive Rows and Columns that results in much simpler (and cheaper) query processing. As in AllTheWeb Rows and Columns, it is assumed that r·c hosts are arranged in the usual matrix of hosts.
For Two-Class Rows and Columns, the keyword is classified along two axes. The first axis is the size of the posting list, where keywords are partitioned into short and long keywords based on the size of their posting lists. The second axis is the arrival rate of the keywords in the query workload where keywords are partitioned into popular and unpopular keywords based on their arrival rate. This results in four different classes of keywords:
This instantiation of the two functions numPartitions(·) and numReplicas(·) from the present framework for the different classes of keywords is shown in Table 2. Note that compared to Fully Adaptive Rows and Columns, in Two-Class Rows and Columns there are only four different types of matrices as depicted in
Let keywHash(·) be as defined and disclosed in connection with the discussion of fully adaptive Rows and Columns hereinabove. Let rowHash(·,·) be a function from K×{1, . . . , f} to {1, . . . , r} such that rowHash(κ, i1)≠rowHash(κ, i2) for i1, i2 ∈ {1, . . . , r}. How the component-replicas of a keyword are assigned to hosts depends on the class of the keyword.
hostAssign(κSU,i, 1)=(rowHash(κ,i),(keywHash(κSU) mod c)+1)),
hostAssign(κLU,i,j)=(rowHash(κ, i), j),
hostAssign(κSP,i,1)=(i, (keywHash(κSU) mod c)+1),
hostAssign(κLP,i,j)=(i,j),
Similar to the previous section, the following theorem can be proved.
Theorem 2
Let s be the total number of component-replicas created over all keywords κ, (independent of the class of κ), formally
Let o be the number of hosts, and assume hostAssign is defined as in the previous paragraph. Then the maximum number of component-replicas at each host h ∈ H is Θ(s/o), i.e. the maximum number of component-replicas assigned to any host is on the order of the mean number of component-replicas assigned to any host.
Proof
Follows from bounds on balls into bins [MR95].
Given the functions hostAssign(κ,·,·) for the different classes of keywords, query processing in Two-Class Rows and Columns is not very complicated, and there are usually many choices of how to process a query. Table 3 describes how to process a query q={κ1, κ2} with two keywords; query processing for queries with more than two keywords is analogues.
The set of possible sets of hosts to choose for query processing is straightforward given this discussion.
Based on the instantiation framework as provided by the method of the present invention and discussed hereinabove, a few practical considerations will be outlined in the following.
Firstly, applying the method of the present invention allows an extension of the search system framework that permits each keyword to have more than a single row-and-column instance. This shall be described immediately below.
In the above-mentioned embodiments there has so far assumed that for each keyword κ there are two functions numPartitions(κ) and numReplicas(κ). However, for performance reasons, one may partition with a keyword κ in more than one way and possibly have a different number of replicas for the different partitionings. For example, in Two-Class Rows and Columns, the posting list for an SP keyword κSP is replicated across one column. In addition to the current replication, the posting list of κSP might have been partitioned across one row, because κSP often co-occurs with a second LP keyword κLP that is partitioned across all rows.
This extension can be characterized by associating sets of functions from the resulting framework with each keyword applying the method of the invention; for example, a keyword κ could have two sets of functions {numPartitions1(κ), numReplicas1(κ)} and {numPartitions2(κ), numReplicas2(κ)}. The number of sets could be keyword-dependent. This greatly increases the possible choices for query processing. However, this extension formally shall not be introduced formally herein since it is conceptually straightforward.
Secondly, a person skilled in the art shall realize that applying the method of the invention to creating frameworks for real search systems, including enterprise search systems, may allow for various optimizations thereof. Such optimizations shall take the method of the present invention as their point of departure, but their reduction of practice is considered to lie outside the scope of the present invention, and they shall hence not be elaborated further herein.
The method of the present invention realizes a framework for distributing the index of a search engine across several hosts in a computing cluster. The framework as disclosed distinguishes three orthogonal mechanisms for distributing a search index: Index partitioning, index replication, and assignment of replicas to hosts. Instantiations of these mechanisms yield different ways of distributing the index of a search engine, including popular methods from the literature and novel methods that by far outperform the prior art in terms of resource usage and performance while achieving the same level of fault tolerance.
Further the method of the present invention for the first time recognizes that different keywords and different documents in a search engine might have different properties (such as length or frequency of access). The framework realized by applying the method of the present invention creates a configuration of the index of a search engine according to these properties. The framework also serves to outline how to process queries for the space of configurations made possible by its realizations.
Instantiations of this framework moreover lead to existing index configurations disclosed in prior art as well as novel index configurations that are not possible in the prior art.
Number | Date | Country | |
---|---|---|---|
61013705 | Dec 2007 | US |