System and method for retrieving documents based on mixture models

Description

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates generally to document retrieval systems, generally, and particularly, query-driven systems for the automated retrieval of electronic documents.

2. Description of the Prior Art

Document retrieval systems typically comprise document databases and the ability to determine the relevance of each document to any given query. Typically queries consist of a few words, but in principal they are themselves documents. Thus the core problem of building a document retrieval system is the task of estimating the relevance of two documents when such relevance judgments are generally subjective.

Recent approaches developed to solve this task are based on the “physical” properties of the documents (i.e., on the words they contain and how frequently they appear. See, for example, the reference to Baeza-Yates, R. and Ribeiro-Neto, B. entitled “Modern Information Retrieval”, Addison Wesley, 1999 and the reference to van Rijsbergen, C. J. entitled “Information Retrieval”, Buttersworth, 1979).

While being a desirable approach, it would be further desirable to provide a system and method for retrieving electronic (stored) documents that does not require explicit use of semantic meaning (e.g., a thesaurus).

SUMMARY OF THE INVENTION

It is an object of the present invention to apply a discrete time Markov process for determining the relationship or relevance between electronic documents in a document search and retrieval system and method. Basically, the current invention provides a method, computer program product and system for constructing a probabilistic model for ranking documents in a database based on their probability of relevance to a given query. The method is based on a mixture model trained using various techniques.

According to one aspect of the invention, there is provided a document retrieval approach that models document-document relevance using a discrete-time, static, Markov process (see, for example, the reference to Jelinek, F. entitled “Statistical Aspects of Speech Recognition”, MIT Press, 1998) in the document space. A discrete-time, static, Markov process is produced by a system that changes state at regular time steps according to fixed state-to-state transition probabilities. The states in the inventive model correspond to each document in a given database.

A Markov process over document states leads then to a sequence of documents that can link a query document with any document in a given database. Given a query, the probability of reaching a given document after “n” time steps is estimated by summing over all possible paths to the state corresponding to the document. For different tasks, databases and queries, the relevance of each time step in the Markov process will vary. The method of the invention models these variations from the data. In addition, the inventive model specifies the state-to-state transition probabilities and the initial state probabilities.

BRIEF DESCRIPTION OF THE DRAWINGS

Further features, aspects and advantages of the apparatus and methods of the present invention will become better understood with regard to the following description, appended claims, and the accompanying drawings where:

FIG. 1 is a general block diagram depicting the underlying system architecture 100 for employing the document retrieval technique implementing a discrete time, static Markov process in the document space according to the present invention; and,

FIG. 2 is a trellis diagram that is built for each query according to the principles of the invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

The probabilistic model for retrieving documents according to the invention, is capable of being implemented in a computer or networked computer system as described in greater detail hereinbelow with respect to FIG. 1. According to the invention, the probabilistic model is based on the Markov process represented by the trellis diagram 60 shown in FIG. 2. Such a trellis is built for each query with each node 62, 64, etc., in a column corresponding to a document, and each column 72, 74, etc., corresponding to a step in the Markov process. Thus, the node 80 at row “D” column T stores the probability that the D^thdocument is relevant after T time steps. Each of the lines connecting the nodes to other nodes, in essence, represent the transition probabilities between document “i” and document “j” if the line goes from node I to node J. Thus, the diagram represents the Markov process model in that, given some initial state which has some associated probabilities, multiplying those probabilities with Markov Transition matrix elements (as will be descr4ibed herein) will arrive at the next probability associated with the next state. Only a subset of the possible transitions from documents states at time t=0 to t=1 (one column to the next column in FIG. 2) is shown, however, it is understood that, according to the Markov process, the process simply requires the probability vector of the column being multiplied by the Markov transition matrix.

In order to make this model explicit, P(d|q) represents the probability that document d is relevant to query q. This may be represented by the equation (1) as follows:
$\begin{matrix} P (d | q) = \sum_{t} P (d | t, q) P (t | q) . & (1) \end{matrix}$

where P(d|t,q) is the probability that “d” is relevant to “q” at time step “t” and P(t|q) is the probability that the t^thtime step is relevant to “q”. Assuming that the Markov property holds, there is obtained equation (2) as follows:
$\begin{matrix} P (d | t, q) = \sum_{d^{'}} T (d | d^{'}) P (d^{'} | t - 1, q) . & (2) \end{matrix}$

where T(d|d′) is the transition probability from state d to d′. Equation (2) defines the t^thorder semantic relevance, P(d|t,q), as a Markov process where T(d|d′) is the Markov state-to-state transition matrix. Equation (1) asserts that the true relevance probability can be modeled as a weighted mixture of the order semantics. Solving for the system of equations represented by Equation (2) gives:
$\begin{matrix} P (d | t, q) = \sum_{d^{'}} P (d | d^{'}, t) P (d^{'} | t = 0, q) & (3) \end{matrix}$

where P(d|t=0,q) is the initial state of document relevance and P(d|d′,t) is the Markov transition matrix P(d|d′) raised to the “t” power. For example,
$\begin{matrix} P (d | d^{'}, 2) = \sum_{d^{″}} T (d | d^{″}) T (d^{″} | d^{'}) . & (4) \end{matrix}$

Combining Equations (1) and (3) gives the general form of the Markov model:
$\begin{matrix} P (d | q) = \sum_{t, d^{'}} P (d | d^{'}, t) P (t | q) P (d^{'} | t = 0, q) . & (5) \end{matrix}$

Without loss of generality, equation (6) defines T(d|d′), i.e., the transition probability from state d to d′:
$\begin{matrix} T (d | d^{'}) = \frac{ⅇ^{- β D (d, d^{'})}}{\sum_{d^{'}} ⅇ^{- β D (d, d^{'})}} & (6) \end{matrix}$

where β is a tuning parameter and D(q,d) is a function measuring the relevance match between d and d′ and comprises a value between 0 and 1 (i.e., an exponentiation of a negative number having a value ranging between 0 and infinity). The denominator term is simply the sum to give it a probability. It is assumed that the distance measure D(q,d) does not correspond exactly to human relevance ranking; otherwise, it would not be necessary to proceed further. Instead, assuming that the distance measure D(q,d) is imperfect, the Markov model is used to improve upon it. Further assuming that P(d|t=0,q) obeys the same relationship. This assumption is reasonable given the interpretation that the system begins with all of its probability localized on state q; thus P(d|t=0,q)=T(d|q) is the state of the system after the first time step.

Thus, the problem of estimating P(d|q) is transformed into the sub-problems of estimating D(q,d) and P(t|q) herein referred to as a mixture coefficient representing the probability that the t^thtime step is relevant to a query document q. That is, the purpose of the mixture coefficient is to find the correct mixture of Markov probability time steps.

The system runs by initializing the first column of the trellis with the probabilities p₀(d|q), using T(d|d′) to fill subsequent columns and using P(t|q) to combine the column probabilities into the final relevance probabilities. It is understood that, as Markov transition matrices can be very large, the system may reduce them, i.e., make them “sparse matrices”, by removing all element values below a threshold (i.e., small values that have little or no impact on retrieval accuracy). This enables a host of sparse matrix algorithms, known to those skilled in the art, to be used to accelerate the run-time of the algorithm thus, making the system more useful. Then, a Latent Semantic Analysis (LSA) or Indexing (LSI) measure may be used to obtain the value D(q,d); however, it is understood that any suitable measure can be used. All that is left is to specify P(t|q), the mixture coefficient.

Preferably, a Maximum Likelihood (ML) or Minimum Square Error (USE) estimates may be implemented, however, it is understood that other cost minimization methods may be used as would be known to skilled artisans familiar with machine learning. It is additionally understood that these methods may result in non-probabilistic measures of relevance, and hence, the methods may generate what is alternately referred to herein as a relevance measure(s). The step of estimating the mixture coefficients P(t|q) is now described. Use may be first made of a simplifying assumption that P(t|q)=P(t), i.e. “t” is independent of “q”. This assumption makes the learning algorithms more tractable at the expense of weakening the model. However, using methods known to those skilled artisans, the mixture coefficients may be estimated without the simplifying assumption.

The ML estimate is given by equation (7) as follows:
$\begin{matrix} P (t) = \arg \max_{P (t)} \prod_{q} \sum_{d} P (d | q) R (d | q) & (7) \end{matrix}$

where R(d|q) is a binary, hand-labeled relevance judgment. If document d is relevant to query q then R(d|q)≡1 otherwise R(d|q) \equiv 0. Note that the ML estimate attempts to maximize
$\begin{matrix} \sum_{d} R (d | q) P (d | q) & (8) \end{matrix}$

which is the total probability of the documents relevant to q. Using the EM algorithm (with the constraint that Σ_tP(t)=1) gives the following update rule:
$\begin{matrix} P_{τ + 1} (t) = \frac{1}{Q} \sum_{q = 1}^{Q} \frac{\sum_{d, d^{'}} P (d | d^{'}, t) P (d^{'} | t = 0, q) R (d | q) P τ (t)}{\sum_{t, d, d^{'}} P (d | d^{'}, t) P (d^{'} | t = 0, q) R (d | q) P τ (t)} & (9) \end{matrix}$

AMSE estimate of P(t) is obtained by solving equation (10) as follows:
$\begin{matrix} \frac{\partial}{\partial P (t)} {\sum_{dq} {[p (d | q) - R (d | q)]}^{2} - 2 λ [\sum_{t} P (t) - 1]} = 0 & (10) \end{matrix}$

where λ is a Lagrange multiplier. The above equation leads to equation (11) as follows:
$\begin{matrix} \sum_{t} P (t) \sum_{dq} P (d | tq) P (d | k q) - \sum_{dq} R (d | q) P (d | kq) = λ . & (11) \end{matrix}$

Next, a matrix “B” is defined such that B_lk≡Σ_dqp₁(d|q)p_k(d|q), a vector “A” is defined such that A_k≡Σ_dqR(d|q)p_k(d|q), a vector “P” is defined such that P_t=P(t), and, a vector “e” is defined such that e_i=1. Thus, equation (11) may be rewritten as:

BP−A+λe=0. (12)

Solving for P utilizing the fact that e^TP=1 results in equation (13) as follows:
$\begin{matrix} P = B^{- 1} A - \frac{ⅇ^{T} B^{1} A - 1}{ⅇ^{T} B^{- 1} e} B^{- 1} e . & (13) \end{matrix}$

For both of these algorithm, it is necessary to truncate P(t) at some value of “t” by forcing P(t)=0 for all “t” above some time step. The threshold may be set to ten (10) iterations, for example, in several experiments as little or no relevance measure was found to exist above that point when experiments were run with larger truncation times.

Preferably, the system and method of the invention includes a step of learning the Markov models mixture coefficients from a document database so as to maximize the relevance measure of the relevant documents being retrieved. As described herein, however, the method requires only that a similarity measure, D(d, d′), between two documents be specified. It is understood that, the ML and MSE methods described herein embodies examples of training the system parameters. Furthermore, the system may be designed to adapt, in real-time, to changes in the document database. For example, as documents are removed or added to the database, the algorithms may be iterated, or additional algorithms may be used to enable the system parameters to adapt to the changes in the document database in real-time.

It is further understood that, before performing document retrieval, all documents may or may not be subject to one or more of the following text preprocessing methods including, but not limited to: punctuation removal, stopping, stemming and vectorization steps. Each step discards some information assumed to be nearly irrelevant for retrieval purposes. The goal of the preprocessing is to remove content from the documents that can be considered “noise” and thereby improve retrieval performance. “Punctuation removal” is self-explanatory. “Stopping” involves creating a list of “stop” words all occurrences of each of which are removed from each of the documents. The stop list eliminated words that were viewed as playing no semantic role (see, for instance, the reference to Frakes, W. B. and Baeza-Yates, R., entitled Information Retrieval, Data Structures and Algorithms, Prentice Hall, 1992). Words removed include, but are not limited to: articles, prepositions, verbs of common use (e.g. “to be”, “to have”, etc) and other functional words. While it is not necessary, topic-dependent stop lists may be used. “Stemming” comprises the step of replacing all morphological variations of a single word (e.g. “interest”, “interests”, “interesting”, “interested”, etc.) with a single term carrying the same semantic information. “Vectorization” takes the resulting documents and converts them into vectors each element of which corresponded to a specific term appearing in a given document. It is understood that other pre-processing methods exist that could be implemented with the invention.

All the resulting vectors are organized to form the columns of the term-by-document matrix, “M”. The element M_ijaccounts for the role played by term “i” in modeling the content of document “j” and can be written according to equation (14) as follows:

M_ij=G(i)L(i,j) (14)

where G(i) is called global weight and L(i,j) is called the local weight. The local weight is the number of times a certain term appears in the document.

The global weight emphasizes the terms that allow a better discrimination between documents. While the literature presents several weighting schemes (see, for instance, the reference to Dumais, S. entitled “Improving the Retrieval of Information from External Sources”, Behavior Research Methods, Instruments and Computers, Vol. 23, pp. 229-236 (1991)), the following weighting schemes may be used including: term frequency (tf), inverse data frequency (idf), and entropy weighting (ent).

When term frequency is used, G(i) is equal to 1 and the element a_ijcorresponds to the number of times the term appears in the document. This is based on the hypothesis that, when a term appears many times it is more representative of the document content. One problem with using term frequency is that a term that appears in many documents, and thus does not provide much discrimination ability, can be weighted more than other terms that appear less times, but being in less documents, are more discriminative. This is especially a problem if no stop list is used.

The inverse data frequency approach attempts to avoid this problem by using a global weight defined according to equation (15) as:

G(i)=1+log₂N_log₂df(i) (15)

where “N” is the total number of documents in the database and df(i) is the number of documents including the term “i”. With this weighting, the terms that appear in few documents are enhanced with respect to other terms. This is desirable because terms that appear in few documents are a good basis for discrimination. The problem of this weighting scheme is that a term appearing only once in a document can be significantly emphasized even if it is not really representative of the document content.

To overcome the problem of over-emphasis in inverse data frequency weighting, one can use entropy weighting where the global weight is defined according to equation (16) to be:
$\begin{matrix} G (i) = 1 - \frac{\sum_{j} p (i, j) \log p (i, j)}{\log (N)} & (16) \end{matrix}$

where p(i,j) is the probability of finding term “i” in a document “j” and the log(N) term corresponds to the entropy of the least discriminative distribution possible (i.e., a uniform p(ij)). This weighting scheme gives more emphasis to the terms appearing many times in few documents. The second term in equation (16) is the relative entropy of the distribution of a term across the documents (i.e., entropy weighting). The relative entropy reaches its maximum (corresponding to 1) when a term is uniformly distributed across the documents. In this case the weight of the term is zero. On the contrary, the relative entropy is close to zero (and the corresponding weight is close to 1) when the term appears in few documents with a high frequency.

FIG. 1 is a general block diagram depicting the underlying system architecture 100 for employing document retrieval technique of present invention. In its most general form, the architecture may comprise a client-server system in which multiple users (clients) 40 may search a database 20 of electronic documents. The clients make these specifications (i.e., queries) across a network 35, such as the Internet or like public or private network, to a server device 5, employing software for executing the technique of the invention. For example, the server device 5 may be a Web-server such as employed by Yahoo® or Google™ for performing searches for clients over the Internet and may include a Central Processing Unit (CPU) 10, and, disk 15 and main memory 30 systems where the data method is executed. The CPU is the component in which most of the calculations are performed. The documents to be searched via the techniques of the invention may additionally be stored on a medium such as a hard drive disk 15 or other high density, magnetic or optical media storage device. In addition, a main memory 30 is available in order to perform the calculations in the CPU. Optionally, a cache 25 may be retained in order to speed up all the calculations for this system. After processing the documents employing the technique of the invention including the generation of a trellis for each query, the document search results are returned across the network to the one or more client(s) 40. It should be understood to skilled artisans that this is not the only system architecture which may support such a system, though it is the most general one. Another possible architecture could include a desktop personal computer system with a keyboard, screen and user interface which may be more suitable in a single client environment.

While the invention has been particularly shown and described with respect to illustrative and preformed embodiments thereof, it will be understood by those skilled in the art that the foregoing and other changes in form and details may be made therein without departing from the spirit and scope of the invention which should be limited only by the scope of the appended claims.

Claims

1. A method of implementing a computer for retrieving documents in a database comprising a plurality of electronic documents, said method comprising the steps of: a) receiving a query document, b) applying a Markov Process for calculating a relevance measure of each document in said database to the query document; and c) outputting results of those database documents most relevant to the query document from said applied Markov process.
2. The method of implementing a computer as claimed in claim 1, wherein said step b) of implementing a Markov Process includes a step of: determining a state-to-state Markov Transition matrix having elements, whereby the measure of a document, d, being relevant at time t is equal to the Markov transition matrix multiplied by the probability of a document being relevant at time t−1, wherein the measure of a document being relevant to a query document is a function of time.
3. The method of implementing a computer as claimed in claim 2, wherein for any time t, the additional step of obtaining a different set of relevance measures for each said documents, said method including determining a best mixture of those relevance measures at each time step.
4. The method of implementing a computer as claimed in claim 3, wherein said step of determining a best mixture of those relevance measures at each time step comprises: calculating a set of weighted averages, a relevance measure being multiplied by a weight and then summed according to the following: P⁡(d|q)=∑t,d′⁢P⁡(d|d′,t)⁢P⁡(t|q)⁢P⁡(d′|t=0,q)where p(d|q) is the probability that document d is relevant to given query q independent of time, P(t|q) is a mixture coefficient representing the probability that a tth time step is relevant to a query document q, P(d′|d, t) is the probability of document state d transition to a state d′ as dictated by said Markov Transition matrix at tth time step, and, P(d′|t=0, q) is the initial state of document relevance.
5. The method of implementing a computer as claimed in claim 4, wherein the Markov Transition matrix is defined as
6. The method of implementing a computer as claimed in claim 5, further including the step of: providing a similarity relevance measure between two documents d, d′ for said distance function D.
7. The method of implementing a computer as claimed in claim 6, wherein the similarity relevance measure includes one of: implementing a Maximum Likelihood estimate, or a Minimum Square Error estimate, or like cost minimization estimate.
8. The method of implementing a computer as claimed in claim 6, further including the steps of learning the Markov models mixture coefficients from the document database so as to maximize the relevance measure of the documents being retrieved.
9. The method of implementing a computer as claimed in claim 5, further including the step of training the parameters of the system.
10. The method of implementing a computer as claimed in claim 5, further including the step adapting system parameters in real time.
11. The method of implementing a computer as claimed in claim 4, further including the step of applying document pre-processing to remove noise content from the document for improving document retrieval performance.
12. The method of implementing a computer as claimed in claim 4, wherein said step of determining a best mixture of those relevance measures further includes the step of reducing said Markov Transition matrix into a sparse matrix by removing all element values below a threshold; and, implementing a sparse matrix method to accelerate the determination.
13. A system for retrieving documents in a database comprising a plurality of electronic documents, said method system comprising: means for receiving a query document, means implementing a Markov Process for calculating the relevance measure of each document in said database to the query document; and means for outputting results of those database documents most relevant to the query document resulting from said applied Markov process.
14. The system as claimed in claim 13, wherein said means for implementing a Markov Process includes functionality adapted to: determine a state-to-state Markov Transition matrix having elements, whereby the measure of a document, d, being relevant at time t is equal to the Markov transition matrix multiplied by the probability of a document being relevant at time t−1, wherein the measure of a document being relevant to a query document is a function of time.
15. The system as claimed in claim 14, wherein said means for implementing a Markov Process includes functionality adapted to: obtain, for any time t, a different set of relevance measures for each said documents, and determine a best mixture of those relevance measures at each time step.
16. The system as claimed in claim 15, wherein said means for implementing a Markov Process for determining a best mixture of those relevance measures at each time step further includes functionality adapted to: calculate a set of weighted averages, a relevance measure being multiplied by a weight and then summed according to the following:
17. The system as claimed in claim 16, wherein said Markov Transition matrix is defined as
18. The system as claimed in claim 17, further including means for providing a similarity relevance measure between two documents d, d′ for said distance function D.
19. The system as claimed in claim 18, wherein the similarity relevance measure includes one of: implementing a Maximum Likelihood estimate, or a Minimum Square Error estimate, or like cost minimization estimate.
20. The system as claimed in claim 18, wherein said means for implementing a Markov Process includes functionality adapted to: learn the Markov models mixture coefficients from the document database so as to maximize the relevance measure of the documents being retrieved.
21. The system as claimed in claim 17, further including: means for training the parameters of the system.
22. The system as claimed in claim 17, further including: means for adapting system parameters in real time.
23. The system as claimed in claim 13, further including means for applying document pre-processing to remove noise content from the document for improving document retrieval performance.
24. The system as claimed in claim 16, wherein said means for implementing a Markov Process further comprises means for reducing said Markov Transition matrix into a sparse matrix by removing all element values below a threshold; and, means for implementing a sparse matrix method to accelerate the determination of said best mixture of those relevance measures.
25. A program storage device readable by a machine, tangibly embodying a program of instructions executable by the machine to perform method steps for retrieving documents in a database comprising a plurality of electronic documents, said method step comprising: a) receiving a query document, b) applying a Markov Process for calculating the relevance measure of each document in said database to the query document; and c) outputting results of those database documents most relevant to the query document from said applied Markov process.
26. The program storage device readable by a machine as claimed in claim 25, wherein said method step b) of implementing a Markov Process includes a step of: determining a state-to-state Markov Transition matrix having elements, whereby the relevance measure of a document, d, being relevant at time t is equal to the Markov transition matrix multiplied by the probability of a document being relevant at time t−1, wherein the measure of a document being relevant to a query document is a function of time.
27. The program storage device readable by a machine as claimed in claim 26, wherein for any time t, the step of obtaining a different set of relevance measures for each said documents, said method steps further including a step of determining a best mixture of those relevance measures at each time step.
28. The program storage device readable by a machine as claimed in claim 27, wherein said step of determining a best mixture of those relevance measures at each time step comprises: calculating a set of weighted averages, a relevance measure being multiplied by a weight and then summed according to the following:
29. The program storage device readable by a machine as claimed in claim 28, wherein the e Markov Transition matrix is defined as
30. The program storage device readable by-a machine as claimed in claim 29, further including the step of: providing a similarity relevance measure between two documents d, d′ for said distance function D.
31. The program storage device readable by a machine as claimed in claim 30, wherein the similarity measure includes one of implementing a Maximum Likelihood estimate, or a Minimum Square Error estimate, or like cost minimization estimate.
32. The program storage device readable by a machine as claimed in claim 30, further including the steps of learning the Markov models mixture coefficients from the document database so as to maximize the relevance measure of the documents being retrieved.
33. The program storage device readable by a machine as claimed in claim 28, wherein said step of determining a best mixture of those relevance measures further includes the step of reducing said Markov Transition matrix into a sparse matrix by removing all element values below a threshold; and, implementing a sparse matrix method to accelerate the determination.

System and method for retrieving documents based on mixture models

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

US Classifications

International Classifications

Abstract

Description

Claims