Claims
- 1. A method of indexing a database of documents, comprising:
providing a vocabulary of n terms; indexing the database in the form of a non-negative n×m index matrix V, wherein:
m is equal to the number of documents in the database; n is equal to the number of terms used to represent the database; and the value of each element vij of index matrix V is a function of the number of occurrences of the ith vocabulary term in the jth document; factoring out non-negative matrix factors T and D such that V≈TD; and wherein T is an n×r term matrix, D is an r×m document matrix, and r<nm/(n+m).
- 2. The method of claim 1 further comprising deleting said index matrix V.
- 3. The method of claim 2 further comprising deleting said term matrix T.
- 4. The method of claim 1 wherein r is at least one order of magnitude smaller than n.
- 5. The method of claim 1 wherein r is from two to three orders of magnitude smaller than n.
- 6. The method of claim 1 wherein entries of said document matrix D falling below a predetermined threshold value t are set to zero.
- 7. The method of claim 2 wherein r is at least one order of magnitude smaller than n.
- 8. The method of claim 2 wherein r is from two to three orders of magnitude smaller than n.
- 9. The method of claim 2 wherein entries of said document matrix D falling below a predetermined threshold value t are set to zero.
- 10. The method of claim 3 wherein r is at least one order of magnitude smaller than n.
- 11. The method of claim 3 wherein r is from two to three orders of magnitude smaller than n.
- 12. The method of claim 3 wherein entries of said document matrix D falling below a predetermined threshold value t are set to zero.
- 13. The method of claim 1 wherein said factoring out of non-negative matrix factors T and D further comprises:
selecting a cost function and associated update rules from the group:
cost function 14F=∑i=1n∑j=1m[Vijlog(T D)ij-(T D)ij]associated with update rules 15Tik←Tik∑jVij(T D)ijDkj,Tik←Tik∑lTik,a n dDkj←Dkj∑iTijVij(T D)ij,and cost function 16F=∑i=1n∑j=1m[VijlogVij(T D)ij-(Vij)+(T D)ij]associated with update rules 17Dkj←Dkj∑iTi kVij(T D)ij∑lTl k a n dTik←Tik∑jDkjVij(T D)ij∑hDkh,and cost function 18&LeftDoubleBracketingBar;V-T D&RightDoubleBracketingBar;2=∑i=1n∑j=1m(Vij-(T D)ij)2associated with update rules 19Dkj←Dkj(TTV)kj(TTT D)kj a n d Tik←Tik(V DT)ik(T D DT)ik;and iteratively calculating said update rules so as to converge said cost function toward a limit until the distance between V and TD is reduced to or beyond a desired value.
- 14. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for indexing a database of documents, said method steps comprising:
providing a vocabulary of n terms; indexing the database in the form of a non-negative n×m index matrix V, wherein:
m is equal to the number of documents in the database; n is equal to the number of terms used to represent the database; and the value of each element vij of index matrix V is a function of the number of occurrences of the ith vocabulary term in the jth document; factoring out non-negative matrix factors T and D such that V≈TD; and wherein T is an n×r term matrix, D is an r×m document matrix, and r<nm/(n+m).
- 15. A database index, comprising:
an r×m document matrix D, such that V≈TD wherein T is an n×r term matrix; V is a non-negative n×m index matrix , wherein each of its m columns represents an jth document having n entries containing the value of a function of the number of occurrences of a ith term appearing in said jth document; and wherein T and D are non-negative matrix factors of V and r<nm/(n+m); and wherein each of the m columns of said document matrix D corresponds to said jth document.
- 16. A method of information retrieval, comprising:
providing a query comprising a plurality of search terms; providing a vocabulary of n terms; performing a first pass retrieval through a first database representation and scoring m retrieved documents according to relevance to said query; executing a second pass retrieval through a second database representation and scoring documents retrieved from said first pass retrieval so as to generate a final relevancy score for each document; and wherein said second database representation comprises an r×m document matrix D, such that V≈TD wherein T is an n×r term matrix; V is a non-negative n×m index matrix , wherein each of its m columns represents an jth document having n entries containing the value of a function of the number of occurrences of a ith term of said vocabulary appearing in said jth document; and wherein T and D are non-negative matrix factors of V and r<nm/(n+m); and wherein each of the m columns of said document matrix D corresponds to said jth document.
- 17. The method of claim 16 wherein said final relevancy score for any jth document is a function of said jth document s corresponding entry in said document matrix D and the corresponding entries in said document matrix D of the Γ top-scoring documents from said first pass retrieval.
- 18. The method of claim 17 wherein said relevancy score function for said jth document is proportional to a sum of cosine distances between said jth document s corresponding entry in said document matrix D and each of said corresponding entries in said document matrix D of the Γ top-scoring documents from said first pass retrieval.
- 19. The method of claim 16 wherein r is at least one order of magnitude smaller than n.
- 20. The method of claim 16 wherein r is from two to three orders of magnitude smaller than n.
- 21. The method of claim 16 wherein entries of said document matrix D falling below a predetermined threshold value t are set to zero.
- 22. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for information retrieval, said method steps comprising:
providing a query comprising a plurality of search terms; providing a vocabulary of n terms; performing a first pass retrieval through a first database representation and scoring m retrieved documents according to relevance to said query; executing a second pass retrieval through a second database representation and scoring documents retrieved from said first pass retrieval so as to generate a final relevancy score for each document; and wherein said second database representation comprises an r×m document matrix D, such that V≈TD wherein T is an n×r term matrix; V is a non-negative n×m index matrix, wherein each of its m columns represents an jth document having n entries containing the value of a function of the number of occurrences of a ith term of said vocabulary appearing in said jth document; and wherein T and D are non-negative matrix factors of V and r<nm/(n+m); and wherein each of the m columns of said document matrix D corresponds to said jth document.
Government Interests
[0001] This work was supported under a DARPA government contract, SPAWAR contract No. N66001-99-2-8916.