Claims
- 1. A method of retrieving a document from a database, comprising the steps of:
extracting a plurality of words contained in a query document received; collecting a plurality of words contained in a plurality of documents registered previously in the database for thereby creating retrieving indexes on the basis of numbers of times said plural words as collected occur in said previously registered documents, respectively, said retrieving indexes being held in a memory; calculating weights of said plural words, respectively, acquired in said extracting step through comparison with the words included in said retrieving indexes; selecting a plurality of words on the basis of weight values of said plural words as the condition for selection; and calculating degrees of similarity of said plural documents registered previously to said query document on the basis of said plurality of selected words.
- 2. A document retrieving method according to claim 1, further comprising a step of:
extracting a predetermined number of words of greater weight for selecting said plural words.
- 3. A document retrieving method according to claim 1, further comprising a step of:
excluding words of less significance for selecting said plural words.
- 4. A document retrieving method according to claim 3, further comprising a step of:
selecting a plurality of words contained in said previously registered documents on a per-language basis for creating said retrieving indexes.
- 5. A document retrieving method according to claim 2, further comprising a step of:
selecting a plurality of words contained in said previously registered documents on a per-language basis for creating said retrieving indexes.
- 6. A document retrieving entity composed of computer-readable codes constituting a program designed to run on a document retrieving system in which said program is installed, comprising:
a code representing a step of extracting a plurality of words contained in a query document received; a code representing a step of collecting a plurality of words contained in a plurality of documents registered previously for thereby creating retrieving indexes on the basis of numbers of times said plural words as collected occur in said previously registered documents, respectively, said retrieving indexes being held in a memory; a code representing a step of calculating respective weights of said plural words acquired in said extracting step through comparison with the words included in said retrieving indexes; a code representing a step of selecting a plurality of words on the basis of weight values of said plural words as the condition for selection; and a code representing a step of calculating degrees of similarity of said plural documents registered previously to said query document on the basis of said plurality of selected words.
- 7. A document retrieving entity according to claim 6, further comprising a step of:
extracting a predetermined number of words of greater weight for selecting said plural words.
- 8. A document retrieving entity according to claim 6, further comprising a step of:
excluding words of less significance for selecting said plural words.
- 9. A document retrieving entity according to claim 8, further comprising a step of:
selecting a plurality of words contained in said previously registered documents on a per-language basis for creating said retrieving indexes.
- 10. A document retrieving entity according to claim 7, further comprising a step of:
selecting a plurality of words contained in said previously registered documents on a per-language basis for creating said retrieving indexes.
- 11. A system for retrieving a document from a database, comprising:
selector means for extracting a plurality of words contained in a query document received; collector for collecting a plurality of words contained in a plurality of documents registered previously in the database for thereby creating retrieving indexes on the basis of numbers of times said plural words as collected occur in said previously registered documents, respectively, said retrieving indexes being held in a memory; calculator for calculating respective weights of said plural words acquired by said selector through comparison with the words included in said retrieving indexes; another selector for selecting a plurality of words on the basis of weight values of said plural words as the condition for selection; and another calculator for calculating degrees of similarity of said plural documents registered previously to said query document on the basis of said plurality of selected words.
- 12. A document retrieving system according to claim 11,
wherein a predetermined number of words of greater weight are extracted for selecting said plural words.
- 13. A document retrieving system according to claim 11,
wherein words of less significance are excluded for selecting said plural words.
- 14. A document retrieving system according to claim 13,
wherein a plurality of words contained in said previously registered documents are selected on a per-language basis for creating said retrieving indexes.
- 15. A document retrieving system according to claim 12,
wherein a plurality of words contained in said previously registered documents are selected on a per-language basis for creating said retrieving indexes.
- 16. A similar document retrieving method for retrieving documents bearing similarity to a designated query document by using a computer, comprising the steps of:
collecting statistical information concerning retrieval-subjected documents on a per-language basis upon registration thereof; extracting words from said query document to thereby calculate degrees of importance of the extracted words by referencing said per-language statistical information in dependence on the languages of said extracted words, respectively; and calculating the degrees of similarity of said registered documents to said query document on the basis of said calculated degrees of importance of the words.
- 17. A similar document retrieving method according to claim 16,
wherein numbers of documents registered on a per-language basis are employed as said per-language statistical information collected upon registration of the documents.
- 18. A similar document retrieving method according to claim 16,
wherein the word whose degree of importance meets a predetermined condition is selected as a feature word representing a feature of a query document concerned, and wherein the degree of importance of the feature word is calculated on the basis of said feature word and statistical information of all the registration-subjected documents collected upon registration thereof.
- 19. A similar document retrieving method according to claim 18,
wherein said predetermined condition prescribes that the degree of importance of said word is not smaller than a predetermined value.
- 20. A similar document retrieving method according to claim 18,
wherein said predetermined condition prescribes that a predetermined number of words are extracted in a descending order of the importance degrees of the words.
- 21. A similar document retrieving method according to claim 18, wherein the feature words are selected on a per-language basis.
- 22. A similar document retrieving method according to claim 16,
wherein the per-language statistical information of said registration-subjected documents is stored on a per-language basis.
Priority Claims (1)
Number |
Date |
Country |
Kind |
2001-363568 |
Nov 2001 |
JP |
|
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application relates to U.S. patent application Ser. No. 09/320,558 filed by T. Matsubayashi et al on May, 27, 1999 under the title “METHOD AND SYSTEM FOR EXTRACTING CHARACTERISTIC STRING, METHOD AND SYSTEM FOR SEARCHING FOR RELEVANT DOCUMENT USING THE SAME, STORAGE MEDIUM FOR STORING CHARACTERISTIC STRING EXTRACTION PROGRAM, AND STORAGE MEDIUM FOR STORING RELEVANT DOCUMENT SEARCHING PROGRAM”.