Extraction of Compounds

Information

  • Patent Application
  • 20070225968
  • Publication Number
    20070225968
  • Date Filed
    March 26, 2007
    17 years ago
  • Date Published
    September 27, 2007
    16 years ago
Abstract
A system for extracting a compound from a plurality of texts is provided. The system includes an obtaining section that analyzes a plurality of first texts and obtains a compound candidate based on analysis of the plurality of first texts, a calculation section that searches a plurality of second texts for each word included in the compound candidate and calculates appearing frequencies of each word included in the compound candidate in the plurality of second texts, and a selection section that selects whether to extract the compound candidate as a compound on the basis of whether or not changes in the appearing frequencies of each word included in the compound candidate synchronize with one another when the appearing frequencies of each word included in the compound candidate are arranged as time series data.
Description

BRIEF DESCRIPTION OF THE DRAWINGS

For a more complete understanding of the present invention and the advantage thereof, reference is now made to the following description taken in conjunction with the accompanying drawings.



FIG. 1 shows an information processing system according to an embodiment of the present invention.



FIG. 2 is a flowchart of processing steps performed by a compound extraction device to extract a compound according to an embodiment of the present invention.



FIG. 3 shows sample appearing frequencies of the word “bird” as time series data.



FIG. 4 shows sample appearing frequencies of the word “flu” as time series data.



FIG. 5 shows sample appearing frequencies of the word “problem” as time series data.



FIG. 6 shows sample appearing frequencies of the phrase “train explosion accident” as time series data



FIG. 7 shows sample appearing frequencies of the word “train” as time series data.



FIG. 8 shows sample appearing frequencies of the word “explosion” as time series data.



FIG. 9 shows sample appearing frequencies of the word “accident” as time series data.



FIG. 10 is a flowchart of processing steps performed by a text retrieval device to retrieve texts according to an embodiment of the present invention.



FIG. 11 shows a sample display for retrieval results outputted by a search section according to an embodiment of the present invention.



FIG. 12 shows an information processing device according to an embodiment of the present invention.


Claims
  • 1. A system for extracting a compound from a plurality of texts, the system comprising: an obtaining section that analyzes a plurality of first texts and obtains a compound candidate based on analysis of the plurality of first texts;a calculation section that searches a plurality of second texts for each word included in the compound candidate and calculates appearing frequencies of each word included in the compound candidate in the plurality of second texts; anda selection section that selects whether to extract the compound candidate as a compound on the basis of whether or not changes in the appearing frequencies of each word included in the compound candidate synchronize with one another when the appearing frequencies of each word included in the compound candidate are arranged as time series data in which the appearing frequencies of each word included in the compound candidate are in chronological order based on publication dates of the plurality of second texts.
  • 2. The system of claim 1, wherein the obtaining section further obtains a plurality of compound candidates based on analysis of the plurality of first texts,wherein, for each of the plurality of compound candidates, the calculation section further searches the plurality of second texts for each word included in the corresponding compound candidate and calculates appearing frequencies of each word included in the corresponding compound candidate in the plurality of second texts, andthe selection section further calculates a score based on whether or not changes in the appearing frequencies of each word included in the corresponding compound candidate synchronize with one another when the appearing frequencies of each word included in the corresponding compound candidate are arranged as time series data in which the appearing frequencies of each word included in the corresponding compound candidate is in chronological order based on publication dates of the plurality of second texts, andwherein the selection section further selects to extract one of the plurality of compound candidates as a compound based on the score of the one compound candidate.
  • 3. The system of claim 1, wherein, responsive to the compound candidate including a previously specified word, the selection section selects to extract the compound candidate as a compound on the condition that changes in the appearing frequencies of the previously specified word synchronize with changes in the appearing frequencies of a different word included in the compound candidate.
  • 4. The system of claim 1, wherein, responsive to the compound candidate including a medium frequency word that has appearing frequencies under a predetermined upper limit and above a predetermined lower limit, the selection section selects to extract the compound candidate as a compound on the condition that changes in the appearing frequencies of the medium frequency word synchronize with changes in the appearing frequencies of a different word included in the compound candidate.
  • 5. The system of claim 4, wherein the different word is a modifier on the medium frequency word.
  • 6. The system of claim 1, wherein responsive to the compound candidate not including a previously specified word, the calculation section searches the plurality of second texts for the compound candidate and calculates appearing frequencies of the compound candidate in the plurality of second texts, andthe selection section selects whether to extract the compound candidate as a compound on the basis of whether or not changes in the appearing frequencies of the compound candidate synchronize with changes in the appearing frequencies of each word included in the compound candidate when the appearing frequencies of the compound candidate and the appearing frequencies of each word included in the compound candidate are arranged as time series data in which the appearing frequencies are in chronological order based on publication dates of the plurality of second texts.
  • 7. The system of claim 1, wherein the selection section divides the time series data corresponding to each word included in the compound candidate into a plurality of data pieces, each data piece corresponding to a certain time period,the selection section determines changes in the appearing frequencies of each word in the certain time period using the data piece corresponding to the certain time period for the word, andthe selection section selects whether to extract the compound candidate as a compound on the basis of whether or not the changes in the appearing frequencies of each word in the certain time period synchronize with one another.
  • 8. The system of claim 1, further comprising: a storing section that stores a third text that includes a plurality of title words previously set;an input section that receives an input of a keyword; anda search section that reads the third text from the storing section responsive to the keyword being one of the plurality of title words,wherein the plurality of title words are previously set by the selection section as the words of the compound selected by the selection section.
  • 9. The system of claim 8, further comprising: an output section that outputs to the storing section the compound selected by the selection section.
  • 10. The system of claim 1, further comprising: an input section that receives an input of a plurality of keywords; anda search section that searches a plurality of target third texts and retrieves a third text that includes the plurality of keywords,wherein, responsive to the compound selected by the selection section including the plurality of keywords, the search section further searches the plurality of target third texts and retrieves another third text that includes the compound.
  • 11. The system of claim 10, wherein the search section further outputs the third text that includes the plurality of keywords and the other third text that includes the compound.
  • 12. The system of claim 1, further comprising: an output section that outputs the compound selected by the selection section to a text retrieval device, the text retrieval device comprising: an input section that receives an input of a plurality of keywords, the plurality of keywords being included in the compound selected by the selection section; anda search section that searches a plurality of target third texts and retrieves a third text that includes each of the plurality of keywords and another third text that includes the compound selected by the selection section.
  • 13. The system of claim 1, wherein the obtaining section analyzes the syntax of each of the plurality of first texts to determine the word class of each word in the respective first text and obtains a plurality of successively appearing nouns as the compound candidate.
  • 14. A system for extracting a compound from a plurality of texts, the system comprising: an obtaining section that analyzes a plurality of first texts and obtains a compound candidate based on analysis of the plurality of first texts;a calculation section that searches a plurality of second texts for the compound candidate and each word included in the compound candidate and calculates appearing frequencies of the compound candidate and each word included in the compound candidate in the plurality of second texts; anda selection section that selects whether to extract the compound candidate as a compound on the basis of whether or not changes in the appearing frequencies of the compound candidate synchronize with changes in the appearing frequencies of each word included in the compound candidate when the appearing frequencies of the compound candidate and the appearing frequencies of each word included in the compound candidate are arranged as time series data in which the appearing frequencies are in chronological order based on publication dates of the plurality of second texts.
  • 15. The system of claim 14, wherein the obtaining section further obtains a plurality of compound candidates based on analysis of the plurality of first texts,wherein, for each of the plurality of compound candidates, the calculation section further searches the plurality of second texts for the corresponding compound candidate and each word included in the corresponding compound candidate and calculates appearing frequencies of the corresponding compound candidate and each word included in the corresponding compound candidate in the plurality of second texts, andthe selection section further calculates a score based on whether or not changes in the appearing frequencies of the corresponding compound candidate synchronize with changes in the appearing frequencies of each word included in the corresponding compound candidate when the appearing frequencies of the corresponding compound candidate and the appearing frequencies of each word included in the corresponding compound candidate are arranged as time series data in which the appearing frequencies are in chronological order based on publication dates of the plurality of second texts, andwherein the selection section further selects to extract one of the plurality of compound candidates as a compound based on the score of the one compound candidate.
  • 16. The system of claim 14, wherein the compound candidate does not include a previously specified word.
  • 17. The system according to claim 14, wherein the compound candidate does not include a medium frequency word that has appearing frequencies under a predetermined upper limit and above a predetermined lower limit.
  • 18. A method for extracting a compound from a plurality of texts, the method comprising: analyzing a plurality of first texts;obtaining a compound candidate based on analysis of the plurality of first texts;searching a plurality of second texts for each word included in the compound candidate;calculating appearing frequencies of each word included in the compound candidate in the plurality of second texts; andselecting whether to extract the compound candidate as a compound on the basis of whether or not changes in the appearing frequencies of each word included in the compound candidate synchronize with one another when the appearing frequencies of each word included in the compound candidate are arranged as time series data in which the appearing frequencies of each word included in the compound candidate are in chronological order based on publication dates of the plurality of second texts.
  • 19. A computer program that causes an information processing device to function as a system for extracting a compound from a plurality of texts, the computer program causing the information processing device to function as: an obtaining section that analyzes a plurality of first texts and obtains a compound candidate based on analysis of the plurality of first texts;a calculation section that searches a plurality of second texts for each word included in the compound candidate and calculates appearing frequencies of each word included in the compound candidate in the plurality of second texts; anda selection section that selects whether to extract the compound candidate as a compound on the basis of whether or not changes in the appearing frequencies of each word included in the compound candidate synchronize with one another when the appearing frequencies of each word included in the compound candidate are arranged as time series data in which the appearing frequencies of each word included in the compound candidate are in chronological order based on publication dates of the plurality of second texts.
  • 20. A computer program product comprising a computer readable medium, the computer readable medium including a computer readable program for extracting a compound from a plurality of texts, wherein the computer readable program when executed on a computer causes the computer to: analyze a plurality of first texts;obtain a compound candidate based on analysis of the plurality of first texts;search a plurality of second texts for each word included in the compound candidate;calculate appearing frequencies of each word included in the compound candidate in the plurality of second texts; andselect whether to extract the compound candidate as a compound on the basis of whether or not changes in the appearing frequencies of each word included in the compound candidate synchronize with one another when the appearing frequencies of each word included in the compound candidate are arranged as time series data in which the appearing frequencies of each word included in the compound candidate are in chronological order based on publication dates of the plurality of second texts.
Priority Claims (1)
Number Date Country Kind
2006-82026 Mar 2006 JP national