Information
-
Patent Grant
-
6173251
-
Patent Number
6,173,251
-
Date Filed
Tuesday, July 28, 199826 years ago
-
Date Issued
Tuesday, January 9, 200124 years ago
-
Inventors
-
Original Assignees
-
Examiners
- Hudspeth; David R.
- Abebe; Daniel
-
CPC
-
US Classifications
Field of Search
US
- 704 7
- 704 9
- 704 10
- 704 260
- 707 3
-
International Classifications
-
Abstract
Disclosed is a keyword extraction apparatus and method capable of overcoming a problem in the conventional automatic keyword extraction wherein character strings in a sentence to be processed are employed, as they are, to assign a document with an index in terms of keywords; hence words having the similar meaning but different expressions in written language cannot be retrieved. The keyword extraction apparatus comprises technical term storage means for storing technical terms with proper expressions and different expressions thereof, and basic word storage means for storing general basic words of high frequency. Technical-term segmentation point setting means cuts out a range of any of the technical terms stored in technical term storage means from an input sentence. When the cut-out technical term is written in a different expression, the different expression is replaced by a corresponding proper expression in proper expression replacing means. Character-type segmentation point setting means detects a difference in character type in the input sentence. Basic-word segmentation point setting means cuts out, from the input sentence, a range of any of the basic words stored in the basic word storage means. Partial character string cutting means cuts out, as keywords, all relevant partial character strings based on segmentation points set by the technical-term segmentation point setting means, the character-type segmentation point setting means and the basic-word segmentation point setting means.
Description
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a keyword extraction apparatus, a keyword extraction method and a computer readable recording medium storing a keyword extraction program, which are used in a system for retrieving a document written in natural language to automatically extract keywords from the document beforehand for creating an index of the document in terms of keywords and, at the time of retrieval, to extract a keyword from an input sentence for retrieving the document through collation of the keyword.
2. Description of the Related Art
As a method of retrieving documents in electronic form, it has been hitherto known to previously assign keywords to a document in the form of an index and, at the time of retrieval, to search the document by collating a designated keyword with the keywords assigned to the document. This method has problems in that manually assigning keywords to a document requires a lot of time and labor, and the retrieval cannot work if the keywords assigned by a person who has engaged in creating the index differ from keywords designated by persons who are going to perform retrieval.
For lessening time and labor required to assign keywords, methods of automatically extracting keywords from documents in electronic form have been proposed.
FIG. 64
is a block diagram showing a conventional keyword extraction system disclosed in, for example, Japanese Unexamined Patent Publication No. 8-30627. In
FIG. 64
, denoted by
6401
is a character type discriminating portion for discriminating types of individual characters in an input text and then transferring the discriminated types to character type storage means
6402
. The character type storage means
6402
stores the types and corresponding positions of the individual characters in the input text which have been discriminated by the character type discriminating portion
6401
. Denoted by
6403
is an effective-character-type character string cutting portion for cutting out all effective-character-type character strings, each of which is as long as any of four effective character types, i.e., katakana (the square form of Japanese letters hiragana), kanji (Chinese characters), alphabets and numerals, continue, based on the information stored in the character type storage means
6402
.
Denoted by
6406
is a character-type boundary discriminating portion for discriminating all boundary positions between different character types of all the effective-character-type character strings based on the information stored in the character type storage means
6402
, and then transferring the discriminated positions to character-type segmentation point storage means
6407
. The character-type segmentation point storage means
6407
stores every boundary position, at which the character type changes from one to another, discriminated by the character-type boundary discriminating portion
6406
.
Denoted by
6409
is affix storage means for storing affixes of high frequency.
6410
is an affix discriminating portion for discriminating all affixes in a character string and then transferring the discriminated affixes to affix segmentation point storage means
6411
. The affix segmentation point storage means
6411
stores, as affix segmentation points, positions before and behind all the affixes discriminated by the affix discriminating portion
6410
.
Denoted by
6413
is basic word storage means for storing, as basic words, nouns of high frequency.
6414
is a basic-word discriminating portion for discriminating all basic words in a character string and then transferring the discriminated basic words to basic-word segmentation point storage means
6415
. The basic-word segmentation point storage means
6415
stores, as basic-word segmentation points, positions before and behind all the basic words discriminated by the basic-word discriminating portion
6414
.
Denoted by
6412
is a partial-character-string cutting portion for cutting out partial character strings based on the character-type segmentation points stored in the character-type segmentation point storage means
6407
, the affix segmentation points stored in the affix segmentation point storage means
6411
, or the basic-word segmentation points stored in the basic-word segmentation point storage means
6415
.
Denoted by
6404
is a noun discriminating portion which, when a character succeeding each of the effective-character-type character string cut out by the effective-character-type character string cutting portion
6403
is hiragana, compares the hiragana with hiragana character strings stored in noun-succeeding-hiragana storage means
6405
, and then deletes the effective-character-type character string when a head portion of the hiragana succeeding to that effective-character-type character string does not match with any of the hiragana character strings stored in the noun-succeeding-hiragana storage means
6405
.
Denoted by
6416
is a basic-word deleting portion for deleting the partial character string which matches with any of the basic words stored in the basic word storage means
6413
.
Denoted by
6417
is a necessary keyword storage means for storing keyword character strings designated beforehand.
6418
is a necessary keyword cutting portion which, when character strings matching with the character strings stored in the necessary keyword storage means
6417
appear in a text, cuts out all those character strings and adds them to keywords.
The operation of the conventional keyword extraction system will be described below. The description will be made on the case of entering a text “ (oekaki mohdo=painting mode)”, for example.
First, the character type discriminating portion
6401
discriminates types of individual characters in an input text, and the character type storage means
6402
stores the types and corresponding positions of the individual characters in such a way that the first character is hiragana, the second character is kanji, the third character is kanji, the fourth character is hiragana, and so on.
Next, the effective-character-type character string cutting portion
6403
cuts out “” and “”. Since there are no differences in character type within the partial character strings of “” and “”, character-type segmentation points are not stored in the character-type segmentation point storage means
6407
. Also, since no affixes are included in the partial character strings of “” and “”, affix segmentation points are not stored in the affix segmentation point storage means
6411
. Further, since no basic words are included in the partial character strings of “” and “”, basic-word segmentation points are not stored in the basic-word segmentation point storage means
6415
.
Then, since “” and “” do not include any of the character-type segmentation point, the affix segmentation point and the basic-word segmentation point, the partial-character-string cutting portion
6412
eventually cut outs two partial character strings of “” and “”.
Subsequently, since hiragana “” succeeding to “” is not registered in the noun-succeeding-hiragana storage means
6405
, the noun discriminating portion
6404
deletes “”. On the other hand, since there is no hiragana succeeding to “”, “” is not deleted in the noun discriminating portion
6404
. The basic-word deleting portion
6416
then deletes the basic word which matches with any of those stored in the basic word storage means
6413
. If “” is assumed here not to be a basic word, “” would not be deleted.
Next, the necessary keyword cutting portion
618
cuts out “” from the text “” stored in the necessary keyword storage means
6417
and adds it to keywords. Finally, “” and “” are output.
When “” or “” is designated as a retrieval key at the time of retrieval, the document including the original text “” is retrieved.
In retrieval with the thus-constructed keyword extraction system disclosed in Japanese Unexamined Patent Publication No. 8-30627, the retrieval is hit only when the character string designated as a keyword completely matches with any of the keywords assigned to a document. In retrieval, however, words having the similar meaning and pronunciation but different expressions (in written language) must be often taken into account. For example, “ (oekaki=painting)” may be entered as a retrieval key rather than “” at the time of retrieval. Thus the keyword extraction system disclosed in Japanese Unexamined Patent Publication No. 8-30627 has a problem that retrieval cannot be effected unless there is a complete match between character strings.
To cope with the problem caused by words having the similar meaning and pronunciation but different expressions, a document retrieval method and apparatus are proposed in Japanese Unexamined Patent Publication No. 8-137892. In the document retrieval method and apparatus proposed in Japanese Unexamined Patent Publication No. 8-137892, when a character string designated upon retrieval is a compound word, the compound word is divided into individual words composing it and synonym expressions for the compound word are created in combinations of synonyms for each of the divided words by using a synonym dictionary.
FIG. 65
is a block diagram of the conventional document retrieval method and apparatus disclosed in Japanese Unexamined Patent Publication No. 8-137892. In
FIG. 65
, denoted by
6501
is a control unit comprised of a CPU and memory,
6502
is an input unit such as a keyboard or mouse through which the user enters a retrieval keyword and performs retrieval operation,
6503
is a display unit for displaying the retrieval keyword entered through the input unit
6502
, the retrieval operation instructed by the user, and retrieved results,
6504
is an external storage unit for storing data to be retrieved,
6505
is a synonym dictionary in which synonym information for retrieved keywords is stored, and
6506
is a segmentation dictionary in which the retrieved keywords are stored. A character string designated for retrieval is segmented based on words registered in the segmentation dictionary
6506
.
The operation of the conventional document retrieval method will be described below.
FIG. 66
is a flowchart illustrating a flow of processing disclosed in Japanese Unexamined Patent Publication No. 8-137892. The following description will be made on the case of designating, for example, “ (bunsbo kensaku=document retrieval)* (wahku sutehshon=work station)” (where “*” indicates logical product) as a retrieval formula. It is assumed that “” and “” are registered in the segmentation dictionary. Also, the synonym dictionary is assumed to store such information that “” and “ (tekisuto=text)” are synonyms, “” and “ (sahchi=search)” are synonyms, and “” and “WS” are synonyms.
In step
6612
, a value in a synonym-dictionary usage flag buffer to set whether to use the synonym dictionary or not is checked. Assuming here that the buffer value is set to “1” indicating the use of the synonym dictionary, the processing follows the path indicated by at Y.
Next, in step
6613
, the retrieval formula is segmented into a character string to be retrieved and a logical formula. Then, in step
6614
, the character string to be retrieved is compared with words in the segmentation dictionary for segmentation of a keyword. Subsequently, in step
6615
, synonyms which correspond to each of the segmented keywords are extracted from the synonym dictionary.
It is determined in step
6616
whether or not the processing for all keywords has been completed, and the processing of steps
6614
and
6615
is repeated until all keywords are processed.
Next, in step
6617
, the synonyms corresponding to the segmented keywords are combined with each other to create retrieval keywords.
Subsequently, in step
6618
, the created retrieval keywords are joined by putting logical sum (“+”) between adjacent two. As a result, for “”, a retrieval formula “(+++”) is created in step
6619
.
It is then checked in step
6620
whether or not a logical formula storage buffer is empty. The processing now returns to step
6614
to repeat the similar processing as explained above for the next character string to be retrieved, i.e., “”.
For “”, a retrieval formula “(+WS)” is created in step
6619
.
Although it is checked in step
6620
whether or not the logical formula storage buffer is empty, the processing now follows the path indicated by Y because there is no more retrieved character string to be processed. As a result, for the designated retrieval formula “* ”, “+ ++” * (+WS)” is created as a retrieval formula for use in actual retrieval.
However, the document retrieval method and apparatus disclosed in Japanese Unexamined Patent Publication No. 8-137892 are designed to perform retrieval for character strings created by all possible combinations of different expressions, and hence have a problem that a longer time is required for retrieval as the number of combinations increases.
As another related art for creation of different expressions, Japanese Unexamined Patent Publication No. 3-15980 discloses a different expression and synonym developing method.
FIG. 67
is a block diagram of the different expression and synonym developing method for retrieval of character strings which is disclosed in Japanese Unexamined Patent Publication No. 3-15980. In
FIG. 67
, denoted by
6711
and
6713
are conversion rule tables for storing conversion rules which instruct a relevant character string in an input character string to be replaced by another character string, and
6712
is a synonym dictionary in which words having the similar meaning but different expressions are collected. Denoted by
6700
is a keyboard,
6701
and
6703
are different expression developing processes for developing a character string into character strings having the similar pronunciation and meaning but different expressions, and
6702
is a synonym developing process for developing a character string into character strings having the similar meaning by using a synonym dictionary
6712
.
FIG. 68
shows an outline of the different expression and synonym developing process. A character string
6801
designated by the user is once subjected to different expression development, and a synonym development is then performed on a group of developed character strings
6802
by using the synonym dictionary
6712
. After that, another different expression development is performed on a group of character strings
6803
resulted from the synonym development, whereby a group of character strings
6804
is obtained as a final development result. An example of
FIG. 68
represents the case where the user designates a character string “ (takujougata intafohn=desktop interphone)” on condition that each of the conversion tables stores rules for converting “(foh)” into “ (ho)” and “ (gata)” into “ (gata)”, and the synonym dictionary stores information that “” and “” are synonyms.
Thus, the method disclosed in Japanese Unexamined Patent Publication No. 3-15980 is designed to avoid a retrieval omission by developing various representations of different expressions and synonyms. However, because the disclosed method creates all possible different expressions, it is required to collate an input character string with all the different expressions created by the above-mentioned processing in order to determine whether or not there occurs a match for each word.
The conventional keyword extraction methods for use in retrieval of documents have had problems below because of their constructions described above.
First, in such a conventional automatic keyword extraction process as disclosed in Japanese Unexamined Patent Publication No. 8-30627, character strings appearing in a sentence to be processed are cut out, as they are, to be used as keywords which are assigned in the form of an index to a document. The conventional automatic keyword extraction process cannot therefore perform retrieval for words having the similar meaning and pronunciation but different expressions.
Although techniques to permit retrieval for words having similar meaning and pronunciations but different expressions are disclosed in Japanese Unexamined Patent Publication No. 8-137892 and No. 3-15980, those techniques require a word designated for retrieval to be collated with all possible combinations of individual words composing the designated word which have the similar pronunciation and meaning but different expressions. Thus, there has been a problem that a long time is required for retrieval processing.
Assuming, for example, that words having the similar meaning and pronunciation but different expressions are “ (sahbah=server)” for “ (sahba=server)” and “”, “”, “” for “” (each kirikae=switching), a total of eight keywords, i.e., “”, “”, “”, “”, “”, “”, “”, and “” have been created and collated for a keyword “”.
Secondly, where a keyword contains a word which succeeds to a prefix and has different expressions, it has been required to create all combinations of the presence/absence of the prefix and the different expressions of the word succeeding to the prefix, and then collate an input keyword with all those combinations.
Assuming, for example, that there are three words having the similar meaning and pronunciation but different expressions, i.e., “”, “” and “”, for “” (each kirikae=switching), a total of eight keywords, i.e., “”, “”, “”, “”, “”, “”, “”, and “” “” have been created and collated for a keyword “ (zenkirikae=full switching”. Thus, the necessity of collating an input keyword with all of the created keywords has raised a problem that a long time is required for retrieval processing.
Thirdly, where a keyword contains a word which precedes a suffix and has different expressions, it has been required to create all combinations of the presence/absence of the suffix and the different expressions of the word preceding the suffix, and then collate an input keyword with all those combinations.
Assuming, for example, that there are three words having the similar meaning and pronunciation but different expressions, i.e., “”, “” and “”, for “” (each kirikae=switching), a total of eight keywords, i.e., “”, “”, “”, “”, “”, “”, “”, and “” have been created and collated for a keyword “ (kirikaego=after switching”. Thus, the necessity of collating an input keyword with all of the created keywords has raised a problem that a long time is required for retrieval processing.
Fourthly, the conventional automatic keyword extraction process as disclosed in Japanese Unexamined Patent Publication No. 8-30627 is designed to set a limit in length of keywords and deleted the keywords which have a length beyond the limit. However, such a design employed in the process disclosed in Japanese Unexamined Patent Publication No. 8-30627 may cause a problem of uneven keyword extraction that, for keywords which have the similar meaning but different expressions and which are different in length, some keywords are extracted, but other keywords are deleted.
Assuming, for example, that “ (konpyubta=computer)” and “ (konpyuhtah=computer)” are registered as words having the similar meaning and pronunciation but different expressions, and a limit of the keyword length is set to be less than 15 characters, “ (konpyubta ahkitekuchah=computer architecture)” is extracted, but “ (konpyuhtah ahkitekuchah=computer architecture)” is deleted.
Stated otherwise, when combinations of a compound word are created in accordance with the method disclosed in Japanese Unexamined Patent Publication No. 8-137892 to cope with retrieval for words having the similar meaning and pronunciation but different expressions, there has been a problem of uneven keyword extraction that, even upon the same retrieval key being designated, documents containing “” are retrieved, but documents containing “” are not retrieved.
Fifthly, with the conventional keyword extraction process disclosed in Japanese Unexamined Patent Publication No. 8-30627, because character strings appearing in a sentence to be processed are cut out, as they are, to be used as keywords, words having the similar meaning and pronunciation but different expressions are extracted as separate words. Accordingly, there has been a problem that precise frequency totalization which is necessary for, e.g., a keyword weighting process, cannot be achieved for the words having the similar meaning and pronunciation but different expressions.
Sixthly, in compound words such as “. (yuza intafehsu=user interface), for example, symbolic characters such as “•” and “/” may be put between individual words composing the compound word; e.g., “. ” and “. ”, in addition to different expressions for each of the individual words composing the compound word; i.e., “” and “”. It is therefore required to unify the expression format for compound words.
The conventional keyword extraction process disclosed in Japanese Unexamined Patent Publication No. 8-30627 includes a method of deleting “•” and “/” to unify the expression format for compound words, but it cannot deal with different expressions for each word which have the similar meaning and pronunciation, as described above. Also, Japanese Unexamined Patent Publication No. 8-137892 and No. 3-15980 disclose methods of creating combinations of different expressions for each word which have the similar meaning and pronunciation, but cannot deal with a process needed to unify the expression format for compound words. Accordingly, even if the above conventional techniques are combined with each other, an input keyword must be collated with all possible combinations of different expressions of individual words composing a compound word; hence a problem of requiring a long time for retrieval processing still remains.
Assuming, for example, that “ (yuhza=user)” has a different expression “ (yuhzah=user)” which has the similar meaning and pronunciation, and “ (intafehsu=interface)” has a different expression of “ (intafeisu=interface)”, four expressions “”, “”,“”, “”, and “” would be produced for “. ” even if the above conventional techniques are combined with each other. Accordingly, a problem of requiring collation with all those different expressions is encountered.
Seventhly, in the methods disclosed in Japanese Unexamined Patent Publication No. 3-15980 and No. 8-137892, different expressions of a retrieval key, which have the similar meaning and pronunciation, are created at the time of retrieval in combinations of different expressions for each word and character string. As a result, a large number of retrieval keys to be collated are produced and a retrieval speed is reduced.
Furthermore, the methods disclosed in Japanese Unexamined Patent Publication No. 3-15980 and No. 8-137892 have a risk that an improper retrieval key may be produced when replacing a short word, in particular. For example, because the method disclosed in Japanese Unexamined Patent Publication No. 3-15980 holds a rule that “ (tah)” is a different expression of “ (ta)”, “ (intahfohn=interphone)” is created as a different expression of “ (intafohn=interphone)” in the step of creating a different expression of “ (intafohn=interphone)”. However, the rule that “ (tah)” is a different expression of “ (ta)” can be applied to “”, but not to “ (takushih=taxi)”, for example. It is therefore demanded to avoid a short word and store a relatively long word, such as a compound word, as information in a different expression dictionary used for replacement of one to another of different expressions. Hitherto, there have been no techniques to assist construction of a different expression dictionary responding to such a demand. As a result, a number of retrieval keys are produced and a problem that a keyword extraction method for realizing a high-speed document retrieval cannot be achieved has been encountered.
SUMMARY OF THE INVENTION
The present invention has been made to solve the problems as set forth above, and its object is to realize keyword extraction for high-speed document retrieval without increasing the number of combinations of different expressions of words serving as retrieval keys unlike the conventional document retrieval methods intended to cope with the problem caused by words having the similar meaning but different expressions, wherein in a keyword extraction process for creating an index assigned to the document, technical term storage means for storing technical terms along with different expressions thereof are referred to for assigning a Japanese document with keywords for technical terms appearing in the document after conversion of their different expressions into respective proper expressions, and at the time of retrieval, a different expression of an input word is converted into a corresponding proper expression with reference to the technical term storage means, followed by collation using the proper expression.
Another object is to realize keyword extraction for high-speed document retrieval without increasing the number of combinations of different expressions of words serving as retrieval keys regardless of the presence/absence of a prefix and different expressions of a technical term succeeding to the prefix, wherein when the technical term succeeding to the prefix is written in a different expression, the different expression of the technical term is replaced by the corresponding proper expression before assigning the technical term as a keyword to a document, and at the time of retrieval, a different expression of an input word is converted into a corresponding proper expression, followed by collation using the proper expression.
Still another object is to realize keyword extraction for high-speed document retrieval without increasing the number of combinations of different expressions of words serving as retrieval keys regardless of the presence/absence of a suffix and different expressions of a technical term preceding the suffix, wherein when the technical term preceding the prefix is written in a different expression, the different expression of the technical term is replaced by the corresponding proper expression before assigning the technical term as a keyword to a document, and at the time of retrieval, a different expression of an input word is converted into a corresponding proper expression, followed by collation using the proper expression.
Still another object is to realize keyword extraction wherein when a length of the extracted keyword is limited, the number of characters is counted based on the word after converting its different expression into a corresponding proper expression, thereby avoiding such an uneven extraction of keywords that some words are registered, but other words are deleted depending on difference in number of characters between different expressions of even those words which have the similar meaning.
Still another object is to realize keyword extraction wherein since keywords are extracted after replacing their different expressions by corresponding proper expressions, the words having the similar meaning but different expressions are avoided from being determined as separate words, and the keywords can be given with respective precise values of appearance frequency.
Still another object is to realize keyword extraction for high-speed document retrieval without increasing the number of combinations of different expressions of compound words serving as retrieval keys, wherein in a process of dealing with different expressions of a compound word, “•” and “/” appearing between words composing the compound word are deleted and a word resulted from replacing a different expression of each of the words composing the compound word by a corresponding proper expression is assigned as a keyword to a document, while at the time of retrieval, the similar processing is executed for an input compound word so that different expressions in the form of a compound word and different expressions for each of words composing the compound word can be dealt with in a unified manner.
Still another object is to realize keyword extraction for high-speed document retrieval without increasing the number of combinations of different expressions of compound words serving as retrieval keys, wherein for adding words to be registered in the technical term storage means used in the keyword extraction method according to the present invention, a set of words are created by combining different expressions of each of individual words composing a compound word based on both different expressions of general words of high frequency and different expressions of the technical terms registered in the technical term storage means, one in the created set of the words having different expressions is determined to be a proper expression, and pairs of each headword and the proper expression are registered in the technical term storage means, thereby assisting the operation of additionally registering words, which are necessary as technical terms, in the technical term storage means.
A keyword extraction apparatus according to a first aspect of the present invention technical term storage means for storing technical terms with proper expressions and different expressions thereof; basic word storage means for storing general basic words of high frequency; input means through which a sentence is input; technical-term segmentation point setting means for, when any of the technical terms stored in the technical term storage means exists in the sentence input through the input means, cutting out a range of that technical term from the input sentence; proper-expression replacing means for, when the technical term cut out by the technical-term segmentation point setting means is written in a different expression, replacing the different expression by a corresponding proper expression; character-type segmentation point setting means for detecting a difference in character type in the input sentence; basic-word segmentation point setting means for cutting out, from the input sentence, a range of any of the basic words stored in the basic word storage means; partial character string cutting means for cutting out partial character strings based on segmentation points set by the technical-term segmentation point setting means, the character-type segmentation point setting means and the basic-word segmentation-point setting means; and output means for outputting, as keywords, the partial character strings cut out by the partial character string cutting means.
A keyword extraction method according to a second aspect of the present invention includes an input step for inputting a sentence; a technical-term segmentation point setting step for, when any of technical terms in technical term storage means for storing technical terms with proper expressions and different expressions thereof exists in the sentence input in the input step, cutting out a range of that technical term from the input sentence; a proper-expression replacing step for, when the technical term cut out in the technical-term segmentation point setting step is written in a different expression, replacing a range of the technical term in the input sentence by a corresponding proper expression; a character-type segmentation point setting step for detecting a difference in character type in the input sentence; a basic-word segmentation point setting step for, when any of basic words in basic word storage means for storing, as the basic words, general words of high frequency exists in the input sentence, cutting out a range of any of the basic words from the input sentence; and a partial character string cutting step for cutting out, as keywords, partial character strings based on segmentation points set in the technical-term segmentation point setting step, the character-type segmentation point setting step and the basic-word segmentation point setting step.
A keyword extraction method according to a third aspect of the present invention further includes, when the sentence input in the input step is written in Japanese, a prefix segmentation point setting step for cutting out a range of any of prefixes in the Japanese input sentence by referring to prefix storage means for storing the prefixes, wherein the partial character string cutting step cuts out, as keywords, all relevant partial character strings based on the segmentation points set in the technical-term segmentation point setting step, the character-type segmentation point setting step, the basic-word segmentation point setting step, and the prefix segmentation point setting step.
A keyword extraction method according to a fourth aspect of the present invention further includes, when the sentence input in the input step is written in Japanese, a suffix segmentation point setting step for cutting out a range of any of suffixes in the Japanese input sentence by referring to suffix storage means for storing the prefixes, wherein the partial character string cutting step cuts out, as keywords, all relevant partial character strings based on the segmentation points set in the technical-term segmentation point setting step, the character-type segmentation point setting step, the basic-word segmentation point setting step, the prefix segmentation point setting step, and the suffix segmentation point setting step.
A keyword extraction method according to a fifth aspect of the present invention further includes a number-of-characters limiting step for deleting those ones of the keywords extracted in the partial character string cutting step which have a character string length outside a predetermined range, thereby providing redetermined keywords.
A keyword extraction method according to a sixth aspect of the present invention further includes a frequency totalizing step for counting appearance frequency of each of the keywords or the redetermined keywords extracted in the partial character string cutting step or the number-of-characters limiting step.
A keyword extraction method according to a seventh aspect of the present invention further comprises a symbolic-character segmentation point setting step for, when any of prescribed symbolic characters appears in the input sentence, cutting out that symbolic character, and a symbolic character deleting step for deleting the symbolic character cut out in the symbolic-character segmentation point setting step when the symbolic character is contained as one character in any of the keywords or the redetermined keywords extracted in the partial character string cutting step or the number-of-characters limiting step.
In a keyword extraction method according to an eighth aspect of the present invention, the technical term storage means stores technical terms which are created in a different expression adding step with the aid of different expressions registered in non-technical-term different expression storage means for storing different expressions of general words of high frequency and different expressions of the technical terms registered in the technical term storage means, the different expression adding step comprising a word dividing step for, when a technical term in the input sentence is a compound word, dividing the compound word into partial character strings composing the compound word, a different expression developing step for combining different expressions of the partial character strings with each other to create different expressions of the compound word, and a registering step for creating pairs of each of the created different expressions and a proper expression of the compound word, and registering the pairs in the technical term storage means.
A computer readable recording medium storing a keyword extraction program, according to a ninth aspect of the present invention, which includes an input sequence for inputting a sentence; a technical-term segmentation point setting sequence for, when any of technical terms in technical term storage means for storing technical terms with proper expressions and different expressions thereof exists in the sentence input in the input step, cutting out a range of that technical term from the input sentence; a proper-expression replacing sequence for, when the technical term cut out in the technical-term segmentation point setting step is written in a different expression, replacing a range of the technical term in the input sentence by a corresponding proper expression; a character-type segmentation point setting sequence for detecting a difference in character type in the input sentence; a basic-word segmentation point setting sequence for, when any of basic words in basic word storage means for storing, as the basic words, general words of high frequency exists in the input sentence, cutting out a range of any of the basic words from the input sentence; and a partial character string cutting sequence for cutting out, as keywords, all relevant partial character strings based on segmentation points set in the technical-term segmentation point setting sequence, the character-type segmentation point setting sequence and the basic-word segmentation point setting sequence.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1
is an overall block diagram of a keyword extraction apparatus according to Embodiment 1 of the present invention.
FIG. 2
is a representation showing one example of technical term storage means used in the present invention.
FIG. 3
is a representation showing one example of basic word storage means used in the present invention.
FIG. 4
is a representation showing one example of effective-part-of-speech succeeding hiragana-character-string storage means used in the present invention.
FIG. 5
is a flowchart showing a flow of data in a keyword extraction method according to Embodiment 1 of the present invention following successive steps.
FIG. 6
is a flowchart showing the operation of the keyword extraction method according to Embodiment 1 of the present invention.
FIG. 7
is a flowchart showing the operation of processing to set technical-term segmentation points in the present invention.
FIG. 8
is a representation showing successive states of an example of character string to be processed in the processing to set the technical-term segmentation points in the present invention.
FIG. 9
is a representation showing an intermediate state of the processing made on the example of character string to be processed in the present invention.
FIG. 10
is a representation showing successive states of an example of character string to be processed in the processing to set the technical-term segmentation points in the present invention.
FIG. 11
is a representation showing an intermediate state of the processing made on the example of character string to be processed in the present invention.
FIG. 12
is a flowchart showing the operation of processing to take out effective character strings in the present invention.
FIG. 13
is a flowchart showing the operation of processing to set a character-type segmentation point.
FIG. 14
is a representation showing an intermediate state of the processing made on the example of character string to be processed in the present invention.
FIG. 15
is a flowchart showing the operation of processing to set basic-word segmentation points in the present invention.
FIG. 16
is a flowchart showing the operation of taking out a segment range, which contains no technical term, from an effective character string in the present invention.
FIG. 17
is a flowchart showing the operation of processing to determine an effective part-of-speed in the present invention.
FIG. 18
is a representation showing an intermediate state of the processing made on the example of character string to be processed in the present invention.
FIG. 19
is a flowchart showing the operation of processing to take out a keyword candidate in the present invention.
FIG. 20
is a representation showing an intermediate state of the processing made on an example of character string to be processed in the present invention.
FIG. 21
is a representation showing successive states of the example of character string to be processed in the processing to set the basic-word segmentation points in the present invention.
FIG. 22
is a representation showing successive states of the example of character string to be processed in the processing to set the basic-word segmentation points in the present invention.
FIG. 23
is a representation showing an intermediate state of the processing made on the example of character string to be processed in the present invention.
FIG. 24
is a block diagram showing an example of data flow in the keyword extraction method according to Embodiment 1 of the present invention in relation to the successive steps.
FIG. 25
is an overall block diagram of a keyword extraction method according to Embodiment 2 of the present invention.
FIG. 26
is a flowchart showing the operation of the keyword extraction method according to Embodiment 2 of the present invention.
FIG. 27
is a flowchart showing the operation of processing to delete a basic word in the present invention.
FIG. 28
is a block diagram showing an example of data flow in the keyword extraction method according to Embodiment 2 of the present invention in relation to the successive steps.
FIG. 29
is an overall block diagram of a keyword extraction method according to Embodiment 3 of the present invention.
FIG. 30
is a representation showing one example of data registered in prefix storage means used in the present invention.
FIG. 31
is a flowchart showing the operation of the keyword extraction method according to Embodiment 3 of the present invention.
FIG. 32
is a representation showing an intermediate state of the processing made on an example of character string to be processed in the present invention.
FIG. 33
is a flowchart showing the operation of processing to set prefix segmentation points in the present invention.
FIG. 34
is a representation showing an intermediate state of the processing made on an example of character string to be processed in the present invention.
FIG. 35
is a representation showing an intermediate state of the processing made on an example of character string to be processed in the present invention.
FIG. 36
is a block diagram showing an example of data flow in the keyword extraction method according to Embodiment 3 of the present invention in relation to the successive steps.
FIG. 37
is an overall block diagram of a keyword extraction method according to Embodiment 4 of the present invention.
FIG. 38
is a representation showing one example of data registered in suffix storage means used in the present invention.
FIG. 39
is a flowchart showing the operation of the keyword extraction method according to Embodiment 4 of the present invention.
FIG. 40
is a representation showing an intermediate state of the processing made on an example of character string to be processed in the present invention.
FIG. 41
is a flowchart showing the operation of processing to set suffix segmentation points in the present invention.
FIG. 42
is a representation showing an intermediate state of the processing made on an example of character string to be processed in the present invention.
FIG. 43
is a representation showing an intermediate state of the processing made on an example of character string to be processed in the present invention.
FIG. 44
is a block diagram showing an example of data flow in the keyword extraction method according to Embodiment 4 of the present invention in relation to the successive steps.
FIG. 45
is an overall block diagram of a keyword extraction method according to Embodiment 5 of the present invention.
FIG. 46
is a flowchart showing the operation of the keyword extraction method according to Embodiment 5 of the present invention.
FIG. 47
is a flowchart showing the operation of a number-of-character limiting process in the present invention.
FIG. 48
is a block diagram showing an example of data flow in the keyword extraction method according to Embodiment 5 of the present invention in relation to the successive steps.
FIG. 49
is an overall block diagram of a keyword extraction method according to Embodiment 6 of the present invention.
FIG. 50
is a flowchart showing the operation of the keyword extraction method according to Embodiment 6 of the present invention.
FIG. 51
is a flowchart showing the operation of a frequency totalizing process in the present invention.
FIG. 52
is a block diagram showing an example of data flow in the keyword extraction method according to Embodiment 6 of the present invention in relation to the successive steps.
FIG. 53
is an overall block diagram of a keyword extraction method according to Embodiment 7 of the present invention.
FIG. 54
is a flowchart showing the operation of the keyword extraction method according to Embodiment 7 of the present invention.
FIG. 55
is a flowchart showing the operation of processing to set symbolic-character segmentation points in the present invention.
FIG. 56
is a representation showing an intermediate state of the processing made on an example of character string to be processed in the present invention.
FIG. 57
is a flowchart showing the operation of processing to delete a symbolic character in the present invention.
FIG. 58
is a block diagram showing an example of data flow in the keyword extraction method according to Embodiment 7 of the present invention in relation to the successive steps.
FIG. 59
is a block diagram showing correlation between a different expression adding step and the keyword extraction method in the present invention.
FIG. 60
is a representation showing one example of non-technical-term different expression storage means used in the present invention.
FIG. 61
is a block diagram showing the configuration of the different expression adding step in the present invention.
FIG. 62
is a flowchart showing the operation of the different expression adding step in the present invention.
FIG. 63
is a block diagram showing an example of data flow in the different expression adding step in the present invention.
FIG. 64
is a block diagram showing a conventional keyword extraction system.
FIG. 65
is a block diagram of a conventional document retrieval method.
FIG. 66
is a flowchart showing part of a processing flow in the conventional document retrieval method.
FIG. 67
is a block diagram of a conventional different expression and synonym developing method for retrieval of character strings.
FIG. 68
is an outline of a conventional different expression and synonym developing process.
DESCRIPTION OF THE PREFERRED EMBODIMENTS
Embodiment 1.
Embodiment 1 of the present invention will be described hereunder, taking a sentence in Japanese as an example.
FIG. 1
is a block diagram showing one embodiment according to a first aspect of the present invention. In
FIG. 1
, denoted by
1
is technical term storage means for storing technical terms which are intimately related to the field of interest. As seen from
FIG. 2
which shows one example of the technical term storage means
1
, the storage means
1
is made up of two fields, i.e., one field of headword and the other field of proper expression corresponding to the headword. The word for which the proper expression field is blank means that the headword itself is a proper expression. Also, those headwords which have the same proper expression mean words having the similar meaning and pronunciation but different expressions (in written language); i.e., they are in relation of different expression to each other. In
FIG. 2
, for example, the headword “ (kirikae=switching)” is a different expression of the proper expression “ (kirikae=switching)”. Also, “”, “”, “” and “” are in relation of different expression to each other.
Denoted by
2
is basic word storage mean for storing general basic words of high frequency. As seen from
FIG. 3
which shows one example of the basic word storage means
2
, the storage mean
2
is made up of one field of headword alone. Denoted by
3
is effective-part-of-speech succeeding hiragana-character-string storage mean for storing hiragana character strings succeeding to parts of speech which can serve as keywords (i.e., effective parts-of-speech), such as the stems of nouns, -column declinable nouns, and adjective verbs. As seen from
FIG. 4
which shows one example of the storage means
3
, the storage mean
3
is made up of one field of headword alone.
Denoted by
104
is input means through which a Japanese sentence to be subjected to the keyword extraction process is input to a control unit
115
. The control unit
115
includes technical-term-storage-means managing means
105
, technical-term segmentation point setting means
106
, proper-expression replacing means
107
, effective character-string cutting means
108
, character-type segmentation point setting means
109
, basic-word-storage- means managing means
110
, basic-word segmentation point setting means
111
, effective-part-of-speech-succeeding-hiragana-character-string storage-means managing means
112
, effective part-of-speech determining means
113
, and partial character string cutting means
114
. The control unit
115
executes later-described data processing in accordance with control programs stored in ROM, RAM, etc. Denoted by
116
is output means through which keywords extracted by the control unit
115
are output to a file, display or any other suitable means.
FIG. 5
is a flowchart representing a keyword extraction method of the present invention in accordance with successive step s corresponding to the various means in
FIG. 1
, and showing a flow of data from entry of a sentence to extraction of a keyword following the steps.
In
FIG. 5
, denoted by
4
is an input step in which a Japanese sentence is entered through the input means
104
;
5
is a technical-term-storage-means managing step in which the technical-term-storage-means managing means
105
searches the technical term storage means
1
and takes out a technical term; and
6
is a technical-term segmentation point setting step in which the technical-term segmentation point setting means
106
extracts a character string, which matches with the technical term searched in the technical-term-storage-means managing step
5
, from the input sentence and sets segmentation points before and behind the extracted character string. Denoted by
7
is a proper-expression replacing step in which, when the technical term searched in the technical-term-storage-means managing step
5
is a different expression with another word, the proper-expression replacing means
107
replaces the technical term in the input sentence by a proper expression.
Denoted by
8
is an effective character-string cutting step in which the effective character-string cutting means
108
cuts out, from the input sentence, character types which can serve as keywords (i.e., effective character types), such as kanji (Chinese characters), katakana (the square form of Japanese letters hiragana), alphabets and numerals, and technical terms. Denoted by
9
is a character-type segmentation point setting step in which the character-type segmentation point setting means
109
sets a character-type segmentation point for the character string cut out in the effective character-string cutting step
8
, which is not itself a technical term, based on difference in character types such as kanji and hiragana. Denoted by
10
is a basic-word-storage-means managing step in which the basic-word-storage-means managing means
110
searches the basic word storage means
2
and takes out basic words. Denoted by
11
is a basic-word segmentation point setting step in which the basic-word segmentation point setting means
111
extracts a character string, which matches with the basic word searched in the basic-word-storage-means managing step
10
, from the character strings cut out in the effective character-string cutting step
8
except technical terms and sets segmentation points before and behind the extracted character string.
Denoted by
12
is an effective-part-of-speech-succeeding-hiragana-character-string-storage-means managing step in which the effective-part-of-speech-succeeding-hiragana-character-string-storage-means managing means
3
searches the effective part-of-speech succeeding hiragana-character-string storage means
3
. Denoted by
13
is an effective part-of-speech determining step in which the effective part-of-speech determining means
113
compares the character string succeeding each of the character strings cut out in the effective character-string cutting step
8
with the hiragana character string searched in the effective-part-of-speech-succeeding-hiragana-character-string-storage-means managing step
12
, and when a head portion of the succeeding hiragana does not match with any of hiragana character strings stored in the effective part-of-speech succeeding hiragana-character-string storage means
5
and the last word in the effective character string is not a technical term, sets information that the last word in the effective character string cannot serve as a keyword.
Denoted by
14
is a partial character string cutting step in which the partial character string cutting means
114
cuts out a character string, which can serve as a keyword, based on the segmentation points set in the technical-term segmentation point setting step
6
, the effective character-string cutting step
8
, the character-type segmentation point setting step
9
, and the basic-word segmentation point setting step
11
.
The flow of data from entry of a sentence to extraction of a keyword will now be described following the successive steps.
In the technical-term-storage-means managing step
5
, the technical term storage means
1
is searched and a searched technical term
501
is passed to the technical-term segmentation point setting step
6
, whereas the technical term and its proper expression
502
are passed to the proper-expression replacing step
7
. In the basic-word-storage-means managing step
10
, the basic word storage means
2
is searched and a searched basic word
503
is passed to the basic-word segmentation point setting step
11
. In the effective-part-of-speech-succeeding-hiragana-character-string-storage-means managing step
12
, the effective part-of-speech succeeding hiragana-character-string storage means
3
is searched and a hiragana character string
504
succeeding to the effective part-of-speech is passed to the effective part-of-speech determining step
13
.
In the input step
4
, an input sentence
505
is passed to the technical-term segmentation point setting step
6
. The technical-term segmentation point setting step
6
receives both the input sentence
505
and the technical term
501
, and outputs a sentence
506
resulted from setting, as technical-term segmentation points, a start-of-technical-term segmentation point and an end-of-technical-term segmentation point in the input sentence
505
. The proper-expression replacing step
7
receives both the sentence
506
and the technical term and its proper expression
502
, and outputs a sentence
507
resulted from, when the technical term contained in the sentence
506
is written in a different expression, replacing the different expression by a proper expression.
The effective character-string cutting step
8
outputs a sentence
508
in which a start-of-effective-character-string point and an end-of-effective-character-string point are set to mark, as a character string which can serve as a keyword (i.e., effective character string), a character string range of effective character types in the sentence
507
and of technical term set in the sentence
507
.
The character-type segmentation point setting step
9
receives the sentence
508
and outputs a sentence
509
resulted from setting, in the sentence
508
, a character-type segmentation point for the character string range of the effective character string which is not itself a technical term.
The basic-word segmentation point setting step
11
receives both the sentence
509
and the basic word
503
and outputs a sentence
510
in which a start-of-basic-word segmentation point and an end-of-basic-word segmentation point are set as basic-word segmentation points at a position, where the basic word
503
appears in the sentence
509
, for the character string range of the effective character string which contains no technical term.
The effective part-of-speech determining step
13
receives, as inputs, both the sentence
510
and the hiragana character string
504
registered in the effective part-of-speech succeeding hiragana-character-string storage means
3
, and outputs a sentence
511
for which each character string in the sentence
510
, which cannot serve as a keyword, has been determined.
The partial character string cutting step
14
receives the sentence
511
and extracts and outputs keywords
512
in the input sentence based on the technical-term segmentation points set in the technical-term segmentation point setting step
6
, the effective character strings set in the effective character-string cutting step
8
, the character-type segmentation points set in the character-type segmentation point setting step
9
, the basic-word segmentation points set in the basic-word segmentation point setting step
11
, and the determination made in the effective part-of-speech determining step
14
on the character strings which cannot serve as keywords.
FIG. 6
is a flowchart showing the operation of one embodiment according to the first aspect of the present invention. The following description will be made on processing of, for example, a Japanese sentence “ (sahbah kirikae niyoru tsuushin tesuto wo okonau.=A server is switched over to perform a communication test.)”. First, in step
601
, the Japanese sentence is input through a keyboard or file. Then, in step
602
, technical-term segmentation points are set in the input sentence.
FIG. 7
is a flowchart showing a flow of the processing to set the technical-term segmentation points in step
602
. In step
701
, a character string or segment up to the first punctuation point in the input sentence is taken out. In the illustrated example, the step
701
finds a full stop “°” and takes out the whole of input sentence “”.
Then, in step
702
, the head and tail of the segment are marked by pointers. In the illustrated example, a pointer ph is set to the head character “” of the segment and a pointer pt is set to the tail character “” of the segment.
Subsequently, in step
703
, the technical term storage means
1
is searched by using the character string from ph to pt as a retrieval key. In the illustrated example, the input sentence “” is used as a retrieval key as it is. It is then checked whether or not the same word as the key exists in the technical term storage means
1
. Assuming that a technical term “” is not registered in the technical term storage means
1
, the processing follows the path indicated by N and goes to step
708
where pt is shifted one character toward the head. As a result, pt now points “”. Next, it is checked in step
709
whether or not ph is positioned nearer to the head than pt. In this case, since ph is positioned nearer to the head than pt, the processing follows the path indicated by Y and returns to step
703
for searching the technical term storage means
1
again with the character string from ph to pt used as a retrieval key. The retrieval key at this time is given by “”.
By repeating the above operation, the characters composing the segment are deleted one by one from the tail, as shown in FIG.
8
. It is assumed that upon the retrieval key being given by “”, the same word as the retrieval key is found in the technical term storage means
1
. In this case, therefore, the processing follows the path indicated by Y from step
704
and goes to step
705
for checking whether or not the retrieval key is a different expression of another word. On condition that the words shown in
FIG. 2
are registered in the technical term storage means
1
, since there is a proper expression “” for “”, the processing follows the path indicated by Y from step
705
and goes to step
707
where the different expression of a character string portion in the sentence corresponding to the technical term is replaced by the proper expression, and the start-of-technical-term segmentation point and the end-of-technical-term segmentation point are set respectively to start and end points of the replaced proper expression. The result of the processing made on the input character string so far is shown in FIG.
9
.
After that, in step
711
, ph is set to the character subsequent to pt and pt is set to the character at the tail of the segment demarcated by the punctuation point. In the illustrated example, ph is set to the position of “” and pt is set to the position of “”. It is then checked in step
712
whether ph is in the segment range demarcated by the punctuation point. In this case, since ph is in the segment range, the processing follows the path indicated by Y and returns to step
703
for searching the technical term storage means
1
again with the character string from ph to pt used as a retrieval key.
As with the processing for the first input character string, the characters composing the character string are deleted one by one from the tail, as shown in FIG.
10
. Assuming that the same word as the retrieval key is found in the technical term storage means
1
upon the retrieval key being given by “”, the processing follows the path indicated by Y from step
704
and goes to step
705
for checking whether or not the retrieval key, i.e., “”, is a different expression of another word. On condition that the words shown in
FIG. 2
are registered in the technical term storage means
1
, since “” is itself a proper expression, the processing follows the path indicated by N from step
705
and goes to step
706
where the start-of-technical-term segmentation point is set before the character pointed by ph and the end-of-technical-term segmentation point is set behind the character pointed by pt. The result of the processing made on the remaining character string so far is shown in FIG.
11
.
After that, for “”, the technical term storage means
1
is likewise searched while the characters composing the segment demarcated by the punctuation point are deleted one by one from the tail. If no matching technical term is found in the dictionary until ph is shifted to the head, then the processing goes to step
710
where ph is shifted one character toward the tail and pt is set to the tail of the segment, followed by searching the technical term storage means
1
.
It is assumed that, as a result of repeating the similar processing as described above, any of the character strings registered in the technical term storage means
1
is not found in the remaining character string. In this case, upon pt being shifted outside the segment range demarcated by the punctuation point, the determination in step
712
is responded by NO (indicated by N), and no more segment demarcated by the punctuation point remains. Accordingly, the determination in step
713
is responded by NO, thereby completing the technical-term segmentation point setting process shown in FIG.
7
.
Next, in step
603
of
FIG. 6
, effective character strings are taken out one by one from the head of the input sentence.
FIG. 12
shows a flow of processing to take out the effective character strings.
The character string to be processed is “” shown in FIG.
11
. First, in step
1201
, one character is taken out from the character string. In this case, “” is taken out, followed by checking in step
1202
whether or not “” is the effective character types or it is in the range between the technical-term segmentation points. The effective character types include kanji, katakana, alphabets and numerals. Since “” is katakana, i.e., an effective character type, is positioned between the start-of-technical-term segmentation point and the end-of-technical-term segmentation point, the start point of the effective character string is set before “” in step
1203
. Then, the next character “” is taken out in step
1204
. It is then checked in step
1205
whether or not “” is the effective character type or it is in the range between the technical-term segmentation points. At this time, since a long sound “” subsequent to katakana is also regarded as katakana and “” is positioned between the technical-term segmentation points, the processing follows the path indicated by Y and takes out the next character “” in step
1204
.
Repeating the similar processing as described above, at “” of “”, the determination in step
1205
is responded by NO and a position behind “” is set in step
1206
as the end point of the effective character string. As a result of the above processing, the first effective character string “” is taken out.
After that, a character-type segmentation point is set in step
604
of FIG.
6
.
FIG. 13
is a flowchart showing a flow of processing to set the character-type segmentation point. A character string to be processed is the effective character string “” in the illustrated example. First, in step
1301
, “”, i.e., the head character in the effective character string, is assigned to p_moji and “”, i.e., the second character in the segment, is assigned to moji. It is then checked in step
1302
whether or not p_moji and moji are positioned between the start and end segmentation points for the same technical term. In the illustrated example, since both p_moji and moji are positioned in the range of the same technical term “”, the processing follows the path indicated by Y from step
1302
.
Next, it is checked in step
1305
whether or not moji is the last character in the effective character string. In this case, the processing follows the path indicated by N and goes to step
1306
where the positions of p_moji and moji are shifted one character rearward. Subsequently, the processing returns to step
1302
for checking again whether or not both p_moji and moji are positioned in the range of the same technical term.
Repeating the similar processing as described above, upon p_moji indicating “” and moji indicating “”, the determination in step
1302
is responded by NO and the processing goes to step
1303
for checking whether or not the character types of p_moji and moji are the same. In this case, since the character type of “” of is katakana and the character type of “” is kanji, the processing follows the path indicated by N from step
1303
. Then, in step
1304
, a character-type segmentation point is set between p_moji and moji.
Repeating the similar processing as described above for the segment example of “”, no more character-type segmentation point is set, and upon moji indicating the last character in step
1305
, the processing follows the path indicated by N from step
1305
and goes out of the processing routine of FIG.
13
. As a result, the character-type segmentation point is set between “” and “”, as shown in FIG.
14
.
Thereafter, the basic-word segmentation points are set in step
605
of FIG.
6
.
FIG. 15
is a flowchart showing a flow of processing to set the basic-word segmentation points. A character string to be processed is the effective character string “” in the illustrated example.
First, in step
1501
, a segment range containing no technical terms is taken out from the effective character string. Details of processing in step
1501
is shown in a flowchart of FIG.
16
.
In step
1601
of
FIG. 16
, one character is taken out. In this case, “” is taken out. It is then checked in step
1602
whether or not “” is outside the range of effective character string. Since “” is in the range of effective character string, the processing follows the path indicated by N from step
1602
. Next, it is checked in step
1603
whether or not “” is outside the range of technical term. Since “” is in the range of technical term, the processing follows the path indicated by N and returns to step
1601
for taking out the next character “”.
Repeating the similar processing as described above, since all characters of “” are in the range of technical term, the character finally taken out in step
1601
is outside the range of effective character string, whereupon the processing follows the path indicated by Y from step
1602
. The processing routine of
FIG. 16
is thus completed without taking out the segment which contains no technical term, followed by returning to step
1502
in FIG.
15
.
It is then checked in step
1502
of
FIG. 15
whether or not there is a segment which contains no technical term. Since the processing routine of
FIG. 16
has determined that there is not a segment which contains no technical term, the processing follows the path indicated by N from step
1502
and goes out of the processing routine of
FIG. 15
without setting the basic-word segmentation points.
Next, in step
606
of
FIG. 6
, the character string succeeding to a keyword candidate is checked to determine whether or not the keyword candidate is an effective part-of-speech.
FIG. 17
is a flowchart showing a flow of processing to determine the effective part-of-speech. In step
1701
, it is checked whether or not the last character in the effective character string belongs to a technical term. In this case, since the end-of-technical-term segmentation point is set behind “” in “”, the determination in step
1701
is responded by YES (indicated by Y) and the processing goes out of the processing routine of
FIG. 17
, followed by returning to step
607
of FIG.
6
.
As a result of the processing executed so far, the segmentation points are set in the first the effective character string, as shown in FIG.
18
.
Subsequently, in step
607
of
FIG. 6
, keyword candidates are taken out based on the segmentation points and the effective part-of-speech.
FIG. 19
is a flowchart showing a flow of processing to take out the keyword candidates. First, in step
1901
, a keyword start-enable point is taken out one by one from the head of the effective character string.
In this embodiment, the keyword start-enable point is assumed to be given by any of the start-of-technical-term segmentation point, the start point of the effective character string, the start-of-basic-word segmentation point, and the character-type segmentation point. Also, a keyword end-enable point is assumed to be given by any of the end-of-technical-term segmentation point, the end point of the effective character string, the end-of-basic-word segmentation point, and the character-type segmentation point. Further, it is assumed that the position for which a keyword end-disable point has been set by the effective part-of-speech determining process cannot serve as the keyword end-enable point.
In the illustrated example, the start-of-technical-term segmentation point and the start point of the effective character string both set before “”, shown in
FIG. 18
, are taken out as the keyword start-enable point in step
1901
. Next, in step
1902
, the keyword end-enable point rearward of “” is taken out. Since the keyword end-enable point is given by the end-of-technical-term segmentation point and the character-type segmentation point between “” and “”, the character string “” from the keyword start-enable point to the keyword end-enable point is copied as a keyword candidate into a buffer in step
1903
.
Subsequently, it is checked in step
1904
whether or not any keyword end-enable point still remains rearward of the keyword start-enable point. In this case, the processing follows the path indicated by Y and returns to step
1902
where the end-of-technical-term segmentation point and the end point of the effective character string both set behind “” are taken out as a keyword end-enable point. Then, in step
1903
, the character string “” from the keyword start-enable point to the keyword end-enable point is copied as a keyword candidate into the buffer.
Since there is no keyword end-enable point rearward of “”, the determination in step
1904
is responded by N and the processing goes to step
1905
for checking the presence of a next keyword start-enable point. In this case, since the start-of-technical-term segmentation point and the character-type segmentation point are set between “” and “”, the processing follows the path indicated by Y and returns to step
1901
where the position between “” and “” is taken out as a keyword start-enable point. Next, in step
1902
, the end-of-technical-term segmentation point behind “” are taken out as a keyword end-enable point. Then, in step
1903
, the character string “” from the keyword start-enable point to the keyword end-enable point is copied as a keyword candidate into the buffer.
Further, since there is neither keyword end-enable point nor keyword start-enable point rearward of “”, the determinations in steps
1904
and
1905
are both responded by N and the processing goes out of the processing routine of
FIG. 19
, followed by returning to step
608
of FIG.
6
. As a result of the above routine for the keyword candidate extraction process, three keyword candidates, i.e., “”, “” and “”, are taken out.
Subsequently, it is checked in step
608
of
FIG. 6
whether or not any effective character string still remains in the input sentence. In this case, the processing follows the path indicated by Y and returns to step
603
for taking out a next effective character string. The characters following “” are checked one by one in accordance with the flowchart of
FIG. 12
whether or not each character is the effective character type or it is in the range between the technical-term segmentation points. As a result, “ (tsuushin tesuto=communication test” is taken out as a next effective character string.
After that, in step
604
of
FIG. 6
, the character-type segmentation point is set.
FIG. 13
is the flowchart showing the flow of processing to set the character-type segmentation point. A character string to be now processed is “”. First, in step
1301
, “”, i.e., the head character of “”, is assigned to p_moji and “”, i.e., the second character of “”, is assigned to moji. It is then checked in step
1302
whether or not p_moji and moji are positioned between the start and end segmentation points for the same technical term. In this case, since there is no technical term in the effective character string, the processing follows the path indicated by N. It is then checked in step
1303
whether or not the character types of p_moji and moji are the same. Since the character types of p_moji and moji are both kanji, the processing follows the path indicated by Y from step
1303
.
Next, it is checked in step
1305
whether or not moji is the last character in the effective character string. In this case, the processing follows the path indicated by N and goes to step
1306
where the positions of p_moji and moji are shifted one character rearward. Subsequently, the processing returns to step
1302
for checking again whether or not both p_moji and moji are positioned in the range of the same technical term. The determination in step
1302
is now responded by NO and the processing goes to step
1303
. Since the character type of “” indicated by p_moji is kanji and the character type of “” indicated by moji is katakana, the determination in step
1303
is now responded by NO. Accordingly, in step
1304
, a character-type segmentation point is set between p_moji and moji.
As a result of continuing the similar processing as described above for the effective character string “” until moji points the last character in the effective character string, the character-type segmentation point is set between “” and “”, as shown in FIG.
20
.
Thereafter, the basic-word segmentation points are set for “” in step
605
of FIG.
6
.
FIG. 15
is the flowchart showing the flow of processing to set the basic-word segmentation points.
First, in step
1501
, a segment range containing no technical terms is taken out from the effective character string. As with the above-mentioned “”, the processing in step
1501
is executed in accordance with the flowchart of FIG.
16
. In step
1601
, one character “” is taken out. Since “” is in the range of effective character string, the processing follows the path indicated by N from step
1602
. Further, since “” is outside the range of technical term, the processing follows the path indicated by Y from step
1603
. Next, in step
1604
, the start point of the segment range containing no technical terms is set before “”. Subsequently, in step
1605
, one character “” is taken out. Since “” is in the range of effective character string, the processing follows the path indicated by Y from step
1606
. Further, since “” is outside the range of technical term, the processing follows the path indicated by Y from step
1607
, followed by taking out one character in step
1605
again.
Repeating the similar processing as described above, upon exceeding “” of “”, the taken-out character is positioned outside the range of the effective character string, whereupon the determination in step
1606
is responded by YES and the end point of the segment range containing no technical terms is set behind “” in step
1608
.
Returning to
FIG. 15
again, it is then checked in step
1502
whether or not there is a segment which contains no technical term. In this case, since “” is present as a segment range which contains no technical term, the processing follows the path indicated by Y from step
1502
.
Then, in step
1503
, a pointer ph is assigned to the head character “” of the segment range which contains no technical term, and a pointer pt is assigned to the tail character “” of the segment range. Subsequently, in step
1504
, the basic word storage means
2
is searched by using the character string from ph to pt as a retrieval key. In this case, the retrieval key is given by “”. Assuming that a basic word “” is not registered in the basic word storage means
2
, the processing follows the path indicated by N from step
1505
and goes to step
1507
where pt is shifted one character toward the head so as to point “”. It is then checked in step
1508
whether or not ph is positioned nearer to the head than pt. In this case, the processing follows the path indicated by Y and returns to step
1504
for searching the basic word storage means
2
again with “” now used as a retrieval key.
Searching the basic word storage means
2
is repeated while using, as the retrieval key, a character string which is given by deleting characters of the segment range one by one from the tail, as shown in FIG.
21
. Assuming that a word “” is registered in the basic word storage means
2
, as shown in
FIG. 3
, the processing follows the path indicated by Y from step
1505
upon pt pointing “”. In step
1506
, the start-of-basic-word segmentation point is set before “” and the end-of-basic-word segmentation point is set behind “”.
If pt points a position before the segment range containing no technical terms as a result of shifting pt toward the head side by one character in step
1507
, then the processing follows the path indicated by N from step
1508
and goes to step
1509
where ph is shifted one character toward the tail of the segment range and pt is set to the last character in the segment range containing no technical term. Thus, ph is assigned to “” and pt is assigned to “”. As with the processing for “”, the basic word storage means
2
is searched for “” while deleting characters thereof one by one from the tail, as shown in FIG.
22
.
Assuming that, of partial character strings of “”, only the character string “” is registered in the basic word storage means
2
, the basic-word segmentation points are set for “”, as shown in FIG.
23
. After that, if ph points a position behind the segment range containing no technical terms as a result of shifting ph rearward one-character by one-character, then the determination in step
1510
is responded by NO. The processing returns to step
1501
for executing the process of taking out a next segment range containing no technical terms from “”. In this case, since the next segment range containing no technical terms is not present, the determination in step
1502
is responded by NO and the processing goes out of the processing routine of FIG.
15
.
Next, in step
606
of
FIG. 6
, the hiragana character string succeeding to the effective character string is checked to determine whether or not the effective character string is an effective part-of-speech. In step
1701
of
FIG. 17
, it is checked whether or not the last character in the effective character string belongs to a technical term. In this case, since the last character in the effective character string does not belong to any technical term, the processing follows the path indicated by N and goes to step
1702
for checking whether or not the character string succeeding to the effective character string matches with any character string registered in the effective part-of-speech succeeding hiragana-character-string storage means
3
. In this case, the hiragana character string succeeding to “” is “” and, as shown in
FIG. 4
, “” is registered in the effective part-of-speech succeeding hiragana-character-string storage means
3
. Accordingly, the determination in step
1702
is responded by YES, followed by going out of the processing routine of FIG.
17
.
Subsequently, in step
607
of
FIG. 6
, keyword are taken out based on the segmentation points and the effective part of speech. By executing the similar processing as for “” in accordance with the flowchart of
FIG. 19
, three keyword candidates, i.e., “”, “” and “”, are taken out from the routine for the keyword candidate extraction process,
After that, it is checked in step
608
of
FIG. 6
whether or not any effective character string still remains in the input sentence. In this case, since there still remains an effective character string, the processing follows the path indicated by Y and returns to step
603
for taking out a next effective character string. In accordance with the flowchart of
FIG. 12
, “” is taken out the next effective character string. The processing goes to step
604
for setting a character-type segmentation point. In this case, since the effective character string includes no difference in character type, the processing goes to step
605
without setting the character-type segmentation point. Then, basic-word segmentation points are set in step
605
. Assuming now that “” is not registered in the basic word storage means
2
, the processing goes to step
606
without setting the basic-word segmentation points.
In step
1701
of
FIG. 17
, it is checked whether or not the last character in the effective character string belongs to a technical term. In this case, since the last character in the effective character string does not belong to any technical term, the processing follows the path indicated by N and goes to step
1702
for checking whether or not the character string succeeding to the effective character string matches with any character string registered in the effective part-of-speech succeeding hiragana-character-string storage means
3
. In this case, the hiragana character string succeeding to “” is “” . Assuming that “” is not registered in the effective part-of-speech succeeding hiragana-character-string storage means
3
, a keyword end-disable point is set behind “” in step
1703
.
Subsequently, keyword candidates are taken out in step
607
of FIG.
6
. Although this step is executed in accordance with the flowchart of
FIG. 19
, there is no keyword to be taken out because of the absence of keyword end-enable point.
The processing then goes to step
608
, but no effective character string still remains in the input sentence. Accordingly, the determination in step
608
is responded by NO, thereby completing the processing.
As a result, six keywords, i.e., “”, “”, “”, “”, “”, “” and “”, are extracted.
FIG. 24
is a block diagram showing an example of data flow in the present invention in relation to the steps according to a second aspect of the present invention.
Referring to
FIG. 24
, a Japanese input sentence “ (sahbah kirikae niyoru tsuushin tesuto wo okonau.=A server is switched over to perform a communication test.)”
2405
is entered in the input step
4
. In the technical-term-storage-means managing step
5
, words “” and “”
2401
are retrieved from the technical term storage means
1
. In the technical-term segmentation point setting step
6
, the start-of-technical-term segmentation point and the end-of-technical-term segmentation point are set in respective positions where “” and “” appear in the input sentence, as shown in block
2406
.
Then, the information that the proper expression of the word “” is “” is passed from the technical-term-storage-means managing step
5
to the proper-expression replacing step
7
. As a result, the character string “” in block
2406
is replaced by the proper expression, i.e., “”.
Next, in the effective character-string cutting step
8
, a range of character string consisting of the effective character types, such as kanji, katakana, alphabets and numerals, or a technical term is taken out. As a result, “”, “” and “” are taken out as effective character strings, as shown in block
2408
.
Subsequently, in the character-type segmentation point setting step
9
, the position where the character type changes from one to another is set as a character-type segmentation point for the character string range of the effective character string which is not itself a technical term. As a result, the character-type segmentation points are set between “” and “” and between “” and “”, as shown in block
2409
.
After that, the basic-word segmentation points are set in the basic-word segmentation point setting step
11
. To this end, in the basic-word-storage-means managing step
10
, the basic word storage means
2
is searched and the information that a word “”
2403
is a basic word is passed to the basic-word segmentation point setting step
11
. As a result, the start-of-basic-word segmentation point and the end-of-basic-word segmentation point are set respectively before and behind “”, as shown in block
2410
.
Then, the effective part-of-speech succeeding hiragana-character-string storage means
3
is searched in the effective-part-of-speech-succeeding-hiragana-character-string-storage-means managing step
12
, and the character string succeeding to each effective character string is checked in the effective part-of-speech determining step
13
. Assuming that “” and “” are found, but “” is not found as indicated at
2404
, the keyword end-disable point is set behind “” as shown in block
2411
.
Next, the partial character string cutting step
14
cuts out, from the effective character string, the range of character string which starts from any of the start-of-technical-term segmentation point, the start point of the effective character string, the start-of-basic-word segmentation point, and the character-type segmentation point, which terminates at any of the end-of-technical-term segmentation point, the end point of the effective character string, the end-of-basic-word segmentation point, and the character-type segmentation point, and which does not terminate at the keyword end-disable point. As a result of the above processing, “”, “”, ”, “”, “”, and “” are extracted as keywords from the input sentence, as shown in block
2412
.
It is to be noted that a program for executing the above-described operation in computers may be stored in a recording medium which is readable by computers, e.g., a floppy disk, and the above-described operation may be executed by computers using such a recording medium.
Also, while the segmentation points are set in Embodiment 1 in the order of the technical-term segmentation point setting step, the character-type segmentation point setting step, and the basic-word segmentation point setting step, the order of those processing steps may be optionally selected.
With Embodiment 1, as described above, in the keyword extraction process for assigning an index to a document, a keyword of a technical term appearing in a Japanese sentence is assigned to the document after a different expression of the technical term is replaced by a proper expression thereof by referring to the technical term storage means in which technical terms are stored along with their different expressions. At this time, when the technical term having the replaced proper expression is in continuity with the character string cut out from the input sentence because of difference in character type and the presence of a basic word, a keyword in the form of a compound word is also extracted so that the keyword extraction can be performed comprehensively. By converting a different expression of the technical term into a corresponding proper expression with the same technical term storage means before starting retrieval, a keyword extraction apparatus adaptable for high-speed document retrieval can be achieved while the number of different expressions of words, which serve as retrieval keys, is avoided from increasing in a way of combinations unlike the conventional document retrieval intended to cope with the problem caused by words which have the similar meaning and pronunciation but different expressions.
Embodiment 2.
FIG. 25
is an overall block diagram of a keyword extraction method according to Embodiment 2 of the present invention. In
FIG. 25
, reference numerals
1
,
2
,
3
,
4
,
5
,
6
,
7
,
8
,
9
,
10
,
11
,
12
,
13
and
14
denote respectively technical term storage means, basic word storage means, effective part-of-speech succeeding hiragana-character-string storage means, an input step, a technical-term-storage-means managing step, a technical-term segmentation point setting step, a proper-expression replacing step, an effective character-string cutting step, a character-type segmentation point setting step, a basic-word-storage-means managing step, a basic-word segmentation point setting step, an effective-part-of-speech-succeeding-hiragana-character-string-storage-means managing step, an effective part-of-speech determining step, and a partial character string cutting step which are similar to those denoted by
1
,
2
,
3
,
4
,
5
,
6
,
7
,
8
,
9
,
10
,
11
,
12
,
13
and
14
in FIG.
5
. Denoted by
4101
is a basic word deleting step for deleting those ones of keyword candidates extracted in the partial character string cutting step
14
which are present in the basic word storage means
2
.
FIG. 26
is a flowchart showing the operation of another embodiment of the second aspect, i.e., Embodiment 2, of the present invention. The following description will be made on processing of, for example, a Japanese sentence “ (sahbah kirikae niyoru tsuushin tesuto wo okonau=A server is switched over to perform a communication test)”.
The operation from step
4201
to step
4208
is exactly the same as in Embodiment 1. First, in step
4201
, the Japanese sentence is input through a keyboard or file. Then, in step
4202
, technical-term segmentation points are set in the input sentence.
Assuming that the words shown in
FIG. 2
are registered in the technical term storage means
1
, “” and “” are taken out as technical terms from the input sentence and “” is replaced by its proper expression, i.e., “”, in accordance with the flowchart of FIG.
7
. Also, the start-of-technical-term segmentation point and the end-of-technical-term segmentation point are set respectively before and behind each of “” and “”.
Next, in step
4203
, effective character strings are taken out one by one from the head of the input sentence. In accordance with the flowchart of
FIG. 12
, “” is taken out as the first effective character string.
Subsequently, a character-type segmentation point is set in step
4204
. In accordance with the flowchart of
FIG. 13
, the character-type segmentation point is set between “” and “”.
After that, basic-word segmentation points are set in step
4205
. It is assumed that any partial character string of “” is not registered in the basic word storage means
2
. In accordance with the flowchart of
FIG. 15
, the processing goes to step
4206
without setting the basic-word segmentation points for that effective character string.
The character string succeeding to a keyword candidate is then checked in step
4206
to determine whether the keyword candidate is an effective part-of-speech. In accordance with the flowchart of
FIG. 17
, the part-of-speech determining routine is skipped because “” is a technical term.
Thereafter, in step
4207
, keyword candidates are taken out based on the segmentation points and the effective part-of-speech. In this embodiment, a keyword start-enable point is assumed to be given by any of the start-of-technical-term segmentation point, the start point of the effective character string, the start-of-basic-word segmentation point, and the character-type segmentation point. Also, a keyword end-enable point is assumed to be given by any of the end-of-technical-term segmentation point, the end point of the effective character string, the end-of-basic-word segmentation point, and the character-type segmentation point. It is further assumed that the position where a keyword end-disable point is set in the effective part-of-speech determining process cannot serve as the keyword end-enable point.
In accordance with the flowchart of
FIG. 19
, “”, “” and “” are extracted as keywords from “”.
Next, it is checked in step
4208
whether or not any effective character string still remains in the input sentence. In this case, the processing follows the path indicated by Y and returns to step
4203
for taking out a next effective character string “”.
Subsequently, a character-type segmentation point is set in step
4204
. In accordance with the flowchart of
FIG. 13
, the character-type segmentation point is set between “” and “”.
After that, basic-word segmentation points are set in step
4205
. Assuming that “” is registered as a basic word in the basic word storage means
2
, the start-of-basic-word segmentation point and the end-of-basic-word segmentation point are set before and behind “”, respectively, in accordance with the flowchart of FIG.
15
.
The character string succeeding to a keyword candidate is then checked in step
4206
to determine whether the keyword candidate is an effective part-of-speech. In this case, since the character succeeding to “” is “” that is registered in the effective part-of-speech succeeding hiragana-character-string storage means
3
, the processing goes out of the part-of-speech determining routine without setting the keyword end-disable point in accordance with the flowchart of FIG.
17
.
Thereafter, in step
4207
, keyword candidates are taken out based on the segmentation points and the effective part-of-speech. In accordance with the flowchart of
FIG. 19
, “”, “” and “” are extracted as keywords.
Further, the processing from step
4203
to step
4207
is executed for the next effective character string “”. Assuming that the character-type segmentation point does not exist, “” is not present in the basic word storage means and the prefix storage means, and “” succeeding to “” is not present in the effective part-of-speech succeeding hiragana-character-string storage means
3
, no keywords are extracted from this segment as with the processing executed in Embodiment 1 for “”.
When all the effective character strings to be processed are taken out, the processing follows the path indicated by N from step
4208
and goes to step
4209
.
In step
4209
, those ones of the extracted keyword candidates which are present in the basic word storage means are discarded. This processing is executed in accordance with a flowchart shown in FIG.
27
.
It is assumed that the keyword candidates, i.e., “”, “”, “”, “”, “” and “” are stored in a buffer. First, one of the keyword candidates is taken out from the buffer in step
4301
. It is then checked in step
4303
whether or not the same word as the taken-out keyword candidate is present in the basic word storage means
2
. If step
4304
determines that the same word is present, then the taken-out keyword candidate is deleted in step
4305
. This processing is repeated for all the keyword candidates stored in the buffer, and is completed upon the determination in step
4302
being responded by NO.
As a result of the above processing, since “” is present in the basic word storage means, “” is deleted from the buffer. Thus, “”, “”, “”, “”, ” and “” are finally extracted as keywords, thereby completing the processing sequence.
FIG. 28
is a block diagram showing an example of data flow in the present invention in relation to the steps according to the second aspect of the present invention.
Referring to
FIG. 28
, a Japanese input sentence “ (sahbah kirikae niyoru tsuushin tesuto wo okonau.=A server is switched over to perform a communication test.)”
4405
is entered in the input step
4
. In the technical-term-storage-means managing step
5
, words
4401
, i.e., “” and “”, are retrieved from the technical term storage means
1
. In the technical-term segmentation point setting step
6
, the start-of-technical-term segmentation point and the end-of-technical-term segmentation point are set in respective positions where “” and “” appear in the input sentence, as shown in block
4406
.
Then, the information that the proper expression of the word “” is “” is passed from the technical-term-storage-means managing step
5
to the proper-expression replacing step
7
. As a result, the character string “” in block
4406
is replaced by the proper expression, i.e., “”.
Next, in the effective character-string cutting step
8
, a range of character string consisting of the effective character types, such as kanji, katakana, alphabets and numerals, or a technical term is taken out. As a result, “”, “” and “” are taken out as effective character strings, as shown in block
4408
.
Subsequently, in the character-type segmentation point setting step
9
, the position where the character type changes from one to another is set as a character-type segmentation point for the character string range of the effective character string which is not itself a technical term. As a result, the character-type segmentation points are set between “” and “” and between “” and “”, as shown in block
4409
.
After that, the basic-word segmentation points are set in the basic-word segmentation point setting step
11
. To this end, in the basic-word-storage-means managing step
10
, the basic word storage means
2
is searched and the information that a word “”
4403
is a basic word is passed to the basic-word segmentation point setting step
11
. As a result, the start-of-basic-word segmentation point and the end-of-basic-word segmentation point are set respectively before and behind “”, as shown in block
4410
.
Then, the effective part-of-speech succeeding hiragana-character-string storage means
3
is searched in the effective-part-of-speech-succeeding-hiragana-character-string-storage-means managing step
12
, and the character string succeeding to each effective character string is checked in the effective part-of-speech determining step
13
. Assuming that “” and “” are found, but “” is not found as indicated at
4404
, the keyword end-disable point is set behind “” as shown in block
4411
.
Next, the partial character string cutting step
14
cuts out, from the effective character string, the range of character string which starts from any of the start-of-technical-term segmentation point, the start point of the effective character string, the start-of-basic-word segmentation point, and the character-type segmentation point, which terminates at any of the end-of-technical-term segmentation point, the end point of the effective character string, the end-of-basic-word segmentation point, and the character-type segmentation point, and which does not terminate at the keyword end-disable point. As a result of the above processing, “”, “”, “”, “”, “” and “” are extracted as keywords from the input sentence, as shown in block
4412
.
Thereafter, those ones of the keyword candidates which are the same as the basic words registered in the basic word storage means
2
are deleted from the buffer in the basic word deleting step
4101
. As a result of this processing, the keywords finally extracted from the input sentence are given by “”, “”, “”, “” and “”.
It is to be noted that while the segmentation points are set in Embodiment 2 in the order of the technical-term segmentation point setting step, the character-type segmentation point setting step, and the basic-word segmentation point setting step, the order of those processing steps may be optionally selected.
With Embodiment 2, as described above, the keyword extraction is carried out after replacing a different expression of the headword by a corresponding proper expression for technical terms registered in the technical term storage means, and when the technical term having the replaced proper expression is in continuity with the character string cut out from the input sentence because of difference in character type and the presence of a basic word, a keyword in the form of a compound word is also extracted so that the keyword extraction can be performed comprehensively. Since collation of words is made using their proper expressions at the time of both registration and retrieval of sentences, the number of different expressions of words, which serve as retrieval keys, from being increasing in a way of combinations, and a high-speed keyword extraction apparatus can be achieved. Moreover, with the provision of the basic word deleting step, the words which are not necessary as keywords used to identify a document can be deleted and a highly-accurate keyword extraction can be realized with a less amount of retrieval wastes.
Embodiment 3.
FIG. 29
is an overall block diagram of a keyword extraction method according to one embodiment of a third aspect, i.e., Embodiment 3, of the present invention. In
FIG. 29
, reference numerals
1
,
2
,
3
,
4
,
5
,
6
,
7
,
8
,
9
,
10
,
11
,
12
,
13
and
14
denote respectively technical term storage means, basic word storage means, effective part-of-speech succeeding hiragana-character-string storage means, an input step, a technical-term-storage-means managing step, a technical-term segmentation point setting step, a proper-expression replacing step, an effective character-string cutting step, a character-type segmentation point setting step, a basic-word-storage-means managing step, a basic-word segmentation point setting step, an effective-part-of-speech-succeeding-hiragana-character-string-storage-means managing step, an effective part-of-speech determining step, and a partial character string cutting step which are similar to those denoted by
1
,
2
,
3
,
4
,
5
,
6
,
7
,
8
,
9
,
10
,
11
,
12
,
13
and
14
in FIG.
5
. Denoted by
2501
is prefix storage means which is made up of one field of headword alone as shown in
FIG. 30
, for example. Denoted by
2502
is a prefix-storage-means managing step for searching the prefix storage means
2502
to take out prefixes, and
2503
is a prefix segmentation point setting step for setting prefix segmentation points before and behind a character string which is in match with any prefix taken out in the prefix-storage-means managing step
2502
.
FIG. 31
is a flowchart showing the operation of the embodiment of the third aspect, i.e., Embodiment 3, of the present invention. The following description will be made on processing of, for example. a Japanese sentence “ (kakusahbah no saikakunin wo okonau.=Each server is reconfirmed.)”. First, in step
2701
, the Japanese sentence is input through a keyboard or file. Then, in step
2702
, technical-term segmentation points are set in the input sentence.
Assuming that the words shown in
FIG. 2
are registered in the technical term storage means
1
, similarly to the processing in Embodiment 1, “” is taken out as a technical term from the input sentence and is replaced by its proper expression, i.e., “”, in accordance with the flowchart of FIG.
7
. Also, the start-of-technical-term segmentation point and the end-of-technical-term segmentation point are set respectively before and behind “”, as shown in FIG.
32
.
Next, in step
2703
, effective character strings are taken out one by one from the head of the input sentence. In accordance with the flowchart of
FIG. 12
, similarly to the processing in Embodiment 1, “” is taken out as the first effective character string.
Subsequently, a character-type segmentation point is set in step
2704
. In accordance with the flowchart of
FIG. 13
, similarly to the processing in Embodiment 1, the character-type segmentation point is set between “” and “”.
After that, basic-word segmentation points are set in step
2705
. It is assumed that any partial character string of “” is not registered in the basic word storage means
2
. In accordance with the flowchart of
FIG. 15
, similarly to the processing in Embodiment 1, the processing goes to step
2706
without setting the basic-word segmentation points for that effective character string.
Prefix segmentation points are then set in step
2706
.
FIG. 33
shows a flow of processing to set the prefix segmentation points. First, in
2901
, a segment range containing no technical terms is taken out from the effective character string. In accordance with the flowchart of
FIG. 16
, similarly to the processing in Embodiment 1, “” is taken out as a segment of the effective character string containing no technical term.
Since there is such a segment to be processed, the determination in step
2902
is responded by YES. Then, in step
2903
, a pointer ph is assigned to the head character “” of the segment of effective character string which contains no technical term.
Subsequently, prefixes registered in the prefix storage means
2501
are taken out one by one in step
2904
, and the length of the taken-out prefix is assigned to a variable len in step
2906
. It is then checked in step
2907
whether or not the character string in length len starting from its head pointed by ph matches with the prefix taken out from the prefix storage means
2501
.
Assuming that “” is registered in the prefix storage means
2501
as shown in
FIG. 30
, the determination in step
2907
is responded by YES upon “” being taken out in step
2904
. After that, in step
2908
, a start-of-prefix segmentation point and an end-of-prefix segmentation point are set respectively before and behind “” in the character string to be processed. When the prefixes registered in the prefix storage means
2501
are all taken out in step
2904
, the determination in step
2905
is responded by NO and the processing goes to step
2909
.
In step
2909
, ph is shifted one character toward the tail of the segment. So long as ph is still in the segment, the prefix is taken out from the prefix storage means
2501
to repeat the similar processing as mentioned above.
In this case, since the character succeeding to “” is outside the range of effective character string containing no technical term, the processing follows the path indicated by N from step
2910
. For “” there is no other segment of effective character string containing no technical term. Accordingly, the processing follows the path indicated by N from step
2902
, thereby going out of the routine of FIG.
33
.
The character string succeeding to a keyword candidate is then checked in step
2707
in
FIG. 31
to determine whether the keyword candidate is an effective part-of-speech. In accordance with the flowchart of
FIG. 17
, similarly to the processing in Embodiment 1, the part-of-speech determining routine is skipped because “” is a technical term.
As a result of the above processing, the segmentation points are set in the first effective character string, as shown in FIG.
34
.
Thereafter, in step
2708
, keyword candidates are taken out based on the segmentation points and the effective part-of-speech. In this embodiment, a keyword start-enable point is assumed to be given by any of the start-of-technical-term segmentation point, the start point of the effective character string, the start-of-basic-word segmentation point, the character-type segmentation point, the start-of-prefix segmentation point, and the end-of-prefix segmentation point. Also, a keyword end-enable point is assumed to be given by any of the end-of-technical-term segmentation point, the end point of the effective character string, the end-of-basic-word segmentation point, and the character-type segmentation point. It is further assumed that the position where a keyword end-disable point is set in the effective part-of-speech determining process cannot serve as the keyword end-enable point. In addition, it is assumed that the end-of-prefix segmentation point serves as the keyword end-disable point only and cannot serve as the keyword end-enable point.
In accordance with the flowchart of
FIG. 19
, similarly to the processing in Embodiment 1, “” and “” are extracted as keywords from “”.
Next, it is checked in step
2709
whether or not any effective character string still remains in the input sentence. In this case, the processing follows the path indicated by Y and returns to step
2703
for taking out a next effective character string “”.
Subsequently, a character-type segmentation point is set in step
2704
. This processing is executed in accordance with the flowchart of
FIG. 13
, but the processing goes out of the routine of
FIG. 13
without setting the character-type segmentation point because there is no difference in character type in the character string “”.
After that, basic-word segmentation points are set in step
2705
. This processing is executed in accordance with the flowchart of
FIG. 15
, but the processing goes out of the routine of
FIG. 15
without setting the basic-word segmentation points on an assumption that any partial character string of “” is not registered in the basic word storage means
2
.
Subsequently, prefix segmentation points are set in step
2706
. This processing is executed in accordance with the flowchart of FIG.
33
. Assuming that “” is registered in the prefix storage means
2501
, the start-of-prefix segmentation point and the end-of-prefix segmentation point are set before and behind “” of “”, respectively.
The character string succeeding to a keyword candidate is then checked in step
2707
to determine whether the keyword candidate is an effective part-of-speech. In this case, since the character succeeding to “” is “” that is registered in the effective part-of-speech succeeding hiragana-character-string storage means
3
, the processing goes out of the part-of-speech determining routine without setting the keyword end-disable point in accordance with the flowchart of FIG.
17
.
As a result of the above processing, the segmentation points are set “”, as shown in FIG.
35
.
Thereafter, in step
2708
, keyword candidates are taken out based on the segmentation points and the effective part-of-speech. In accordance with the flowchart of
FIG. 19
, “” and “” are extracted as keywords.
Further, the processing from step
2703
to step
2708
is executed for the next effective character string “”. Assuming that the character-type segmentation point does not exist, “” is not present in the basic word storage means
2
and the prefix storage means
2501
, and “” succeeding to “” is not present in the effective part-of-speech succeeding hiragana-character-string storage means
3
, no keywords are extracted from this segment as with the processing executed in Embodiment 1 for “”.
When all the effective character strings to be processed are taken out, the processing follows the path indicated by N from step
2709
, thereby going out of the processing sequence of FIG.
31
.
FIG. 36
is a block diagram showing an example of data flow in the present invention in relation to the steps according to the third aspect of the present invention.
Referring to
FIG. 36
, a Japanese input sentence “ (kakusahbah no saikakunin wo okonau.=Each server is reconfirmed.)”
3205
is entered in the input step
4
. In the technical-term-storage-means managing step
5
, a word
3201
, i.e., “”, is retrieved from the technical term storage means
1
. In the technical-term segmentation point setting step
6
, the start-of-technical-term segmentation point and the end-of-technical-term segmentation point are set in respective positions where “” appears in the input sentence, as shown in block
3206
.
Then, the information that the proper expression of the word “” is “” is passed from the technical-term-storage-means managing step
5
to the proper-expression replacing step
7
. As a result, the character string “” in block
3206
is replaced by the proper expression, i.e., “”.
Next, in the effective character-string cutting step
8
, a range of character string consisted of the effective character types, such as kanji, katakana, alphabets and numerals, or a technical term is taken out. As a result, “”, “” and “” are taken out as effective character strings, as shown in block
3208
.
Subsequently, in the character-type segmentation point setting step
9
, the position where the character type changes from one to another is set as a character-type segmentation point for the character string range of the effective character string which is not itself a technical term. As a result, the character-type segmentation point is set between “” and “”, as shown in block
3209
.
After that, the basic-word segmentation points are set in the basic-word segmentation point setting step
11
. In this case, the basic-word segmentation points are not set as shown in block
3210
.
In the prefix-storage-means managing step
2502
, the prefix storage means
2501
is searched and the information that words “” and “”
3203
are prefixes is passed to the prefix segmentation point setting step
2503
. As a result, the start-of-prefix segmentation point and the end-of-prefix segmentation point are set before and behind each of “” and “”, respectively, as shown in block
3211
.
Then, the effective part-of-speech succeeding hiragana-character-string storage means
3
is searched in the effective-part-of-speech-succeeding-hiraganacharacter-string-storage-means managing step
12
, and the character string succeeding to each effective character string is checked in the effective part-of-speech determining step
13
. Assuming that “” and “” are found, but “” is not found as indicated at
3204
, the keyword end-disable point is set behind “” as shown in block
3212
.
Next, the partial character string cutting step
14
cuts out, from the effective character string, the range of character string which starts from any of the start-of-technical-term segmentation point, the start point of the effective character string, the start-of-basic-word segmentation point, the character-type segmentation point, the start-of-prefix segmentation point, and the end-of-prefix segmentation point, which terminates at any of the end-of-technical-term segmentation point, the end point of the effective character string, the end-of-basic-word segmentation point, and the character-type segmentation point, and which does not terminate at the keyword end-disable point. As a result of the above processing, “”, “”, “” and “” are extracted as keywords from the input sentence, as shown in block
3213
.
It is to be noted that while the segmentation points are set in Embodiment 3 in the order of the technical-term segmentation point setting step, the character-type segmentation point setting step, the basic-word segmentation point setting step, and the prefix segmentation point setting step, the order of those processing steps may be optionally selected.
Also, quantity prefixes preceding character strings for quantity expressions, such as “” of “ (yaku ichiman en=about ten thousand yen)” and “” of “ (dai 30 kai=30-th)”, may be selected as prefixes which are registered in the prefix storage means, enabling the keyword extraction process to be executed for those prefixes in a similar manner as described above.
With Embodiment 3, as described above, when keywords are extracted in consideration of the correlation between prefixes, which are registered in the prefix storage means, and technical terms succeeding to the prefixes, a different expression of the headword is replaced by a corresponding proper expression for the technical term, and collation of words is made using the proper expressions at the time of both registration and retrieval of documents. Accordingly, a keyword extraction method adapted for high-speed document retrieval can be realized while the number of different expressions of words, which serve as retrieval keys, is avoided from increasing in a way of combinations due to the presence/absence of a prefix and different expressions of a technical term succeeding to the prefix.
Embodiment 4.
FIG. 37
is an overall block diagram of a keyword extraction method according to one embodiment of a fourth aspect, i.e., Embodiment 4, of the present invention. In
FIG. 37
, reference numerals
1
,
2
,
3
,
4
,
5
,
6
,
7
,
8
,
9
,
10
,
11
,
12
,
13
and
14
denote respectively technical term storage means, basic word storage means, effective part-of-speech succeeding hiragana-character-string storage means, an input step, a technical-term-storage-means managing step, a technical-term segmentation point setting step, a proper-expression replacing step, an effective character-string cutting step, a character-type segmentation point setting step, a basic-word-storage-means managing step, a basic-word segmentation point setting step, an effective-part-of-speech-succeeding-hiragana-character-string-storage-means managing step, an effective part-of-speech determining step, and a partial character string cutting step which are similar to those denoted by
1
,
2
,
3
,
4
,
5
,
6
,
7
,
8
,
9
,
10
,
11
,
12
,
13
and
14
in FIG.
5
. Denoted by
3301
is suffix storage means which is made up of one field of headword alone as shown in
FIG. 38
, for example. Denoted by
3302
is a suffix-storage-means managing step for searching the suffix storage means
3301
to take out suffixes, and
3303
is a suffix segmentation point setting step for setting suffix segmentation points before and behind a character string which is in match with any suffix taken out in the suffix-storage-means managing step
3302
.
FIG. 39
is a flowchart showing the operation of the embodiment of the fourth aspect, i.e., Embodiment 4, of the present invention. The following description will be made on processing of, for example. a Japanese sentence “ (sahbahgawa wo kakuninchuu tosuru.=Assume server side to be under confirmation.)”. First, in step
3501
, the Japanese sentence is input through a keyboard or file. Then, in step
3502
, technical-term segmentation points are set in the input sentence.
Assuming that the words shown in
FIG. 2
are registered in the technical term storage means
1
, similarly to the processing in Embodiment 1, “” is taken out as a technical term from the input sentence and is replaced by its proper expression, i.e., “”, in accordance with the flowchart of FIG.
7
. Also, the start-of-technical-term segmentation point and the end-of-technical-term segmentation point are set respectively before and behind “”, as shown in FIG.
40
.
Next, in step
3503
, effective character strings are taken out one by one from the head of the input sentence. In accordance with the flowchart of
FIG. 12
, similarly to the processing in Embodiment 1, “” is taken out as the first effective character string.
Subsequently, a character-type segmentation point is set in step
3504
. In accordance with the flowchart of
FIG. 13
, similarly to the processing in Embodiment 1, the character-type segmentation point is set between “” and “”.
After that, basic-word segmentation points are set in step
3505
. It is assumed that any partial character string of “” is not registered in the basic word storage means
2
. In accordance with the flowchart of
FIG. 15
, similarly to the processing in Embodiment 1, the processing goes to step
3506
without setting the basic-word segmentation points for that effective character string.
Suffix segmentation points are then set in step
3506
.
FIG. 41
shows a flow of processing to set the suffix segmentation points. First, in
3701
, a segment range containing no technical terms is taken out from the effective character string. In accordance with the flowchart of
FIG. 16
, similarly to the processing in Embodiment 1, “” is taken out as a segment of the effective character string containing no technical term.
Since there is such a segment to be processed, the determination in step
3702
is responded by YES. Then, in step
3703
, a pointer ph is assigned to the head character “” of the segment of effective character sting which contains no technical term.
Subsequently, suffixes registered in the suffix storage means
3301
are taken out one by one in step
3704
, and the length of the taken-out suffix is assigned to a variable len in step
3706
. It is then checked in step
3707
whether or not the character string in length len starting from its head pointed by ph matches with the suffix taken out from the suffix storage means
3301
.
Assuming that “” is registered in the suffix storage means
3301
as shown in
FIG. 38
, the determination in step
3707
is responded by YES upon “” being taken out in step
3704
. After that, in step
3708
, a start-of-suffix segmentation point and an end-of-suffix segmentation point are set respectively before and behind “” in the character string to be processed. When the suffixes registered in the suffix storage means
3301
are all taken out in step
3704
, the determination in step
3705
is responded by NO and the processing goes to step
3709
.
In step
3709
, ph is shifted one character toward the tail of the segment. So long as ph is still in the segment, the suffix is taken out from the suffix storage means
3301
to repeat the similar processing as mentioned above.
In this case, since the character succeeding to “” is outside the range of effective character string containing no technical term, the processing follows the path indicated by N from step
3710
. For “”, there is no other segment of effective character string containing no technical term. Accordingly, the processing follows the path indicated by N from step
3702
, thereby going out of the routine of FIG.
41
.
The character string succeeding to a keyword candidate is then checked in step
3507
in
FIG. 39
to determine whether the keyword candidate is an effective part-of-speech. In accordance with the flowchart of
FIG. 17
, similarly to the processing in Embodiment 1, the processing goes out of the part-of-speech determining routine without setting the keyword end-disable point because the character succeeding to “” is “” that is registered in the effective part-of-speech succeeding hiragana-character-string storage means
3
in this case.
As a result of the above processing, the segmentation points are set in the first effective character string, as shown in FIG.
42
.
Thereafter, in step
3508
, keyword candidates are taken out based on the segmentation points and the effective part-of-speech. In this embodiment, a keyword start-enable point is assumed to be given by any of the start-of-technical-term segmentation point, the start point of the effective character string, the start-of-basic-word segmentation point, and the character-type segmentation point. Also, a keyword end-enable point is assumed to be given by any of the end-of-technical-term segmentation point, the end point of the effective character string, the end-of-basic-word segmentation point, the character-type segmentation point, the start-of-suffix segmentation point, and the end-of-suffix segmentation point. It is further assumed that the position where a keyword end-disable point is set in the effective part-of-speech determining process cannot serve as the keyword end-enable point. In addition, it is assumed that the start-of-suffix segmentation point serves as the keyword start-disable point only and cannot serve as the keyword start-enable point.
In accordance with the flowchart of
FIG. 19
, similarly to the processing in Embodiment 1, “” and “” are extracted as keywords from “”.
Next, it is checked in step
3509
whether or not any effective character string still remains in the input sentence. In this case, the processing follows the path indicated by Y and returns to step
3503
for taking out a next effective character string “”.
Subsequently, a character-type segmentation point is set in step
3504
. This processing is executed in accordance with the flowchart of
FIG. 13
, but the processing goes out of the routine of
FIG. 13
without setting the character-type segmentation point because there is no difference in character type in the character string “”.
After that, basic-word segmentation points are set in step
3505
. This processing is executed in accordance with the flowchart of
FIG. 15
, but the processing goes out of the routine of
FIG. 15
without setting the basic-word segmentation points on an assumption that any partial character string of “” is not registered in the basic word storage means
2
.
Subsequently, suffix segmentation points are set in step
3506
. This processing is executed in accordance with the flowchart of FIG.
41
. Assuming that “” is registered in the suffix storage means
3301
, the start-of-suffix segmentation point and the end-of-suffix segmentation point are set before and behind “” of “”, respectively.
The character string succeeding to a keyword candidate is then checked in step
3507
to determine whether the keyword candidate is an effective part-of-speech. In this case, since the character succeeding to “” is “” that is registered in the effective part-of-speech succeeding hiragana-character-string storage means
3
, the processing goes out of the part-of-speech determining routine without setting the keyword end-disable point in accordance with the flowchart of FIG.
17
.
As a result of the above processing, the segmentation points are set in “”, as shown in FIG.
43
.
Thereafter, in step
3508
, keyword candidates are taken out based on the segmentation points and the effective part-of-speech. In accordance with the flowchart of
FIG. 19
, “” and “” are extracted as keywords.
It is then checked in step
3509
whether or not any segment of effective character string remains in the input sentence. In this case, since there remains no such a segment, the processing sequence of
FIG. 39
is completed.
FIG. 44
is a block diagram showing an example of data flow in the present invention in relation to the steps according to the fourth aspect of the present invention.
Referring to
FIG. 44
, a Japanese input sentence “ (sahbahgawa wo kakuninchuu tosuru.=Assume server side to be under confirmation.)”
4005
is entered in the input step
4
. In the technical-term-storage-means managing step
5
, a word “”
4001
is retrieved from the technical term storage means
1
. In the technical-term segmentation point setting step
6
, the start-of-technical-term segmentation point and the end-of-technical-term segmentation point are set in respective positions where “” appears in the input sentence, as shown in block
4006
.
Then, the information that the proper expression of the word “” is “” is passed from the technical-term-storage-means managing step
5
to the proper-expression replacing step
7
. As a result, the character string “” in block
4006
is replaced by the proper expression, i.e., “”.
Next, in the effective character-string cutting step
8
, a range of character string consisted of the effective character types, such as kanji, katakana, alphabets and numerals, or a technical term is taken out. As a result, “” and “” are taken out as effective character strings, as shown in block
4008
.
Subsequently, in the character-type segmentation point setting step
9
, the position where the character type changes from one to another is set as a character-type segmentation point for the character string range of the effective character string which is not itself a technical term. As a result, the character-type segmentation point is set between “” and “”, as shown in block
4009
.
After that, the basic-word segmentation points are set in the basic-word segmentation point setting step
11
. In this case, the basic-word segmentation points are not set as shown in block
4010
.
In the suffix-storage-means managing step
3302
, the suffix storage means
3301
is searched and the information that words “” and “”
4003
are suffixes is passed to the suffix segmentation point setting step
3303
. As a result, the start-of-suffix segmentation point and the end-of-suffix segmentation point are set before and behind each of “” and “”, respectively, as shown in block
4011
.
Then, the effective part-of-speech succeeding hiragana-character-string storage means
3
is searched in the effective-part-of-speech-succeeding-hiragana-character-string-storage-means managing step
12
, and the character string succeeding to each effective character string is checked in the effective part-of-speech determining step
13
. Assuming that “” and “” are found, the keyword end-disable point is not set as shown in block
4004
.
Next, the partial character string cutting step
14
cuts out, from the effective character string, the range of character string which starts from any of the start-of-technical-term segmentation point, the start point of the effective character string, the start-of-basic-word segmentation point, and the character-type segmentation point, which terminates at any of the end-of-technical-term segmentation point, the end point of the effective character string, the end-of-basic-word segmentation point, the character-type segmentation point, the start-of-suffix segmentation point, and the end-of-suffix segmentation point, and which does not start from the start-of-suffix segmentation point and does not terminate at the keyword end-disable point. As a result of the above processing, “” “”, and “” are extracted as keywords from the input sentence, as shown in block
4013
.
It is to be noted that while this embodiment has been described in connection with suffixes, the keyword extraction process can be performed for an infix, for example, “” of “ (nihon tai amerika=Japan versus America), by setting segmentation points before and behind through the similar processing as described above.
Also, quantity suffixes succeeding to character strings for quantity expressions, such as “” of “ (yaku ichiman en=about ten thousand yen)” and “” of “ (dai 30 kai=30-th)”, may be selected as suffixes which are registered in the suffix storage means, enabling the keyword extraction process to be executed for those suffixes in a similar manner as described above. further, while the segmentation points are set in the order of the technical-term segmentation point setting step, the character-type segmentation point setting step, the basic-word segmentation point setting step, and the suffix segmentation point setting step, the order of those processing steps may be optionally selected.
With Embodiment 4, as described above, when keywords are extracted in consideration of the correlation between suffixes, which are registered in the suffix storage means, and technical terms preceding the suffixes, a different expression of the headword is replaced by a corresponding proper expression for the technical term, and collation of words is made using the proper expressions at the time of both registration and retrieval of documents. Accordingly, a keyword extraction method adapted for high-speed document retrieval can be realized while the number of different expressions of words, which serve as retrieval keys, is avoided from increasing in a way of combinations due to the presence/absence of a suffix and different expressions of a technical term preceding the suffix.
Embodiment 5.
FIG. 45
is an overall block diagram of a keyword extraction method according to one embodiment of a fifth-aspect, i.e., Embodiment 5, of the present invention. In
FIG. 45
, reference numerals
1
,
2
,
3
,
4
,
5
,
6
,
7
,
8
,
9
,
10
,
11
,
12
,
13
and
14
denote respectively technical term storage means, basic word storage means, effective part-of-speech succeeding hiragana-character-string storage means, an input step, a technical-term-storage-means managing step, a technical-term segmentation point setting step, a proper-expression replacing step, an effective character-string cutting step, a character-type segmentation point setting step, a basic-word-storage-means managing step, a basic-word segmentation point setting step, an effective-part-of-speech-succeeding-hiragana-character-string-storage-means managing step, an effective part-of-speech determining step, and a partial character string cutting step which are similar to those denoted by
1
,
2
,
3
,
4
,
5
,
6
,
7
,
8
,
9
,
10
,
11
,
12
,
13
and
14
in FIG.
5
. Denoted by
4501
is a number-of-character limiting step for deleting those ones of keyword candidates extracted in the partial character string cutting step
14
, which have the number of characters not less than a certain value.
FIG. 46
is a flowchart showing the operation of the embodiment of the fifth aspect, i.e., Embodiment 5, of the present invention. The following description will be made on processing of, for example, a Japanese sentence “ (yuza intafehsu kirikae wo okonau=A user interface is switched over)”. First, in step
4601
, the Japanese sentence is input through a keyboard or file. Then, in step
4602
, technical-term segmentation points are set in the input sentence.
Assuming that the words shown in
FIG. 2
are registered in the technical term storage means
1
, “” is taken out as a technical term from the input sentence, and the start-of-technical-term segmentation point and the end-of-technical-term segmentation point are set respectively before and behind “” in accordance with the flowchart of FIG.
7
.
Next, in step
4603
, effective character strings are taken out one by one from the head of the input sentence. In accordance with the flowchart of
FIG. 12
, “” is taken out as the first effective character string.
Subsequently, a character-type segmentation point is set in step
4604
. In accordance with the flowchart of
FIG. 13
, the character-type segmentation point is set between “” and “”.
After that, basic-word segmentation points are set in step
4605
. It is assumed that any partial character string of “” is not registered in the basic word storage means
2
. In accordance with the flowchart of
FIG. 15
, the processing goes to step
4606
without setting the basic-word segmentation points for that effective character string.
The character string succeeding to a keyword candidate is then checked in step
4606
to determine whether the keyword candidate is an effective part-of-speech. In accordance with the flowchart of
FIG. 17
, the part-of-speech determining routine is skipped because “” is a technical term.
Thereafter, in step
4607
, keyword candidates are taken out based on the segmentation points and the effective part-of-speech. In this embodiment, a keyword start-enable point is assumed to be given by any of the start-of-technical-term segmentation point, the start point of the effective character string, the start-of-basic-word segmentation point, and the character-type segmentation point. Also, a keyword end-enable point is assumed to be given by any of the end-of-technical-term segmentation point, the end point of the effective character string, the end-of-basic-word segmentation point, and the character-type segmentation point. It is further assumed that the position where a keyword end-disable point is set in the effective part-of-speech determining process cannot serve as the keyword end-enable point.
In accordance with the flowchart of
FIG. 19
, “”, “” and “” are extracted as keywords from “”.
Next, it is checked in step
4608
whether or not any effective character string still remains in the input sentence.
Further, the processing from step
4603
to step
4607
is executed for the next effective character string “”. Assuming that there is no character-type segmentation point in the effective character string, “” is not present in the basic word storage means and the prefix storage means, and “” succeeding to “” is not present in the effective part-of-speech succeeding hiragana-character-string storage means
3
, no keywords are extracted from this segment as with the processing executed in Embodiment 1 for “”.
When all the effective character strings to be processed are taken out, the processing follows the path indicated by N from step
4608
and goes to step
4609
.
In step
4609
, those ones of the extracted keyword candidates which have the number of characters not less than a certain value are deleted. This processing is executed in accordance with a flowchart shown in FIG.
47
. In this embodiment, the number of characters is assumed to be limited within 12 characters.
It is assumed that the keyword candidates “”, “”, ” and “” are stored in a buffer. First, one of the keyword candidates is taken out from the buffer in step
4701
. It is then checked in step
4703
whether or not the number of characters of the taken-out keyword is equal to or less than 12. If the number of characters exceeds 12, then that word is deleted in step
4704
. This processing is repeated for all the keyword candidates stored in the buffer, and is completed upon the determination in step
4702
being responded by NO.
As a result of the above processing, since the number of characters of “” exceeds 12, it is deleted from the buffer. Thus, “” ” and ” are finally extracted as keywords, thereby completing the processing sequence.
FIG. 48
is a block diagram showing an example of data flow in the present invention in relation to the steps according to the fifth aspect of the present invention.
Referring to
FIG. 48
, a Japanese input sentence “ (yuza intafehsu kirikae wo okonau=A user interface is switched over)”
4805
is entered in the input step
4
. In the technical-term-storage-means managing step
5
, a word “”
4801
is retrieved from the technical term storage means
1
. In the technical-term segmentation point setting step
6
, the start-of-technical-term segmentation point and the end-of-technical-term segmentation point are set in respective positions where “” appears in the input sentence, as shown in block
4806
.
Then, a different expression of the technical term is replaced by a corresponding proper expression in the proper-expression replacing step
7
. In this case, since the input sentence contains no technical terms is written in different expression, the proper-expression replacing step
7
is skipped.
Next, in the effective character-string cutting step
8
, a range of character string consisted of the effective character types, such as kanji, katakana, alphabets and numerals, or a technical term is taken out. As a result, “” and “” are taken out as effective character strings, as shown in block
4808
.
Subsequently, in the character-type segmentation point setting step
9
, the position where the character type changes from one to another is set as a character-type segmentation point for the character string range of the effective character string which is not itself a technical term. As a result, the character-type segmentation point is set between “” and “”, as shown in block
4809
.
After that, the basic-word segmentation points are set in the basic-word segmentation point setting step
11
. In this case, since the input sentence contains no basic words, the basic-word segmentation point setting step
11
is skipped.
Then, the effective part-of-speech succeeding hiragana-character-string storage means
3
is searched in the effective-part-of-speech-succeeding-hiragana-character-string-storage-means managing step
12
, and the character string succeeding to each effective character string is checked in the effective part-of-speech determining step
13
. Assuming that “” is found, but “” is not found as indicated at
4802
, the keyword end-disable point is set behind “” as shown in block
4811
.
Next, the partial character string cutting step
14
cuts out, from the effective character string, the range of character string which starts from any of the start-of-technical-term segmentation point, the start point of the effective character string, the start-of-basic-word segmentation point, and the character-type segmentation point, which terminates at any of the end-of-technical-term segmentation point, the end point of the effective character string, the end-of-basic-word segmentation point, and the character-type segmentation point, and which does not terminate at the keyword end-disable point. As a result of the above processing, “”, “” and “” are extracted as keyword candidates from the input sentence, as shown in block
4812
.
Thereafter, in the number-of-characters limiting step
4501
, those ones of the extracted keyword candidates which have the number of characters in excess of 12 are deleted. As a result of this processing, “” and “” are finally extracted as keywords from the input sentence, as shown in block
4813
.
It is to be noted that while the segmentation points are set in the order of the technical-term segmentation point setting step, the character-type segmentation point setting step, and the basic-word segmentation point setting step in Embodiment 5, the order of those processing steps may be optionally selected.
With Embodiment 5, as described above, the number of characters of the extracted keyword is limited so as to fall in a certain range. To this end, for the technical terms registered in the technical term storage means, the keyword is extracted after replacing a different expression of the headword by a corresponding proper expression, and the number of characters is then counted for the keyword having the proper expression. Accordingly, a keyword extraction method is realized which can avoid such an uneven extraction of keywords where some words are registered, but other words are deleted depending on the difference in the number of characters between different expressions of even those words which have the similar meaning.
Embodiment 6.
FIG. 49
is an overall block diagram of a keyword extraction method according to one embodiment of a sixth aspect, i.e., Embodiment 6, of the present invention. In
FIG. 49
, reference numerals
1
,
2
,
3
,
4
,
5
,
6
,
7
,
8
,
9
,
10
,
11
,
12
,
13
and
14
denote respectively technical term storage means, basic word storage means, effective part-of-speech succeeding hiragana-character-string storage means, an input step, a technical-term-storage-means managing step, a technical-term segmentation point setting step, a proper-expression replacing step, an effective character-string cutting step, a character-type segmentation point setting step, a basic-word-storage-means managing step, a basic-word segmentation point setting step, an effective-part-of-speech-succeeding-hiragana-character-string-storage-means managing step, an effective part-of-speech determining step, and a partial character string cutting step which are similar to those denoted by
1
,
2
,
3
,
4
,
5
,
6
,
7
,
8
,
9
,
10
,
11
,
12
,
13
and
14
in FIG.
5
. Denoted by
4901
is a frequency totalizing step for totalizing the appearance frequency for each extracted keyword.
FIG. 50
is a flowchart showing the operation of the embodiment of the sixth aspect, i.e., Embodiment 6, of the present invention. The following description will be made on processing of, for example, a Japanese sentence “” (tanmatsuno kirikae to kaisenno kirikae wo okonau=Terminal switching and line switching are made)”. First, in step
5001
, the Japanese sentence is input through a keyboard or file. Then, in step
5002
, technical-term segmentation points are set in the input sentence.
Assuming that the words shown in
FIG. 2
are registered in the technical term storage means
1
, “” and “” are taken out as technical terms from the input sentence in accordance with the flowchart of FIG.
7
. Also, the start-of-technical-term segmentation point and the end-of-technical-term segmentation point are set respectively before and behind each of “” and “”. Since “” is a different expression, it is replaced by a corresponding proper expression, i.e., “”.
Next, in step
5003
, effective character strings are taken out one by one from the head of the input sentence. In accordance with the flowchart of
FIG. 12
, “” is taken out as the first effective character string.
Subsequently, a character-type segmentation point is set in step
5004
. This processing is executed in accordance with the flowchart of
FIG. 13
similarly to the processing in Embodiment 1. In this case, however, since there is no difference in character type, the processing goes to next step
5005
without setting the character-type segmentation point.
Basic-word segmentation points are set in next step
5005
. It is assumed that any partial character string of “” is not registered in the basic word storage means
2
. This processing is executed in accordance with the flowchart of
FIG. 15
similarly to the processing in Embodiment 1. In this case, however, the processing goes to step
5006
without setting the basic-word segmentation points for that effective character string.
The character string succeeding to a keyword candidate is then checked in step
5006
to determine whether the keyword candidate is an effective part-of-speech. In this case, since the character succeeding to “” is “” that is registered in the effective part-of-speech succeeding hiragana-character-string storage means
3
, the processing goes out of the part-of-speech determining routine without setting the keyword end-disable point in accordance with the flowchart of FIG.
17
.
Thereafter, in step
5007
, keyword candidates are taken out based on the segmentation points and the effective part-of-speech. In this embodiment, a keyword start-enable point is assumed to be given by any of the start-of-technical-term segmentation point, the start point of the effective character string, the start-of-basic-word segmentation point, and the character-type segmentation point. Also, a keyword end-enable point is assumed to be given by any of the end-of-technical-term segmentation point, the end point of the effective character string, the end-of-basic-word segmentation point, and the character-type segmentation point. It is further assumed that the position where a keyword end-disable point is set in the effective part-of-speech determining process cannot serve as the keyword end-enable point.
In accordance with the flowchart of
FIG. 19
, “” is extracted as a keyword from “”.
Next, it is checked in step
5008
whether or not any effective character string still remains in the input sentence. Repeating the above processing, character strings “”, “”, “” and “” are taken out successively as effective character strings. For “”, since neither character-type segmentation point and nor basic-word segmentation points are set in the range of technical term, “” is allowed to serve as a keyword candidate as it is. Assuming that there is no difference in character type in the character string “” and any partial character string of “” is not registered in the basic word storage means
2
, “” is also allowed to serve as a keyword candidate as it is. As with Embodiment 1, no keywords are extracted from “”.
As a result, until the determination in step
5008
is responded by NO, four words “”, “”, “” and “” are extracted as keyword candidates.
In step
5009
, the appearance frequency is totalized for each of the extracted candidates. This processing is executed in accordance with a flowchart shown in FIG.
51
.
It is assumed that the keyword candidates “”, “”, “” and “” are stored in a buffer A. Also, a buffer B is assumed to be empty. First, one of the keyword candidates is taken out from the buffer A in step
5101
. It is then checked in step
5103
whether or not the taken-out keyword is present in the buffer B. If the taken-out keyword is present in the buffer B, then a frequency value of that keyword in the buffer B is counted up one in step
5104
. If the taken-out keyword is not present in the buffer B, then it is copied into the buffer B in step
5105
with a frequency value given 1. This processing is repeated for all the keyword candidates stored in the buffer A, and is completed upon the determination in step
5102
being responded by NO. The finally extracted keywords are those stored in the buffer B.
In the above processing, “”, “” appearing first in the input sentence, and “” are copied into the buffer B in step
5105
with frequency values all given 1. For “” appearing second in the input sentence, the processing to count up the frequency value of “” in the buffer B by one is executed in step
5104
. As a result, “”, “” and “” are finally extracted as keywords with frequency values given 1, 2 and 1, respectively. The processing in step
5009
is thus completed.
FIG. 52
is a block diagram showing an example of data flow in the present invention in relation to the steps according to the sixth aspect of the present invention.
Referring to
FIG. 52
, a Japanese input sentence “ (tanmatsuno kirikae to kaisenno kirikae wo okonau=Terminal switching and line switching are made)”
5205
is entered in the input step
4
. In the technical-term-storage-means managing step
5
, words “” and “”
5201
are retrieved from the technical term storage means
1
. In the technical-term segmentation point setting step
6
, the start-of-technical-term segmentation point and the end-of-technical-term segmentation point are set in respective positions where “” and “” appears in the input sentence, as shown in block
5206
.
Then, a different expression of the technical term is replaced by a corresponding proper expression in the proper-expression replacing step
7
. In this case, “” is replaced by “”, followed by going to the next step.
In the next effective character-string cutting step
8
, a range of character string consisted of the effective character types, such as kanji, katakana, alphabets and numerals, or a technical term is taken out. As a result, “”, “”, “”, “” and “” are taken out as effective character strings, as shown in block
5208
.
Subsequently, in the character-type segmentation point setting step
9
, the position where the character type changes from one to another is set as a character-type segmentation point for the character string range of the effective character string which is not itself a technical term. In this case, since there is no point meeting the condition, the processing goes to the next step without setting the character-type segmentation point.
The basic-word segmentation points are set in the next basic-word segmentation point setting step
11
. In this case, the basic-word segmentation point is not set as shown in block
5210
.
Then, the effective part-of-speech succeeding hiragana-character-string storage means
3
is searched in the effective-part-of-speech-succeeding-hiragana-character-string-storage-means managing step
12
, and the character string succeeding to each effective character string is checked in the effective part-of-speech determining step
13
. Assuming that “”, “” and “” are found, but “” is not found as indicated at
5203
, the keyword end-disable point is set behind “” as shown in block
5211
.
Next, the partial character string cutting step
14
cuts out, from the effective character string, the range of character string which starts from any of the start-of-technical-term segmentation point, the start point of the effective character string, the start-of-basic-word segmentation point, and the character-type segmentation point, which terminates at any of the end-of-technical-term segmentation point, the end point of the effective character string, the end-of-basic-word segmentation point, and the character-type segmentation point, and which does not terminate at the keyword end-disable point. As a result of the above processing, “”, “”, “” and “” are extracted as keyword candidates from the input sentence, as shown in block
5212
.
Thereafter, in the frequency totalizing step
4901
, the appearance frequency is totalized for each of the extracted keywords. As a result of this processing, “”, “” and “” are finally extracted as keywords with frequency values given 1, 2 and 1, respectively.
It is to be noted that while the segmentation points are set in the order of the technical-term segmentation point setting step, the character-type segmentation point setting step, and the basic-word segmentation point setting step in Embodiment 6, the order of those processing steps may be optionally selected.
With Embodiment 6, as described above, for the technical terms registered in the technical term storage means, keyword extraction is performed after replacing a different expression of the headword by a corresponding proper expression. Accordingly, a keyword extraction method is realized which can avoid the words having the similar meaning but different expressions from being determined as separate words, and can be give the keywords with respective precise values of appearance frequency.
Embodiment 7.
FIG. 53
is an overall block diagram of a keyword extraction method according to one embodiment of a seventh aspect, i.e., Embodiment 7, of the present invention. In
FIG. 53
, reference numerals
1
,
2
,
3
,
4
,
5
,
6
,
7
,
8
,
9
,
10
,
11
,
12
,
13
and
14
denote respectively technical term storage means, basic word storage means, effective part-of-speech succeeding hiragana-character-string storage means, an input step, a technical-term-storage-means managing step, a technical-term segmentation point setting step, a proper-expression replacing step, an effective character-string cutting step, a character-type segmentation point setting step, a basic-word-storage-means managing step, a basic-word segmentation point setting step, an effective-part-of-speech-succeeding-hiragana-character-string-storage-means managing step, an effective part-of-speech determining step, and a partial character string cutting step which are similar to those denoted by
1
,
2
,
3
,
4
,
5
,
6
,
7
,
8
,
9
,
10
,
11
,
12
,
13
and
14
in FIG.
5
. Denoted by
5301
is a symbolic-character segmentation point setting step for setting symbolic-character segmentation points before and behind each of prescribed symbolic characters, such as “•” and “/”. Denoted by
5302
is a symbolic character deleting step for removing the prescribed symbolic characters, such as “•” and “/”, from extracted keywords.
FIG. 54
is a flowchart showing the operation of the embodiment of the seventh aspect, i.e., Embodiment 7, of the present invention. The following description will be made on processing of, for example, a Japanese sentence “ (yuhzah intafeisu no settei wo okonau=Setting of a user interface is made)”. First, in step
5401
, the Japanese sentence is input through a keyboard or file. Then, in step
5402
, technical-term segmentation points are set in the input sentence.
The technical-term segmentation points are set in accordance with the flowchart of FIG.
7
. It is here assumed here that “” and “” are technical terms, “” is a proper expression for “”, and “” is a proper expression for “”. On this assumption, in the input character string, “” is replaced by “”, “” is replaced by “”, and the start-of-technical-term segmentation point and the end-of-technical-term segmentation point are set respectively before and behind each of “” and “”.
Next, in step
5403
, effective character strings are taken out one by one from the head of the input sentence. In accordance with the flowchart of
FIG. 12
, “” is taken out as the first effective character string.
Subsequently, a character-type segmentation point is set in step
5404
. This processing is executed in accordance with the flowchart of FIG.
13
. In this case, since there is no difference in character type in the character string “”, the processing goes to step
5405
without setting the character-type segmentation point. Note that symbolic characters such as “•” are assumed to be not regarded as different character type in the step of setting the character-type segmentation points.
Basic-word segmentation points are set in step
5405
. Assuming that any partial character string of “” is not registered in the basic word storage means
2
, the processing goes to next step
5406
without setting the basic-word segmentation points in accordance with the flowchart of FIG.
15
.
Symbolic-character segmentation points are set in the step
5406
.
FIG. 55
shows a flow of processing to set the symbolic-character segmentation points. First, in
5501
, a segment range containing no technical terms is taken out from the effective character string. In accordance with the flowchart of
FIG. 16
, “•” is taken out as a segment of the effective character string containing no technical terms.
Since there is such a segment to be processed, the determination in step
5502
is responded by YES. Then, in step
3503
, a pointer ph is assigned to the head character “•” of the segment of effective character string which contains no technical terms.
Subsequently, it is checked in step
5504
whether or not ph is pointing the prescribed symbolic character. It is assumed that “•” is the prescribed symbolic character in this embodiment. The determination in step
5504
is therefore responded by YES, followed by going to step
5505
.
In step
5505
, a start-of-symbolic-character segmentation point and an end-of-symbolic-character segmentation point are set respectively before and behind “•” in the character string to be processed.
Then, in step
5506
, ph is shifted one character toward the tail of the segment. The range of effective character string containing no technical terms is thereby exceeded; hence the determination in step
5507
is responded by NO, returning to step
5501
. Since there is no other segment of effective character string containing no technical terms, the processing follows the path indicated by N from step
5502
, thereby going out of the routine of FIG.
55
.
The character string succeeding to a keyword candidate is then checked in step
5407
of
FIG. 54
to determine whether the keyword candidate is an effective part-of-speech. On condition that “” is registered in the effective part-of-speech succeeding hiragana-character-string storage means
3
as shown in
FIG. 4
, since the character succeeding to “” is “”, the processing goes out of the part-of-speech determining routine without setting the keyword end-disable point in accordance with the flowchart of FIG.
17
.
As a result of the above processing, the segmentation points are set in the first effective character string, as shown in FIG.
56
.
Thereafter, in step
5408
, keyword candidates are taken out based on the segmentation points and the effective part-of-speech. In this embodiment, a keyword start-enable point is assumed to be given by any of the start-of-technical-term segmentation point, the start point of the effective character string, the start-of-basic-word segmentation point, the character-type segmentation point, and the end-of-symbolic-character segmentation point. Also, a keyword end-enable point is assumed to be given by any of the end-of-technical-term segmentation point, the end point of the effective character string, the end-of-basic-word segmentation point, the character-type segmentation point, and the start-of-symbolic-character segmentation point. It is further assumed that the position where a keyword end-disable point is set in the effective part-of-speech determining process cannot serve as the keyword end-enable point.
In accordance with the flowchart of
FIG. 19
, “”, “” and “” are extracted as keyword candidates from “”. These keyword candidates are assumed to be stored in a buffer.
Next, in step
5409
, a symbolic character appearing in the keyword candidate is deleted. This processing is executed in accordance with a flowchart shown in FIG.
57
. First, one of the keyword candidates is taken out from the buffer in step
5701
. It is then checked in step
5703
whether or not “•” exists in the character string of the taken-out keyword. If so, then “•” is deleted in step
5704
. This processing is repeated for all the keyword candidates stored in the buffer, and is completed upon the determination in step
5702
being responded by NO.
In this embodiment, since “•” exists in the character string “”, this symbolic character “•” is deleted to obtain “” as a keyword candidate. As a result, “”, “” and “” are extracted as keyword candidates.
Next, it is checked in step
5410
whether or not any effective character string still remains in the input sentence. The segment taken out next is “”. Assuming that there is no difference in character type in the character string “” and any partial character string of “” is not stored in the basic word storage means, “” is extracted as a keyword candidate as it is. Further, “” is the effective character string taken out next, but no keywords are extracted from “” as with Embodiment 1.
As a result, “”, “”, “” and “” are finally extracted as keywords.
FIG. 58
is a block diagram showing an example of data flow in the present invention in relation to the steps according to the seventh aspect of the present invention.
Referring to
FIG. 58
, a Japanese input sentence “ (yuhzah intafeisu no settei wo okonau=Setting of a user interface is made)
5805
is entered in the input step
4
. Assuming that “” and “” are registered in the technical term storage means
1
, the start-of-technical-term segmentation point and the end-of-technical-term segmentation point are set respectively front and behind each of “” and “”, as shown in block
5806
.
Then, a different expression of the technical term is replaced by a corresponding proper expression in the proper-expression replacing step
7
. Assuming that the proper expression of “” is “” and the proper expression of “” is “”, different expressions “” and “” are replaced respectively by the proper expressions “” and “”, as shown in block
5807
.
Next, in the effective character-string cutting step
8
, a range of character string consisted of the effective character types or a technical term is taken out. As a result, “”, “” and “” are taken out as effective character strings, as shown in block
5808
.
Subsequently, in the character-type segmentation point setting step
9
, the position where the character type changes from one to another is set as a character-type segmentation point for the character string range of the effective character string which is not itself a technical term. In this case, since there is no difference in character type in the range of effective character string, the character-type segmentation point is set as shown in block
5809
.
The basic-word segmentation points are set in the next basic-word segmentation point setting step
11
. In this case, the basic-word segmentation point is not set as shown in block
5810
.
After that, in the symbolic-character segmentation point setting step
5301
, the start-of-symbolic-character segmentation point and the end-of-symbolic-character segmentation point are set respectively front and behind “•” in the character string under processing.
Then, the effective part-of-speech succeeding hiragana-character-string storage means
3
is searched in the effective-part-of-speech-succeeding-hiragana-character-string-storage-means managing step
12
, and the character string succeeding to each effective character string is checked in the effective part-of-speech determining step
13
. Assuming that “” and “” are found, but “” is not found as indicated at
5803
, the keyword end-disable point is set behind “” as shown in block
5812
.
Next, the partial character string cutting step
14
cuts out, from the effective character string, the range of character string which starts from any of the start-of-technical-term segmentation point, the start point of the effective character string, the start-of-basic-word segmentation point, the character-type segmentation point, and the end-of-symbolic-character segmentation point, which terminates at any of the end-of-technical-term segmentation point, the end point of the effective character string, the end-of-basic-word segmentation point, the character-type segmentation point, and the start-of-symbolic-character segmentation point, and which does not terminate at the keyword end-disable point. As a result of the above processing, “”, “”, “” and “” are extracted as keyword candidates, as shown in block
5813
.
Thereafter, in the symbolic character deleting step
5302
, “•” contained in the character strings of the keyword candidates is deleted. As a result, “” turns to “”; hence “”, “”, “” and “” are finally extracted as keywords.
It is to be noted that while the segmentation points are set in the order of the technical-term segmentation point setting step, the character-type segmentation point setting step, the basic-word segmentation point setting step and the symbolic-character segmentation point setting step in Embodiment 7, the order of those processing steps may be optionally selected.
With Embodiment 7, as described above, in a process of dealing with different expressions of a compound word, “•” and “/” appearing between words composing the compound word are deleted and a word resulted from replacing a different expression of each of technical terms, which are registered in the technical term storage means, by a corresponding proper expression is assigned as a keyword to a document. By executing the similar processing for an input word at the time of retrieval, different expressions in the form of a compound word and different expressions for each of words composing the compound word can be dealt with in a unified manner. Also, a keyword extraction method adapted for high-speed document retrieval can be realized without inviting an increase in the number of retrieval keys due to combinations of words composing the compound word.
Embodiment 8.
FIG. 59
is an overall block diagram of a keyword extraction method according to one embodiment of an eighth aspect, i.e., Embodiment 8, of the present invention. In
FIG. 59
, reference numerals
1
,
2
,
3
,
4
,
5
,
6
,
7
,
8
,
9
,
10
,
11
,
12
,
13
and
14
denote respectively technical term storage means, basic word storage means, effective part-of-speech succeeding hiragana-character-string storage means, an input step, a technical-term-storage-means managing step, a technical-term segmentation point setting step, a proper-expression replacing step, an effective character-string cutting step, a character-type segmentation point setting step, a basic-word-storage-means managing step, a basic-word segmentation point setting step, an effective-part-of-speech-succeeding-hiragana-character-string-storage-means managing step, an effective part-of-speech determining step, and a partial character string cutting step which are similar to those denoted by
1
,
2
,
3
,
4
,
5
,
6
,
7
,
8
,
9
,
10
,
11
,
12
,
13
and
14
in FIG.
5
. Denoted by
5901
is a non-technical-term different expression storage means for storing proper expressions of general words of high frequency and different expressions thereof in corresponding relation. The non-technical-term different expression storage means
5901
is made up of each proper expression and a set of one or more different expressions corresponding to the proper expression as shown in
FIG. 60
, for example. Denoted by
5902
is a different expression adding step for, when a technical term is a compound word, searching the technical term storage means
1
and the non-technical-term different expression storage means
5901
, and combining different expressions of words composing the compound word with each other to create different expressions of the compound word.
FIG. 61
is a block diagram showing sub-steps of the different expression adding step
5902
. Denoted by
6101
is a non-technical-term-different-expression-storage-means managing step for searching the non-technical-term different expression storage means
5901
to take out different expression information. Denoted by
6102
is a technical-term different expression means managing step for searching the technical term storage means to take out different expression information. Denoted by
6103
is a word dividing step for, when a word to be processed is a compound word consisting of individual words which are searched in the non-technical-term-different-expression-storage-means managing step
6101
and the technical-term different expression managing step
6102
, for dividing the compound word into the individual words. Denoted by
6104
is a different expression developing step for creating different expressions of the compound word based on combinations of different expressions for each of the individual words divided in the word dividing step
6103
. Denoted by
6105
is a registering step for determining one in a set of the different expressions created in the different expression developing step
6104
to be a proper expression, creating pairs of each headword and the proper expression, and registering those pairs in the technical term storage means.
FIG. 62
is a flowchart showing the operation of one embodiment of the eighth aspect, i.e., Embodiment 8, of the present invention. The following description will be made on processing of, for example, a Japanese word “ (kirikae botan=switching button)”. First, in step
6201
, the word “∩” is taken out. Then, in step
1503
, a pointer ph is assigned to the head character “” of the word and a pointer pt is assigned to the character “” one before the tail of the word.
Subsequently, in step
6203
, the technical term storage means
1
and the non-technical-term different expression storage means
5901
are searched by using the character string “” from ph to pt as a retrieval key. If “” is not found in the technical term storage means
1
and the non-technical-term different expression storage means
5901
, then pt is shifted one character toward the head of the word in step
6205
. At this time, since ph is still positioned nearer to the head than pt, the determination in step
6206
is responded by YES and the processing returns to step
6203
for searching the technical term storage means
1
and the non-technical-term different expression storage means
5901
again with “” now used as a retrieval key.
Assuming that “” is registered as one headword in the technical term storage means
1
, upon the character string from ph to pt being given by “”, the determination in step
6204
is responded by YES and the processing goes to step
6208
. In step
6208
, “” of “” is replaced by all different expressions of “” which are registered in the technical term storage means
1
. Assuming now that “” and “” are registered as different expressions of “”, character strings created in step
6208
are “”, “” and “”.
Then, in step
6209
, ph is assigned to “” and pt is assigned to “”. Since ph is still in the word range, the processing follows the path indicated by Y from step
6210
and returns to step
6203
to search for “” in the dictionary. Assuming that “” is found in the non-technical-term different expression storage means
5901
, the determination in step
6204
is responded by YES and the processing goes to step
6208
. In step
6208
, “” in each of “”, “” and “” is replaced by all different expressions of “” which are registered in the non-technical-term different expression storage means
5901
. Assuming now that “” (kanji of “”) is registered as a different expression of “”, character strings now created in step
6208
are “”, “”, “”, “”, “” and “”.
Next, pt is set to the character succeeding to ph in step
6209
. However, since pt now points a position outside the word range, the determination in step
6210
is responded by NO and the processing goes to step
6211
. In step
6211
, one of the created character strings “”, “”, “”, “”, “” and “” is determined as a proper expression to create a pair of a headword and the proper expression. Assuming that the proper expression for a group of “”, “” and “” is “” and the proper expression for a group of “” and “” is “”, “” which is a combination of both the proper expressions is determined as a proper expression for the group of those compound words.
To make a match with the format used in the technical term storage means
1
shown in
FIG. 2
, the proper expression “” is registered in the technical term storage means as it is, whereas the other different expressions “”, “”, “”, “” and “” are registered in the technical term storage means in pair with the proper expression “”. The processing routine of
FIG. 62
is thus completed.
FIG. 63
is a block diagram showing an example of data flow in the different expression adding step
5902
according to the eighth aspect of the present invention in relation to the sub-steps constituting the different expression adding step
5902
.
Referring to
FIG. 63
, a Japanese word “ (kirikae botan=switching button)”
6301
to be processed is passed to the word dividing step
6103
. Assuming that a word “”
6303
and a word “”
6304
are found respectively in the technical-term different expression managing step
6102
and the non-technical-term-different-expression-storage-means managing step
6101
, “” is divided into “” and “” as indicated at
6305
.
Next, assuming that “”, “” and “” are found as a ground of difference expressions for “” as indicated at
6306
and “” and “” are found as a ground of difference expressions for “”, those different expressions are combined with each other in the different expression developing step
6104
to create a set
6308
of combinations of the different expressions. Note that the underline in block
6308
represents the proper expression of each of individual words composing the compound word.
Subsequently, in the registering step
6105
, “”, which is a combination of the proper expressions of both the individual words, is determined as a proper expression for the group of the related compound words. Also, to make a match with the format used in the technical term storage means
1
shown in
FIG. 2
, the created compound words are each paired with the proper expression. At this time, since “” is the proper expression, it is left alone. As a result, “” and the created pairs are registered in the technical term storage means
1
in the format shown in block
6309
.
Incidentally, there may be added a step of prompting an operator to determine, before registering the words created in the registering step
6105
in the technical term storage means, whether or not those words are to be registered.
With Embodiment 8, as described above, a set of words are created by combining different expressions of each of individual words composing a compound word, one in the created set of the words having different expressions is determined to be a proper expression, and pairs of each headword and the proper expression are registered in the technical term storage means. Accordingly, it is possible to assist the operation of registering the words, which are necessary as technical terms, in the technical term storage means and to realize a keyword extraction method capable of achieving high-speed retrieval without generating a large number of retrieval keys.
It is to be noted that, in the first to ninth aspects of the present invention, words having the similar meaning and pronunciation but different expressions may be synonyms, i.e., words having the similar meaning but different pronunciations and expressions.
As fully described above, according to the first aspect of the present invention, there is provided a keyword extraction apparatus comprising technical term storage means for storing technical terms with proper expressions and different expressions thereof; basic word storage means for storing general basic words of high frequency; input means through which a sentence is input; technical-term segmentation point setting means for, when any of the technical terms stored in the technical term storage means exists in the sentence input through the input means, cutting out a range of that technical term from the input sentence; proper-expression replacing means for, when the technical term cut out by the technical-term segmentation point setting means is written in a different expression, replacing the different expression by a corresponding proper expression; character-type segmentation point setting means for detecting a difference in character type in the input sentence; basic-word segmentation point setting means for cutting out, from the input sentence, a range of any of the basic words stored in the basic word storage means; partial character string cutting means for cutting out partial character strings based on segmentation points set by the technical-term segmentation point setting means, the character-type segmentation point setting means and the basic-word segmentation point setting means; and output means for outputting, as keywords, the partial character strings cut out by the partial character string cutting means.
With the above feature, in the keyword extraction process for assigning an index to a document, a keyword of a technical term appearing in the document is assigned to the document after a different expression of the technical term is replaced by a proper expression thereof by referring to the technical term storage means in which technical terms are stored along with their different expressions. At this time, when the technical term having the replaced proper expression is in continuity with the character string cut out from the input sentence because of difference in character type and the presence of a basic word, a keyword in the form of a compound word is also extracted so that the keyword extraction can be performed comprehensively. By converting a different expression of the technical term into a corresponding proper expression with the same technical term storage means before starting retrieval, a keyword extraction apparatus adaptable for high-speed document retrieval can be achieved while the number of different expressions of words, which serve as retrieval keys, is avoided from increasing in a way of combinations unlike the conventional document retrieval intended to cope with the problem caused by words which have the similar meaning and pronunciation but different expressions.
According to the second aspect of the present invention, there is provided a keyword extraction method comprising an input step for inputting a sentence; a technical-term segmentation point setting step for, when any of technical terms in technical term storage means for storing technical terms with proper expressions and different expressions thereof exists in the sentence input in the input step, cutting out a range of that technical term from the input sentence; a proper-expression replacing step for, when the technical term cut out in the technical-term segmentation point setting step is written in a different expression, replacing a range of the technical term in the input sentence by a corresponding proper expression; a character-type segmentation point setting step for detecting a difference in character type in the input sentence; a basic-word segmentation point setting step for, when any of basic words in basic word storage means for storing, as the basic words, general words of high frequency exists in the input sentence, cutting out a range of any of the basic words from the input sentence; and a partial character string cutting step for cutting out, as keywords, partial character strings based on segmentation points set in the technical-term segmentation point setting step, the character-type segmentation point setting step and the basic-word segmentation point setting step.
With the above feature, it is possible to achieve a high-speed keyword extraction apparatus which can realize the operation of the keyword extraction apparatus according to the first aspect of the present invention.
Moreover, if the basic word deleting step is additionally provided, the words which are not necessary as keywords used to identify a document can be deleted and a highly-accurate keyword extraction can be realized with a less amount of retrieval wastes.
According to the third aspect of the present invention there is provided a keyword extraction method further comprising, in addition to the steps of the keyword extraction method according to the second aspect, when the sentence input in the input step is written in Japanese, a prefix segmentation point setting step for cutting out a range of any of prefixes in the Japanese input sentence by referring to prefix storage means for storing the prefixes, wherein the partial character string cutting step cuts out, as keywords, all relevant partial character strings based on the segmentation points set in the technical-term segmentation point setting step, the character-type segmentation point setting step, the basic-word segmentation point setting step, and the prefix segmentation point setting step.
With the above feature, a keyword extraction method for high-speed document retrieval can be achieved without increasing the number of combinations of different expressions of words serving as retrieval keys regardless of the presence/absence of a prefix and different expressions of a technical term succeeding to the prefix.
According to the fourth aspect of the present invention, there is provided a keyword extraction method further comprising, in addition to the steps of the keyword extraction method according to the third aspect, when the sentence input in the input step is written in Japanese, a suffix segmentation point setting step for cutting out a range of any of suffixes in the Japanese input sentence by referring to suffix storage means for storing the prefixes, wherein the partial character string cutting step cuts out, as keywords, all relevant partial character strings based on the segmentation points set in the technical-term segmentation point setting step, the character-type segmentation point setting step, the basic-word segmentation point setting step, the prefix segmentation point setting step, and the suffix segmentation point setting step.
With the above feature, a keyword extraction method for high-speed document retrieval can be achieved without increasing the number of combinations of different expressions of words serving as retrieval keys regardless of the presence/absence of a suffix and different expressions of a technical term succeeding to the suffix.
According to the fifth aspect of the present invention, there is provided a keyword extraction method further comprising, in addition to the steps of the keyword extraction method according to the second aspect, a number-of-characters limiting step for deleting those ones of the keywords extracted in the partial character string cutting step which have a character string length outside a predetermined range, thereby providing redetermined keywords.
With the above feature, the number of characters of each of the extracted keywords can be limited within a certain range. Further, since the number of characters is counted based on the word after converting its different expression into a corresponding proper expression, it is possible to achieve a keyword extraction capable of avoiding such an uneven extraction of keywords that some words are registered, but other words are deleted depending on difference in number of characters between different expressions of even those words which have the similar meaning.
According to the sixth aspect of the present invention, there is provided a keyword extraction method further comprising, in addition to the steps of the keyword extraction method according to the fifth aspect, a frequency totalizing step for counting appearance frequency of each of the keywords or the redetermined keywords extracted in the partial character string cutting step or the number-of-characters limiting step.
With the above feature, since keywords are extracted after replacing their different expressions by corresponding proper expressions, the words having the similar meaning but different expressions are avoided from being determined as separate words, and the keywords can be given with respective precise values of appearance frequency.
According to the seventh aspect of the present invention, there is provided a keyword extraction method further comprising, in addition to the steps of the keyword extraction method according to the fifth aspect, a symbolic-character segmentation point setting step for, when any of prescribed symbolic characters appears in the input sentence, cutting out that symbolic character, and a symbolic character deleting step for deleting the symbolic character cut out in the symbolic-character segmentation point setting step when the symbolic character is contained as one character in any of the keywords or the redetermined keywords extracted in the partial character string cutting step or the number-of-characters limiting step.
With the above feature, in a process of dealing with different expressions of a compound word, “•” and “/” appearing between words composing the compound word are deleted and a word resulted from replacing a different expression of each of the words composing the compound word by a corresponding proper expression can be assigned as a keyword to a document. By executing the similar processing for an input compound word at the time of retrieval, different expressions in the form of a compound word and different expressions for each of words composing the compound word can be dealt with in a unified manner. Also, it is possible to achieve a keyword extraction method for high-speed document retrieval without inviting an increase in the number of retrieval keys due to combinations of words composing the compound word.
According to the eighth aspect of the present invention, there is provided a keyword extraction method wherein, in addition to the steps of the keyword extraction method according to the second aspect, the technical term storage means stores technical terms which are created in a different expression adding step with the aid of different expressions registered in non-technical-term different expression storage means for storing different expressions of general words of high frequency and different expressions of the technical terms registered in the technical term storage means, the different expression adding step comprising a word dividing step for, when a technical term in the input sentence is a compound word, dividing the compound word into partial character strings composing the compound word, a different expression developing step for combining different expressions of the partial character strings with each other to create different expressions of the compound word, and a registering step for creating pairs of each of the created different expressions and a proper expression of the compound word, and registering the pairs in the technical term storage means.
With the above feature, a set of words are created by combining different expressions of each of individual words composing a compound word, one in the created set of the words having different expressions is determined to be a proper expression, and pairs of each headword and the proper expression are registered in the technical term storage means. As a result, it is possible to achieve a keyword extraction method adaptable for high-speed document retrieval without generating a large number of retrieval keys while assisting the operation of additionally registering words, which are necessary as technical terms, in the technical term storage means.
According to the ninth aspect of the present invention, there is provided a computer readable recording medium storing a keyword extraction program which comprises an input sequence for inputting a sentence; a technical-term segmentation point setting sequence for, when any of technical terms in technical term storage means for storing technical terms with proper expressions and different expressions thereof exists in the sentence input in the input step, cutting out a range of that technical term from the input sentence; a proper-expression replacing sequence for, when the technical term cut out in the technical-term segmentation point setting step is written in a different expression, replacing a range of the technical term in the input sentence by a corresponding proper expression; a character-type segmentation point setting sequence for detecting a difference in character type in the input sentence; a basic-word segmentation point setting sequence for, when any of basic words in basic word storage means for storing, as the basic words, general words of high frequency exists in the input sentence, cutting out a range of any of the basic words from the input sentence; and a partial character string cutting sequence for cutting out, as keywords, all relevant partial character strings based on segmentation points set in the technical-term segmentation point setting sequence, the character-type segmentation point setting sequence and the basic-word segmentation point setting sequence.
With the above feature, it is possible to achieve a computer readable recording medium storing a program which represents the keyword extraction method according to the second aspect, and which enables a computer to execute a keyword extraction process adaptable for high-speed document retrieval.
Claims
- 1. A keyword extraction apparatus comprising:a technical term storage means for storing technical terms with proper expressions and different expressions thereof, a basic word storage means for storing general basic words of high frequency, an input means through which a sentence is input, a technical-term segmentation point setting means for, when any of the technical terms stored in said technical term storage means exists in the sentence input through said input means, cutting out a range of that technical term from the input sentence, a proper-expression replacing means for, when the technical term cut out by said technical-term segmentation point setting means is written in a different expression, replacing the different expression by a corresponding proper expression, a character-type segmentation point setting means for detecting a difference in character type in the input sentence, a basic-word segmentation point setting means for cutting out, from the input sentence, a range of any of the basic words stored in said basic word storage means, a partial character string cutting means for cutting out partial character strings based on segmentation points set by said technical-term segmentation point setting means, said character-type segmentation point setting means and said basic-word segmentation point setting means, and an output means for outputting, as keywords, the partial character strings cut out by said partial character string cutting means.
- 2. A keyword extraction method comprising:an input step for inputting a sentence, a technical-term segmentation point setting step for, when any of technical terms in a technical term storage means for storing technical terms with proper expressions and different expressions thereof exists in the sentence input in said input step, cutting out a range of that technical term from the input sentence, a proper-expression replacing step for, when the technical term cut out in said technical-term segmentation point setting step is written in a different expression, replacing a range of said technical term in the input sentence with a corresponding proper expression, a character-type segmentation point setting step for detecting a difference in character type in the input sentence, a basic-word segmentation point setting step for, when any of basic words in a basic word storage means for storing, as the basic words, general words of a high frequency existing in the input sentence, cutting out a range of any of the basic words from the input sentence, and a partial character string cutting step for cutting out, as keywords, partial character strings based on segmentation points set in said technical-term segmentation point setting step, said character-type segmentation point setting step and said basic-word segmentation point setting step.
- 3. A keyword extraction method according to claim 2, further comprising, when the sentence input in said input step is written in Japanese:a prefix segmentation point setting step for cutting out a range of any of prefixes in the Japanese input sentence by referring to a prefix storage means for storing the prefixes, wherein said partial character string cutting step cuts out, as keywords, all relevant partial character strings based on the segmentation points set in said technical-term segmentation point setting step, said character-type segmentation point setting step, said basic-word segmentation point setting step, and said prefix segmentation point setting step.
- 4. A keyword extraction method according to claim 3, further comprising, when the sentence input in said input step is written in Japanese:a suffix segmentation point setting step for cutting out a range of any of suffixes in the Japanese input sentence by referring to a suffix storage means for storing the prefixes, wherein said partial character string cutting step cuts out, as keywords, all relevant partial character strings based on the segmentation points set in said technical-term segmentation point setting step, said character-type segmentation point setting step, said basic-word segmentation point setting step, said prefix segmentation point setting step, and said suffix segmentation point setting step.
- 5. A keyword extraction method according to claim 2, further comprising a number-of-characters limiting step for deleting the keywords extracted in said partial character string cutting step which have a character string length outside a predetermined range, thereby providing redetermined keywords.
- 6. A keyword extraction method according to claim 5, further comprising a frequency totalizing step for counting an appearance frequency of each of the keywords or the redetermined keywords extracted in said partial character string cutting step or said number-of-characters limiting step.
- 7. A keyword extraction method according to claim 5, further comprising a symbolic-character segmentation point setting step for, when any of prescribed symbolic characters appears in the input sentence, cutting out the symbolic character, anda symbolic character deleting step for deleting the symbolic character cut out in said symbolic-character segmentation point setting step when said symbolic character is contained as one character in any of the keywords or the redetermined keywords extracted in said partial character string cutting step or said number-of-characters limiting step.
- 8. A keyword extraction method according to claim 2, wherein said technical term storage means stores technical terms which are created in a different expression adding step with the aid of different expressions registered in non-technical-term different expression storage means for storing different expressions of general words of high frequency and different expressions of the technical terms registered in said technical term storage means, said different expression adding step comprising:a word dividing step for, when a technical term in the input sentence is a compound word, dividing the compound word into partial character strings composing said compound word, a different expression developing step for combining different expressions of said partial character strings with each other to create different expressions of said compound word, and a registering step for creating pairs of each of said created different expressions and a proper expression of said compound word, and registering the pairs in said technical term storage means.
- 9. A computer readable recording medium storing a program which enables a keyword extraction process to be executed in a computer, said keyword extraction process comprising:an input sequence for inputting a sentence, a technical-term segmentation point setting sequence for, when any of technical terms in technical term storage means for storing technical terms with proper expressions and different expressions thereof exist in the sentence input in said input step, cutting out a range of that technical term from the input sentence, a proper-expression replacing sequence for, when the technical term cut out in said technical-term segmentation point setting step is written in a different expression, replacing a range of said technical term in the input sentence by a corresponding proper expression, a character-type segmentation point setting sequence for detecting a difference in character type in the input sentence, a basic-word segmentation point setting sequence for, when any of basic words in basic word storage means for storing, as the basic words, general words of high frequency existing in the input sentence, cutting out a range of any of the basic words from the input sentence, and a partial character string cutting sequence for cutting out, as keywords, all relevant partial character strings based on segmentation points set in said technical-term segmentation point setting sequence, said character-type segmentation point setting sequence and said basic-word segmentation point setting sequence.
Priority Claims (1)
Number |
Date |
Country |
Kind |
9-210252 |
Aug 1997 |
JP |
|
US Referenced Citations (4)