This invention relates to a method and system of knowledge extraction, particularly to a method and system of knowledge extraction based on sentence groups, which involves the field of digital data processing technology.
Knowledge extraction is one of the research focuses commonly concerned in many fields such as natural language processing, semantic Web, machine learning, knowledge engineering, knowledge discovery, knowledge management, text mining, etc. As a newly developed research focus, knowledge extraction means extracting knowledge from text information, i.e., through content parsing and processing performed on documents, extracting knowledge contained in the documents on the basis of items. Knowledge extraction is one kind of knowledge acquisition and is sublimation and deepening of information extraction. Currently, a plenty of knowledge resources are available in the form of digital publication resources, however, knowledge resources that are present in the form of sentence groups are scarce. Sentence groups are speech communication units formed by consecutive sentences having close associations in sense or structure, and are considered as an effective representation form of knowledge. Sentence groups are extracted from articles in books (articles are a traditional knowledge organization form). Through knowledge extraction based on sentence groups, the granularity of document processing may be decreased to the level of sentence groups, so that the traditional knowledge organization and management manner may be changed completely.
In the process of knowledge extraction, the following method is commonly adopted in the prior art: performing knowledge extraction on the basis of individual sentences and then combining individual sentences obtained through extraction for output. This method ignores coherence of consecutive sentences, causing that extracted knowledge information lacks logical coherence, and thus is inconvenient for understanding.
In order to solve a problem in the prior art of lacking logical coherence in extracted knowledge information and inconvenience for understanding, the present invention provides a knowledge extraction method and system capable of guaranteeing logical coherence in extracted knowledge information.
In order to solve the above problem, the following technical solutions are provided in this invention.
According to an aspect of this invention, a knowledge extraction method is provided, comprising the following steps: acquiring an initial sentence group, the sentence group including one or more sentences; expanding the initial sentence group in which the length of the initial sentence group is compared with an expected length to determine the initial sentence group to be expanded according to the comparison result; extracting knowledge in which the sentence group that is finally obtained after expansion is outputted to realize knowledge extraction.
Optionally, the step of expanding the initial sentence group comprises: setting a weight threshold in which a weight threshold is set for the initial sentence group according to the result of comparing the length of the initial sentence group with the expected length; expanding the sentence group in which weights of sentences to be expanded are compared with the weight threshold, and expanding the initial sentence groups according to the comparison result.
Optionally, the step of acquiring an initial sentence group comprises: dividing text into sentences; forming an initial sentence group by I consecutive sentences, wherein I is an integer greater than or equal to 1. Optionally, I=3.
According to another aspect of this invention, a knowledge extraction system is further provided comprising: an initial sentence group acquisition module for acquiring an initial sentence group, the initial sentence group including one or more sentences; initial sentence group expansion module for comparing the length of the initial sentence group with an expected length to determine an initial sentence group to be expanded according to the comparison result; a knowledge extraction module for outputting sentence groups that are finally obtained after the expansion of the initial sentence group expansion module to realize knowledge extraction.
Optionally, the initial sentence group expansion module comprises: a weight threshold setting unit for setting a weight threshold for the initial sentence group according to the result of comparing the length of the initial sentence group with the expected length; a sentence group expansion unit for, in the expansion of the initial sentence group, comparing weights of sentences to be expanded with the weight threshold and expanding the initial sentence group according to the comparison result.
Optionally, the initial sentence group acquisition module comprises: a sentence dividing unit for dividing text into sentences; an extraction unit for forming an initial sentence group by 1 consecutive sentences, wherein 1 is an integer greater than or equal to 1.
Optionally, the sentence dividing unit forms the initial sentence group by 3 consecutive sentences.
According to still another aspect of this invention, there is also provided one or more computer readable medium having stored thereon computer-executable instructions that when executed by a computer perform a knowledge extraction method, the method comprising: acquiring an initial sentence group, the initial sentence group including one or more sentences; expanding the initial sentence group in which the length of the initial sentence group is compared with an expected length to determine an initial sentence group to be expanded according to the comparison result; extracting knowledge in which the sentence groups that are finally obtained after expansion are outputted to realize knowledge extraction.
With the knowledge extraction method and system in this disclosure, knowledge extraction is realized through acquiring initial sentence groups each including one or more sentences, and then comparing lengths of the initial sentence groups with an expected length to determine an initial sentence group to be expanded according to the comparison result. Since the sentence groups are formed by consecutive sentences, it may be guaranteed that the sentence groups themselves have good coherence in logic, so that the final sentence groups obtained through expanding the initial sentence groups have good coherence in logic correspondingly. Thus, this disclosure may override the drawback of lacking logical coherence in extracted knowledge information in the prior art.
Furthermore, according to the knowledge extraction method and system in this disclosure, the final sentence groups are obtained through left expansion and/or right expansion of the initial sentence groups, good coherence in logic may be guaranteed for the extracted sentence groups that are finally obtained, thereby causing no unexpected feeling. Meanwhile, through left expansion and/or right expansion of the initial sentence groups, sentences to be extracted may be prevented from being omitted, resulting in more comprehensive content contained in the extracted knowledge information.
For a complete understanding of this invention, a description will be given with reference to the accompanying drawings, wherein:
1 initial sentence group acquisition module, 2 initial sentence group expansion module, 3 knowledge extraction module, 4 property set module, 11 sentence dividing unit, 12 extraction unit, 21 weight threshold setting unit, 22 sentence group expansion unit, 31 final sentence group deduplicating and outputting unit, 32 final sentence group removing and outputting unit, 33 final sentence group sorting and outputting unit, 211 comparison result determination subunit, 211a redundant value setting device, 212 weight threshold determination subunit, 212a threshold adjustment factor setting device, 212b property weight density acquisition device, 212c weight threshold acquisition device, 221 initial sentence group selection subunit, 222 sentence weight acquisition subunit, 222a first weight acquisition device, 222b second weight acquisition device, 223 comparison subunit, 224 new sentence group acquisition subunit, 225 loop expansion subunit, 226 threshold setting subunit, 227a first counting subunit, 227b second counting subunit, 228a sentence group weight acquisition subunit, 228b sentence group length acquisition subunit, 228c weight density acquisition subunit
A knowledge extraction method is described in this embodiment, as shown in
S102: acquiring an initial sentence group, the initial sentence group including one or more sentences;
S104: expanding the initial sentence group in which the length of the initial sentence group is compared with an expected length to determine an initial sentence group to be expanded according to the comparison result;
S106: extracting knowledge in which the sentence group that is finally obtained after expansion is outputted to realize knowledge extraction.
In this embodiment, knowledge extraction is realized through acquiring initial sentence groups each including one or more sentences, and then comparing lengths of the initial sentence groups with an expected length to determine an initial sentence group to be expanded according to the comparison result. Since the sentence groups are formed by consecutive sentences, it may be guaranteed that the sentence groups themselves have good coherence in logic, so that the final sentence groups obtained through expanding the initial sentence groups have good coherence in logic correspondingly. Thus, this disclosure may override the drawback of lacking logical coherence in extracted knowledge information in the prior art.
As a preferred embodiment, in the knowledge extraction method of this embodiment, the step of acquiring an initial sentence group comprises: dividing text into sentences; forming an initial sentence group by I consecutive sentences, wherein I is an integer greater than or equal to 1. As a preferred embodiment, I=3.
In this embodiment, text is divided into sentences to form initial sentence groups by three consecutive sentences. A better output result is obtained in this embodiment when I=3, guaranteeing that each final sentence group extracted includes at least three sentences. In this embodiment, three consecutive sentences are drawn out from text to form the initial sentence groups, so that the initial sentence groups themselves have good logical relationships; further, because the final sentence groups are obtained through expanding the initial sentence groups, the final sentence groups obtained through extraction have good logical relationships and may not lead to an unexpected feeling.
In the knowledge extraction method of this embodiment, the step of expanding the initial sentence group comprises: setting a weight threshold in which a weight threshold is set for the initial sentence group according to the result of comparing the length of the initial sentence group with the expected length; expanding the sentence group in which weights of sentences to be expanded are compared with the weight threshold, and expanding the initial sentence group according to the comparison result.
As another alternative embodiment, in the knowledge extraction method of this embodiment, the step of expanding the initial sentence group may comprise: comparing the length of the initial sentence group and an expected length; if a length of an initial sentence group does not reach the expected length, expanding the initial sentence group; if a length of an initial sentence group reaches or exceeds the expected length, terminating the expansion.
In this embodiment, no matter in which manner the initial sentence groups are expanded, the relationship between lengths of initial sentence groups and an expected length is considered, making that the lengths of finally extracted sentence groups approach the expected length closely.
The expected length in this embodiment is familiar to those skilled in the art. For example, there is a limitation on the length of abstracts of patent descriptions of not exceeding 300 words. In the case of extracting relative sentences from text to form an abstract of a patent application, the expected length is 300 words. If there is not a specific requirement on the expected length, it may be selected based on practical demands.
The expected length, lengths of initial sentence groups and lengths of sentences in this embodiment and subsequent embodiments are all counted in the number of characters.
On the basis of embodiment 1, in the knowledge extraction method of this embodiment, as shown in
In this embodiment, according to the result of comparison between lengths of the initial sentence groups and the expected length, a weight threshold is set for the initial sentence groups, wherein the comparison result F=the expected length/(the length of an initial sentence group+a redundant value); the weight threshold is set as a function of the comparison result F, when F is greater than or equal to 1, the weight threshold=(K/F)/G; when F is less than 1, the weight threshold=(K/F)*G. Thus, the less the comparison result F is, i.e., the closer the length of the initial sentence group approaches the expected length or the more the length of the initial sentence group goes beyond the expected length, the larger the weight threshold is, i.e., the weight threshold may be adjusted dynamically according to the result of the comparison between the lengths of the initial sentence groups and the expected length. Compared with the prior art in which the a fixed criteria is adopted, this embodiment provides a criteria that may be adjusted dynamically based on practical situations, so as to guarantee that the extracted knowledge information is more closer to the expected length.
As a preferred embodiment, the threshold adjustment factor G is in a range 5≦G≦30. As demonstrated by experiments, the best effect of knowledge extraction may be obtained when the threshold adjustment factor G is set in this range.
As an alternative embodiment, the knowledge extraction method of this embodiment further comprises the following steps:
The property name of property parameter αi is a keyword predetermined according to knowledge information to be extracted and is represented by a character string corresponding to the property name. Determining whether property parameter αi is contained in a sentence is to determine whether the sentence includes a character string representing property parameter αi. Weight vi corresponding to property parameter αi may be determined according to the importance degree of property parameter αi, i.e., the more important the property parameter αi is, the larger value the corresponding weight vi is assigned, and vice versa.
In addition to the equation K=Σvi/N, the property weight density K may also be specified by users according to practical demands.
On the basis of embodiment 1 and embodiment 2, in the knowledge extraction method of this embodiment, as shown in
In this embodiment, the expansion of the initial sentence group comprises left expansion, right expansion or left-right expansion, in which:
In the knowledge extraction method of this embodiment, in the step of obtaining a weight of a left sentence and a weight of a right sentence:
After the above determination performed on left and right sentences, for example, it is determined that the left sentence includes property parameters α1 and α2, the weight of the left sentence is WL=v1+v2; it is determined that the right sentence includes property parameters α3 and α4, the weight of the right sentence is WR=v3+v4. Herein, when the same property αi occurs several times, a corresponding weight vi will be accumulated one or multiple times. In general, in order to obtain a result meeting users' demands better, the property αi may be accumulated a number of times that the property αi occurs.
As an alternative solution, an alternative method of calculating sentence weight is Σβivi, wherein βivi is a value contributed by property αi occurred in a sentence, βi is a field feature weight of property αi. The field feature weight of property αi may be obtained through training using field documents. When βi is 1, it becomes the scheme adopted in this embodiment. This embodiment only provides a method of obtaining a weight WL of a left sentence and/or a weight WR of a right sentence adjacent to the initial sentence group. Other methods of calculating sentence weight existed in the prior art may be adopted, so long as the same method is used throughout for the calculations of all sentence weight values.
In the knowledge extraction method of this embodiment, according to the result of the comparison between lengths of initial sentence groups and the expected length, a weight threshold is set for the initial sentence groups. The comparison result F=expected length/(the length of an initial sentence group+a redundant value), and the weight threshold is set as a function of the comparison result F. The less the comparison result F is, i.e., the closer the length of the initial sentence group approaches the expected length or the more the length of the initial sentence group goes beyond the expected length, the larger the weight threshold is; the weight WL of the left sentence and/or the weight WR of the right sentence adjacent to the initial sentence group is compared with the weight threshold, only if the weight WL of the left sentence and/or the weight WR of the right sentence adjacent to the initial sentence group is greater than or equal to the weight threshold, the left sentence and/or the right sentence is expanded into the initial sentence group to form a new sentence group; otherwise, no expansion is performed on the initial sentence group. Thus, the weight threshold may be adjusted dynamically according to the result of the comparison between the lengths of the initial sentence groups and the expected length. For example, if the length of an initial sentence group is far less than the expected length, the weight threshold will become very small, causing that the weight WL of the left sentence and the weight WR of the right sentence are prone to be greater than the weight threshold, thereby the left sentence and/or the right sentence is liable to be expanded into the initial sentence group; otherwise, the weight threshold will become very large, and the left sentence and/or the right sentence may be expanded into the initial sentence group only if it includes many property parameters αi. In this manner, the length of the initial sentence group may be controlled effectively to obtain a final sentence group having a length approaching the expected length.
In the knowledge extraction method of this embodiment, in the step of determining the comparison result F, in the case of left expansion of the initial sentence group, the redundant value is set to half of the length of the left sentence adjacent to the initial sentence group; in the case of right expansion of the initial sentence group, the redundant value is set to half of the length of the right sentence adjacent to the initial sentence group.
In practical applications, in left expansion, the redundant value may be selected as a value that is m times of the length of the left sentence adjacent to the initial sentence group; in right expansion, the redundant value may be selected as a value that is m times of the length of the right sentence adjacent to the initial sentence group; preferably, m is a value less than 1. When m is 0.5, it becomes the scheme provided in this embodiment. With the redundant value of this embodiment, according to statistics, the final sentence group may get close enough to the expected length.
On the basis of any of embodiment 1 to embodiment 3, as shown in
In the step of left expanding and/or right expanding the initial sentence group to obtain a final sentence group, when the number of sentences for left expansion of the initial sentence group is greater than the left-expansion sentence number threshold L, no left expansion is performed on the initial sentence group anymore; when the number of sentences for right expansion of the initial sentence group is greater than the right-expansion sentence number threshold R, no right expansion is performed on the initial sentence group anymore.
Through limiting the number of sentences for left and/or right expansion of an initial sentence group, left and/or right expansion of the initial sentence group may be further controlled in a reasonable range, making it convenient to check and understand the sentence group finally extracted.
As a preferred embodiment, in the step of setting a sentence number threshold for left and/or right expansion in the knowledge extraction method of this embodiment, in the case of left and right expanding the initial sentence group, the left-expansion sentence number threshold L is set to 6 and the right-expansion sentence number threshold R is set to 6; in the case of only left expanding the initial sentence group, the left-expansion sentence number threshold L is set to 12 and the right-expansion sentence number threshold R is set to 0; in the case of only right expanding the initial sentence group, the left-expansion sentence number threshold L is set to 0 and the right-expansion sentence number threshold R is set to 12.
As demonstrated by experiments, through setting the left-expansion sentence number threshold and right-expansion sentence number threshold to the above values, the best effect may be obtained in terms of not only sentence coherence in the result of knowledge extraction, but also length control of the final sentence group.
On the basis of any of embodiment 1 to embodiment 4, the knowledge extraction method of this embodiment further comprises the following steps:
Note that, in the calculation of the final sentence group weight density K′, it is also possible to divide final sentence group weight by the number of sentences in the final sentence group, so long as the same criterion is adopted for each final sentence group in the calculation of the final sentence group weight density K′.
From the above determinations, for example, it is determined that a final sentence group includes property parameters α1, α3, α5, through adding weights V1, V3, V5 together, a weight=V1+V3+V5 is obtained for final sentence group; if the length of the final sentence group is 300 characters, the final sentence group weight density K′=(V1+V3+V5)/300. If one sentence or different sentences in the final sentence group includes more than one property parameters αi, its corresponding weight may be added once or several times. In general, for a better result meeting the demand of users, parameters αi may be added a number of times that its corresponding weight Vi occurs.
Alternatively, an alternative scheme of sentence group weight calculation is Σβivi, wherein βivi is a value contributed by property αi present in sentences in the sentence group, βi is a field feature weight of property αi. The field feature weight of property αi may be obtained through training using field documents. When all βi are 1, it becomes the scheme used in the present embodiment. This embodiment only provides a method of obtaining the final sentence group weight. Other methods of calculating sentence weight existed in the prior art may be adopted, so long as the same method is used to calculate weights for all sentences in the sentence group.
According to the knowledge extraction method of this embodiment, the step of extracting knowledge further comprises: deduplicating and outputting final sentence groups in which final sentence groups are deduplicated and then outputted.
According to the knowledge extraction method of this embodiment, the step of extracting knowledge further comprises: removing and outputting final sentence groups, in which a minimum length is set for final sentence groups and those final sentence groups having a length less than the minimum length are removed.
According to the knowledge extraction method of this embodiment, the step of extracting knowledge further comprises: sorting and outputting final sentence groups, in which final sentence groups are sorted and then outputted according to the weight density K′ of each final sentence group.
According to the knowledge extraction method of this embodiment, through deduplicating all final sentence groups, the output of duplicate knowledge information is avoided so that a waste of time due to reading duplicate contents may be prevented; through setting a minimum length for final sentence groups and removing those final sentence groups having a length less than the minimum length, more knowledge information is contained in each final sentence group that is outputted, thereby satisfying the requirement of consulting by users; through sorting and outputting final sentence groups according to the weight density K′ of each final sentence group, users may selectively read final sentence groups that are extracted. For example, according to weight densities K′, final sentence groups are sorted in descending order and then outputted. Users only need to read the first few final sentence groups to obtain desired knowledge information, so that time for querying by users may be reduced.
A particular example of knowledge extraction is further provided in this embodiment, with the following text:
There are totally 68 properties in the above set of properties. The sum of weights corresponding to those properties is 1, thus the property weight density K=1/68=0.1470588.
The above text is segmented based on punctuations representing a complete sentence, such as periods, question marks and exclamations, and total 40 sentences are obtained after the segmentation. For the simplicity of description below, a label is provided for each sentence. In this embodiment, these 40 sentences are labeled as J1 to J40. These labels are provided for the purpose of facilitating the understanding of this technical solution. In the operation of a practical system, these labels are not actually present in the text.
Initial sentence groups are formed by any three consecutive sentences, and the initial sentence groups obtained in such a manner are shown in a table below.
After the above initial sentence groups are obtained, expansion is performed for each initial sentence group. Below, an initial sentence group of three sentences J5-J7 is taken as an example to described how to expand sentence groups in the process of knowledge extraction.
In this process of sentence group expansion, the expected sentence group length is set to 300. In left expansion of the sentence group, the redundant value is set to half of a left adjacent sentence and L=6; in right expansion of the sentence group, the redundant value is set to half of a right adjacent sentence and R=6. In both left expansion and right expansion of the sentence group, a description of left expansion before right expansion will be given. Alternatively, right expansion before left expansion is also possible, or left expansion and right expansion may be performed alternately.
Parameters of the sentence group and a left sentence adjacent to the sentence group are obtained as follows.
The length of the sentence group of J5-J7: 155, which is counted in characters that are contained in the sentence group (excluding spaces), and this criterion is used throughout in this embodiment for counting characters. A left sentence adjacent to the sentence group is J4 and the length of J4 is 23, including properties: “” and “
”. Thereby, the weight of J4 is the sum of a weight 0.045021438780371605 corresponding to “
” and a weight 0.115054787994283 corresponding to “
”, which is 0.160076226774654605.
The weight threshold is obtained as follows:
because F>1, the weight threshold is selected as (K/F)/G=0.004069142;
because the weight of J4 is larger than the weight threshold and the number of sentences that have been left expanded is less than 6, J4 may be expanded into the sentence group to form a new sentence group J4-J7.
Left expansion continues while taking the new sentence group J4-J7 as an initial sentence group. The length of the new sentence group is 155+23=178; a left sentence adjacent to the initial sentence group is J3 and its length is 41, which includes properties “” and “
”. Thereby, the weight of the initial sentence group is the sum of weights corresponding to these two properties: 0.01643639828489757+0.115054787994283=0.13149118627918057;
F=300/(178+41/2)=1.51133501;
Because F>1, the weight threshold is selected as (K/F)/G=0.0048774502;
Because the weight of J3 is larger than the weight threshold and the number of sentences that have been left expanded is less than 6, J3 may be expanded into the sentence group to form a new sentence group J3-J7.
Similarly, through the above steps, determinations are sequentially performed on J2 and J1 in similar steps, which will not be described in detail. After these determinations, both J2 and J1 are determined as meeting the criterion of being expanded into the sentence group. However, because J1 is the first sentence at the left side, left expansion of the sentence group is automatically terminated upon J1 has been left expanded, and a new initial sentence group J1-J7 is obtained after left expansion.
Right expansion is performed on the initial sentence group J1-J7. The length of the initial sentence group is: 267 and a right sentence adjacent to the initial sentence group is J8. The length of J8 is 64 and it includes properties: “”, “
” and “
”, wherein “
” appears twice, thereby the weight of J8 is the sum of a weight of “
”, a weight of “
” and a weight of “
” multiplied by 2 as follows: 0.02763220581229150+0.11505478799428300+0.06955693187232010*2=0.281800857551214 7.
F=300/(267+64/2)=1.0033444816
Because F>1, a weight threshold (K/F)/G=0.0073284302 is selected.
Because the weight of J8 is greater than the weight threshold and the number of sentences that have been right expanded is less than 6, J8 is expanded in the initial sentence group to form a new sentence group J1-J8.
Right expansion continues while taking the sentence group J1-J8 as a new initial sentence group.
The length of the initial sentence group is 331 and a right sentence adjacent to the initial sentence group is J9. The length of J9 is 38 and it includes properties: “” and “
”. Thereby, its weight is calculated as follows: 0.11505478799428300+0.02096236303001420=0.1360171510242972.
F=300/(329+38/2)=0.857142857
F<1, a weight threshold (K/F)*G=3.431372 is selected.
Although the number of sentences that have been right expanded is less than 6, since the weight of J9 is less than the weight threshold, J9 cannot be expanded into the sentence group and sentence group expansion terminates. Thus, if the length of the sentence group is greater than the expected length, the weight threshold will become very large, so that it is difficult for sentences having a moderate weight to be expanded into the sentence group.
In the similar method, expansion is performed based on other initial sentence groups. For those skilled in the art, all initial sentence groups in a whole document may be expanded according to the process described above, which will not be further described herein.
After all final sentence groups are obtained, duplicate sentence groups are removed and sentence groups are sorted according to their weight densities. Weight density K′=the weight of a final sentence group/the length of the final sentence group, the length of the final sentence group being the number of characters contained in the final sentence group, the weight of the final sentence group being the sum of weights of various sentences in the final sentence group. Wherein, the weight of each sentence is calculated in the method above, i.e., through adding weights of all properties appeared in the sentence together.
With respect to the above input text, 20 final sentence groups are obtained, which are sorted by weight densities and outputted as follows:
J1-J8; J3-J9; J6-J10; J7-J11; J2-J8; J7-J12; J8-J13; J22-J26; J26-J30; J15-J19; J14-18; J22-J27; J15-J20; J29-J34; J34-J40; J13-J17; J33-J40; J16-J22; J12-J17; J17-J22.
This embodiment provides a knowledge extraction system, as shown in
In this embodiment, knowledge extraction is realized through acquiring initial sentence groups each including one or more sentences by the initial sentence group acquisition module 1, and then comparing lengths of the initial sentence groups with an expected length by the initial sentence group expansion module 2 to determine initial sentence groups to be expanded according to the comparison result. Since the sentence groups are formed by consecutive sentences, it may be guaranteed that the sentence groups themselves have good coherence in logic, so that the final sentence groups obtained through expanding the initial sentence groups have good coherence in logic correspondingly. Thus, this disclosure may override the drawback of lacking logical coherence in extracted knowledge information in the prior art.
As a preferred embodiment, in the knowledge extraction method of this embodiment, the step of acquiring initial sentence groups comprises: dividing text into sentences; forming initial sentence groups by I consecutive sentences, wherein I is an integer greater than or equal to 1. As a preferred embodiment, I=3.
In this embodiment, in the knowledge extraction system of this embodiment, as shown in
In this embodiment, the text document is divided into sentences by the sentence dividing unit 11 to form initial sentence groups of three consecutive sentences. A better output result is obtained in this embodiment when I=3, guaranteeing that each final sentence group extracted includes at least three sentences. In this embodiment, three consecutive sentences are drawn out from text to form the initial sentence groups, so that the initial sentence groups themselves have good logical relationships; further, because the final sentence groups are obtained through expanding the initial sentence groups, the final sentence groups obtained through extraction have good logical relationships and may not lead to an unexpected feeling.
In the knowledge extraction system of this embodiment, the initial sentence group expansion module 2 comprises a weight threshold setting unit 21 for setting a weight threshold for initial sentence groups according to the result of comparing lengths of the initial sentence groups with the expected length; a sentence group expansion unit 22 for, in expansion of the initial sentence groups, comparing weights of sentences to be expanded with the weight threshold, and expanding the initial sentence groups according to the comparison result.
In this embodiment, the relationship between lengths of initial sentence groups and an expected length is considered, making that the lengths of extracted final sentence groups approach the expected length closely.
The expected length in this embodiment is familiar to those skilled in the art. For example, there is a limitation on the length of abstracts of patent descriptions of not exceeding 300 words. In the case of extracting relative sentences from text to form an abstract of a patent application, the expected length is 300 words. If there is not a specific requirement on the expected length, it may be selected based on practical demands.
The expected length, lengths of initial sentence groups and lengths of sentences in this embodiment and subsequent embodiments are all counted in the number of characters.
On the basis of embodiment 6, in the knowledge extraction system of this embodiment, as shown in
In the knowledge extraction system of this embodiment, the weight threshold determination subunit 212 comprises a threshold adjustment factor setting device 212a for setting and outputting a threshold adjustment factor G, wherein G is a value greater than 1; a property weight density acquisition device 212b for obtaining and outputting a property weight density K; a weight threshold acquisition device 212c for obtaining and outputting a weight threshold according to outputs of the threshold adjustment factor setting device 212a, the property weight density acquisition device 212b and the comparison result determination unit 211; when F is greater than or equal to 1, the weight threshold=(K/F)/G; when F is less than 1, the weight threshold=(K/F)*G, wherein, G is a threshold adjustment factor and G is a value greater than 1; K is a property weight density.
In this embodiment, the weight threshold setting unit 21 set a weight threshold according to the result of comparison between lengths of initial sentence groups and an expected length; the comparison result determination subunit 211 determines a comparison result F=the expected length/(the length of an initial sentence group+a redundant value); the weight threshold acquisition device 212c determines a weight threshold=(K/F)/G when F is greater than or equal to 1, and a weight threshold=(K/F)*G when F is less than 1. Thus, the less the comparison result F is, i.e., the closer the length of the initial sentence group approaches the expected length or the more the length of the initial sentence group goes beyond the expected length, the larger the weight threshold is, i.e., the weight threshold may be adjusted dynamically according to the result of the comparison between the lengths of the initial sentence groups and the expected length. Compared with the prior art in which the a fixed criteria is adopted, this embodiment provides a criteria that may be adjusted dynamically based on practical situations, so as to guarantee that the extracted knowledge information is more closer to the expected length.
As a preferred embodiment, in the knowledge extraction system of this embodiment, the threshold adjustment factor setting device 212a sets the threshold adjustment factor G in a range 5≦G≦30.
As demonstrated by experiments, the best effect of knowledge extraction may be obtained when the threshold adjustment factor G is set in this range.
As an alternative embodiment, the knowledge extraction system of this embodiment further comprises:
The property name of property parameter αi is a keyword predetermined according to knowledge information to be extracted and is represented by a character string corresponding to the property name. Determining whether property parameter αi is contained in a sentence is to determine whether the sentence includes a character string representing property parameter αi. Weight vi corresponding to property parameter αi may be determined according to the importance degree of property parameter αi, i.e., the more important the property parameter αi is, the larger value the corresponding weight vi is assigned, and vice versa.
In addition to the equation K=Σvi/N, the property weight density K may also be specified by users according to practical demands.
On the basis of embodiment 6 or embodiment 7, in the knowledge extraction system of this embodiment, as shown in
In this embodiment, in the case of only left expansion of the initial sentence group, if the weight WL of the left sentence adjacent to the initial sentence group is greater than or equal to the weight threshold, the new sentence group acquisition subunit 224 expands the left sentence into the initial sentence group to form a new sentence group and outputs it to the sentence weight acquisition subunit 222 as an initial sentence group, until no expansion is performed on the initial sentence group anymore, so as to obtain a final sentence group, the final sentence group being outputted to the knowledge extraction module 3.
In the case of only right expansion of the initial sentence group, if the weight WR of the right sentence adjacent to the initial sentence group is greater than or equal to the weight threshold, the new sentence group acquisition subunit 224 expands the right sentence into the initial sentence group to form a new sentence group and outputs it to the sentence weight acquisition subunit 222 as an initial sentence group, until no expansion is performed on the initial sentence group anymore, so as to obtain a final sentence group, the final sentence group being outputted to the knowledge extraction module 3.
In the case of both left and right expansion of the initial sentence group, if the weight WL of the left sentence adjacent to the initial sentence group and the weight WR of the right sentence adjacent to the initial sentence group are greater than the weight threshold, the new sentence group acquisition subunit 224 expands the left and right sentences into the initial sentence group to form a new sentence group and outputs it to the sentence weight acquisition subunit 222 as an initial sentence group, until no expansion is performed on the initial sentence group anymore, so as to obtain a final sentence group, the final sentence group being outputted to the knowledge extraction module 3.
In the knowledge extraction system of this embodiment, the sentence weight acquisition subunit 222 comprises: a first weight acquisition device 222a for adding weights v1 corresponding to all property parameters αi contained in the left sentence adjacent to the initial sentence group together to obtain a weight WL of the left sentence; a second weight acquisition device 222b for adding weights vi corresponding to all property parameters αi contained in the right sentence adjacent to the initial sentence group together to obtain a weight WR of the right sentence; the above determination is performed on left and right sentences, for example, if it is determined that the left sentence includes property parameters α1 and α2, the weight of the left sentence is WL=v1+v2; if it is determined that the right sentence includes property parameters α3 and α4, the weight of the right sentence is WR=v3+v4. Herein, when the same property αi occurs several times, a corresponding weight vi will be accumulated one or multiple times. In general, in order to obtain a result meeting users' demands better, the property αi may be accumulated a number of times that the property αi occurs.
As an alternative solution, an alternative method of calculating sentence weight is Σβivi, wherein βivii is a value contributed by property αi occurred in a sentence, βi is a field feature weight of property αi. The field feature weight of property αi may be obtained through training using field documents. When βi is 1, it becomes the scheme adopted in this embodiment. This embodiment only provides a method of obtaining a weight WL of a left sentence and/or a weight WR of a right sentence adjacent to the initial sentence group. Other methods of calculating sentence weight existed in the prior art may be adopted, so long as the same method is used throughout for the calculations of all sentence weight values.
In the knowledge extraction system of this embodiment, according to the result of the comparison between lengths of initial sentence groups and the expected length, a weight threshold is set for the initial sentence groups. The comparison result F=expected length/(the length of an initial sentence group+a redundant value), and the weight threshold is set as a function of the comparison result F. The less the comparison result F is, i.e., the closer the length of the initial sentence group approaches the expected length or the more the length of the initial sentence group goes beyond the expected length, the larger the weight threshold is; the weight WL of the left sentence and/or the weight WR of the right sentence adjacent to the initial sentence group is compared with the weight threshold, only if the weight WL of the left sentence and/or the weight WR of the right sentence adjacent to the initial sentence group is greater than or equal to the weight threshold, the left sentence and/or the right sentence is expanded into the initial sentence group to form a new sentence group; otherwise, no expansion is performed on the initial sentence group. Thus, the weight threshold may be adjusted dynamically according to the result of the comparison between the lengths of the initial sentence groups and the expected length. For example, if the length of an initial sentence group is far less than the expected length, the weight threshold will become very small, causing that the weight WL of the left sentence and the weight WR of the right sentence are prone to be greater than the weight threshold, thereby the left sentence and/or the right sentence is liable to be expanded into the initial sentence group; otherwise, the weight threshold will become very large, and the left sentence and/or the right sentence may be expanded into the initial sentence group only if it includes many property parameters αi. In this manner, the length of the initial sentence group may be controlled effectively to obtain a final sentence group having a length approaching the expected length.
In the knowledge extraction system of this embodiment, the comparison result determination unit 211 comprises: a redundant value setting device 211a for setting a redundant value, wherein in the case of left expansion of the initial sentence group, the redundant value is set to half of the length of the left sentence adjacent to the initial sentence group; in the case of right expansion of the initial sentence group, the redundant value is set to half of the length of the right sentence adjacent to the initial sentence group.
In practical applications, in left expansion, the redundant value may be selected as a value that is m times of the length of the left sentence adjacent to the initial sentence group; in right expansion, the redundant value may be selected as a value that is m times of the length of the right sentence adjacent to the initial sentence group; preferably, m is a value less than 1. When m is 0.5, it becomes the scheme provided in this embodiment. With the redundant value of this embodiment, according to statistics, the final sentence group may get close enough to the expected length.
On the basis of any of embodiment 6 to embodiment 8, as shown in
Through limiting the number of sentences for left and/or right expansion of an initial sentence group, left and/or right expansion of the initial sentence group may be further controlled in a reasonable range, making it convenient to check and understand the sentence group finally extracted.
As a preferred embodiment, in the knowledge extraction system of this embodiment, in the case of both left and right expanding the initial sentence group, the left-expansion sentence number threshold L is set to 6 and the right-expansion sentence number threshold R is set to 6; in the case of only left expanding the initial sentence group, the left-expansion sentence number threshold L is set to 12 and the right-expansion sentence number threshold R is set to 0; in the case of only right expanding the initial sentence group, the left-expansion sentence number threshold L is set to 0 and the right-expansion sentence number threshold R is set to 12.
As demonstrated by experiments, through setting the left-expansion sentence number threshold and right-expansion sentence number threshold to the above values, the best effect may be obtained in terms of not only sentence coherence in the result of knowledge extraction, but also length control of the final sentence group.
On the basis of any of embodiment 6 to embodiment 9, in the knowledge extraction system of this embodiment, as shown in
Note that, in the calculation of the final sentence group weight density K′, it is also possible to divide final sentence group weight by the number of sentences in the final sentence group, so long as the same criterion is adopted for each final sentence group in the calculation of the final sentence group weight density K′.
From the above determinations, for example, it is determined that a final sentence group includes property parameters α1, α3, α5, through adding weights V1, V3, V5 together, a weight=V1+V3+V5 is obtained for final sentence group; if the length of the final sentence group is 300 characters, the final sentence group weight density K′=(V1+V3+V5)/300. If one sentence or different sentences in the final sentence group includes more than one property parameters αi, its corresponding weight may be added once or several times. In general, for a better result meeting the demand of users, parameters αi may be added a number of times that its corresponding weight Vi occurs.
Alternatively, an alternative scheme of sentence group weight calculation is Σβivi, wherein βivi is a value contributed by property αi present in sentences in the sentence group, βi is a field feature weight of property αi. The field feature weight of property αi may be obtained through training using field documents. When all βi are 1, it becomes the scheme used in the present embodiment. This embodiment only provides a method of obtaining the final sentence group weight. Other methods of calculating sentence weight existed in the prior art may be adopted, so long as the same method is used to calculate weights for all sentences in the sentence group.
In the knowledge extraction system of this embodiment, the knowledge extraction module 3 comprises: p1 a final sentence group deduplicating and outputting unit 31 for deduplicating the final sentence groups and then outputting the final sentence groups.
In the knowledge extraction system of this embodiment, the knowledge extraction module 3 further comprises:
In the knowledge extraction system of this embodiment, the knowledge extraction module 3 further comprises:
In the knowledge extraction system of this embodiment, through deduplicating all final sentence groups, the output of duplicate knowledge information is avoided by deduplicating all of the obtained final sentence groups by the final sentence group deduplicating and outputting unit 31, so that a waste of time due to reading duplicate contents may be prevented; through setting a minimum length for final sentence groups and removing those final sentence groups having a length less than the minimum length by the final sentence group removing and outputting unit 32, more knowledge information is contained in each final sentence group that is outputted, thereby satisfying the requirement of consulting by users; through sorting and outputting final sentence groups according to the weight density K′ of each final sentence group by the final sentence group sorting and outputting unit 33, users may selectively read final sentence groups that are extracted. For example, according to weight densities K′, final sentence groups are sorted in descending order and then outputted. Users only need to read the first few final sentence groups to obtain desired knowledge information, so that time for querying by users may be reduced.
This disclosure also provides one or more computer readable mediums having stored thereon computer-executable instructions that when executed by a computer perform a knowledge extraction method, comprising: acquiring initial sentence groups, the sentence group including one or more sentences; expanding the initial sentence groups in which lengths of the initial sentence groups are compared with an expected length to determine an initial sentence group to be expanded according to the comparison result; extracting knowledge in which the sentence groups that are finally obtained after expansion are outputted to realize knowledge extraction.
Those skilled in the art should understand that the embodiments of this application can be provided as method, system or products of computer programs. Therefore, this application can use the forms of entirely hardware embodiment, entirely software embodiment, or embodiment combining software and hardware. Moreover, this application can use the form of the product of computer programs to be carried out on one or multiple storage media (including but not limit to disk memory, CD-ROM, optical memory etc.) comprising programming codes that can be executed by computers.
This application is described with reference to the method, equipment (system) and the flow charts and/or block diagrams of computer program products according to the embodiments of the present invention. It should be understood that each flow and/or block in the flowchart and/or block diagrams as well as the combination of the flow and/or block in the flowchart and/or block diagram can be achieved through computer program commands Such computer program commands can be provided to general computers, special-purpose computers, embedded processors or any other processors of programmable data processing equipment so as to generate a machine, so that a device for realizing one or multiple flows in the flow diagram and/or the functions specified in one block or multiple blocks of the block diagram is generated by the commands to be executed by computers or any other processors of the programmable data processing equipment.
Such computer program commands can also be stored in readable memory of computers which can lead computers or other programmable data processing equipment to working in a specific style so that the commands stored in the readable memory of computers generate the product of command device; such command device can achieve one or multiple flows in the flowchart and/or the functions specified in one or multiple blocks of the block diagram.
Such computer program commands can also be loaded on computers or other programmable data processing equipment so as to carry out a series of operation steps on computers or other programmable equipment to generate the process to be achieved by computers, so that the commands to be executed by computers or other programmable equipment achieve the one or multiple flows in the flowchart and/or the functions specified in one block or multiple blocks of the block diagram.
Although preferred embodiments of this application are already described, once those skilled in the art understand basic creative concept, they can make additional modification and alteration for these embodiments. Therefore, the appended claims are intended to be interpreted as encompassing preferred embodiments and all the modifications and alterations within the scope of this application.
Number | Date | Country | Kind |
---|---|---|---|
201310456958.7 | Sep 2013 | CN | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2013/088777 | 12/6/2013 | WO | 00 |