The invention relates to a method and a system for semantic relation extraction in particular from biomedical data.
The rapid growth of published literature in many fields of technology such as the biomedical domain renders automated information extraction tools indispensable for researchers to make use of this immense source of knowledge.
The past decade has been undergone an unprecedented increase of biomedical data in published literature. Progress in computational and biomedical methods has increased the pace of biomedical research. High throughput experiments, such as micro-arrays, produce large quantities of high-quality data which consequently leads to an increase of new findings and results. This development has caused an explosion of scientific literature published in this technical field. The overwhelming amount of textual information makes it necessary to use automated text information extraction tools to efficiently use the enormous amount of knowledge contained in biomedical literature stored in data bases. Text mining applications are provided to transfer unstructured information such as unstructured text information into structured form. Some text mining applications can only identify named entities. Possible entities in the biomedical field are genes, diseases, drugs, compounds, proteins etc. More important than identifying entities in an unstructured information data base is the identification of associations and relations between these entities. Relation extraction (RE) is the finding of associations and roles between entities having an unstructured information base such as text phrases. These text phrases are usually but not necessarily formed by a sentence.
The conventional semantic relation extraction methods comprise two consecutive steps. In a first step the entities are identified by means of a named entity recognition (NER). In a second step for each pair of entities a relation type is predicted.
After the feature definition and the generation of the annotated training corpus in the preprocessing phase, a feature set is provided for the annotated training corpus and weights are calculated on the basis of a feature label distribution in a training phase.
In a further step an input query is input by a user to extract a semantic relation. A possible example is the sentence “Inactivating TP53 mutations were found in 55% of lethal metastatic pancreatic neoplasms”. The input query is tokenized into a sequence of tokens.
The table of
After completion of the entity detection a second step for relation extraction is performed in the conventional method as shown in the flow-chart of
As can be seen from the given example, relation extraction in conventional methods performed in a two-step manner, i.e. first the participating entities are identified and then the relations between the entities are extracted. Both pairs of entities are enumerated for a given text phrase and for each pair a prediction is made whether there is a relation or not.
However, this conventional method for relation extraction as shown in the flow-chart of
It is an object of the present invention to provide a method and a system for overcoming the disadvantages of the conventional method for semantic relation extraction as shown in
The invention provides a method and a system for semantic relation extraction on the basis of an annotated training corpus having tokens with associated relation labels each indicating a relation between the respective token and a selectable key entity wherein semantic relations between the key entity and other entities are directly extracted from unstructured text using a probabilistic extraction model.
In an embodiment of the system according to the present invention the probabilistic extraction model is a conditional random field (CRF).
In a preprocessing phase a feature definition is performed in step S1 and the training corpus is generated in step S2. An example for an annotated training corpus generated in step S2 is shown in
During a training phase consisting of step S3, S4 as shown in
The features used by the method according to the present invention comprise a set of standard condition features and additional relation recognition features. The standard recognition features can comprise orthographic feature, work shape features, n-gram features, dictionary features or context features.
The biomedical entities often yield some orthographic characteristics. In many cases, biomedical entities consist of capitalized letters, include some numbers or are composed of combinations of both. Accordingly, orthographic features can help to distinguish various types of biomedical entities. Another recognition feature is a word shape feature.
Some words belonging to the same class of entities have the same word shape. For instance, for disease abbreviations it is common that no number plus normal letters appear in the token as for gene/protein co-occurrence of numbers and letters is typical.
As a further recognition feature according to the method according to the present invention uses character n-gram features for 2≦n≦4. This recognition feature helps to recognize informative sub-strikings like “ASE” or “HOMEO”, especially for words not seen in training.
A further group of recognition features are dictionary features. For example, a disease dictionary can be used and is constructed by taking all names and synonyms of concepts covered by the disease branch (C) of the MeSH ontology. Furthermore, as a possible embodiment keyword dictionaries are used for different relation types such as altered expression, genetic variation, regulatory modification and unrelated. For example, a genetic variation dictionary can contain words like “mutation” and “polymorphism”. A dictionary feature is on, if the token matches with at least one keyword in the corresponding dictionary. Note that the presence of a certain keyword in a sentence is indicative, but not imperative for a specific relation. This is handled by the method according to the present invention because of its probabilistic nature.
A further group of recognition features are context features. These context features consider the properties of preceding or following tokens for a current token xi in order to determine its category. Context features are important for several reasons. Thus, in case of nested entities such as: “breast cancer 2 protein is expressed . . . ”. In this text phrase one does not want to extract a disease entity. Thus, when determining the correct label y for the token “breast”, it is important that one of the preceding word features will be “protein” indicating that “breast” refers to gene/protein entity and not to a disease. In a possible embodiment a window size is set to three. Context features are not only important in case of nested entities but also for relation extraction.
In the method and system according to the present invention besides the recognition features further relation recognition features are provided. These additional relation recognition features comprise for example a dictionary window feature, a key entity neighborhood feature, a start window feature and a negation feature.
Each of the relation type dictionaries, for example for the relation type dictionaries mentioned above, i.e. the altered expression dictionary, the genetic variation dictionary, the regulatory modification dictionary and the unrelated dictionary it is defined that a feature is on, if at least keyword from the corresponding dictionary matches a word in a window size of N, i.e.
tokens away from the current token. In an embodiment N=20.
Furthermore, as a key entity neighborhood feature for each of the relation type dictionaries a feature is defined to be on if at least one keyword matches a word in a window size of M, i.e.
tokens away from the key entity token. In a possible embodiment M=6.
As a start window feature for each of the relation type dictionaries it is defined that the feature is on if at least one keyword matches a word in the first L tokens of a sentence. In a possible embodiment L=3. With this feature the fact is addressed that for many sentences important properties of a gene-disease-relation are mentioned at the beginning of a sentence.
A negation feature is defined such that this feature is on, if none of the three above-mentioned relation recognition features matches a dictionary keyword.
In an embodiment relation type features are based solely on dictionary information. In alternative embodiments, further information is integrated as relation type features such as word shape or n-gram features.
In step S3 of the flow-chart of
A conditional probability p of a label or state sequence for a given input sequence is defined as:
wherein Zx is a normalization factor, fk(yi−1, yi, x, i) is an arbitrary feature function and λK is a calculated weight for a feature function ranging between −∞ and +∞.
Each feature function fi specifies an association between a token x at a certain position and a label y for that position. Therefore, with each feature f one can express some characteristics of an empirical distribution of training data that should also be true for a model distribution.
The corresponding feature weight λk specifies whether the association should be favored or disfavored. Higher values of λ indicate that their corresponding label transitions are more likely. In general, a weight λ for each feature f is high if the feature f tends to be on for the correct labeling. The weight λ is negative if the feature tends to be off for the correct labeling and should be around zero if it is uninformative. The weights λ are learned in a possible embodiment from labeled training data of the training corpus by a maximum likelihood estimation (MLE) algorithm.
The normalization factor Zx is the sum over all possible state or label sequences SN, while N is the length of the input sequence:
After the training phase the user can input a query via the keyboard 5 to perform a semantic relation extraction in the extraction phase as shown in
In a further step S6 the query Q is tokenized, i.e. a token sequence x1, x2, . . . xm is generated as illustrated by
As can be seen from the table in
In a step S7 the token sequence of the input query Q is labeled by means of a Viterbi algorithm to find a most likely label sequence as shown in
In the present invention the investigated text phrase refers to a key entity KE such as “TP53” so that all other entities in the text phrase state a kind of relation to the key entity KE.
For example, a biographical text usually gives information about an entity such as “Tony Blair” and all other entities in the text are involved in a certain relation with the entity (for example his family). Thus, with the present invention it is possible to predict a kind of relation holding between the key entity KE and all other secondary entities. With the method and system according to the present invention relation extraction is treated as a sequence labeling task. Accordingly, with the present invention a named entity recognition NER and a relation extraction step are merged together.
Accordingly, with the method and system according to the present invention the entities' label y encodes a relation to the key entity KE and there is no initial labeling of the named entities.
Gene RIF-sentences represent a similar style of text in the biomedical domain as biographical text. Gene RIF-sentences describe the function of a gene/protein, the key entity KE, as a concise phrase. As a consequence, gene RIF-sentences are an adequate source for transferring relation extraction to a sequence labeling problem.
For example, the following gene RIF sentence is linked to a gene COX-2:
“COX-2 expression is significantly more common in endometrical adenocarcinoma and ovarian serous cystadenocarcinoma, but not in cervical squamous carcinoma, compared with normal tissue.”
This sentence states three disease relations with COX-2 (the key entity), namely two altered expression relations (expression of COX-2 relates to endometrical adenocarcinoma and ovarian serous cystadenocarcinoma) and one unrelated relation (cervical squamous carcinoma).
Relation extraction RE is treated by the method according to the present invention as a tagging task such as NER or part of speech POS tagging. Accordingly, for each secondary entity the method of the present invention predicts the type of relation it has to the key entity KE. Each word in a sentence is regarded as a token x. Each token x is associated with a tag or label y which indicates the type of the token x. In the given example sentence about COX-2, the label “unrelated” is assigned to the tokens “cervical”, “squamous”, “carcinoma”, as they are evidently not related with the key entity gene whereas the tokens “endometrical”, “adenocarcinoma”, “ovarian”, “serous”, “cystadenocarcinoma” are labeled each as a disease related to the gene altered expression behaviour, thus, “altered expression”. These are the words representing diseases in the sentence. The other tokens x are labeled as not forming part of an entity. Two random variables X and Y are used to denote any input token sequences with associated label sequences. In the method according to the present invention to the given token sequence x1, x2, . . . , x, xn a correct label sequence y1, Y2, . . . yn is assigned.
The method of the relation extraction according to the present invention is based on a one-step probabilistic extraction model, such as a linear chain conditional random field CRF. The method according to the present invention extracts the relations. For example, the method according to the present invention extracts relations between genes and diseases from Gene RIF (Gene Reference Into Function) sentences. Gene RIF (Gene Reference Into Function) are sentences which refer to a particular gene in the Entrez gene data base and describe its function in a concise phrase. The semantic relations extracted by the method and system according to the present invention can comprise different relations such as “altered expression”, “other genetic variation”, “regulatory modification”, “a general relation” or “an existing relation” between two entities. For example gene-disease-relations are categorized based on whether a gene is causing a disease state is a predisposition factor or is just associated with the disease. In an embodiment of the method according to the present invention, the gene-disease-relation categories are based on the observed state of a gene or protein, e.g. transcription level or mutation associated with the disease state. A class for sentences reporting evidence of no association between a gene state and a disease and a neutral class given not specific observe state are provided.
The “altered expression” level of a gene/protein is reported to be associated with a certain disease or state of a disease. For example “low expression of BRCA-1 was associated with colorectal cancer”.
As a further semantic relation, the “genetic variation” relates to a mutational event which is reported to be related with a disease. For example, “Inactivating TP53 mutations were found in 55% of lethal metastatic pancreatic neoplasms”.
A further semantic relation “regulatory modification” states a modification of the gene/protein through methylation or phosphorylation. For example “e-cadherin and P16INK4A are commonly methylated in non-small cell lung cancer”.
The semantic relation “any” is given when relation between a gene and a disease is reported without any further information regarding the gene's state. For example: “e-cadherin has a role in preventing peritoneal dissemination in gastric cancer”.
As a further semantic relation, the relation “unrelated” indicates that a sentence is evident for an independence between a gene an a certain disease. For example “variations in TP53NBAX alleles are unrelated to the development of pemphigus foleaceus”. The method and system according to the present invention has in comparison to conventional methods a high recall, precision and f-score value.
On a manually annotated data set of gene RIFS, the recall, precision and f-score of the method and system according to the present invention are evaluated. The recall and precision depend of true positive TP, false negative TN and false positive FP as follows:
A true positive TP is a label sequence for a certain entity which exactly matches the label sequence for this entity from the standard. For example, in the following sentence “BRCA2 is mutated in stage II breast cancer” a human annotator labels “stage II breast cancer” as a disease related via genetic variation. Under the assumption that the method and system according to the present invention only recognizes “breast cancer” as a disease entity and categorizes the relation to gene-“BRCA2” as a “genetic variation”, the system gets assigned a false negative (FN) for not recognizing the whole sequence as well as one false positive (FP). In general, since this is hard matching criteria in many situations a more gentle criteria of correctness can be used.
Table 1 shows a text corpus statistics for an annotated data set of 5.469 gene RIFs.
Table 2 shows the results of a relation extraction RF as performed using the method and system according to the present invention.
Table 2 lists accuracy measures for each of the predefined regulation types. For any, altered expression and genetic variation relations the method and system according to the present invention exceeds a boundary 74 F-measure. Average over all relations types the method and system according to the present invention achieves an overall accuracy of 73.84 F-measure for the given data set.
Table 3 shows a comparison of different methods of semantic relation extraction. The first two models are based on a conventional two-step approach according to the state of the art consisting of an NER-step and a successive RE-step. In a first baseline model (dictionary plus rule-base) the NER-step is done via a dictionary longest matching approach while in the CRF plus rule-based model the NER-step is tackled via a disease NER CRF.
As can be seen from table 3, the method and system according to the present invention clearly outperforms the conventional two baseline approaches. The difference between the two-step approach according to the prior art methods with disease CRF tagger plus additional successive rules for RE and the method according to the present invention is 4.16 F-measure. This result indicates that the unified CRF performed by the method according to the present invention is able to learn additional patterns from the empirical distribution which are important for inferring the type of relation holding between gene and disease pairs.
The method and system according to the present invention allows in a possible embodiment the identification of semantic gene disease relations based on a probabilistic extraction model. As can be seen from table 3, the overall performance of the method and system according to the present invention is better than conventional methods employing a two-step approach.
Since method and system according to the present invention is discussed mostly with respect to biomedical data it is emphasized that the method and system according to the present invention can be used for semantic relation extraction for any kind of unstructured text.
Further, the method and system according to the present invention can be used for semantic relation extraction for any unstructured text written in any language and any alphabet. The method and system according to the present invention allows to detect entities and their relations at the same time. The method and system according to the present invention has a higher performance, i.e. sensitivity and F-score, than conventional methods. The method and system according to the present invention do not only allow for a detection of a relation but also its characterization of its nature as far as mentioned in the unstructured text.
In a possible embodiment the method according to the present invention is performed by a computer program on a computer. A possible embodiment this computer program comprises instructions to perform the method and is stored on a data carrier.
Number | Date | Country | Kind |
---|---|---|---|
EP07013828 | Jul 2007 | EP | regional |