This application is based on and hereby claims priority to European Patent Application No. 08010815 filed on Jun. 13, 2008, the contents of which are hereby incorporated by reference.
Described below are a method and an apparatus for processing semantic data resources of a domain and in particular data resources such as ontology, terminology and classifications in the medical domain.
Through the advanced technologies in the clinical care and research, especially the rapid progress in imaging technologies more and more medical imaging data and patient text data is generated by hospitals, pharmaceutical companies and medical research institutes. Because of the plurality of available data which is provided by a number of different data sources it is difficult to identify potential queries reflecting different perspectives that can be used by clinicians and radiologists to find patient-specific sets of relevant images.
Described below is a method for processing at least one semantic data resource of a domain, including calculating relevance scores for terms which occur in domain corpora and weighting the semantic data resources depending on the calculated relevance scores of the terms.
In an embodiment the semantic data resource includes domain-specific terms and relations.
In an embodiment the semantic data resources include a domain ontology, a domain terminology and a domain classification.
In an embodiment the domain ontology includes a domain-specific-hierarchy of terms assigned to nodes which are connected by edges.
In an embodiment the domain terminology includes a lexicon having domain-specific terms, relations and synonyms.
In an embodiment the domain classification includes codes classifying domain-specific terms.
In an embodiment the relevance scores are chi-square-scores which are calculated depending on a frequency of a term in the domain corpora and an expected frequency of the term.
In an embodiment the expected frequency of the term is derived from a reference corpus.
In an embodiment the domain corpora are formed by text corpora.
In an embodiment the domain ontology is encoded in a web ontology language (OWL).
In an embodiment the domain corpora include an XML-(extended mark-up language) format.
In an embodiment the reference corpus is formed by the British National corpus.
In an embodiment for the domain corpora a list of relevant terms is generated.
In an embodiment the list of terms is filtered according to a predetermined filter criterion.
In an embodiment each term includes one or more words.
In an embodiment a relevance score for a multi-word term is calculated on the basis of the chi-square-score for each noun or adjective in the multi-word term which are summed and normalized over the length of the multi-word term.
In an embodiment each term is marked by a part of speech information.
Described below is an apparatus for processing a semantic data resource of a domain that includes a memory storing the semantic data resource and a calculation unit calculating relevance scores for terms which occur in domain corpora and weighting the semantic data resource depending on the calculated relevance scores of the terms.
In an embodiment the apparatus includes a network interface for receiving the domain corpora from a network.
In an embodiment the network interface is provided for receiving domain corpora from the world wide web.
In an embodiment the apparatus includes a user interface for outputting the weighted semantic data resources.
In an embodiment the calculation unit includes a microprocessor for executing a computer program for calculating relevance scores for terms and weighting the semantic data resources depending on the calculated relevance scores.
Also described below is a computer-readable storage medium encoded with a computer program having commands for executing a method for processing a semantic data resource of a domain including calculating relevance scores for terms which occur in domain corpora and weighting the semantic data resource depending on the calculated relevance scores of the terms.
These and other aspects and advantages will become more apparent and more readily appreciated from the following description of the exemplary embodiments, taken in conjunction with the accompanying drawings of which:
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to like elements throughout.
As can be seen from
An ontology is a formal representation of a set of concepts within a domain and the relationships between those concepts. Common components of ontology include individuals such as instances or objects, classes, attributes, relations, function terms, restrictions, rules, actions and events. Individuals or instances are the basic ground level components of the domain ontology. Individuals in the domain ontology may include complete objects of the domain as well as abstract individuals such as numbers and words. Classes also called type, sort, category and kind are abstract groups, sets or collections of objects. Classes may contain individuals other classes or a combination of both. A class of a domain ontology can include other classes which are also called subclasses. Objects in the ontology can be described by assigning attributes to them. Each attribute within the domain ontology has at least a name and a value and can be used to store information data that is specific to the object to which the attribute is attached. With the use of attributes it is possible to describe relationships between objects in the ontology. In the ontology a hierarchical taxonomy can be provided which indicates how objects relate to one and other.
The ontology forms a semantic data resource in a specific domain such as the medical domain. In a possible embodiment the main ontology is generated by merging other domain ontologies into a more general representation. Different ontologies in the same domain can arise due to different perceptions of the domain based on the background, education or representation languages. The main ontology can be encoded by a formal language such as OWL, RDF or RDFS. Other ontology languages can be used as well.
In a possible embodiment the domain specific ontology is from the medical domain. For example the foundation and module of anatomy—(FMA) ontology can be used as a knowledge-base data resource of the medical domain. The FMA-ontology specifies an anatomy taxonomy and corresponding relationships. The FMA-ontology covers a plurality of anatomical concepts and a huge number of relations instances from any relation types. The complex terminological structure of the FMA-ontology provides a linguistically attractive semantic data resource. For example a common structure of the FMA-terminology is the following:
Moreover, the terms in the FMA-ontology can formed cascaded structures in the one term occurs with in another term such as in:
The FMA-ontology is a machine readable anatomy data resource in the medical domain.
Further, the data resource process performed by the method can be formed by a domain terminology. This domain terminology can include a lexicon including a plurality of domain specific terms, relations and synonyms. An example for a domain terminology in the medical domain is the radiology lexicon which is a data resource for obtaining image relevant information. The radiology lexicon is an open source control vocabulary for the purpose of uniform indexing and retrieval of radiology information data. The radiological lexicon includes several thousand anatomic and pathological terms including terms about imaging techniques, difficulties and diagnostic image qualities. The radiology lexicon is a unified lexicon to capture cross vocabulary radiology information and it contains besides domain specific knowledge also lexical relationships such as synonyms.
A further type of semantic data resources are domain classifications. In a domain classification the domain classification includes for example codes classifying domain-specific terms. In an embodiment a domain classification as a data resource is formed by the international classification of diseases ICD. The international classification of diseases (ICD) is a collection of codes classifying diseases, signs, symptoms, abnormal findings etc. provided by a database of the world health organisation. The international classification of diseases (ICD) classifies diseases under digit codes which can include several digits. For example the international classification of diseases ICD classifies lymph nodes of head, face and neck under neoplasms (140-249) meaning that any disease that is coded with a number between 140 and 249 is a neoplasm. The lymph nodes of head, face and neck has the code 196.0 and forms a subcategory of secondary and unspecified, malignant neoplasm of lymph nodes that has the code 196.
In the embodiment shown in
The apparatus 1 shown in the embodiment of
In a possible embodiment the calculation unit 5 of the apparatus 1 includes a microprocessor for executing a computer program. This computer program can be stored in a program memory. In a possible embodiment the computer program is read from a data carrier storing the computer program.
The calculation unit 5 is further connected to a user interface 6 of the apparatus 1 such as a display for outputting the weighted semantic data resources. In a possible embodiment the user interface 6 is formed by a display for displaying tables indicating list of terms which are weighted according to the calculated relevance scores for the terms.
As can be seen from
In
The domain corpora with the text segments in XML-format are written back in the memory 2 of the apparatus 1 and a part of speech (POS) tagging is performed at S2. In a possible embodiment text sections of each domain corpus stored in the memory 2 are run through an TNT-part-of-speech-parser to extract all nouns in the domain corpus. In a possible embodiment each term of the domain corpus is marked with a part-of-speech (POS) information data which indicate for example whether the respective term is an adjective, a noun or a plural-noun. The tagged domain corpus is written back in the memory 2 as shown in
At S3 a term recognition is performed. This is done on the basis of a domain term data base which is provided in a possible embodiment also in the memory 2 of the apparatus 1. The domain term database stores at least one semantic data resource of the domain such as the medical domain. These semantic data resources include domain ontologies, domain terminologies and domain classifications wherein the domain ontologies can be encoded by the web ontology languages OWL or RDFS. At S3 it is identified which terms from which data resource occur in the corresponding context corpus, i.e. in the different domain corpora such as the anatomy corpus, the radiology corpus and the disease corpus.
Each identified term is written back into the memory 2 along with the part of speech tags and relevant scores for those terms which occur in the domain corpora are calculated by the calculation unit 5 at S4. Then the semantic data resources are weighted by the calculation unit 5 depending on the calculated relevance scores of the identified terms. In a possible embodiment the relevance scores are chi-square scores which are calculated depending on a frequency of a term in a domain corpus and depending on an expected frequency of this term. The expected frequency of the term is derived in a possible embodiment from a reference corpus. This reference corpus can be formed for example by the British National Corpus BNC and it is a collection of samples of written and spoken language documents from a wide range of sources designed to represent a wide-cross-section of British English. This reference corpus is stored in a possible embodiment also in the memory 2 of the apparatus 1. In an alternative embodiment the reference corpus is downloaded via the network interface 3 from the world wide web 4.
In a possible embodiment chi-square scores are calculated according to the following equation:
where
Oi=an observed frequency;
Ei=an expected frequency,
n=the number of possible outcomes of each event.
Each term weighted at S4 can include one or more words. The relevance score for a multi-word term is calculated on the basis of the chi-square score for each noun or adjective in the multi-word term which are summed and normalized over the length of the multi-word term. Weighted terms are written back to the memory 2. Further, at S5 the weighted semantic data resources such as weighted domain ontologies are output by the apparatus 1 via the user interface 6.
In a possible embodiment an FMA-ontology is used to identify the human anatomy relevant terms and relationships from different text corpora. First, the concept and relationships are extracted yielding in a specific example a list of several thousand (e.g. 124769) entries. This list can include very dynamic terms such as “anatomical structure” as well as very specific terms such as “Anastomotic branch of right inferior cerebella artery with right superior cerebella artery”. This very generic terms and very specific terms are filtered out according to a filter criterion. For example from the list of terms only those concentrating on terms consisting up to three-words are not filtered out. In the specific example after filtering such terms the resulting list of terms consists of a lower number of terms such as 19337 terms including terms such as “up-dominal lymph node”, “femoral head”, “jugular lymphatic trunk” etc. The statistically most relevant terms of this ontology are identified on the basis of the chi-square scores computed for nouns of each text corpus. Single word terms in the FMA-ontology and occurring in the text corpus of the domain correspond directly to the noun that the term is built up of (e.g. the noun “ear” corresponding to the FMA-term “ear”). In this case the statistic relevance of the term is the chi-square score of the corresponding noun.
In the case of multi-word terms occurring in the corpus the statistic relevance is computed on the basis of the chi-square score for each constituting noun and/or adjective in the term which are summed and normalized over the length of the term. For example the relevance value or relevance score for “lymph node” is the summation of the chi-square scores for “lymph” and/or “node” divided by two. In order to take frequency into account the summed relevance score is multiplied by the frequency of the term. This assures that only frequently occurring terms are judged to be relevant. The FMA-ontology is very complex from a terminology prospective and therefore rich in lexical information. In order to capture this lexical information each term is additionally marked with a part of speech information. The same approach can be adapted for other terminologies.
A selection of a resulting list of most relevant FMA-terms in different medical domain corpus are shown in the tables of
As can be seen from
In the same manner terms of the radiology lexicon can be used to identify most relevant radiology terms in different corpora of the medical domain. In a specific example a list of terms that consists of 13156 entries is extracted from the RadLex data resource controlled vocabulary by parsing the downloaded version from the websites. After filtering duplicates are removed is the list can be reduced to, e.g., 12055 entries. In contrast to the FMA-ontology also very specific terms e.g. terms including more than three words, can be kept in the resulting term list because there are only view terms including more than three words. The most relevant RadLex terms in the given example are shown in
In a similar way an ICD-subset terminology that corresponds to RadLex terms can be analysed in the corpora. In a specific example a subset term list can consist of 3193 entries where for each entry its ICD-9 CM code and the corresponding RadLex ID are encoded. After searching for these terms in three text corpora of the medical domain the results as shown in the tables of
Comparing the tables in
In order to obtain a joint view as reflection of different semantic knowledge data resources and terminologies covering different prospects on the basis of joint data sets in a possible embodiment the terminologies for the FMA-ontology the RadLex lexicon and the ICD-9 CM classification of disease codes are used as the data basis. A common view is presented in the tables of
In the given example an ontology of human anatomy, a controlled vocabulary for radiology and the international classification of disease codes are used as knowledge resources in driving significant concepts and relations. These concepts and relations extracted by the method described herein can be used to generate potential query patterns. These query patterns form the basis for actual queries that clinicians pose on a semantic search engine to find patient-specific sets of relevant images and textual data.
The system also includes permanent or removable storage, such as magnetic and optical discs, RAM, ROM, etc. on which the process and data structures of the present invention can be stored and distributed. The processes can also be distributed via, for example, downloading over a network such as the Internet. The system can output the results to a display device, printer, readily accessible memory or another computer on a network.
A description has been provided with particular reference to exemplary embodiments thereof and examples, but it will be understood that variations and modifications can be effected within the spirit and scope of the claims which may include the phrase “at least one of A, B and C” as an alternative expression that means one or more of A, B and C may be used, contrary to the holding in Superguide v. DIRECTV, 358 F3d 870, 69 USPQ2d 1865 (Fed. Cir. 2004).
Number | Date | Country | Kind |
---|---|---|---|
08010815 | Jun 2008 | EP | regional |