Claims
- 1. A method of constructing a network of conceptual entities and their relationships, in which data processing means is used to scan at least one database of text documents referring to concepts so as to identify concepts by scanning for concept abbreviations, to note co-occurrences of concepts in the same article, to record information representing the co-occurrences and the number of documents in which they appear, and to provide a network representing relationships between concepts, wherein concepts are also identified in documents by means of scanning for concept names and in that at least some abbreviations are only accepted as representing a reference to a concept in a document if the scan reveals that at least one word from one name of the concept also appears in a suitable part of the document.
- 2. A method as claimed in claim 1, wherein a filtering procedure is carried out when mapping occurrences of abbreviations into valid occurrences of concepts, in which if an abbreviation is longer than L1 characters and is found in less than K documents, then all occurrences of the abbreviation are treated as valid.
- 3. A method as claimed in claim 2, wherein L1 is 4 and K is 10.
- 4. A method as claimed in claim 2, wherein if the abbreviation is more than L2 characters an occurrence of the abbreviation will be treated as valid if there are at least W words from at least one of the concept names.
- 5. A method as claimed in claim 4, wherein if the symbol is no more than L2 characters long, then there must be more than W words from at least one of the concept names.
- 6. A method as claimed in claim 5, wherein L2 is 2 and W is 1.
- 7. A method as claimed in claim 6, wherein L2 is less than or equal to L1
- 8. A method of compiling a hierarchical dictionary in the form of a database of concepts identified by a unique identifier, wherein each concept has a list of associated names of which one of the names is selected as a primary name and the remaining names are considered as synonyms, comprising the steps of:
inputting one or more lists of names together with a list of unique identifiers or a rule to determine unique identifiers from the list of names together with a rule to determine the list of primary names to correspond with the unique identifiers; and identifying concept name families.
- 9. A text data processing method for indexing logical concepts in text, comprising the steps of:
creating a hierarchical dictionary for names of concepts in said domain by the steps of compiling the dictionary in the form of a database of concepts identified by a unique identifier, wherein each concept has a list of associated names of which one of the names is selected as a primary name and the remaining names are considered as synonyms; inputting one or more lists of names together with a list of unique identifiers or a rule to determine unique identifiers from the list of names together with a rule to determine the list of primary names to correspond with the unique identifiers, and identifying concept name families; the method further comprising the steps of:
inputting the dictionary into a data processor; inputting documents into the data processor; detecting names of concepts in said documents; organizing occurrences of names into occurrences of concepts, by means of filtering occurrences of names before accepting these as valid occurrences of the associated concepts, wherein the length of the name and the frequency of occurrence within all, or a subset of, documents is taken into consideration; and mapping occurrences of names into occurrences of corresponding concepts to which these names are associated, including resolution of ambiguities of polysemic names.
- 10. A method as claimed in claim 9, wherein the dictionary is pre-generated.
- 11. A method as claimed in claim 9, wherein the units of text are scientific abstracts.
- 12. A method as claimed in claim 9, wherein the units of text are logical subunits of text-documents.
- 13. A method as claimed in claim 12, wherein the units of text are paragraphs or sentences.
- 14. A method as claimed in claim 9, wherein the method includes the filtering of names from a list of names which are expected to be ambiguous, comprising the further step of generating a list of names shared by several concepts.
- 15. A method as claimed in claim 14, wherein the list of ambiguous names is pre-generated and input by the user.
- 16. A method as claimed in claim 9 wherein the analysis of co-occurrences includes the annotation of co-occurrences by co-indexing ancillary concepts, comprising the steps of indexing ancillary concepts, and linking each pair of co-occurring target concepts with occurrences of ancillary concepts.
- 17. A method of validating relationships indicated by co-occurrence where co-occurrences of concepts are checked against an external list of relationships between the concepts, comprising the steps of:
assessing the precision of the co-occurrences as the number of pairs of concepts that are listed in the said external list of relationships versus the number of pairs that are listed as co-occurring; and assessing the recall of the co-occurrences as the number of pairs of concepts that are listed as co-occurring versus the number of pairs that are listed in the said external list of relationship.
- 18. A method as claimed in claim 9, including the steps of utilizing relationships between concepts extracted from text documents in data analysis of external data associated to the concepts, comprising the further steps of:
scoring networks of concepts defined by the relationships extracted from text documents with the external data; sorting the scored networks; and outputting the optimal scored networks with the scores.
- 19. A method as claimed in claim 9, for utilizing relationships between concepts extracted from text documents in data analysis of external data associated to the concepts, comprising the steps of:
scoring networks of concepts defined by the relationships extracted from text documents with the external data; sorting the scored networks; and computing the most relevant ancillary concepts for each subset of concepts defined by each of the networks, wherein the most relevant concepts are those that co-occurred most frequently with the target concepts in the said subset of concepts.
- 20. A method as claimed in claim 9, comprising the steps of:
scoring networks and subsets of target concepts defined by the relationships with ancillary concepts as extracted from text documents; and sorting the scored networks and subsets of target concepts.
RELATED APPLICATION
[0001] This non-provisional application is related to and claims priority from provisional application No. 60/342,682, entitled “A Method and System for Analysing Occurrences of Logical Concepts in Text Documents”, filed on Dec. 21, 2001, which is hereby fully incorporated by reference.
Provisional Applications (1)
|
Number |
Date |
Country |
|
60342682 |
Dec 2001 |
US |