Claims
- 1. A method in a computer system for transforming a document into a canonical representation using entity tags, each entity tag having a type and an associated value, the document having at least one sentence, comprising:
receiving a designation of a plurality of entity tags; and for each sentence,
parsing the sentence to generate a parse structure having a plurality of syntactic elements; determining from the parse structure a set of syntactic elements that correspond to the designated entity tags; and storing in an enhanced data representation data structure a representation of each association between a syntactic element of the determined set of syntactic elements and the type of the entity tag that corresponds to the syntactic element, the syntactic element representing the value of the corresponding entity tag, such that the sentence is represented in the data structure by at least one entity tag.
- 2. The method of claim 1 wherein the enhanced data representation stores a plurality of entity tags to represent a sentence.
- 3. The method of claim 1 wherein each stored representation includes an identifier of an entity tag type from the designated tags and an associated value that the corresponding syntactic element from the set of syntactic elements.
- 4. The method of claim 1, each sentence having a plurality of clauses, wherein each clause of a parsed sentence is represented in the enhanced data representation data structure by at least one entity tag.
- 5. The method of claim 1 wherein the stored entity tag type indicates that the associated syntactic element specifies a name.
- 6. The method of claim 5 wherein the name is at least one of a person name, a company name, an organization name, a genetic name, a pharmaceutical name, and a document name.
- 7. The method of claim 1 wherein the stored entity tag type indicates that the syntactic element specifies a location.
- 8. The method of claim 7 wherein the location name specifies at least one of a physical location, a geographic location, a country, a city, a region, a state, a locality, a phone number, and an address.
- 9. The method of claim 1 wherein the stored entity tag type is date specification.
- 10. The method of claim 1 wherein the determining from the parse structure the set of syntactic elements that correspond to the designated entity tags is performed by evaluating a tag value specification associated with each designated entity tag type.
- 11. The method of claim 1 wherein the designated entity tags are stored in a look-up table and wherein the determining from the parse structure the set of corresponding syntactic elements is performed by looking up the syntactic elements in the stored look-up table.
- 12. The method of claim 1 wherein the document is part of a corpus of heterogeneous documents.
- 13. The method of claim 1 wherein the enhanced data representation data structure is used to index the corpus of documents.
- 14. The method of claim 1 wherein the enhanced data representation data structure is used to execute a query against objects in a corpus of documents.
- 15. The method of claim 14 wherein the enhanced data representation corresponds to the query and results are returned that satisfy the query when an object in the corpus contains the same entity tag values and corresponding entity tag types as those stored in the enhanced data representation.
- 16. The method of claim 14 wherein the enhanced data representation corresponds to the query and results are returned that satisfy the query when an object in the corpus contains a similar entity tag type to an entity tag type stored in the enhanced data representation.
- 17. The method of claim 16 wherein the objects in the corpus are sentences and indications of sentences that satisfy the query are returned.
- 18. The method of claim 17 wherein the returned indications of sentences indicate paragraphs.
- 19. The method of claim 17 wherein the returned indications of sentences indicate documents.
- 20. The method of claim 16, further comprising:
receiving an indication of at least one sentence of the document; and returning indications of documents in the corpus that contain similar terms to terms of the indicated at least one sentence.
- 21. The method of claim 20 wherein the determination of indications of documents that contain similar terms is performed by determining the documents in the corpus that contain similar entity tag values and corresponding entity tag types to those found in the enhanced data representation that corresponds to the at least one indicated sentence.
- 22. The method of claim 20 wherein the determination of indications of documents that contain similar terms is performed using latent semantic regression techniques.
- 23. The method of claim 20 wherein the at least one indicated sentence is at least one of an indicated paragraph and an indicated document.
- 24. The method of claim 23 wherein the determination of indications of documents that contain similar terms is performed using latent semantic regression techniques.
- 25. A computer-readable memory medium containing instructions for controlling a computer processor to transform a document into a canonical representation using entity tags, each entity tag having a type and an associated value, the document having at least one sentence, by:
receiving a designation of a plurality of entity tags; and for each sentence,
parsing the sentence to generate a parse structure having a plurality of syntactic elements; determining from the parse structure a set of syntactic elements that correspond to the designated entity tags; and storing in an enhanced data representation data structure a representation of each association between a syntactic element of the determined set of syntactic elements and the type of the entity tag that corresponds to the syntactic element, the syntactic element representing the value of the corresponding entity tag, such that the sentence is represented in the data structure by at least one entity tag.
- 26. The computer-readable memory medium of claim 25 wherein the enhanced data representation stores a plurality of entity tags to represent a sentence.
- 27. The computer-readable memory medium of claim 25 wherein each stored representation includes an identifier of an entity tag type from the designated tags and an associated value that the corresponding syntactic element from the set of syntactic elements.
- 28. The computer-readable memory medium of claim 25, each sentence having a plurality of clauses, wherein each clause of a parsed sentence is represented in the enhanced data representation data structure by at least one entity tag.
- 29. The computer-readable memory medium of claim 25 wherein the stored entity tag type indicates that the associated syntactic element specifies a name.
- 30. The computer-readable memory medium of claim 29 wherein the name is at least one of a person name, a company name, an organization name, a genetic name, a pharmaceutical name, and a document name.
- 31. The computer-readable memory medium of claim 25 wherein the stored entity tag type indicates that the syntactic element specifies a location.
- 32. The computer-readable memory medium of claim 31 wherein the location name specifies at least one of a physical location, a geographic location, a country, a city, a region, a state, a locality, a phone number, and an address.
- 33. The computer-readable memory medium of claim 25 wherein the stored entity tag type is date specification.
- 34. The computer-readable memory medium of claim 25 wherein the determining from the parse structure the set of syntactic elements that correspond to the designated entity tags is performed by evaluating a tag value specification associated with each designated entity tag type.
- 35. The computer-readable memory medium of claim 25 wherein the designated entity tags are stored in a look-up table and wherein the determining from the parse structure the set of corresponding syntactic elements is performed by looking up the syntactic elements in the stored look-up table.
- 36. The computer-readable memory medium of claim 25 wherein the document is part of a corpus of heterogeneous documents.
- 37. The computer-readable memory medium of claim 25 wherein the enhanced data representation data structure is used to index the corpus of documents.
- 38. The computer-readable memory medium of claim 25 wherein the enhanced data representation data structure is used to execute a query against objects in a corpus of documents.
- 39. The computer-readable memory medium of claim 38 wherein the enhanced data representation corresponds to the query and results are returned that satisfy the query when an object in the corpus contains the same entity tag values and corresponding entity tag types as those stored in the enhanced data representation.
- 40. The computer-readable memory medium of claim 38 wherein the enhanced data representation corresponds to the query and results are returned that satisfy the query when an object in the corpus contains a similar entity tag type to an entity tag type stored in the enhanced data representation.
- 41. The computer-readable memory medium of claim 40 wherein the objects in the corpus are sentences and indications of sentences that satisfy the query are returned.
- 42. The computer-readable memory medium of claim 41 wherein the returned indications of sentences indicate paragraphs.
- 43. The computer-readable memory medium of claim 41 wherein the returned indications of sentences indicate documents.
- 44. The computer-readable memory medium of claim 40, wherein the instructions further control the computer processor by:
receiving an indication of at least one sentence of the document; and returning indications of documents in the corpus that contain similar terms to terms of the indicated at least one sentence.
- 45. The computer-readable memory medium of claim 44 wherein the determination of indications of documents that contain similar terms is performed by determining the documents in the corpus that contain similar entity tag values and corresponding entity tag types to those found in the enhanced data representation that corresponds to the at least one indicated sentence.
- 46. The computer-readable memory medium of claim 44 wherein the determination of indications of documents that contain similar terms is performed using latent semantic regression techniques.
- 47. The computer-readable memory medium of claim 44 wherein the at least one indicated sentence is at least one of an indicated paragraph and an indicated document.
- 48. The computer-readable memory medium of claim 47 wherein the determination of indications of documents that contain similar terms is performed using latent semantic regression techniques.
- 49. A syntactic query engine for transforming a document into a canonical representation using entity tags, each entity tag having a type and as associated value, the document having at least one sentence, comprising:
parser that is structured to
receive a designation of a plurality of entity tags; and decompose each sentence to generate a parse structure for the sentence having a plurality of syntactic elements; determine from the structure of the parse structure a set of syntactic elements that correspond to the designated entity tags; and store, in an enhanced data representation data structure, a representation of each association between a syntactic element of the determined set of syntactic elements and the corresponding entity tag type, such that the sentence is represented in the data structure by at least one entity tag.
- 50. The query engine of claim 49 wherein the enhanced data representation stores a plurality of entity tags to represent a sentence.
- 51. The query engine of claim 49 wherein each stored representation includes an identifier of an entity tag type from the designated tags and an associated value that the corresponding syntactic element from the set of syntactic elements.
- 52. The query engine of claim 49, each sentence having a plurality of clauses, wherein each clause of a parsed sentence is represented in the enhanced data representation data structure by at least one entity tag.
- 53. The query engine of claim 49 wherein the stored entity tag type indicates that the associated syntactic element specifies a name.
- 54. The query engine of claim 53 wherein the name is at least one of a person name, a company name, an organization name, a genetic name, a pharmaceutical name, and a document name.
- 55. The query engine of claim 49 wherein the stored entity tag type indicates that the syntactic element specifies a location.
- 56. The query engine of claim 55 wherein the location name specifies at least one of a physical location, a geographic location, a country, a city, a region, a state, a locality, a phone number, and an address.
- 57. The query engine of claim 49 wherein the stored entity tag type is date specification.
- 58. The query engine of claim 49 wherein the determining from the parse structure the set of syntactic elements that correspond to the designated entity tags is performed by evaluating a tag value specification associated with each designated entity tag type.
- 59. The query engine of claim 49 wherein the designated entity tags are stored in a look-up table and wherein the determining from the parse structure the set of corresponding syntactic elements is performed by looking up the syntactic elements in the stored look-up table.
- 60. The query engine of claim 49 wherein the document is part of a corpus of heterogeneous documents.
- 61. The query engine of claim 49 wherein the enhanced data representation data structure is used to index the corpus of documents.
- 62. The query engine of claim 49 wherein the enhanced data representation data structure is used to execute a query against objects in a corpus of documents.
- 63. The query engine of claim 62 wherein the enhanced data representation corresponds to the query and results are returned that satisfy the query when an object in the corpus contains the same entity tag values and corresponding entity tag types as those stored in the enhanced data representation.
- 64. The query engine of claim 62 wherein the enhanced data representation corresponds to the query and results are returned that satisfy the query when an object in the corpus contains a similar entity tag type to an entity tag type stored in the enhanced data representation.
- 65. The query engine of claim 64 wherein the objects in the corpus are sentences and indications of sentences that satisfy the query are returned.
- 66. The query engine of claim 65 wherein the returned indications of sentences indicate paragraphs.
- 67. The query engine of claim 65 wherein the returned indications of sentences indicate documents.
- 68. The query engine of claim 64, further comprising:
search mechanism that is structured to,
receive an indication of at least one sentence of the document; and return indications of documents in the corpus that contain similar terms to terms of the indicated at least one sentence.
- 69. The query engine of claim 67 wherein the determination of indications of documents that contain similar terms is performed by determining the documents in the corpus that contain similar entity tag values and corresponding entity tag types to those found in the enhanced data representation that corresponds to the at least one indicated sentence.
- 70. The query engine of claim 68 wherein the determination of indications of documents that contain similar terms is performed using latent semantic regression techniques.
- 71. The query engine of claim 68 wherein the at least one indicated sentence is at least one of an indicated paragraph and an indicated document.
- 72. The query engine of claim 71 wherein the determination of indications of documents that contain similar terms is performed using latent semantic regression techniques.
- 73. A method in a computer system for transforming a document into a canonical representation using entity tags, each entity tag having a type and an associated value, the document having at least one sentence, each sentence having a plurality of terms, comprising:
receiving a designation of a plurality of entity tags and a designation of at least one grammatical role; and for each sentence,
parsing the sentence to generate a parse structure having a plurality of syntactic elements; determining a set of meaningful terms of the sentence from these syntactic elements; determining from the structure of the parse structure and the syntactic elements a grammatical role for each meaningful term; determining which meaningful terms correspond to the designated entity tags and which meaningful terms correspond to the designated grammatical role; and storing in an enhanced data representation data structure a representation of an association between the meaningful term that corresponds to the designated grammatical role and an association between a meaningful term and the type of a corresponding designated entity tag, the meaningful term associated with the entity tag type representing the value of the entity tag, such that the sentence is represented by at least one entity tag and one meaningful term having a grammatical role.
- 74. The method of claim 73 wherein the designated grammatical role is a governing verb and the enhanced data representation stores a meaningful term that corresponds to a governing verb of the sentence and at least one entity tag.
- 75. The method of claim 74, each sentence comprising at least one clause, wherein there is a governing verb associated with each clause of each sentence and the stored enhanced data representation, for each clause of each sentence, stores the meaningful term that corresponds to the associated governing verb.
- 76. The method of claim 73 wherein the enhanced data representation for each sentence stores a plurality of meaningful terms and their corresponding grammatical roles and at least one entity tag.
- 77. The method of claim 73 wherein then enhanced data representation for each sentence stores an association between each meaningful term and its corresponding grammatical role and an association between a meaningful term and a type of a corresponding designated entity tag.
- 78. The method of claim 76 wherein the meaningful term associated with the corresponding designated entity tag is the same as the meaningful term associated with the corresponding grammatical role.
- 79. The method of claim 73 wherein heuristics are used to determine the grammatical role for the at least one of the meaningful terms.
- 80. The method of claim 73 wherein the determining of a grammatical role for each meaningful term includes determining whether the term is at least one of a subject, object, verb, part of a prepositional phrase, noun modifier, and verb modifier.
- 81. The method of claim 73 wherein the document is part of a corpus of heterogeneous documents.
- 82. The method of claim 73 wherein the enhanced data representation data structure is used to index a corpus of documents.
- 83. The method of claim 73 wherein the enhanced data representation data structure is used to execute a query against objects in a corpus of documents.
- 84. The method of claim 83 wherein results are returned that satisfy the query when an object in the corpus contains a term associated with similar grammatical role to the term and its associated role as stored in the enhanced data representation and contains a similar entity tag type to an entity tag type stored in the enhanced data representation.
- 85. The method of claim 84 wherein the objects in the corpus are sentences and indications of sentences that satisfy the query are returned.
- 86. A computer-readable memory medium containing instructions for controlling a computer processor to transform a document into a canonical representation using entity tags, each entity tag having a type and an associated value, the document having at least one sentence, each sentence having a plurality of terms, by:
receiving a designation of a plurality of entity tags and a designation of at least one grammatical role; and for each sentence,
parsing the sentence to generate a parse structure having a plurality of syntactic elements; determining a set of meaningful terms of the sentence from these syntactic elements; determining from the structure of the parse structure and the syntactic elements a grammatical role for each meaningful term; determining which meaningful terms correspond to the designated entity tags and which meaningful terms correspond to the designated grammatical role; and storing in an enhanced data representation data structure a representation of an association between the meaningful term that corresponds to the designated grammatical role and an association between a meaningful term and the type of a corresponding designated entity tag, the meaningful term associated with the entity tag type representing the value of the entity tag, such that the sentence is represented by at least one entity tag and one meaningful term having a grammatical role.
- 87. A syntactic query engine for transforming a document into a canonical representation using entity tags, each entity tag having a type and as associated value, the document having at least one sentence, each sentence having a plurality of terms, comprising:
parser that is structured to
receive a designation of a plurality of entity tags and a designation of at least one grammatical role; decompose each sentence to generate a parse structure for the sentence having a plurality of syntactic elements; determine a set of meaningful terms of the sentence from the syntactic elements; determine from the structure of the parse structure and the syntactic elements a grammatical role for each meaningful term; determine which meaningful terms correspond to the designated entity tags and which meaningful terms correspond to the designated grammatical role; and store, in an enhanced data representation data structure a representation of an association between the meaningful term that corresponds to the designated grammatical role and an association between a meaningful term and the type of a corresponding designated entity tag, the meaningful term associated with the entity tag type representing the value of the entity tag, such that the sentence is represented by at least one entity tag and one meaningful term having a grammatical role.
- 88. The query engine of claim 86 wherein the designated grammatical role is a governing verb and the enhanced data representation stores a meaningful term that corresponds to a governing verb of the sentence and at least one entity tag.
- 89. A data processing system comprising a computer processor and a memory, the memory containing structured data that stores a normalized representation of sentence data, the structured data being manipulated by the computer processor under the control of program code and stored in the memory as:
an entity table having a set of entity tag pairs, each pair having a term that is a value of a corresponding entity tag and an indication of an entity tag type of the corresponding entity tag.
- 90. A computer-readable memory medium containing instructions for controlling a computer processor to store a normalized data structure representing a document of a data set, the document having a plurality of sentences, comprising:
for each sentence,
determining a set of terms of the sentence that correspond to a designated set of entity tags; and storing sets of relationships between each determined term and its corresponding entity tag type in the normalized data structure so as to represent the entire sentence as entity tags.
- 91. A computer system for storing a normalized data structure representing a document of a data set, the document having a plurality of sentences, each sentence having a plurality of terms, comprising:
enhanced parsing mechanism that determines a set of terms of the sentence that correspond to a designated set of entity tags; and storage mechanism structured to store sets of relationships between each determined term and its corresponding entity tag type in the normalized data structure so as to represent the entire sentence as entity tags.
- 92. The system of claim 91, the storage mechanism further structured to store terms that correspond to a document level attribute.
- 93. A method in a computer system for searching a corpus of documents, each document having a plurality of sentences, the corpus having an index of the plurality of sentences for the documents, comprising:
receiving an indication of a plurality of consecutive sentences; parsing the indicated plurality of consecutive sentences to generate a plurality of search terms for searching the document corpus; determining a plurality of result sentences in the corpus that correlate to the search terms using latent semantic regression techniques to determine the similarity of the search terms to the sentences in the corpus of documents; and returning indications of the determined result sentences.
- 94. The method of claim 93 wherein the indications of the determined result sentences indicate documents that contain the result sentences.
- 95. The method of claim 93 wherein the plurality of consecutive sentences comprise a paragraph.
- 96. The method of claim 93 wherein the plurality of consecutive sentences comprise an entire document.
- 97. The method of claim 93 wherein the index of the plurality of sentences for the documents is a term-sentence index.
- 98. A computer-readable memory medium containing instructions for controlling a computer processor to search a corpus of documents, each document having a plurality of sentences, the corpus having an index of the plurality of sentences for the documents, by:
receiving an indication of a plurality of consecutive sentences; parsing the indicated plurality of consecutive sentences to generate a plurality of search terms for searching the document corpus; determining a plurality of result sentences in the corpus that correlate to the search terms using latent semantic regression techniques to determine the similarity of the search terms to the sentences in the corpus of documents; and returning indications of the determined result sentences.
- 99. The computer-readable memory medium of claim 98 wherein the indications of the determined result sentences indicate documents that contain the result sentences.
- 100. The computer-readable memory medium of claim 98 wherein the plurality of consecutive sentences comprise a paragraph.
- 101. The computer-readable memory medium of claim 98 wherein the plurality of consecutive sentences comprise an entire document.
- 102. The computer-readable memory medium of claim 98 wherein the index of the plurality of sentences for the documents is a term-sentence index.
- 103. A query engine for searching a corpus of documents, each having a plurality of sentences, the corpus having an index of the plurality of sentences for the documents, comprising:
parser that is structured to
receive an indication of a plurality of consecutive sentences; and decompose the indicated plurality of consecutive sentences to generate a plurality of search terms for searching the document corpus; and postprocessor that is structured to
determine a plurality of result sentences in the corpus that correlate to the search terms using latent semantic regression techniques to determine the similarity of the search terms to the sentences in the corpus of documents; and return indications of the determined result sentences.
- 104. The engine of claim 103 wherein the indications of the determined result sentences indicate documents that contain the result sentences.
- 105. The engine of claim 103 wherein the plurality of consecutive sentences comprise a paragraph.
- 106. The engine of claim 103 wherein the plurality of consecutive sentences comprise an entire document.
- 107. The engine of claim 103 wherein the index of the plurality of sentences for the documents is a term-sentence index.
- 108. A method in a networked computer environment for searching a corpus of documents, comprising:
receiving an indication of a plurality of consecutive sentences; forwarding to a search engine the indicated plurality of consecutive sentences; and receiving from the search engine indications of a plurality of result sentences from the document corpus that correlate to the indicated plurality of consecutive sentences based upon a latent semantic regression analysis used by the search engine to determine the similarity of terms in the consecutive sentences to terms in the sentences of documents in the corpus.
- 109. The method of claim 108, further comprising outputting an indication of the result sentences.
- 110. The method of claim 108 wherein the indication of the plurality of consecutive sentences comprises a paragraph.
- 111. The method of claim 108 wherein the indication of the plurality of consecutive sentences comprises a document.
- 112. The method of claim 108 wherein the indications of the plurality of result sentences are indications of documents that contain the sentences that correlate to the indicated plurality of consecutive sentences.
- 113. A method in a computer system for searching a corpus of objects each object having a plurality of units, the corpus having an index of the plurality of units for the objects, comprising:
receiving an indication of a plurality of consecutive units; decomposing the indicated plurality of consecutive units to generate a plurality of search terms for searching the object corpus; determining a plurality of result units in the corpus that correlate to the search terms using latent semantic regression techniques to determine the similarity of the search terms to the units in the corpus of objects; and returning indications of the determined result units.
- 114. The method of claim 113 wherein the indications of the determined result sentences indicate objects that contain the result units.
- 115. The method of claim 113 wherein the index of the plurality of units for the objects indexes the terms for each unit of each object.
- 116. A computer-readable memory medium containing instructions for controlling a computer processor to search a corpus of objects each object having a plurality of units, the corpus having an index of the plurality of units for the objects, by:
receiving an indication of a plurality of consecutive units; decomposing the indicated plurality of consecutive units to generate a plurality of search terms for searching the object corpus; determining a plurality of result units in the corpus that correlate to the search terms using latent semantic regression techniques to determine the similarity of the search terms to the units in the corpus of objects; and returning indications of the determined result units.
- 117. The computer-readable memory medium of claim 115 wherein the indications of the determined result sentences indicate objects that contain the result units.
- 118. The computer-readable memory medium of claim 116 wherein the index of the plurality of units for the objects indexes the terms for each unit of each object.
- 119. A search engine for searching a corpus of objects each having a plurality of units, the corpus having an index of the plurality of units for the objects, comprising:
parser that is structured to
receive an indication of a plurality of consecutive units; and decompose the indicated plurality of consecutive units to generate a plurality of search terms for searching the object corpus; and postprocessor that is structured to
determine a plurality of result units in the corpus that correlate to the search terms using latent semantic regression techniques to determine the similarity of the search terms to the units in the corpus of objects; and return indications of the determined result units.
- 120. The search engine of claim 119 wherein the indications of the determined result sentences indicate objects that contain the result units.
- 121. The search engine of claim 119 wherein the index of the plurality of units for the objects indexes the terms for each unit of each object.
STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT
[0001] This invention was made with government support under Contract No. DAAH01-00-C-R168 awarded by Defense Advanced Research Project Agency. The government has or may have certain rights in this invention.
Provisional Applications (1)
|
Number |
Date |
Country |
|
60312385 |
Aug 2001 |
US |
Continuation in Parts (1)
|
Number |
Date |
Country |
Parent |
10007299 |
Nov 2001 |
US |
Child |
10371399 |
Feb 2003 |
US |