This application claims the benefit of priority to Russian patent application No. RU2014147623, filed Nov. 26, 2014; disclosure of which is incorporated herein by reference in its entirety.
The present disclosure is generally related to computer systems, and is more specifically related to systems and methods for natural language processing.
Interpreting unstructured information represented by a natural language text may be hindered by polysemy which is an intrinsic feature of natural languages. Identification, comparison and determining the degree of similarity of semantically similar language constructs may facilitate the task of interpreting natural language texts.
In accordance with one or more aspects of the present disclosure, an example method may comprise: receiving a plurality of semantic structures associated with a text corpus; identifying, by a processing device, a first semantic structure and a second semantic structure, wherein the first semantic structure comprises a first substructure and a second substructure, wherein the second semantic structure comprises a third substructure and a fourth substructure, and wherein the first substructure is similar to the third substructure in view of a first similarity criterion; and responsive to determining that the second substructure is similar to the fourth substructure in view of a second similarity criterion, associating, with a certain concept of an ontology associated with the text corpus, objects represented by the second substructure and the fourth substructure.
In accordance with one or more aspects of the present disclosure, an example system may comprise: a memory; and a processor, coupled to the memory, the processor configured to: receiving a plurality of semantic structures associated with a text corpus; identify a first semantic structure and a second semantic structure, wherein the first semantic structure comprises a first substructure and a second substructure, wherein the second semantic structure comprises a third substructure and a fourth substructure, and wherein the first substructure is similar to the third substructure in view of a first similarity criterion; and responsive to determining that the second substructure is similar to the fourth substructure in view of a second similarity criterion, associate, with a certain concept of an ontology associated with the text corpus, objects represented by the second substructure and the fourth substructure.
In accordance with one or more aspects of the present disclosure, an example computer-readable non-transitory storage medium may comprise executable instructions that, when executed by a computing device, cause the computing device to perform operations comprising: receiving a plurality of semantic structures associated with a text corpus; identifying, by a processing device, a first semantic structure and a second semantic structure, wherein the first semantic structure comprises a first substructure and a second substructure, wherein the second semantic structure comprises a third substructure and a fourth substructure, and wherein the first substructure is similar to the third substructure in view of a first similarity criterion; and responsive to determining that the second substructure is similar to the fourth substructure in view of a second similarity criterion, associating, with a certain concept of an ontology associated with the text corpus, objects represented by the second substructure and the fourth substructure.
The present disclosure is illustrated by way of examples, and not by way of limitation, and may be more fully understood with references to the following detailed description when considered in connection with the figures, in which:
Described herein are methods and systems for creating ontologies by analyzing natural language texts.
“Ontology” herein shall refer to a model representing objects pertaining to a certain branch of knowledge (subject area) and relationships among such objects. An ontology may comprise definitions of a plurality of classes, such that each class corresponds to a concept of the subject area. Each class definition may comprise definitions of one or more objects associated with the class. Following the generally accepted terminology, an ontology class may also be referred to as concept, and an object belonging to a class may also be referred to as an instance of the concept.
In an illustrative example, class “Person” may be associated with one or more objects corresponding to certain persons. Each class definition may further comprise one or more relationship definitions describing the types of relationships that may be associated with the objects of the class. Each class definition may further comprise one or more restrictions defining certain properties of the objects of the class. In certain implementations, a class may be an ancestor or a descendant of another class.
An object definition may represent a real life material object (such as a person or a thing) or a certain notion associated with one or more real life objects (such as a number or a word). In certain implementations, an object may be associated with two or more classes. An ontology may be an ancestor or/and a descendant of another ontology, in which case concepts and properties of the ancestor ontology would also pertain to the descendant ontology.
The present disclosure provides system and methods for identifying, by a computing device, alternative semantic structures representing similar or identical objects, facts, features, or phenomena, and for associating the identified semantic structures with the corresponding classes and objects of an ontology that is associated with the natural language text field being analyzed.
“Computing device” herein shall refer to a data processing device having a general purpose processor, a memory, and at least one communication interface. Examples of computing devices that may employ the methods described herein include, without limitation, desktop computers, notebook computers, tablet computers, and smart phones.
In accordance with one or more aspects of the present disclosure, the computing device implementing the method may perform syntactic and semantic analysis of a plurality of natural language texts belonging to a certain text corpus, to produce a plurality of language independent semantic structures.
The computing device may then identify, within the plurality of semantic structures, a first semantic structure and a second semantic structures, such that the first semantic structure comprises a first substructure, which is similar, in view of a certain similarity criterion, to a second substructure comprised by the second semantic structure. The similarity criterion may represent at least partial equivalence of the two substructures. Thus, in various illustrative examples, the two similar substructures may be considered equivalent. In an illustrative example, each of the similar substructures may comprise two parts (referred to as “left context” and “right context” to indicate that they are surrounding the respective remaining substructures of the first semantic structure and the second semantic structure).
Responsive to identifying the two semantic structures comprising similar substructures, the computing device may assert a hypothesis of similarity of the respective interior contexts of the first semantic structure and second semantic structure (wherein each interior context is surrounded by the respective left and right contexts). The hypothesis may then be tested, e.g., by identifying, within the same text corpus, two semantic structures that are different from the first semantic structure and the second semantic structure, and include substructures the semantic similarity or equivalency of which is being tested (i.e., the third substructure and the fourth substructure representing the respective interior contexts), while the remaining parts of the newly identified semantic structures are similar (e.g., in view of the same similarity criterion that was employed for establishing the similarity of the first substructure and the second substructure). Upon confirming the hypothesis, the computing device may define the objects represented by the respective interior contexts of the two semantic structures (i.e., the third substructure and the fourth substructure) as instances of a certain concept of an ontology associated with a certain knowledge domain.
Various aspects of the above referenced methods and systems are described in details herein below by way of examples, rather than by way of limitation.
At block 120, the computing device implementing the method may perform a semantico-syntactic analysis of an input text corpus 110 to produce a plurality of language independent semantic structures, as described in more details herein below.
At block 130, the computing device may create an index of the plurality of semantic structures, as described in more details herein below. The index may be employed for identifying certain elements within the semantic structures, and thus may facilitate identifying semantic structures that are related in a certain manner (e.g., similar in view of certain similarity criteria).
At block 140, the computing device may identify two semantic structures, such that the first identified semantic structure comprises a first substructure, which is similar, in view of a certain similarity criterion, to a second substructure comprised by the second identified semantic structure. The similarity criterion may represent at least partial equivalence of the two substructures, as described in more details herein below.
In an illustrative example, each of the identified similar substructures may comprise two parts (referred to as “left context” and “right context” to indicate that they are surrounding the respective remaining substructures of the first semantic structure and the second semantic structure).
At block 150, the computing device may ascertain that the respective interior contexts of the first semantic structure and second semantic structure (wherein each interior context is surrounded by the respective left and right contexts) are similar in view of a certain similarity criterion.
At block 160, the computing device may designate the words or word combinations corresponding to the interior contexts of the two semantic structures as being semantically similar or equivalent.
At block 170, the computing device may define the objects represented by the respective interior contexts of the two semantic structures (i.e., the third substructure and the fourth substructure) as instances of a certain concept of an ontology associated with the text corpus, as described in more details herein below, and the method may loop back to block 140.
At block 214, the computing device implementing the method may perform lexico-morphological analysis of sentence 212 to identify morphological meanings of the words comprised by the sentence. “Morphological meaning” of a word herein shall refer to one or more lemma (i.e., canonical or dictionary forms) corresponding to the word and a corresponding set of values of grammatical attributes defining the grammatical value of the word. Such grammatical attributes may include the lexical category (part of speech) of the word and one or more morphological and/or grammatical attributes (e.g., grammatical case, gender, number, conjugation type, etc.). Due to homonymy and/or coinciding grammatical forms corresponding to different lexico-morphological meanings of a certain word, two or more morphological meanings may be identified for a given word. An illustrative example of performing lexico-morphological analysis of a sentence is described in more details herein below with references to
At block 215, the computing device may perform a rough syntactic analysis of sentence 212. The rough syntactic analysis may include applying of one or more syntactic models which may be associated with items of the sentence 212 followed by identification of the surface (i.e., syntactic) associations within sentence 212, in order to produce a graph of generalized constituents. “Constituent” herein shall refer to a contiguous group of words of the original sentence, which behaves as a single grammatical entity. A constituent comprises a core represented by one or more words, and may further comprise one or more child constituents at lower levels. A child constituent is a dependent constituent and may be associated with one or more parent constituents.
At block 216, the computing device may perform a precise syntactic analysis of sentence 212, to produce one or more syntactic trees of the sentence. The pluralism of possible syntactic trees corresponding to a given original sentence may stem from homonymy and/or coinciding grammatical forms corresponding to different lexico-morphological meanings of one or more words within the original sentence. Among the multiple syntactic trees, one or more best syntactic tree corresponding to sentence 212 may be selected, based on a certain rating function talking into account compatibility of lexical meanings of the original sentence words, surface relationships, deep relationships, etc.
At block 217, the computing device may process the syntactic trees to the produce a semantic structure 218 corresponding to sentence 212. Semantic structure 218 may comprise a plurality of nodes corresponding to semantic classes, and may further comprise a plurality of edges corresponding to semantic relationships, as described in more details herein below.
In an illustrative example, a certain lexical meaning of lexical descriptions 203 may be associated with one or more surface models of syntactic descriptions 202 corresponding to this lexical meaning A certain surface model of syntactic descriptions 202 may be associated with a deep model of semantic descriptions 204.
Word inflexion descriptions 310 describe the forms of a given word depending upon its grammatical categories (e.g., grammatical case, grammatical gender, grammatical number, grammatical tense, etc.), and broadly includes or describes various possible forms of the word. Word formation description 330 describes which new words may be constructed based on a given word (e.g., compound words).
According to one aspect of the present disclosure, syntactic relationships among the elements of the original sentence may be established using a constituent model. A constituent may comprise a group of neighboring words in a sentence that behaves as a single entity. A constituent has a word at its core and may comprise child constituents at lower levels. A child constituent is a dependent constituent and may be associated with other constituents (such as parent constituents) for building the syntactic descriptions 102 of the original sentence.
Surface models 410 may be represented as aggregates of one or more syntactic forms (“syntforms” 412) employed to describe possible syntactic structures of the sentences that are comprised by syntactic description 102. In general, the lexical meaning of a natural language word may be linked to surface (syntactic) models 410. A surface model may represent constituents which are viable when the lexical meaning functions as the “core.” A surface model may include a set of surface slots of the child elements, a description of the linear order, and/or diatheses. “Diathesis” herein shall refer to a certain relationship between surface slots and their semantic roles expressed by means corresponding deep slots.
A constituent model may utilize a plurality of surface slots 415 of the child constituents and their linear order descriptions 416 to describe grammatical values 414 of possible fillers of these surface slots. Diatheses 417 represent relationships between surface slots 415 and deep slots 514 (as shown in
Linear order description 416 may be represented by linear order expressions reflecting the sequence in which various surface slots 415 may appear in the sentence. The linear order expressions may include names of variables, names of surface slots, parenthesis, grammemes, the “or” operator, etc. In an illustrative example, a linear order description of a simple sentence of “Boys play football” may be represented as “Subject Core Object_Direct,” where Subject, Core, and Object_Direct are the names of surface slots 415 corresponding to the word order.
Communicative descriptions 480 may describe a word order in a syntform 412 from the point of view of communicative acts that are represented as communicative order expressions, which are similar to linear order expressions. The control and concord description 440 may comprise rules and restrictions which are associated with grammatical values of the related constituents and may be used in performing syntactic analysis.
Non-tree syntax descriptions 450 may be created to reflect various linguistic phenomena, such as ellipsis and coordination, and may be used in syntactic structures transformations which are generated at various stages of the analysis according to one or more aspects of the present disclosure. Non-tree syntax descriptions 450 may include ellipsis description 452, coordination description 454, as well as referential and structural control description 430, among others.
Analysis rules 460 may generally describe properties of a specific language and may be used in performing semantic analysis 150. Analysis rules 460 may comprise rules of identifying semantemes 462 and normalization rules 464. Normalization rules 464 may be used for describing language-dependent transformations of semantic structures.
The core of the semantic descriptions may be represented by semantic hierarchy 510 which may comprise semantic notions (semantic entities) which are also referred to as semantic classes. The latter may be arranged into hierarchical structure reflecting parent-child relationships. In general, a child semantic class may inherits one or more properties of its direct parent and other ancestor semantic classes. In an illustrative example, semantic class SUBSTANCE is a child of semantic class ENTITY and the parent of semantic classes GAS, LIQUID, METAL, WOOD_MATERIAL, etc.
Each semantic class in semantic hierarchy 510 may be associated with a corresponding deep model 512. Deep model 512 of a semantic class may comprise a plurality of deep slots 514 which may reflect semantic roles of child constituents in various sentences that include objects of the semantic class as the core of the parent constituent. Deep model 512 may further comprise possible semantic classes acting as fillers of the deep slots. Deep slots 514 may express semantic relationships, including, for example, “agent,” “addressee,” “instrument,” “quantity,” etc. A child semantic class may inherit and further expand the deep model of its direct parent semantic class.
Deep slots descriptions 520 reflect semantic roles of child constituents in deep models 512 and may be used to describe general properties of deep slots 514. Deep slots descriptions 520 may also comprise grammatical and semantic restrictions associated with the fillers of deep slots 514. Properties and restrictions associated with deep slots 514 and their possible fillers in various languages may be substantially similar and often identical. Thus, deep slots 514 are language-independent.
Set of semantemes 530 may represents a plurality of semantic categories and semantemes which represent meanings of the semantic categories. In an illustrative example, a semantic category “DegreeOfComparison” may be used to describe the degree of comparison of adjectives and may comprise the following semantemes: “Positive,” “ComparativeHigherDegree,” and “SuperlativeHighestDegree,” among others. In another illustrative example, a semantic category “RelationToReferencePoint” may be used to describe an order (spatial or temporal in a broad sense of the words being analyzed), such as before or after a reference point or an event, and may comprise the semantemes “Previous” and “Subsequent.”. In yet another illustrative example, a semantic category “EvaluationObjective” can be used to describe an objective assessment, such as “Bad,” “Good,” etc.
Set of semantemes 530 may include language-independent semantic attributes which may express not only semantic properties but also stylistic, pragmatic and communicative properties. Certain semantemes may be used to express an atomic meaning which corresponds to a regular grammatical and/or lexical expression in a natural language. By their intended purpose and usage, sets of semantemes may be categorized, e.g., as grammatical semantemes 532, lexical semantemes 534, and classifying grammatical (differentiating) semantemes 536.
Grammatical semantemes 532 may be used to describe grammatical properties of the constituents when transforming a syntactic tree into a semantic structure. Lexical semantemes 534 may describe specific properties of objects (e.g., “being flat” or “being liquid”) and may be used in deep slot descriptions 520 as restriction associated with the deep slot fillers (e.g., for the verbs “face (with)” and “flood,” respectively). Classifying grammatical (differentiating) semantemes 536 may express the differentiating properties of objects within a single semantic class. In an illustrative example, in the semantic class of HAIRDRESSER, the semanteme of <<RelatedToMen>> is associated with the lexical meaning of “barber,” to differentiate from other lexical meanings which also belong to this class, such as “hairdresser,” “hairstylist,” etc. Using these language-independent semantic properties that may be expressed by elements of semantic description, including semantic classes, deep slots, and semantemes, may be employed for extracting the semantic information, in accordance with one or more aspects of the present invention.
Pragmatic descriptions 540 allow associating a certain theme, style or genre to texts and objects of semantic hierarchy 510 (e.g., “Economic Policy,” “Foreign Policy,” “Justice,” “Legislation,” “Trade,” “Finance,” etc.). Pragmatic properties may also be expressed by semantemes. In an illustrative example, the pragmatic context may be taken into consideration during the semantic analysis phase.
A lexical meaning 612 in the lexical-semantic hierarchy 510 may be associated with a surface model 410 which, in turn, may be associated, by one or more diatheses 417, with a corresponding deep model 512. A lexical meaning 612 may inherit the semantic class of its parent, and may further specify its deep model 152.
A surface model 410 of a lexical meaning may comprise includes one or more syntforms 412. A syntform, 412 of a surface model 410 may comprise one or more surface slots 415, including their respective linear order descriptions 416, one or more grammatical values 414 expressed as a set of grammatical categories (grammemes), one or more semantic restrictions associated with surface slot fillers, and one or more of the diatheses 417. Semantic restrictions associated with a certain surface slot filler may be represented by one or more semantic classes, whose objects can fill the surface slot.
At block 215, the computing device may perform a rough syntactic analysis of original sentence 212, in order to produce a graph of generalized constituents 732 of
Graph of generalized constituents 732 may be represented by an acyclic graph comprising a plurality of nodes corresponding to the generalized constituents of original sentence 212, and further comprising a plurality of edges corresponding to the surface (syntactic) slots, which may express various types of relationship among the generalized lexical meanings. The method may apply a plurality of potentially applicable syntactic models for each element of a plurality of elements of the lexico-morphological structure of original sentence 212 in order to produce a set of constituents of original sentence 212. Then, the method may consider a plurality of the constituents of original sentence 212 in order to produce graph of generalized constituents 732 based on a set of constituents. Graph of generalized constituents 732 at the level of the surface model may reflect a plurality of relationships among the words of original sentence 212. As the number of viable syntactic structures may be relatively large, graph of generalized constituents 732 may generally comprise redundant information, including relatively large quantity of lexical meanings for certain nodes and/or surface slots for certain edges of the graph.
Graph of generalized constituents 732 may be initially built as a tree, starting with the terminal nodes (leaves) and moving towards the root, by adding child components to fill surface slots 415 of a plurality of parent constituents in order to cover all lexical units of original sentence 212.
In certain implementations, the root of graph of generalized constituents 732 represents a predicate. In the course of the above described process, the tree may become a graph, as certain constituents of a lower level may be included into one or more constituents of an upper level. A plurality of constituents that represent certain elements of the lexico-morphological structure may then be generalized to produce generalized constituents. The constituents may be generalized based on their lexical meanings or grammatical values 414, e.g., based on part of speech and their relationships.
At block 216, the computing device may perform a precise syntactic analysis of sentence 212, to produce one or more syntactic trees 742 of
In the course of producing the syntactic structure 746 based on the selected syntactic tree, the computing device may establish one or more non-tree links (e.g., by establishing additional link among at least two nodes of the graph). If that process fails, the computing device may select a syntactic tree having a suboptimal rating closest to the optimal rating, and may attempt to establish one or more non-tree relationships within that tree. Finally, the precise syntactic analysis produces a syntactic structure 746 which represents the best syntactic structure corresponding to original sentence 212. In fact, selecting the best syntactic structure 746 also produces the best lexical values 240 for items of original sentence 212.
At block 217, the computing device may process the syntactic trees to the produce a semantic structure 218 corresponding to sentence 212. Semantic structure 218 may reflect, in language-independent terms, the semantics conveyed by original sentence. Semantic structure 218 may be represented by an acyclic graph (e.g., a tree may be complemented by one or more non-tree link, such as an edge of the graph among two nodes of the graph). The original words of the source sentence are represented by the nodes corresponding to language-independent semantic classes of semantic hierarchy 510. The edges of the graph represent deep (semantic) relationships between items of the sentence. The transfer to semantic structure 218 may be produced based on analysis rules 460, and may involve associating, one or more attributes (reflecting lexical, syntactic, and/or semantic properties of the words of original sentence 212) with each semantic class.
As noted herein above, and ontology may be provided by a model representing objects pertaining to a certain branch of knowledge (subject area) and relationships among such objects. Thus, an ontology is different from the semantic hierarchy, despite the fact that it may be associated with elements of a semantic hierarchy by certain relationships (also referred to as “anchors”). An ontology may comprise definitions of a plurality of classes, such that each class corresponds to a concept of the subject area. Each class definition may comprise definitions of one or more objects associated with the class. Following the generally accepted terminology, an ontology class may also be referred to as concept, and an object belonging to a class may also be referred to as an instance of the concept.
In accordance with one or more aspects of the present disclosure, the computing device implementing the methods described herein may index one or more parameters yielded by the semantico-syntactic analysis. Thus, the methods described herein allow considering not only the plurality of words comprised by the original text corpus, but also pluralities of lexical meanings of those words, by storing and indexing all syntactic and semantic information produced in the course of syntactic and semantic analysis of each sentence of the original text corpus. Such information may further comprise the data produced in the course of intermediate stages of the analysis, the results of lexical selection, including the results produced in the course of resolving the ambiguities caused by homonymy and/or coinciding grammatical forms corresponding to different lexico-morphological meanings of certain words of the original language.
One or more indexes may be produced for each text, text corpus, or text corpora. An index may be represented by a memory data structure, such as a table, comprising a plurality of entries. Each entry may represent a mapping of a certain element or parameter of descriptions (e.g., one or more words, a lexical meaning, a syntactic relationship, a morphological, lexical, syntactic or semantic property, or a syntactic or semantic structure) to one or more identifiers (or addresses) of occurrences of the semantic structure element within the original text.
In certain implementations, an index may comprise one or more values of morphological, syntactic, lexical, and/or semantic parameters. These values may be produced in the course of the two-stage semantic analysis, as described in more details herein. The index may be employed in various natural language processing tasks, including the task of performing semantic search.
The computing device implementing the method may extract a wide spectrum of lexical, grammatical, syntactic, pragmatic, and/or semantic characteristics in the course of performing the syntactico-semantic analysis and producing semantic structures. In an illustrative example, the system may extract and store certain lexical information, associations of certain lexical units with semantic classes, information regarding grammatical forms and linear order, information regarding syntactic relationships and surface slots, information regarding the usage of certain forms, aspects, tonality (e.g., positive and negative), deep slots, non-tree links, semantemes, etc.
The computing device implementing the methods described herein may produce, by performing one or more text analysis methods described herein, and index any one or more parameters of the language descriptions, including lexical meanings, semantic classes, grammemes, semantemes, etc. Semantic class indexing may be employed in various natural language processing tasks, including semantic search, classification, clustering, text filtering, etc. Indexing lexical meanings (rather than indexing words) allows searching not only words and forms of words, but also lexical meanings, i.e., words having certain lexical meanings. The computing device implementing the methods described herein may also store, index and search the syntactic and semantic structures produced by one or more text analysis methods described herein, for employing those structures and/or indexes in semantic search, classification, clustering, and document filtering.
In various implementations, the computing device implementing the methods described herein may employ indexes comprising one or more integers for indexing various syntactic, semantic, and other parameters. In an illustrative example, for surface or deep slots may be indexed using two-integer combinations, in which the integers identify occurrences of the pairs of words corresponding to a certain slot. For example, for the example semantic structure of
Similar methods may be employed for indexing not only words, but also their lexical meanings, semantic classes, syntactic and semantic relationships, and/or other elements of syntactic and semantic structures employed and/or produced by the methods described herein. The indexes may facilitate searching and identifying certain contexts not only by keywords, but also contexts specified by certain lexical meanings, meanings associated with certain semantic classes, syntactic and/or semantic properties, morphological properties, or combinations thereof.
The computing device implementing the methods described herein may also perform search of certain fragments of syntactic or semantic structures. Such searches may yield sentences, paragraphs, or other textual fragments, as specified by the search parameters.
The computing device implementing the methods described herein may analyze a plurality of sentences comprised by the original text corpus, and may store the results of the syntactic and semantic analysis of those sentences. Hence, the computing device may be programmed to compare the syntactic and semantic structures, as well as perform their classification, clustering, and/or other processing, including producing their respective visual representations using a graphical user interface (GUI) device.
Referring again to
In various implementations, the computing device implementing the method may employ various indexes to identify similar semantic structures. In an illustrative example, the computing device may employ indexes of lexical values, indexes of surface slots, and/or indexes of deep slots. In another illustrative example, the computing device may employ N-gram indexes, i.e., indexes of N element sequences, the elements of which may be represented by lexical meanings, surface slots, etc.
In an illustrative example, the original text corpus may comprise a plurality of legal documents. Such documents usually comprise a relatively large share of sentences having similar semantic structure. An illustrative example is described herein involving creating an ontology concept and/or adding concept instances related to various modifications of the term “employment termination” in various contexts, including “employer initiated termination,” “voluntary separation,” “discharge,” “removal from office,” as well as semantically similar expressions such as “dismissal,” “employment contract termination,” etc. The computing device implementing the method may select certain classes of structures, e.g., structures describing a noun group, structure describing a fact (including a subject, a predicate, and an object), structures comprising a certain deep slot or a certain semantic class, etc.
The term “employment termination” may be represented by a corresponding ontology class, as schematically illustrated by
The method of
In an illustrative example, the method may process the following two sentences:
(a) In a lawsuit to reinstate employment of a person whose employment agreement has been terminated by the employer, the burden of proof of a legal cause of the discharge rests upon the employer; and
(b) In a lawsuit to reinstate employment of a person whose employment agreement has been terminated by the employer, the burden of proof of a legal cause of the employment agreement termination rests upon the employer.
As schematically illustrated by
To minimize the number of iterations, the plurality of semantico-syntactic structures representing the original text corpus may be preliminary classified, clustered, and/or filtered (e.g., based on certain semantic classes). To further minimize the number of iterations, the semantico-syntactic structures comprised by the resulting subsets (e.g., classes or clusters) may be compared pairwise. In certain implementations, the computing device may be configured to identify two or more structures that have equivalent substructures comprising left and right contexts, such as the above referenced sentences (a) and (b). Such substructures may not be textually equivalent, but may have equivalent semantic structures. Two semantic structures may be considered equivalent, for example, if they comprise equivalent sets of semantic classes represented by their respective nodes, and further comprise equivalent semantemes associated with those nodes and equivalent deep slots associated with those nodes. Set of equivalent semantemes may be preliminary limited by a certain set, e.g., a set of differentiating semantemes. Thus, the deep analysis technology allows comparing semantic meanings of sentences or parts of sentences irrespectively of their syntactic representation.
The computing device implementing the method may then assert a hypothesis that the remaining, after excluding the identified equivalent substructures (e.g., respective left and right contexts), parts of the sentences are semantically similar or equivalent. In the illustrative example of
The computing device implementing the method may then test the asserted hypothesis, e.g., using the same or similar text corpus after excluding the two sentences being analyzed. In certain implementations, the hypothesis may be tested by identifying, within the same or similar text corpus, other sentences comprising the terms the semantic similarity or equivalency of which is being tested (e.g., “discharge” and “employment agreement termination”). In an illustrative example, the computing device implementing the method may identify two similar or equivalent semantic structures that are different from the two previously identified semantic structures, and include substructures representing the terms the semantic similarity or equivalency of which is being tested, while the remaining parts of the newly identified structures are semantically similar or equivalent (e.g., in view of the same similarity criterion that was employed for establishing the similarity of the first two semantic structures).
In certain implementations, the requirement of the equivalency of the left and right contexts surrounding a candidate substructure may be relaxed, such that the left and right contexts may be required to be similar in view of a certain similarity criterion, wherein the similarity metric value should exceed a certain threshold value.
In an illustrative example, the method may process the following two sentences:
(a) In a lawsuit to reinstate employment of a person whose employment agreement has been terminated by the employer, the burden of proof of a legal cause of the discharge rests upon the employer; and
(c) In legal proceedings to reinstate employment of a person whose employment agreement has been terminated by the employer, the burden of proof of a legal cause of the employment agreement termination rests upon the employer.
The semantic structures corresponding to sentences (a) and (c) are schematically illustrated by
The similarity of semantic structures may be evaluated using an integral similarity metric. Depending upon the requirements to the accuracy and/or computational complexity involved, the metric may take into account various factors including: structural similarity of the semantic structures; presence of the same deep slots or slots associated with the same semantic class; presence of the same lexical or semantic classes associated with the nodes of the semantic structures, presence of parent-child relationship in certain nodes of the semantic structures, such that the parent and the child are divided by no more than a certain number of semantic hierarchy levels; presence of a common ancestor for certain semantic classes and the distance between the nodes representing those classes. If certain semantic classes are found equivalent or substantially similar, the metric may further take into account the presence or absence of certain differentiating semantemes and/or other factors.
In certain implementations, a partial order relationship of semantic structures may be defined. In an illustrative example, sentences and their respective semantic structures may be ordered by the degree of abstractness, e.g., starting from less abstract (more specific) and moving to more abstract (less specific) statements. In certain implementations, each semantic structure may be associated with a certain score reflecting the partial order relationship.
Referring again to
At block 160, the computing device may designate the words or word combinations corresponding to the interior contexts of the two semantic structures as being semantically similar or equivalent.
At block 170, the computing device may define the objects represented by the respective interior contexts of the two semantic structures (i.e., the third substructure and the fourth substructure) as instances of a certain concept of an ontology associated with the text corpus or field.
In other implementations of the method of
An important advantage of the method of
Another important advantage of the method of
Exemplary computing device 1000 includes a processor 502, a main memory 504 (e.g., read-only memory (ROM) or dynamic random access memory (DRAM)), and a data storage device 518, which communicate with each other via a bus 530.
Processor 502 may be represented by one or more general-purpose computing devices such as a microprocessor, central processing unit, or the like. More particularly, processor 502 may be a complex instruction set computing (CISC) microprocessor, reduced instruction set computing (RISC) microprocessor, very long instruction word (VLIW) microprocessor, or a processor implementing other instruction sets or processors implementing a combination of instruction sets. Processor 502 may also be one or more special-purpose computing devices such as an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a digital signal processor (DSP), network processor, or the like. Processor 502 is configured to execute instructions 526 for performing the operations and functions discussed herein.
Computing device 1000 may further include a network interface device 522, a video display unit 510, a character input device 512 (e.g., a keyboard), and a touch screen input device 514.
Data storage device 518 may include a computer-readable storage medium 524 on which is stored one or more sets of instructions 526 embodying any one or more of the methodologies or functions described herein. Instructions 526 may also reside, completely or at least partially, within main memory 504 and/or within processor 502 during execution thereof by computing device 1000, main memory 504 and processor 502 also constituting computer-readable storage media. Instructions 526 may further be transmitted or received over network 516 via network interface device 522.
In certain implementations, instructions 526 may include instructions of method 800 for creating ontologies by analyzing natural language texts. While computer-readable storage medium 524 is shown in the example of
The methods, components, and features described herein may be implemented by discrete hardware components or may be integrated in the functionality of other hardware components such as ASICS, FPGAs, DSPs or similar devices. In addition, the methods, components, and features may be implemented by firmware modules or functional circuitry within hardware devices. Further, the methods, components, and features may be implemented in any combination of hardware devices and software components, or only in software.
In the foregoing description, numerous details are set forth. It will be apparent, however, to one of ordinary skill in the art having the benefit of this disclosure, that the present disclosure may be practiced without these specific details. In some instances, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the present disclosure.
Some portions of the detailed description have been presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of operations leading to a desired result. The operations are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.
It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the following discussion, it is appreciated that throughout the description, discussions utilizing terms such as “determining,” “computing,” “calculating,” “obtaining,” “identifying,” “modifying” or the like, refer to the actions and processes of a computing device, or similar electronic computing device, that manipulates and transforms data represented as physical (e.g., electronic) quantities within the computing device's registers and memories into other data similarly represented as physical quantities within the computing device memories or registers or other such information storage, transmission or display devices.
The present disclosure also relates to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may comprise a general purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a computer readable storage medium, such as, but not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, and magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, or any type of media suitable for storing electronic instructions.
It is to be understood that the above description is intended to be illustrative, and not restrictive. Various other implementations will be apparent to those of skill in the art upon reading and understanding the above description. The scope of the disclosure should, therefore, be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.
Number | Date | Country | Kind |
---|---|---|---|
2014147623 | Nov 2014 | RU | national |
Number | Name | Date | Kind |
---|---|---|---|
4706212 | Toma | Nov 1987 | A |
5068789 | Van Vliembergen | Nov 1991 | A |
5128865 | Sadler | Jul 1992 | A |
5146405 | Church | Sep 1992 | A |
5175684 | Chong | Dec 1992 | A |
5268839 | Kaji | Dec 1993 | A |
5301109 | Landauer et al. | Apr 1994 | A |
5386556 | Courts et al. | Jan 1995 | A |
5418717 | Su et al. | May 1995 | A |
5426583 | Uribe-Echebarria Diaz De Mendibil | Jun 1995 | A |
5475587 | Anick et al. | Dec 1995 | A |
5477451 | Brown et al. | Dec 1995 | A |
5490061 | Tolin et al. | Feb 1996 | A |
5497319 | Chong et al. | Mar 1996 | A |
5510981 | Berger et al. | Apr 1996 | A |
5550934 | Van Vliembergen et al. | Aug 1996 | A |
5559693 | Anick et al. | Sep 1996 | A |
5669007 | Tateishi | Sep 1997 | A |
5677835 | Carbonell et al. | Oct 1997 | A |
5678051 | Aoyama | Oct 1997 | A |
5687383 | Nakayama et al. | Nov 1997 | A |
5696980 | Brew | Dec 1997 | A |
5715468 | Budzinski | Feb 1998 | A |
5721938 | Stuckey | Feb 1998 | A |
5724593 | Hargrave et al. | Mar 1998 | A |
5729741 | Liaguno et al. | Mar 1998 | A |
5737617 | Bernth et al. | Apr 1998 | A |
5752051 | Cohen | May 1998 | A |
5768603 | Brown et al. | Jun 1998 | A |
5784489 | Van Vliembergen et al. | Jul 1998 | A |
5787410 | McMahon | Jul 1998 | A |
5794050 | Dahlgren et al. | Aug 1998 | A |
5794177 | Carus et al. | Aug 1998 | A |
5826219 | Kutsumi | Oct 1998 | A |
5826220 | Takeda et al. | Oct 1998 | A |
5848385 | Poznanski et al. | Dec 1998 | A |
5867811 | O'Donoghue | Feb 1999 | A |
5873056 | Liddy et al. | Feb 1999 | A |
5884247 | Christy | Mar 1999 | A |
5995920 | Carbonell et al. | Nov 1999 | A |
6006221 | Liddy et al. | Dec 1999 | A |
6055528 | Evans | Apr 2000 | A |
6076051 | Messerly et al. | Jun 2000 | A |
6081774 | De Hita et al. | Jun 2000 | A |
6161083 | Franz et al. | Dec 2000 | A |
6182028 | Karaali et al. | Jan 2001 | B1 |
6223150 | Duan et al. | Apr 2001 | B1 |
6233544 | Alshawi | May 2001 | B1 |
6233546 | Datig | May 2001 | B1 |
6243669 | Horiguchi | Jun 2001 | B1 |
6243670 | Bessho et al. | Jun 2001 | B1 |
6243689 | Norton | Jun 2001 | B1 |
6243723 | Ikeda et al. | Jun 2001 | B1 |
6246977 | Messerly et al. | Jun 2001 | B1 |
6260008 | Sanfilippo | Jul 2001 | B1 |
6266642 | Franz et al. | Jul 2001 | B1 |
6275789 | Moser et al. | Aug 2001 | B1 |
6278967 | Akers et al. | Aug 2001 | B1 |
6282507 | Horiguchi et al. | Aug 2001 | B1 |
6285978 | Bernth et al. | Sep 2001 | B1 |
6330530 | Horiguchi et al. | Dec 2001 | B1 |
6345245 | Sugiyama et al. | Feb 2002 | B1 |
6349276 | McCarley | Feb 2002 | B1 |
6356864 | Foltz et al. | Mar 2002 | B1 |
6356865 | Franz et al. | Mar 2002 | B1 |
6381598 | Williamowski et al. | Apr 2002 | B1 |
6393389 | Chanod et al. | May 2002 | B1 |
6442524 | Ecker et al. | Aug 2002 | B1 |
6463404 | Appleby | Oct 2002 | B1 |
6470306 | Pringle et al. | Oct 2002 | B1 |
6523026 | Gillis et al. | Feb 2003 | B1 |
6529865 | Duan et al. | Mar 2003 | B1 |
6601026 | Appelt et al. | Jul 2003 | B2 |
6604101 | Chan et al. | Aug 2003 | B1 |
6622123 | Chanod et al. | Sep 2003 | B1 |
6658627 | Gallup et al. | Dec 2003 | B1 |
6721697 | Duan et al. | Apr 2004 | B1 |
6760695 | Kuno et al. | Jul 2004 | B1 |
6778949 | Duan et al. | Aug 2004 | B2 |
6871174 | Dolan et al. | Mar 2005 | B1 |
6871199 | Binnig et al. | Mar 2005 | B1 |
6901399 | Corston et al. | May 2005 | B1 |
6901402 | Corston-Oliver et al. | May 2005 | B1 |
6928407 | Ponceleon et al. | Aug 2005 | B2 |
6928448 | Franz et al. | Aug 2005 | B1 |
6937974 | D'Agostini | Aug 2005 | B1 |
6947923 | Cha et al. | Sep 2005 | B2 |
6965857 | Decary | Nov 2005 | B1 |
6983240 | Ait-Mokhtar et al. | Jan 2006 | B2 |
6986104 | Green et al. | Jan 2006 | B2 |
7013264 | Dolan et al. | Mar 2006 | B2 |
7020601 | Hummel et al. | Mar 2006 | B1 |
7027974 | Busch et al. | Apr 2006 | B1 |
7050964 | Menzes et al. | May 2006 | B2 |
7085708 | Manson et al. | Aug 2006 | B2 |
7132445 | Taveras et al. | Nov 2006 | B2 |
7146358 | Gravano et al. | Dec 2006 | B1 |
7167824 | Kallulli | Jan 2007 | B2 |
7191115 | Moore | Mar 2007 | B2 |
7200550 | Menezes et al. | Apr 2007 | B2 |
7263488 | Chu et al. | Aug 2007 | B2 |
7269594 | Corston-Oliver et al. | Sep 2007 | B2 |
7346493 | Ringger et al. | Mar 2008 | B2 |
7356457 | Pinkham et al. | Apr 2008 | B2 |
7409404 | Gates | Aug 2008 | B2 |
7461056 | Cao et al. | Dec 2008 | B2 |
7466334 | Baba | Dec 2008 | B1 |
7475015 | Epstein et al. | Jan 2009 | B2 |
7577683 | Cho et al. | Aug 2009 | B2 |
7619656 | Ben-Ezra et al. | Nov 2009 | B2 |
7668791 | Azzam et al. | Feb 2010 | B2 |
7672830 | Goutte et al. | Mar 2010 | B2 |
7672831 | Todhunter et al. | Mar 2010 | B2 |
7739102 | Bender | Jun 2010 | B2 |
7769579 | Zhao et al. | Aug 2010 | B2 |
8065290 | Hogue et al. | Nov 2011 | B2 |
8073865 | Davis et al. | Dec 2011 | B2 |
8078450 | Anisimovich et al. | Dec 2011 | B2 |
8145473 | Anisimovich et al. | Mar 2012 | B2 |
8176048 | Morgan et al. | May 2012 | B2 |
8214199 | Anismovich et al. | Jul 2012 | B2 |
8229730 | Van Den Berg et al. | Jul 2012 | B2 |
8229944 | Latzina et al. | Jul 2012 | B2 |
8260049 | Deryagin et al. | Sep 2012 | B2 |
8266077 | Handley | Sep 2012 | B2 |
8271453 | Pasca et al. | Sep 2012 | B1 |
8285728 | Rubin | Oct 2012 | B1 |
8300949 | Xu | Oct 2012 | B2 |
8301633 | Cheslow | Oct 2012 | B2 |
8370128 | Brun et al. | Feb 2013 | B2 |
8402036 | Blair-Goldensohn et al. | Mar 2013 | B2 |
8423495 | Komissarchik et al. | Apr 2013 | B1 |
8468153 | Ahlberg et al. | Jun 2013 | B2 |
8495042 | Symington et al. | Jul 2013 | B2 |
8533188 | Yan et al. | Sep 2013 | B2 |
8548951 | Solmer et al. | Oct 2013 | B2 |
8554558 | McCarley et al. | Oct 2013 | B2 |
8577907 | Singhal et al. | Nov 2013 | B1 |
8856096 | Marchisio et al. | Oct 2014 | B2 |
20010014902 | Hu et al. | Aug 2001 | A1 |
20010029442 | Shiotsu et al. | Oct 2001 | A1 |
20010029455 | Chin et al. | Oct 2001 | A1 |
20020040292 | Marcu | Apr 2002 | A1 |
20030145285 | Miyahira et al. | Jul 2003 | A1 |
20030158723 | Masuichi et al. | Aug 2003 | A1 |
20030176999 | Calcagno et al. | Sep 2003 | A1 |
20030182102 | Corston-Oliver et al. | Sep 2003 | A1 |
20030204392 | Finnigan et al. | Oct 2003 | A1 |
20040034520 | Langkilde-Geary et al. | Feb 2004 | A1 |
20040064438 | Kostoff | Apr 2004 | A1 |
20040098247 | Moore | May 2004 | A1 |
20040122656 | Abir | Jun 2004 | A1 |
20040172235 | Pinkham et al. | Sep 2004 | A1 |
20040193401 | Ringger et al. | Sep 2004 | A1 |
20040254781 | Appleby | Dec 2004 | A1 |
20040261016 | Glass et al. | Dec 2004 | A1 |
20050010421 | Watanabe et al. | Jan 2005 | A1 |
20050015240 | Appleby | Jan 2005 | A1 |
20050080613 | Colledge et al. | Apr 2005 | A1 |
20050086047 | Uchimoto et al. | Apr 2005 | A1 |
20050108630 | Wasson et al. | May 2005 | A1 |
20050137853 | Appleby et al. | Jun 2005 | A1 |
20050155017 | Berstis et al. | Jul 2005 | A1 |
20050171757 | Appleby | Aug 2005 | A1 |
20050209844 | Wu et al. | Sep 2005 | A1 |
20050240392 | Munro, Jr. et al. | Oct 2005 | A1 |
20060004563 | Campbell et al. | Jan 2006 | A1 |
20060004653 | Strongin | Jan 2006 | A1 |
20060080079 | Yamabana | Apr 2006 | A1 |
20060095250 | Chen et al. | May 2006 | A1 |
20060217964 | Kamatani et al. | Sep 2006 | A1 |
20060224378 | Chino et al. | Oct 2006 | A1 |
20060293876 | Kamatani et al. | Dec 2006 | A1 |
20070010990 | Woo | Jan 2007 | A1 |
20070016398 | Buchholz | Jan 2007 | A1 |
20070083359 | Bender | Apr 2007 | A1 |
20070083505 | Ferrari et al. | Apr 2007 | A1 |
20070094006 | Todhunter et al. | Apr 2007 | A1 |
20070100601 | Kimura | May 2007 | A1 |
20070150800 | Betz et al. | Jun 2007 | A1 |
20070203688 | Fuji et al. | Aug 2007 | A1 |
20070250305 | Maxwell | Oct 2007 | A1 |
20080133218 | Zhou et al. | Jun 2008 | A1 |
20080228464 | Al-Onaizan et al. | Sep 2008 | A1 |
20080319947 | Latzina et al. | Dec 2008 | A1 |
20090063472 | Pell et al. | Mar 2009 | A1 |
20090070094 | Best et al. | Mar 2009 | A1 |
20100082324 | Itagaki et al. | Apr 2010 | A1 |
20110055188 | Gras | Mar 2011 | A1 |
20110072021 | Lu et al. | Mar 2011 | A1 |
20110191286 | Cho et al. | Aug 2011 | A1 |
20110258181 | Brdiczka et al. | Oct 2011 | A1 |
20110295864 | Betz et al. | Dec 2011 | A1 |
20110301941 | De Vocht | Dec 2011 | A1 |
20110307435 | Overell et al. | Dec 2011 | A1 |
20120023104 | Johnson et al. | Jan 2012 | A1 |
20120030226 | Holt et al. | Feb 2012 | A1 |
20120131060 | Heidasch et al. | May 2012 | A1 |
20120197628 | Best et al. | Aug 2012 | A1 |
20120197885 | Patterson | Aug 2012 | A1 |
20120203777 | Laroco, Jr. et al. | Aug 2012 | A1 |
20120221553 | Wittmer et al. | Aug 2012 | A1 |
20120246153 | Pehle | Sep 2012 | A1 |
20120296897 | Xin-Jing et al. | Nov 2012 | A1 |
20120310627 | Qi et al. | Dec 2012 | A1 |
20130013291 | Bullock et al. | Jan 2013 | A1 |
20130054589 | Cheslow | Feb 2013 | A1 |
20130091113 | Gras | Apr 2013 | A1 |
20130132383 | Ahlberg et al. | May 2013 | A1 |
20130138696 | Turdakov et al. | May 2013 | A1 |
20130144592 | Och et al. | Jun 2013 | A1 |
20130144594 | Bangalore et al. | Jun 2013 | A1 |
20130185307 | El-Yaniv et al. | Jul 2013 | A1 |
20130254209 | Kang et al. | Sep 2013 | A1 |
20130282703 | Puterman-Sobe et al. | Oct 2013 | A1 |
20130311487 | Moore et al. | Nov 2013 | A1 |
20130318095 | Harold | Nov 2013 | A1 |
20140012842 | Yan et al. | Jan 2014 | A1 |
Number | Date | Country |
---|---|---|
2400400 | Dec 2001 | EP |
1365329 | Oct 2009 | EP |
2011160204 | Dec 2011 | WO |
Entry |
---|
Bolshakov, “Co-Ordinative Ellipsis in Russian Texts: Problems of Description and Restoration”. Published in: Proceeding COLING '88 Proceedings of the 12th conference on Computational linguistics—vol. 1 doi>10.3115/991635.991649, 1988, 65-67. |
Hutchins, “Machine Translation: past, present, future”, (Ellis Horwood Series in Computers and their Applications) Ellis Horwooci: Chichester, 1986, 382 pp. ISBN 0-85312-788-3, $49.95 (hb). |
Mitamura, et al,, “An Efficient Interlingua Translation System for Multi-Lingual Document Production”, http://citeseerxist.psu.edu/viewdoc/surnmary?doi=10.1.1.44.5702, Jul. 1, 1991. |
Nakashole, “Automatic Extraction of Facts, Relations, and Entities for Web-Scale Knowledge Base Population”, Dissertation for the Doctor of Engineering Degree, Faculty of Natural Sciences and Technology, 2012, 153 pages, Saarbrucken, Germany. |
Zhao et al., “Corroborate arid Learn Facts from the Web”, Industrial arid Government Track Paper, KDD'07, Aug. 12-15, 2007, 9 pages, San Jose, California, USA. |
Boden et al., “FactCrawl: A Fact Retrieval Framework for Full-Text indices”, WebDB Workshop, Jun. 12-16, 2011, 6 pages, Athens, Greece. |
Zhao et al., “Corroborate and Learn Facts from the Web”, KDD'07, Mar. 6, 2008, 28 pages. |
Nie et al., “Statistical Entity Extraction from Web”, Manuscript ID 0094-SIP-2011-PIEEE.R1, Jun. 15, 2011, 12 pages, Microsoft Research Asia, Beijing, China. |
Number | Date | Country | |
---|---|---|---|
20160147736 A1 | May 2016 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 14559078 | Dec 2014 | US |
Child | 14588644 | US |