This application claims the benefit of priority to Russian patent application No. 2015109665, filed Mar. 19, 2015; disclosure of which is hereby incorporated by reference in its entirety.
The present disclosure pertains to devices, systems, methods and computer programs in the field of automatic processing of text data in natural languages (Natural Language Processing).
One of the major problems at present in the field of automatic processing of text information presented in natural languages is the synthesis of text based on information objects extracted from text data. One of the applied problems of text synthesis based on extracted information is automatic text annotation.
Automatic annotation is a text data processing routine for subsequent extraction of basis information from the data and its further processing. At present, the existing methods for automatic annotation can be divided into two types. The distinguishing feature of the first type of annotation is the fact that the annotation text consists of sentences of the source text, being the so-called method of “extraction-based summarization”. The methods of the second type of annotation, “abstraction-based summarization”, present an annotation text which is synthesized on the basis of the content of the source text. Given the technical complexity in the realization of automatic text synthesis and extraction of information therefrom, the main methods of annotation are methods of the “extraction-based summarization” type. Examples of automatic annotation of the “extraction-based summarization” type are the methods: TextRank, the method of annotation based on terminology and semantics, and the method of annotation based on latent semantic analysis.
The TextRank annotation method is an extremely simple algorithm for automatic annotation which presents the source text in the form of a graph whose nodes are sentences, while its graph edges are the “relation” between two sentences. The relation is defined by the number of identical words in the given sentences. Each edge in the graph has a weight, while each vertex is assigned a rating, computed on the basis of 2 criteria:
The nodes with the highest rating contain sentences which will be used in the annotation text. The chief defect of this method of annotation is that fact that it makes practically no allowance for the text semantics, and therefore the annotation is not always true and accurate.
The annotation algorithm based on terminology and semantics ranks the sentences of the source text by using metrics based on terms extracted from the text. With the aid of ontology, a correlation is established between each term from the text and the terms from the heading, and on this basis the weight of each term is computed. The weight of a sentence is computed as the sum of the weights of all the terms used therein.
The method based on latent semantic analysis is also based on a ranking of sentences with the aid of terms. The foundation of the method is the principle of selection of sentences having maximum importance in terms of a particular topic. However, this method as well has drawbacks. Since the sentences are selected by the principle that the importance of the sentence is a maximum in at least one topic, this means that a sentence whose importance is good in all topics, but not a maximum in any of them, will not make it into the annotation. Besides this, topics of slight importance are not filtered out, so that the size of the annotation may be larger than is needed.
The specification discloses a method of automatic annotation of text data of the “abstraction-based summarization” type, which remedies the deficiencies of the existing methods and enables a text synthesis with high accuracy based on extracted data—information objects—from the text.
Disclosed are systems, methods, and computer programs for synthesis of natural-language text.
In one aspect, an example method of synthesis of natural-language text comprises: receiving by a hardware processor a plurality of received information objects; selecting by the hardware processor among the plurality of received information objects at least one selected information object and, for each selected information object, an associated synthesis template in a template library, wherein the library includes at least one synthesis template, and wherein each synthesis template includes a template semantic-syntactic tree generating by the hardware processor for each selected information object a synthesis semantic-syntactic tree based on the template semantic-syntactic tree of the associated synthesis template selected for the selected information object; and generating by the hardware processor natural language text based on each generated synthesis semantic-syntactic tree.
In another aspect, an example system for synthesis of natural-language text comprises an information object receiving module configured to receive a plurality of received information objects; an information object selection module configured to select among the plurality of received information objects at least one selected information object and, for each selected information object, an associated synthesis template in a template library, wherein the library includes at least one synthesis template, and wherein each synthesis template includes a template semantic-syntactic tree a synthesis semantic-syntactic tree generation module configured to generate for each selected information object a synthesis semantic-syntactic tree based on the template semantic-syntactic tree of the associated synthesis template selected for the selected information object; and a natural text generation module configured to generate natural language text based on each generated synthesis semantic-syntactic tree.
In yet another aspect, an example computer program product stored on a non-transitory computer-readable storage medium, the computer program product comprising computer-executable instructions for synthesis of natural-language text, comprising instructions for: receiving by a hardware processor a plurality of received information objects; selecting by the hardware processor among the plurality of received information objects at least one selected information object and, for each selected information object, an associated synthesis template in a template library, wherein the library includes at least one synthesis template, and wherein each synthesis template includes a template semantic-syntactic tree generating by the hardware processor for each selected information object a synthesis semantic-syntactic tree based on the template semantic-syntactic tree of the associated synthesis template selected for the selected information object; and generating by the hardware processor natural language text based on each generated synthesis semantic-syntactic tree.
In some aspects, each received information object is associated with an ontological object and has a set of filled properties, each filled property having a value; each synthesis template is associated with an ontological object, each synthesis template includes a set of required properties; each synthesis template includes a set of optional properties; each synthesis template includes a validation script; the selecting of at least one selected information object and an associated synthesis template comprises, for each received information object, selecting in the template library synthesis templates associated with the same ontological object as the received information object; then, if any synthesis template is selected, selecting among the selected synthesis templates synthesis templates for each of which a set of required properties is contained in the set of filled properties of the received information object; then, if any synthesis template is selected, selecting among the selected synthesis templates synthesis templates with the largest set of required properties; then, if any synthesis template is selected, selecting among the selected synthesis templates synthesis templates for each of which the validation script validates the received information object; then, if any synthesis template is selected, selecting among the selected synthesis templates synthesis templates with the largest intersection of the set of optional properties with the set of filled properties of the received information object; and then, if any synthesis template is selected, selecting the received information object and associating one of the selected synthesis templates with the selected information object. In some aspects, each selected information object has a set of filled properties, each filled property having a natural-language string value; each template semantic-syntactic tree comprises template nodes; each synthesis template comprises for each of at least some of the template nodes forming a substitution set of nodes a corresponding filled property; generating for each selected information object a synthesis semantic-syntactic tree comprises, for each template node of the associated synthesis template, beginning with a root node of the template semantic-syntactic tree: if the template node is not in the substitution set of nodes, generating in the synthesis semantic-syntactic tree an identical node; if the template node is in the substitution set of nodes and if the property corresponding to the template node is a filled property of the selected information object, generating in the synthesis semantic-syntactic tree a node or a sub-tree based on analysis of the natural-language string value of the filled property of the selected information object corresponding to the template node; and repeating the prior two steps for each child node of the template semantic-syntactic tree. In some aspects, generating for each selected information object a synthesis semantic-syntactic tree further comprises, if the template node is in the substitution set of nodes, if the property corresponding to the template node is a filled property of the selected information object, and if the filled property of the selected information object has more than one natural-language string value, for each natural-language string value, generating in the synthesis semantic-syntactic tree a node or a sub-tree based on the natural-language string value corresponding to the template node; and connecting the generated nodes with a coordinating link. In some aspects, the plurality of received information objects forms an RDF graph. Some aspects further comprise: forming at least one group of selected information objects associated with the same synthesis template; and generating for the at least one group a synthesis semantic-syntactic tree based on the template semantic-syntactic tree of the associated synthesis template.
The above simplified summary of example aspects serves to provide a basic understanding of the present disclosure. This summary is not an extensive overview of all contemplated aspects, and is intended to neither identify key or critical elements of all aspects nor delineate the scope of any or all aspects of the present disclosure. Its sole purpose is to present one or more aspects in a simplified form as a prelude to the more detailed description of the disclosure that follows. To the accomplishment of the foregoing, the one or more aspects of the present disclosure include the features described and particularly pointed out in the claims.
Example aspects are described herein in the context of a system, method and computer program product for text synthesis based on extracted information in the form of an RDF graph making use of templates. Those of ordinary skill in the art will realize that the following description is illustrative only and is not intended to be in any way limiting. Other aspects will readily suggest themselves to those skilled in the art having the benefit of this disclosure. Reference will now be made in detail to implementations of the example aspects as illustrated in the accompanying drawings. The same reference indicators will be used to the extent possible throughout the drawings and the following description to refer to the same or like items.
The present specification presents a method and a system enabling a text synthesis on the basis of an RDF graph making use of templates. The proposed text synthesis method is able to create annotations which include brief information on the most important facts mentioned in the text. However, the text synthesis based on an RDF graph using templates is not limited to application solely in the field of annotation.
In step 110, text data is input to the system by the text data receiving module 10. This text data can either be previously prepared, i.e., tagged, or not (not tagged). Next, the text data in step 120 is subjected to semantic-syntactic analysis by the syntactic-semantic analysis module 20. The primary principles of the semantic-syntactic analysis based on linguistic descriptions have been specified in the U.S. Pat. No. 8,078,450, incorporated herein by reference in its entirety, and these principles lie at the core of the present disclosure. Since the semantic-syntactic analysis is based on the use of language-independent semantic units, the present disclosure is likewise independent of language and can function with one or several natural languages.
The semantic-syntactic text analyzer is a module able to analyze text data: an individual sentence, a text or a collection of texts; and obtain for the text data a forest of semantic-syntactic structures, each of which constitutes a graph, in particular a tree. The nodes and edges of the graph are supplemented with grammatical and semantic information used afterwards to identify objects, their attributes and relations, and also for synthesis of sentences.
Deep analysis includes lexical-morphological, syntactic and semantic analysis of each sentence of the text (corpus of texts), as a result of which language-independent semantic structures are constructed for the sentences in which each word is assigned to a corresponding lexical and/or semantic class (SC) in a universal Semantic Hierarchy (SH).
The Semantic Hierarchy (SH) constitutes a lexical-semantic dictionary containing the entire language lexicon needed for the text analysis and synthesis. The Semantic Hierarchy is organized in the form of a tree of parent-like relations, at the nodes of which are found the Semantic Classes (SC), which are universal for all languages and reflect a certain conceptual content, and the lexical classes (LC), which are specific to a language, being the descendants of a certain semantic class. The totality of lexical classes of one Semantic Class determines a semantic field—the lexical expression of the conceptual content of the Semantic Class. The most widespread concepts are found at the upper levels of the hierarchy.
A child semantic class in the Semantic Hierarchy inherits the majority of attributes of its direct parent and all ancestor semantic classes. For example, the semantic class SUBSTANCE is a child semantic class of the class ENTITY and the parent semantic class for the classes GAS, LIQUID, METAL, WOOD_MATERIAL, and so on.
Let us turn to
The morphological model of the semantic-syntactic analyzer exists outside of the semantic hierarchy. For each language there exists a list of lexemes and their paradigms. Within the semantic hierarchy, each lexeme can be attached to one or more lexical classes. A lexical class usually links together several lexemes.
Each node of the obtained semantic-syntactic tree is attached to a particular lexical class of the semantic hierarchy, which presupposes an elimination of the ambiguity of words in the analysis process. Each node also holds the grammatical and semantic information which determines its role in the text, namely, a set of grammemes and semantemes.
Each arc of the semantic-syntactic tree holds a surface slot (i.e., the syntactic function of the dependent node, such as $Subject or $Object_Direct) and a deep slot (i.e., the semantic role of the dependent node, such as Agent or Experiencer). The set of deep slots is universal and does not depend on the language, unlike the set of surface slots, which differs from one language to another.
In this disclosure, the semantic-syntactic analyzer is used both for the deep analysis of sentences in a text presented to the system, by a user for example, and in the process of creation of templates which will then be used for the text synthesis. This routine will be described below.
Let us return to
The information extraction process is controlled by a system of production rules. There are two types of production rules: rules for interpretation of fragments of the semantic-syntactic trees and rules for identification of information objects.
The rules of interpretation let us describe fragments of the semantic-syntactic trees, upon the discovery of which certain sets of logic statements enter into play. One rule constitutes a production whose left part is a standard fragment of the semantic-syntactic tree, while its right part is a set of expressions describing the logic statements.
A pattern of a semantic-syntactic tree (or tree template) constitutes a formula whose atomic elements are checks for various attributes of the nodes of the semantic-syntactic trees (whether or not a particular grammeme/semanteme is present, what lexical/semantic class does it belong to, is it found in a certain surface/deep slot, and much more).
The rules of identification are used in those situations when it is necessary to merge (combine) already extracted information objects. A rule of identification constitutes a production whose left part describes the limits to be placed on two information objects, upon the fulfillment of which the information objects are deemed to be congruent. The right part of all rules of identification is deemed to be identical (it is a statement about the identity of the two objects) and is not written down.
The method of extracting information with the use of production rules is illustrated in
According to the conceptualization of the RDF (Resource Definition Framework), which is a data presentation model, each information object extracted from the text data in the information extraction process described above is assigned a unique identifier. Specifically, all of the extracted information is presented in the form of a set of triplets <s,p,o>, where s is the identifier of the information object, p is the identifier of its attribute (predicate), and o is the value of the given attribute.
An example of an actual RDF graph is following:
</BasicEntity:firstname>
</Basic:label>
</BasicFact:position>
B ABBYY
All of the RDF data extracted from the texts is coordinated with a model of the subject field (ontology) in which the information extraction module is functioning. The ontology specifies which attributes the information objects can have and what object relations can exist between them. An ontology is a formal explicit description of a certain subject field. The primary components of an ontology are concepts (or in other words, classes), instances, relations, and attributes. The concepts of an ontology constitute a formally described nominative set of instances which are generalized in terms of a certain characteristic. An example of a concept is the set of all people unified into the concept “Person”. The concepts in an ontology are combined into a taxonomy, i.e., a hierarchical structure. An instance is a specific object or phenomenon of the subject field which goes into the concept. For example, the instance Yury_Gagarin goes into the concept “Person”. The relations are formal descriptions between concepts which determine what links can be established between the instances of given concepts.
The congruence of the data generated by the information extraction module with the model of the subject field is automatically ensured. On the one hand, this is made possible by the syntax of the language of the information extraction rules. On the other hand, the system has special validation mechanisms built into it, which do not allow ontologically incorrect data to occur.
Besides the actual RDF graph, coordinated with an OWL ontology, information is kept on the link between the identified information objects and the source text (annotation or “highlighting” of objects). The RDF graph along with information on the annotations of the information objects shall be termed hereinafter the annotated RDF graph.
Let us return to
The text synthesis module is responsible for creating text on the basis of the extracted information presented in the form of the RDF graph.
The architecture of the text synthesis module enables universal use thereof. Specifically, the module does not encode an explicit dependency on any particular natural language or fact, which makes it possible to synthesize text without modification of the text synthesis module itself in the event of expansion of the ontology, such as by adding a user ontology, or adding a new language.
Moreover, the text synthesis module has a built-in filter for the facts being synthesized, allowing us not to synthesize text for certain extracted facts, such as improperly extracted facts. Furthermore, the module performs a ranking of the output, so that more important generated facts are placed on top, less important ones lower down.
For this, the ontologies being used are supplemented with new ontological objects, or synthesis templates 145 (
The synthesis of text based on the information objects extracted in step 130 (in a specific example, based precisely on extracted facts) is done with the use of compiled templates 145. Templates are created by the user for each type of fact, wherein it is possible to create several templates for each fact. An illustrative description of a template is presented below. Two example computer interfaces allowing users to create templates are shown in
In one aspect, the template includes the following components:
The sentence (hereinafter, “template sentence”) in one of the natural languages is the foundation of the template. It is the template sentence to which the list of substitutions refers. The template sentence is used afterwards in constructing the text synthesis tree.
Let us consider the fact “Occupation”, which was extracted from the text by means of the information extraction module. This fact corresponds to the kind of employment. In the majority of cases, the fact “Occupation” can be formulated in general as: “So and so works somewhere as so and so”. For example, one can use as the template sentence for the fact “Occupation” the sentence: “Alexander works as a programmer at ABBYY”. The list of substitutions for this template is as follows:
In addition to this, the template should allow for those properties from the list of substitutions which must be fulfilled in the extracted information object. Required properties impose the following condition on the use of the template: if one of the required properties of the template is not fulfilled for the extracted information object (fact), the given template will not be used for the text synthesis. The list of required properties of the template from the above-given example includes two out of three possible properties, namely, the properties “position” and “employee”. These properties must be fulfilled in the extracted fact in order to use the aforementioned template “Alexander works as a programmer at ABBYY”. An optional property of the extracted fact can remain unfulfilled and the template will still be used. For example, if the property “employer” is not fulfilled in the extracted fact (this property does not enter into the list of mandatory properties of the above-given example), this template can still be used as before during the text synthesis.
The validation script imposes certain limitations which serve as a check on the properties of the extracted facts. The validation script to a certain degree takes part in the filtering of facts in order to produce the synthesis. As the validation script here, we can use a script (condition) to check whether the property “employee” in the extracted fact is nominative, i.e., a proper name. This lets us filter out “garbage” (mistakenly identified from the text) facts. For example, if the validation script does not impose the condition that the property “employee” of the extracted fact is nominative, the text synthesis might produce the following sentence: “Programmer works as a programmer”, which is absurd.
In order to use a sentence in a natural language as a template, the template compilation routine is launched.
After the semantic-syntactic tree has been constructed for the template sentence, the nodes of the semantic-syntactic tree are compared with the properties from the list of substitutions of the template 5020. The comparison is done automatically. In the semantic-syntactic tree a search is made for components (nodes) corresponding to the words given in the list of substitutions for the template. These components (nodes) of the semantic-syntactic tree are coordinated with the properties given in the list of substitutions. For example, the node “Alexander” in the semantic-syntactic tree in
The results of the compilation of the created template, namely, the semantic-syntactic tree, the list of substitutions in which the properties are coordinated with the position in the deep structure of the template sentence, the list of required properties of the template, the language of the template sentence and the validation script, are saved 5030 in the primary ontological model as an object of a particular type—a “compiled template” or a synthesis template.
A certain set of compiled templates is attached to an existing concept of the ontology, such as the concept “BasicFactOccupation”. Thus, the concept in the ontology to which the templates are attached saves a reference to the file of templates existing therefor, which can be used for the given concept during the text synthesis. This is useful in order to be able to determine the set of templates pertaining to an information object (in the given specification, to a fact) which can be used for the text synthesis on the basis of this fact. Then, from this set of templates, the only suitable template for the text synthesis is selected.
The information objects can be of different types, for example, an information object can be a fact, a person, or a location. The type of the information object refers to the corresponding concept from the ontology: “BasicFact”; “Person”; “Location”. In the information extraction process, a “bag of statements” is created—a set of not mutually contradictory logic statements about the information objects and their properties. The end result of the working of the information extraction module can be an RDF graph. In accordance with the conceptualization of the RDF (Resource Definition Framework), which is a data presentation model, each information object is assigned a unique identifier. Specifically, all of the extracted information is presented in the form of a set of triplets <s,p,o>, where s is the identifier of the information object, p is the identifier of its attribute (predicate), and o is the value of the given attribute.
As already mentioned above, within the information extraction module a set of properties and values of the given properties exists for each information object extracted in the course of the text analysis. Within the task of text synthesis making use of the RDF graph, the values of the properties of the extracted information object (fact) are examined and used in the template(s) for the given fact.
The properties can be conventionally divided into two types. The first type includes properties which can be explicitly presented in the template. Examples of such properties are: the name of a person, the title of a position, the name of an organization, and so on. Thus, the value of the property “position” is always represented by a text string, and therefore it will appear explicitly in the template.
The second type includes properties which do not appear explicitly in the template. Such properties can be: the degree of trust in the extracted information object (fact), the degree of completion of an action, and so on. These properties are included in the list of required properties and their presence in the extracted information object (fact) is checked by the validation script.
During the text synthesis both types of properties are processed. The values of the properties of the first type refer to the string type. If the property is an information object, all such objects will have the property “label”. For example, the extracted fact “Occupation” has the property “employer”, the value of this property being an information object with the concept “Organization”, in whose name (and concurrently in the label) is indicated “ABBYY”.
In this property the system places some short readable information about the information object in the form of a string, which is sufficient during the synthesis. Examples of such information objects will be: “Pavel Durov” for a person, “ABBYY” for an organization, and so on.
After referring the value of the properties to the string type, the string is subjected to semantic-syntactic analysis, and it is incorporated into the deep structure (or in other words, into the semantic-syntactic tree) of the sentence being synthesized. In order to understand where the analyzed string (values) of the property needs to be placed in the deep structure of the sentence being synthesized, the list of substitutions is used. The format of the list of substitutions is: “position in the deep structure”—“property”.
In the illustrative example of the template sentence “Alexander works as a programmer at ABBYY”, given above, the node “Alexander” will be substituted with the property “employee”, the node “programmer” with the property “position”, and the node “ABBYY” with the property “employer”. As a result of this, during the text synthesis on the basis of the new extracted fact “Occupation” of this template, the template will be filled with the values of the properties of the already extracted fact and a new sentence will be synthesized. The synthesis procedure on the basis of templates is described in greater detail below.
If there are no values for the property from the list of substitutions of the template for the extracted object (fact), i.e., the indicated property remains empty, then the word corresponding to this property is removed from the tree of the sentence being synthesized. However, if not a single property is filled the meaningless phrase “works” is synthesized. To prevent this from happening, the templates have lists of required properties, as indicated above. If even one property from the list of required properties is missing, this template cannot be used for the text synthesis.
The properties of the second type are not explicitly inserted into the sentence itself during the synthesis, but they often alter its structure or an individual word. These lack a readable parameter “label”. One of the possible variants for processing is to write a separate template for each value of the property and their groups, since there are not many properties of the second category for the fact (1-5), or values of such properties (not more than 4). An example of such a sentence is “Alexander finished working at ABBYY in 2010”; in the property for “degree of completion of the action” there will stand “finished”. However, the problem arises of selecting the appropriate template for the fact, since the number of properties that are required and fulfilled for the template might be the same, and the only difference will be the value of the specific property. The validation scripts are also used precisely to handle this problem.
The validation scripts are a powerful instrument used to assign conditions for verification of extracted facts, thereby creating as many accurate templates as desired. They inherit the syntax of the rules of extraction from the information extraction module and have access to all the properties of the incoming information object extracted from the text. They can determine the type and value of a property, and in the event that the property is also an information object, obtain access to the properties of that object. After being launched, the validation script indicates whether the template is suitable for the synthesis or not.
After conducting the semantic-syntactic text analysis (120,
The text synthesis module 40, as shown in
From the obtained RDF graph those information objects are identified for which it is possible to perform the synthesis, i.e., those information objects extracted from the text for which at least one template exists that is suitable for the synthesis. The creation and compilation of templates for each type of fact has been described above.
Thus, in step 501 a set of templates is formed, which were created for the given type of extracted information object (fact). Since the extracted information object is a concept or instance of an ontology, these templates can be saved (5030,
Next, in step 503, for each template from the set of templates formed in step 501 a check is performed for the lists of properties that have been indicated as required properties. In particular, the fact is checked of whether the required properties in the template are fulfilled by the given extracted object/fact. At this stage, those templates are excluded from further consideration whose lists of required properties have even one property not fulfilled for the information object extracted from the text.
From the remaining templates, only the templates with the longest list of filled properties are selected. This stage is needed to select the most accurate templates for the extracted information object (fact), i.e., the more properties are labeled as required in the template, the more accurately the template will synthesize the fact.
In step 505, additional stages are initiated in the checking of the templates by means of the validation script, and those templates which do not pass this check stage are eliminated from further consideration. As already described above, the validation script imposes certain conditions on the checking of the extracted information object (fact), the properties of this fact, and so on.
The remaining templates are compared (507) in terms of the number of non-empty properties which appear in the list of substitutions, and among these templates one selects those templates whose list of non-empty properties is the largest. If as a result of this analysis more than one template remains, a random template (511) will be selected from among them. If no templates remain, the information object extracted from the text is not suitable for the synthesis (513).
The procedure described in
After determining all extracted information objects (facts) that are suitable for the text synthesis, and also selecting for each information object a suitable template, a separate synthesis tree is generated on the basis of the (semantic-syntactic) tree of the template. The procedure is illustrated in
According to one example aspect, in step 601 the semantic-syntactic tree that was constructed for the template sentence is entered. This is the basis for the deep structure of the sentence being synthesized, and it enters the synthesis module.
In step 603, one moves along the depth of the semantic-syntactic tree (from the root to the leaves) for the template sentence. In parallel with the movement along the semantic-syntactic tree (from root to leaves) of the template sentence, a synthesis semantic-syntactic tree is created. Each node of the semantic-syntactic tree of the template sentence is checked for its presence in the list of substitutions 605 of the given template. Specifically, in step 605 a check is made as to whether a node of the semantic-syntactic tree of the template sentence exists in the list of substitutions.
If the node in the semantic-syntactic tree of the template corresponding to a word in the sentence is not present in the list of substitutions, the synthesis tree creates a full analogue of this node in the semantic-syntactic tree of the template (609). Then its child nodes are analyzed in the semantic-syntactic tree of the template (617).
Let us return to
Let us return to
After the synthesis semantic-syntactic tree has been constructed on the basis of the template semantic-syntactic tree, the text generation or synthesis takes place, as is described in detail in an applications describing machine translation, one application published as the US Patent Application Publication No. US 2008/0091405, incorporated herein by reference in its entirety, another application published as the US Patent Application Publication No. US 2008/0086298, incorporated herein by reference in its entirety, and also in the U.S. Pat. No. 8,195,447, incorporated herein by reference in its entirety, and in the U.S. Pat. No. 8,214,199, incorporated herein by reference in its entirety. The input of this module receives information as to the language (output language) in which the text synthesis is to be done and the semantic-syntactic tree, filling in at every node thereof its semantic class, lexeme, semantemes, pro-form and syntactic paradigm, and indicating on the edges the surface and deep slots. Besides the semantic-syntactic tree, it is possible to use any treelike outcome of analysis of the sentence. As was already noted above, each node of the semantic-syntactic tree is assigned a semantic class, lexeme, semantemes, pro-form and syntactic paradigm, and on the edges of the tree there are indicated the surface and deep slots. The synthesizer then constructs the sentence according to the specified tree, on the basis of knowledge about the particular language which is contained in morphology dictionaries.
Homogeneous facts are often encountered during text analysis. If a separate sentence is synthesized for each fact extracted from the text, a large number of sentences for identical facts will be generated in the synthesized text. For example, the following sentences might be synthesized in this way: “Alexander works as a programmer at ABBYY” and “Nikolai works as a programmer at ABBYY”. It is optimal to combine these sentences and synthesize a single sentence which will include both facts in it. This reduces the size of the synthesized text and improves its quality.
Firstly, the facts need to be homogeneous. Homogeneous are facts pertaining to the same concept in an ontology, and for which the identical properties should be fulfilled from the standpoint of the template. If the facts are not homogeneous, then a distortion of one of these facts will occur after the synthesis, or there will be a loss of information.
Secondly, one cannot combine too many facts, or else the sentence will be overloaded. For this, a certain threshold is established when combining homogeneous facts.
Thirdly, not more than one property should be different in the objects being combined. If there are two or more distinguishing properties, it will be hard to determine in the resulting sentence which property refers to which object. Otherwise, the following sentence might be obtained in the text synthesis: “Alexander and Nikolai work as programmer and designer at ABBYY and Yandex”.
In step 803, a comparison is made between the extracted facts of the templates.
For this, in order to perform the synthesis of homogeneous facts, after obtaining templates for all the objects (facts) it is necessary to group them (805) so that identical facts and identical templates are in the same group. In stage 805, the facts are grouped by templates.
Next (807), a processing is done for the group of facts, this processing being illustrated in
After performing these modifications of the algorithm, the synthesis takes place with combining of the homogeneous facts.
Let us consider the example of the synthesis of homogeneous facts. We shall make a semantic-syntactic analysis of the following sentences: “Nikolai works as a designer at ABBYY. Vasily has found a job at ABBYY in the position of designer”. In each of the sentences, we identify the facts with the help of the information extraction module. For each of the facts we find a template, and on this basis we construct a synthesis template.
The synthesis module will put out this kind of sentence as its response: “Nikolai and Vasily are designers at ABBYY”.
As can be seen, there has been a combination of two facts in one sentence here, even though there are two distinguishing properties, namely, the property employer and employee.
The hardware (900) as a rule has a certain number of inputs and outputs for transmittal and receiving of information from the outside. The user or operator interface of the software (900) can be one or more user entry devices (906), such as keyboard, mouse, imaging device, etc., and also one or more output devices (liquid crystal or other display (908)) and sound reproduction (dynamics) devices.
To obtain an additional volume for data storage, one uses data collectors (910) such as diskettes or other removable disks, hard disks, direct access storage devices (DASD), optical drives (compact disks etc.), DVD drives, magnetic tape storages, and so on. The hardware (900) can also include a network connection interface (912)—LAN, WAN, Wi-Fi, Internet and others—for communicating with other computers located in the network. In particular, one can use a local-area network (LAN) or wireless Wi-Fi network, not connected to the worldwide web of the Internet. It must be considered that the hardware (900) also includes various analog and digital interfaces for connection of the processor (902) and other components of the system (904, 906, 908, 910 and 912).
The hardware (900) runs under the control of an Operating System (OS) (914), which launches the various applications, components, programs, objects, modules, etc., in order to carry out the process described here. The application software should include an application to identify semantic ambiguity of language. One can also include a client dictionary, an application for automated translation, and other installed applications for imaging of text and graphic content (text processor etc.). Besides this, the applications, components, programs and other objects, collectively denoted by the symbol 916 in
All the routine operations in the use of the implementations can be executed by the operating system or separate applications, components, programs, objects, modules or sequential instructions, generically termed “computer programs”. The computer programs usually constitute a series of instructions executed at different times by different data storage and memory devices on the computer. After reading and executing the instructions, the processors perform the operations needed to initialize the elements of the described implementation. Several variants of implementations have been described in the context of fully functioning computers and computer systems. The specialists in the field will properly judge the possibilities of disseminating certain modifications in the form of various program products on any given types of information media. Examples of such media are power-dependent and power-independent memory devices, such as diskettes and other removable disks, hard disks, optical disks (such as CD-ROM, DVD, flash disks) and many others. Such a program package can be downloaded via the Internet.
In the specification presented above, many specific details have been presented solely for explanation. It is obvious to the specialists in this field that these specific details are merely examples. In other cases, structures and devices have been shown only in the form of a block diagram to avoid ambiguity of interpretations.
The references in this specification to “one variant implementation/realization” or “variant implementation/realization” mean that the specific feature, structure or characteristic described for the variant realization is a component of at least one variant realization. The use of the phrase “in one variant realization” in different parts of the specification does not mean that the specifications pertain to the identical variant realization or that these specifications pertain to different or alternative, mutually exclusive variants of realization. Furthermore, different specifications of characteristics may pertain to certain variants of realization, but not pertain to other variants of realization. Different specifications of requirements may pertain to certain variants of realization and not pertain to other variants of realization.
Certain specimens of variants of realization have been specified and shown in the appended drawings. However, it must be understood that such variants of realization are simply examples, but not limitations of the specified variants of realization, and that these variants of realization are not limited to the specific indicated and described designs and devices, since specialists in this field of technology on the basis of the presented materials can create their own variants of realization. In the field of technology to which the present disclosure pertains, it is difficult to foresee the rapid development and further accomplishments, and so the specified variants of realization can easily be replaced in the device and its parts thanks to the development of technology, while maintaining the principles of the present specified disclosure.
In various aspects, the systems and methods described herein may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the methods may be stored as one or more instructions or code on a non-transitory computer-readable medium. Computer-readable medium includes data storage. By way of example, and not limitation, such computer-readable medium can comprise RAM, ROM, EEPROM, CD-ROM, Flash memory or other types of electric, magnetic, or optical storage medium, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a processor of a general purpose computer.
In the interest of clarity, not all of the routine features of the aspects are disclosed herein. It will be appreciated that in the development of any actual implementation of the present disclosure, numerous implementation-specific decisions must be made in order to achieve the developer's specific goals, and that these specific goals will vary for different implementations and different developers. It will be appreciated that such a development effort might be complex and time-consuming, but would nevertheless be a routine undertaking of engineering for those of ordinary skill in the art having the benefit of this disclosure.
Furthermore, it is to be understood that the phraseology or terminology used herein is for the purpose of description and not of restriction, such that the terminology or phraseology of the present specification is to be interpreted by the skilled in the art in light of the teachings and guidance presented herein, in combination with the knowledge of the skilled in the relevant art(s). Moreover, it is not intended for any term in the specification or claims to be ascribed an uncommon or special meaning unless explicitly set forth as such.
The various aspects disclosed herein encompass present and future known equivalents to the known modules referred to herein by way of illustration. Moreover, while aspects and applications have been shown and described, it would be apparent to those skilled in the art having the benefit of this disclosure that many more modifications than mentioned above are possible without departing from the concepts disclosed herein.
Number | Date | Country | Kind |
---|---|---|---|
2015109665 | Mar 2015 | RU | national |