The present application is based upon and claims priority to Chinese Patent Application No. 201910105197.8, filed on Feb. 1, 2019, the entire contents of which are incorporated herein by reference.
Embodiments of the present disclosure relate to the field of computer, and more particularly, to a method and an apparatus for constructing a data model, and a computer readable storage medium.
A knowledge graph is also called a knowledge base, and is used to describe entities and conceptions existing in a real world, relations between the entities and the conceptions, and attributes of the respective entities and conceptions. Presently, the knowledge graph is widely used in the fields such as query, artificial intelligence, and deep learning. In the knowledge graph, a schema is used to describe a data model in a certain field, and the data model includes an entity type and attributes associated with the entity type in the field. For example, taking an entity type “character” as an example, the attributes of the entity type “character” may include height, weight, age, etc. The “attribute” described herein may also be called “predicate.”
In order to construct a schema for certain field, one may need to mine and generalize attributes associated with the entity type using data coming from different sources. However, due to multi-source heterogeneity of data and diversity of expressions, expressions for mining the attributes may be complex and diversiform.
According to exemplary embodiments of the present disclosure, a technical solution for constructing a data model is provided.
In embodiments of the present disclosure, there is provided a method for constructing a data model. The method includes obtaining a first attribute set associated with an entity type. The method further includes aligning a plurality of attributes with a same semantics in the first attribute set to a same attribute, to generate a second attribute set associated with the entity type, attributes in the second attribute set having different semantics. The method further includes constructing the data model associated with the entity type based on the entity type and the second attribute set.
In embodiments of the present disclosure, there is provided an apparatus for constructing a data model. The apparatus includes: one or more processors; a memory storing instructions executable by the one or more processors; in which the one or more processors are configured to: obtain a first attribute set associated with an entity type; align a plurality of attributes with a same semantics in the first attribute set to a same attribute, to generate a second attribute set associated with the entity type, attributes in the second attribute set having different semantics; and construct the data model associated with the entity type based on the entity type and the second attribute set.
In embodiments of the present disclosure, there is provided a computer readable storage medium having computer programs stored thereon. When the computer programs are executed by a processor, a method for construct a data model according to embodiments of the present disclosure is implemented. The method includes obtaining a first attribute set associated with an entity type; aligning a plurality of attributes with a same semantics in the first attribute set to a same attribute, to generate a second attribute set associated with the entity type, attributes in the second attribute set having different semantics; and constructing the data model associated with the entity type based on the entity type and the second attribute set.
It should be understood that, descriptions in Summary of the present disclosure are not intended to limit an essential or important feature in embodiments of the present disclosure, and are also not construed to limit the scope of the present disclosure. Other features of the present disclosure will be easily understood by following descriptions.
The above and other features, advantages and aspects of respective embodiments of the present disclosure will become more apparent with reference to accompanying drawings and following detailed illustrations. In the accompanying drawings, the same or similar numeral references represent the same or similar elements, in which:
Description will be made in detail below to embodiments of the present disclosure with reference to accompanying drawings. Some embodiments of the present disclosure are illustrated in the accompanying drawings. It should be understood that, embodiments of the present disclosure may be implemented by various ways, but not be construed as a limitation of the embodiments herein. On the contrary, those embodiments provided are merely for a more thorough and complete understanding of embodiments of the present disclosure. It should be understood that, the accompanying drawings and embodiments of the present disclosure are merely for exemplary purposes, but is not meant to limit the protection scope of the present disclosure.
In the description of embodiments of the present disclosure, the terms “includes” and its equivalents like should be understood as an open “include”, that is, “include but not limited to”. The terms “based on” should be understood as “based at least in part (at least partially based on or at least part based on)”. The terms “an embodiment” or “the embodiment” should be understood as “at least one embodiment”. The terms “first”, “second” and the like may represent different or same objects. Other explicit and implicit definitions may also be included below.
As described above, in order to construct a knowledge graph of a certain field, a schema needs to be constructed under the field. Therefore, there is a need to mine and to generalize attributes associated with an entity type from data coming from different sources. However, due to multi-source heterogeneous of data and diversity of expressions, the expressions for mining the attribute may be complex and diversiform.
Some conventional solutions perform edition on the attributes associated with the entity type in the schema manually, thus implementing to construct the schema. These conventional solutions have low efficiency and may not adapt to situations with large amounts of data and diverse expressions. Some conventional solutions may mine and refine an attribute set associated with the entity type from large data by utilizing a machine learning model. However, characteristics used in such conventional solutions are relatively single, causing poor robustness and low accuracy.
According to embodiments of the present disclosure, a technical solution for constructing a data model is provided. The technical solution identifies synonyms attributes with different expressions in data coming from different sources by utilizing the machine learning model. Since a procedure for determining the synonyms attributes utilizes rich characteristics in various dimensions, the technical solution may achieve higher robustness and higher accuracy. By aligning the synonyms attributes automatically, the technical solution may construct the data model efficiently while effectively reducing labor costs.
Detailed descriptions will be made to embodiments of the present disclosure with reference to accompanying drawings.
The model construction apparatus 120 may obtain input data 110 associated with an entity type 111 from a plurality of data sources. As illustrated in
Although the input data 110 in
As illustrated in
At block 210, the model construction apparatus 120 obtains a first attribute set associated with the entity type.
In some embodiments, the first attribute set may be such as the original attribute set 112 illustrated in
In some embodiments, the model construction apparatus 120 may perform clustering on the original attribute set 112, to divide the original attribute set 112 into the plurality of subsets. For example, the model construction apparatus 120 may perform the clustering on the original attribute set 112 by utilizing a graph cluster algorithm based on Markov cluster algorithm. Comparing with a conventional text cluster algorithm, the graph cluster algorithm utilizes similarity characteristics of more dimensions, thus solving a cluster problem of a character string with a shorter length better. Additionally or alternatively, in some embodiments, the model construction apparatus 120 may perform the clustering on the original attribute set 112 by utilizing a graph cluster algorithm of the hierarchical clustering algorithm. The above merely lists a few examples of clustering algorithms that may be used by the model construction apparatus 120. It should be understood that, the model construction apparatus 120 may divide the original attribute set 112 into the plurality of subsets by utilizing any method known or to be developed, and does not limit to the methods illustrated above.
At block 220, the model construction apparatus 120 aligns a plurality of attributes with a same semantics in the first attribute set to a same attribute, to generate a second attribute set associated with the entity type. Attributes in the second attribute set having different semantics.
In some embodiments, the model construction apparatus 120 may combine the entity type with each attribute in the first attribute set, to generate a plurality of type-attribute pairs. Taking that the entity type is a character as an example, examples of the generated type-attribute pairs may be “character-height”, “character-stature”, “character-weight”, “character-kilogram”, and the like. For any two type-attribute pairs (which are also called as “a first type-attribute pair” and “a second type-attribute pair”, such as the “character-weight” and the “character-kilogram) in the plurality of type-attributes, the model construction apparatus 120 may determine whether the first type-attribute pair has a same semantics with the second type-attribute pair.
In some embodiments, the characteristic extraction unit 310 may obtain a first type-attribute pair 301-1 and a second type-attribute pair 301-2, and obtain a first group of knowledge items 302-1 associated with the first type-attribute 301-1 and a second group of knowledge items 302-2 associated with the second type-attribute 301-2 from a knowledge item 113 with a SPO structure illustrated in
In some embodiments, the text similarity between the first type-attribute pair 301-1 and the second type-attribute pair 301-2 may be measured by utilizing a jaccard similarity coefficient between the first type-attribute pair 301-1 and the second type-attribute pair 301-2. For example, the larger the jaccard similarity coefficient, the higher the similarity between both the type-attribute pairs. In some embodiments, the second similarity characteristic 303-2 may such as indicate whether the first type-attribute pair 301-1 and the second type-attribute pair 301-2 are synonyms in one or more semantic dictionaries (such as, a wordnet dictionary).
The semantic similarity between the first type-attribute pair 301-1 and the second type-attribute pair 301-2 may be measured in a plurality of ways. In some embodiments, the characteristic extraction unit 310 may determine a query similarity between the first attribute in the first type-attribute pair 301-1 and the second attribute in the second type-attribute pair 301-2 as the third similarity characteristic 303-3 measuring the semantic similarity between the first type-attribute pair 301-1 and the second type-attribute pair 301-2. For example, the characteristic extraction unit 310 may determine the first attribute and the second attribute as a query keyword, and determine the query similarity between the first attribute and the second attribute by determining a similarity between query results of the first attribute and the second attribute. Additionally or alternatively, in some embodiments, the characteristic extraction unit 310 may transform the first type-attribute pair 301-1 and the second type-attribute pair 301-2 into two vectors by utilizing a bag of word model, and determine the semantic similarity by calculating a cosine distance between the two vectors. Additionally or alternatively, in some embodiments, the characteristic extraction unit 310 may transform the first type-attribute pair 301-1 and the second type-attribute pair 301-2 into two vectors by utilizing a generalized regression neural network (GRNN) model, and determine the semantic similarity by calculating a cosine distance between the two vectors. Additionally or alternatively, in some embodiments, the characteristic extraction unit 310 may transform the first type-attribute pair 301-1 and the second type-attribute pair 301-2 into two vectors based on a query clicking characteristic associated with the first attribute in the first type-attribute pair 301-1 and a query clicking characteristic associated with the second attribute in the second type-attribute pair 301-2, and determine the semantic similarity by calculating a cosine distance between the two vectors. Additionally or alternatively, in some embodiments, the characteristic extraction unit 310 may determine the semantic similarity between the first type-attribute pair 301-1 and the second type-attribute pair 301-2 by utilizing a semantic classification model trained based on a supervised learning method. Additionally or alternatively, in some embodiments, the characteristic extraction unit 310 may transform the first type-attribute pair 301-1 and the second type-attribute pair 301-2 into two vectors by utilizing a skip-gram model, and determine the semantic similarity between both the type-attribute pairs by calculating a cosine distance between the two vectors.
The above merely lists a few examples for determining the third similarity characteristic 303-3 indicating the semantic similarity between the first type-attribute pair 301-1 and the second type-attribute pair 301-2. It should be understood that, the characteristic extraction unit 310 may utilize any method known or to be developed to determine the third similarity characteristic 303-3 indicating the semantic similarity between the first type-attribute pair 301-1 and the second type-attribute pair 301-2, and does not limit to these methods illustrated above.
In some embodiments, the characteristic extraction unit 310 may also obtain the fourth similarity characteristic between the first type-attribute pair 301-1 and the second type-attribute pair 301-2 by performing the statistical analysis on the first group of knowledge items 302-1 associated with the first type-attribute pair 301-1 and the second group of knowledge items 302-2 associated with the second type-attribute pair 301-2. For example, the characteristic extraction unit 310 may determine various types of statistical information based on the first group of knowledge items 302-1 associated with the first type-attribute pair 301-1 and the second group of knowledge items 302-2 associated with the second type-attribute pair 301-2. The statistical information may include such as subject-object co-occurrence information. The subject-object co-occurrence information described herein refers to that modifiers in two SPO structures are same and objects in the two SPO structures have are same. For example, “the wife of SanZhang is SiLi” and “the love of SanZhang is SiLi”. The subject-object co-occurrence information may indicate that there is a higher probability that the two subjects (such as “wife” and “love”) in the two subject-predicate object structures have a same semantics. Additionally or alternatively, the statistical information may also include information of an object type. The object type described herein refers to a superordinate word of the object in SPO. When the two object types in the two SPO structures are same, it may be indicated that there is a higher probability that the two subjects in the two SPO structures have a same semantics. Additionally or alternatively, the statistical information may also include such as information of a subject keyword, that is, a result obtained by comparing subjects not having the superordinate word in the two SPO structures. Additionally or alternatively, the statistical information may also include homology information. For example, when the two SPO structures come from a same data resource and relate to a same entity, the statistical information may indicate that there is a higher probability that two predicates (P) in the two SPO structures have different same semantics. The model construction apparatus 120 may determine the fourth similarity characteristic 303-4 between the first type-attribute pair 301-1 and the second type-attribute pair 301-2 based on the statistical information.
The above merely lists a few examples for determining the fourth similarity characteristic 303-4 a SPO statistical similarity between the first type-attribute pair 301-1 and the second type-attribute pair 301-2 based on the statistical information. It should be understood that, the model construction apparatus 120 may utilize any method known or to be developed to determine the fourth similarity characteristic 303-4 indicating the SPO statistical similarity between the first type-attribute pair 301-1 and the second type-attribute pair 301-2, and do not limit to the methods illustrated above.
In some embodiments, the plurality of extracted similarity characteristics 303 between the first type-attribute pair 301-1 and the second type-attribute pair 301-2 may be provided for a classification model 320, to determine whether the first type-attribute pair 301-1 has the same semantics as the second type-attribute pair 301-2. For example, the classification model 320 may be a trained support vector machine (SVM) model.
In some embodiments, the SVM model 320 for determining whether the first type-attribute pair 301-1 has the same semantics as the second type-attribute pair 301-2 may be trained in advance and provided for the model construction apparatus 120. Training data sets for training the SVM model may be obtained by a combination of clustering and manual annotation. For example, type-attribute pairs of a plurality of specific entity types (such as, a character, an appliance, a place and the like) may be used as the training data for training the SVM model. Before the manual annotation is performed on these type-attribute pairs, the clustering may be performed on these type-attribute pairs by utilizing the clustering algorithms. When the manual annotation is performed, the clustered training data set may be provided for a plurality of annotation personnel to mark type-attribute pairs with a same semantics in the clustered training data set. In this way, the accuracy of marking may be ensured by synthesizing marking results from the plurality of annotation personnel. When the SVM model is trained, the selected characteristics may be any similarity characteristic described above, including but not limited to: a text similarity characteristic, a semantics similarity characteristic (including: a query similarity, a Bow similarity, a GRNN similarity, a query clicking similarity, a semantic similarity obtained by a semantic similarity model, a skip-gram similarity, etc), a statistical similarity (which is obtained by performing the statistical analysis on the SPO data), and the like.
In this way, the trained classification model 320 may determine whether the first type-attribute pair 301-1 has the same semantics with the second type attribute-pair 301-2 based on the plurality of similarity characteristics 303 between the first type-attribute pair 301-1 and the second type attribute-pair 301-2, which is illustrated by a classification result 304 in
Additionally or alternatively, in some embodiments, the model construction apparatus 120 may further perform optimization on the classification result 304 of the classification model 320 based on a preset rule. For example, when the classification model 320 determines that the first type-attribute pair 301-1 has the same semantics with the second type attribute-pair 301-2, the model construction apparatus 120 may further determine whether a score (such as, a score indicated by the second similarity characteristic described above) of the semantic similarity between the first type-attribute pair 301-1 and the second type attribute-pair 301-2 exceeds a preset threshold. When the classification model 320 determines that the score of the semantic similarity between the first type-attribute pair 301-1 and the second type attribute-pair 301-2 exceeds the preset threshold, the model construction apparatus 120 may determine that the first type-attribute pair 301-1 has the same semantics with the second type attribute-pair 301-2. In some embodiments, the model construction apparatus 120 may perform filtering on the classification result 304 based on a combination of one or more preset rules, thus further providing the accuracy of the classification result. Additionally or alternatively, in some embodiments, the model construction apparatus 120 may provide the classification result 304 with the user for verification, and perform the optimization on the classification result 304 based on a verification result fed back by the user, thus further improving the accuracy of the classification result.
In some embodiments, when it is determined that the first type-attribute pair has the same semantics with the second type attribute-pair, the model construction apparatus 120 may align a first attribute (i.e., “height”) in the first type-attribute pair (such as, “character-height”) and a second attribute (i.e., “stature”) in the second type attribute-pair (such as, “character-stature”) to a same attribute. For example, the model construction apparatus 120 may align the first attribute and the second attribute which have the same semantics to one of the first attribute and the second attribute. Alternatively, the model construction apparatus 120 may align the first attribute and the second attribute which have the same semantics to another attribute, such as an attribute which may be different from the first attribute and the second attribute. In this way, the model construction apparatus 120 may generate a second attribute set (such as, the attribute set 131 illustrated in
Returning to
It can be seen from the above description that embodiments of the present disclosure use the machine learning model to identify synonyms attributes with different expressions in data from different sources. Since the procedure for determining the synonyms attributes uses rich characteristics of various dimensions, embodiments of the present disclosure may achieve a high accuracy and a high robustness. By aligning attributes with the same semantics automatically, embodiments of the present disclosure may construct the data model efficiently while reducing labor costs effectively.
In some embodiments, the attribute obtaining module 410 includes an attribute obtaining unit, a subset dividing unit and a first determining unit. The attribute obtaining unit is configured to obtain a third attribute set associated with the entity type. The subset dividing unit is configured to divide the third attribute set into a plurality of subsets based on an attribute similarity. The first determining unit is configured to determine one of the plurality of subsets as the first attribute set.
In some embodiments, the subset dividing unit is further configured to perform cluster on the third attribute set, to divide the third attribute set into the plurality of subsets.
In some embodiments, the attribute aligning module 420 includes: a first combining unit, a second combining unit, a second determining unit and an attribute align unit. The first combining unit is configured to combine the entity type with a first attribute in the first attribute set, to obtain a first type-attribute pair. The second combining unit is configured to combine the entity type with a second attribute different from the first attribute in the first attribute set, to obtain a second type-attribute pair. The second determining unit, configured to determine whether the first type-attribute pair has a same semantics with the second type-attribute pair. The attribute align unit is configured to align the first attribute to the second attribute in response to determining that the first type-attribute pair has the same semantics as the second type-attribute pair.
In some embodiments, the second determining unit is further configured to: extract a plurality of similarity characteristics between the first type-attribute pair and the second type-attribute pair; and determine whether the first type-attribute pair has the same semantics with the second type attribute-pair based on the plurality of similarity characteristics.
In some embodiments, the plurality of similarity characteristics include at least one of: a first similarity characteristic indicating a text similarity between the first type-attribute pair and the second type-attribute pair; a second similarity characteristic indicating whether the first type-attribute pair and the second type-attribute pair are synonyms in a semantic dictionary; a third similarity characteristic indicating a semantic similarity between the first type-attribute pair and the second type-attribute pair; and a fourth similarity characteristic obtained by performing a statistical analysis on a first group of knowledge items associated with the first type-attribute pair and a second group of knowledge items associated with the second type-attribute pair.
In some embodiments, the second determining unit is further configured to utilize a classification model trained to determine whether the first type-attribute pair has the same semantics as the second type-attribute pair.
In some embodiments, the classification model is a trained support vector machine (SVM) model.
It should be understood that, each module in the apparatus 400 respectively corresponds to each action at each block in the method 200 illustrated in
In addition, these modules and/or units illustrated in
These modules and/or units illustrated in
A plurality of components in the device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse; an output unit 507 such as various types of displays, loudspeakers; a storage unit 508 such as a magnetic disk, an optical disk; and a communication unit 509, such as a network card, a modem, a wireless communication transceiver. The communication unit 509 allows the device 500 to exchange information/data with other devices over a computer network such as the Internet and/or various telecommunication networks.
The processing unit 501 executes the above-mentioned methods and processes, such as the method 200. For example, in some implementations, the method 200 may be implemented as a computer software program. The computer software program is tangibly contained a machine readable medium, such as the storage unit 508. In some implementations, a part or all of the computer programs may be loaded and/or installed on the device 500 through the ROM 502 and/or the communication unit 509. When the computer programs are loaded to the RAM 503 and are executed by the CPU 501, one or more blocks of the method 200 described above may be executed. Alternatively, in other embodiments, the CPU 501 may be configured to execute the method 200 in other appropriate ways (such as, by means of hardware).
The functions described herein may be executed at least partially by one or more hardware logic components. For example, without not limitation, exemplary types of hardware logic components that may be used include: a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), an application specific standard product (ASSP), a system on chip (SOC), a complex programmable logic device (CPLD) and the like.
Program codes for implementing the method of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or a controller of a general purpose computer, a special purpose computer or other programmable data processing device, such that the functions/operations specified in the flowcharts and/or the block diagrams are implemented when these program codes are executed by the processor or the controller. These program codes may execute entirely on a machine, partly on a machine, partially on the machine as a stand-alone software package and partially on a remote machine or entirely on a remote machine or entirely on a server.
In the context of the present disclosure, the machine-readable medium may be a tangible medium that may contain or store a program to be used by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but not limit to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of the machine-readable storage medium may include electrical connections based on one or more wires, a portable computer disk, a hard disk, a RAM, a ROM, an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage, a magnetic storage device, or any suitable combination of the foregoing.
In addition, although the operations are depicted in a particular order, it should be understood to require that such operations are executed in the particular order illustrated in the drawings or in a sequential order, or that all illustrated operations should be executed to achieve the desired result. Multitasking and parallel processing may be advantageous in certain circumstances. Likewise, although several specific implementation details are included in the above discussion, these should not be construed as limitation of the scope of the present disclosure. Certain features described in the context of separate implementations may also be implemented in combination in a single implementation. On the contrary, various features described in the context of the single implementation may also be implemented in a plurality of implementations, either individually or in any suitable sub-combination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it should be understood that the subject matter defined in the appended claims is not limited to the specific features or acts described above. Instead, the specific features and acts described above are merely exemplary forms of implementing the claims.
Number | Date | Country | Kind |
---|---|---|---|
201910105197.8 | Feb 2019 | CN | national |