The present invention relates to the field of industrial software, and particularly relates to a semantic model instantiation method, system and apparatus.
Many industries including social network, e-commerce and manufacture have started to provide knowledge-based intelligent functions and services to clients, and an extensible knowledge database is needed to be taken as a basis. A domain semantic model or mode may be established by a domain expert, however, it is not easy to fill a knowledge database with data according to a semantic model.
For example, filling a semantic model with data instances or data individuals to execute instantiation of the semantic model still mainly depends on manual work. Typically, when a semantic model is instantiated, data instances are manually identified and extracted by engineers in the art. Or data need to be processed in some predefined data formats or intermediate forms, to fill a knowledge database with the data by utilizing a customized program. By adopting these methods, manpower participation degree is high, and as a result, expense is high and a long time is spent. In many industrial fields, original data are of different classes, so it is hard to apply a customized data extracting process to other conditions. Therefore, customers lack tools for automatically extracting data instances from domain files based on a defined domain semantic model.
Two solutions are provided in the prior art. One solution is form analysis and retrieval, and it is targeted to a correlation between customer problems and form contents. When a customer queries a problem, a form analysis and retrieval algorithm will search in data of forms to determine one or more forms capable of potentially answering the above-mentioned problem. Retrieval methods include a character string similarity algorithm BM25, cell data similarity computing and the like. A system may include apparatuses for processes of semantic parsing, form format analysis, form problem similarity comparison, form retrieval and the like. However, such solutions only pay attention to how to match customer inquiry with form contents.
The other solution is ontology matching, and it is targeted to find a correlation between entities of two ontologies including classes, parameters and instances. Ontology matching includes two basic steps: similar point computing and queue extracting. In these steps, two ontologies are compared from the perspective of two languages and structures, with a purpose of transmitting data from one ontology model to the other ontology model. However, such solutions do not deem form as input, some similar methods also tried extracting network form information based on ontology information, however, these solutions are mainly based on a heuristic rule, and it is hard to extend various layouts to any form.
Moreover, existing software tools of the industrial field cannot automatically identify a correlation between any semi-structured file (form) and a domain semantic model to extract relevant data instances.
According to a first aspect, the present invention provides a semantic model instantiation method, including the following steps: S1, receiving an ontology-based semantic model, parsing the semantic model and converting the semantic model into a characteristic vector set, where the characteristic vectors represent the classes and attributes of an ontology and a relation between the attributes; S3, importing a semi-structured file, and converting the semi-structured file into a key word vector based on a semantic vector of the semantic model; and S4, comparing a correlation between the semantic vector and the key word vector, and identifying a key word vector corresponding to the semantic vector.
Further, the method also includes the following step between step S1 and step S3: S2, matching a near-synonym of a word of the semantic vector based on the semantic vector of the semantic model, where step S3 also includes the following step: converting the semi-structured file into a key word vector based on the semantic vector based on the semantic model and the near-synonym thereof. Further, the method also includes the following step after step S4: extracting instance data of the semi-structured file of the key word vector corresponding to the semantic vector to a database. Further, the ontology includes classes, attributes and a relation between the attributes.
Further, step S3 also includes the following step when the semi-structured file is a form file: determining a header position of the form file, and identifying a data division of the form file. Further, step S4 also includes the following steps: executing multiple correlation computing methods based on the semantic vector, a synonym lexicon and the key word vector to obtain multiple correlation values to compare a correlation of the semantic vector and the key word vector, weighting the correlation values to construct a correlation matrix and screening out parameter mapping to identify a key word vector corresponding to the semantic vector, where the parameter mapping shows a matched key word vector and semantic vector.
Further, the correlation matrix is constructed according to the following algorithm:
M
ij
=Σw
q
Sim
q(Oi,Kj)
where Mij is a correlation, O is a semantic vector, k is a key word vector, wq is a weight, Simq is a correlation algorithm, and i, j, q are natural numbers.
According to a second aspect, the present invention provides a semantic model instantiation system, including a processor; and a memory coupled with the processor, where the memory has instructions stored therein, the instructions enable an electronic device to execute actions when being executed by the processor, and the actions include: S1, receiving an ontology-based semantic model, parsing the semantic model and converting the semantic model into a characteristic vector set, where the characteristic vectors represent the classes and attributes of an ontology and a relation between the attributes; S3, importing a semi-structured file, and converting the semi-structured file into a key word vector based on a semantic vector of the semantic model; and S4, comparing a correlation between the semantic vector and the key word vector, and identifying a key word vector corresponding to the semantic vector.
Further, the following action is also included between action S1 and action S3: S2, matching a near-synonym of a word of the semantic vector based on the semantic vector of the semantic model, where action S3 also includes: converting the semi-structured file into a key word vector based on the semantic vector based on the semantic model and the near-synonym thereof.
Further, the following action is included after action S4: extracting instance data of the semi-structured file of the key word vector corresponding to the semantic vector to a database. Further, the ontology includes classes, attributes and a relation between the attributes.
Further, action S3 also includes the following action when the semi-structured file is a form file: determining a header position of the form file, and identifying a data division of the form file. Further, action S4 also includes: executing multiple correlation computing methods based on the semantic vector, a synonym lexicon and the key word vector to obtain multiple correlation values to compare a correlation of the semantic vector and the key word vector, weighting the correlation values to construct a correlation matrix and screening out parameter mapping to identify a key word vector corresponding to the semantic vector, where the parameter mapping shows a matched key word vector and semantic vector.
Further, the correlation matrix is constructed according to the following algorithm:
M
ij
=Σw
q
Sim
q(Oi,Kj)
where Mij is a correlation, O is a semantic vector, k is a key word vector, wq is a weight, Simq is a correlation algorithm, and i, j, q are natural numbers.
According to a third aspect, the present invention provides a semantic model instantiation apparatus, including a first converting apparatus, for receiving an ontology-based semantic model, parsing the semantic model and converting the semantic model into a characteristic vector set, where the characteristic vectors represent the classes and attributes of an ontology and a relation between the attributes; a second converting apparatus, for importing a semi-structured file, and converting the semi-structured file into a key word vector based on a semantic vector of the semantic model; and a comparing and identifying apparatus, comparing a correlation of the semantic vector and the key word vector, and identifying a key word vector corresponding to the semantic vector.
According to a fourth aspect, the present invention provides a computer program product, the computer program product is tangibly stored on a computer readable medium and includes a computer executable instruction, and the computer executable instruction enables at least one processor to execute the method described according to the first aspect of the present invention when being executed.
According to a fifth aspect, the present invention provides a computer readable medium, the computer readable medium stores a computer executable instruction, and the computer executable instruction enables at least one processor to execute the method described according to the first aspect of the present invention when being executed.
Innovations of the present invention lie in that a semantic model is converted into semantic vectors, including class vectors and correlation vectors, synonyms are computed and a synonym lexicon is constructed for each semantic vector. A separate semantic vector acts as an information extraction guidance. As a result, any semantic model may be dissected to be many retrieval formulae for data retrieval, being conducive to automatic matching and a data retrieval process described by the semantic model.
Innovation of the present invention also lies in that useful header data coming from any semi-structured file are organized and converted into key word vectors, including a key word parameter division identifying form files and a data division, and these key word parameters are extracted to obtain a tree structure. As a result, a form may be converted into vectors, and the vectors may be used for further comparison and computation for data extraction. Innovation of the present invention further lies in that correlation mapping of any semantic vector and a key word vector is extracted, and relevant information is extracted from a semi-structured file. This is for computing distinction between the semantic vector and the key word vector, and matching parameter mapping. According to the present invention, a model-based rapid and automatic mode for estimating and matching data is realized. The present invention can greatly reduce workload and expense for constructing a knowledge graph, and thus accelerates knowledge-based convenient service.
Specific implementations of the present invention will be described below with reference to the accompanying drawings.
The present invention provides a semantic model instantiation mechanism, and the semantic model instantiation mechanism is capable of extracting data instances based on an abstract model, and utilizes corresponding semi-structured data and a semantic model. According to the present invention, useful data instances are rapidly determined and extracted to a knowledge database by automatically screening and executing domain semi-structured files based on semantic definition with reasonable accuracy, so as to automatically extract data from the semi-structured file based on any semantic model.
As shown in
According to a first aspect, the present invention provides a semantic model instantiation method, including the following steps: Firstly, step S1 is executed. The first converting apparatus 110 receives an ontology-based semantic model A, parses the semantic model A and converts the semantic model A into a characteristic vector set, and the characteristic vectors represent the classes and attributes of an ontology and a relation between the attributes. That is, the first converting apparatus 110 resolves the semantic model A into a concept of classes and subclasses, and describes classes and subclasses with characteristic vectors.
The ontology includes classes, attributes and a relation between the attributes. The classes also include subclasses of the classes. According to the present invention, an ontology base may be established in advance, and the ontology base is constantly updated in a process of executing the present invention. For example, classes of the ontology base include: devices, products, manpower, materials, technologies, maintenance and the like. The above-mentioned classes have interrelation.
For example, as shown in
Therefore, output of the first converting apparatus 110 is characteristic vectors and a set of relations among multiple vectors, where the characteristic vectors include semantic vectors and characteristic vectors, and the characteristic vectors are specially vectors of the ontology class. Specifically, each vector includes class name, vector name and a relation therebetween. As a result, exemplarily, the format of one of the semantic vectors is: (class name, vector 1, vector 2 . . . vector N, relation 1, relation 2 . . . relation M)
where for example, semantic vectors are “a worker operates a machine C,” “a worker produces products” and “a machine has a fault”, where “operate”, “produce” and “has” are relations therebetween.
Then, step S3 is executed. The second converting apparatus 120 imports a semi-structured file B, and converts the semi-structured file B into a key word vector based on the semantic vector of the semantic model A. Specifically, the second converting apparatus 120 extracts header data from any semi-structured file B and reorganizes these header data according to a certain logic for subsequent processing, where the semi-structured file B is a form file. As shown in
When the semi-structured file is a form file, step S3 also includes the following step: determining a header position of the form file, and identifying a data division of the form file.
In substep S31, the preprocessing apparatus 1201 executes basic conversion and cleaning for an input form file. For example, the preprocessing apparatus 1201 is capable of converting a form file excel into an HTML form, this is because the HTML form includes richer and clearer header data.
Then, in substep S32, the identifying apparatus 1202 reads the form preprocessed by the preprocessing apparatus 1201 to identify the attribute of data content in the form file. Specifically, according to the present invention, four key divisions ULC, RH, CH and Data are defined for any form file, and then these key divisions are determined.
Specifically, referring to
Then, when RH=h1 and CH=h2 are not met, a judgement is then made as to whether RH<h1 or CH<h2, and when RH<h1 or CH<h2 is met, a correlation between the semantic vectors and the key word vectors is then computed, C3 is identified and a potentially embedded one-dimensional form is extracted.
When RH<h1 or CH<h2 is not met, a judgement is then made as to whether RH>h1, and when RH>h1 is met, only RH and C3 of the data division are extracted. When RH>h1 is not met, a judgement is then made as to whether CH>h2, and when CH>h2 is met, only CH and C3 of the data division are extracted.
Therefore, by executing the above-mentioned steps, four key divisions ULC, RH, CH and data may be found out and defined to determine the header division and data division of the form B.
In substep S33, input of the key word apparatus 1203 is a form with a key position, and a form title and attribute are extracted by applying specifications and rules and are stored in a tree structure. The tree structure will be reorganized as weight vectors for subsequent analysis procedures.
For example, the attribute of a one-dimensional form is extracted as a tree structure and converted into the following form key word vectors:
Further, according to an exemplary embodiment of the present invention, the method also includes step S2 between step S1 and step S3: matching a near-synonym of a word of the semantic vector based on the semantic vector of the semantic model. Step S3 also includes the following step: the second converting apparatus 120 converts the semi-structured file into a key word vector based on the semantic vector based on the semantic model and the near synonym thereof.
The second converting apparatus 120 is configured to generate a group of near-synonyms for each word of the semantic vectors. Although existing software can automatically provide near-synonyms, it is difficult for these software tools to provide a reasonable result of a complicated or compound word, especially words formed by more than one sub-word. As a result, the present invention provides the second converting apparatus 120 applicable to complicated words or compound words.
For example, a compound word is firstly divided into multiple sub-words (sub-word #1, sub-word #2 . . . sub-word #n), then a correlation of each sub-word is computed, and finally, the compound word is constructed by utilizing a correlation principle. As a result, the second converting apparatus 120 includes a synonym result list to establish a synonym matrix, and therefore, a key word lexicon is also formed by a key word matrix.
Finally, step S4 is executed. The comparing and identifying apparatus 130 compares a correlation of the semantic vectors and the key word vectors, and identifies key word vectors corresponding to the semantic vectors. Specifically, according to a specific embodiment of the present invention, the key word vector is a form key word vector. As a result, the comparing and identifying apparatus 130 computes a correlation of the form key word vector and the semantic vector. Input of the comparing and identifying apparatus 130 includes key word vectors, semantic vectors and a synonym lexicon. According to the present invention, distinction between the key word vector and the semantic vector is computed by utilizing an algorithm.
Specifically, step S4 also includes the following steps: executing multiple correlation computing methods based on the semantic vector, the synonym lexicon and the key word vector to obtain multiple correlation values to compare a correlation of the semantic vector and the key word vector, weighting the correlation values to construct a correlation matrix and screening out parameter mapping to identify a key word vector corresponding to the semantic vector, where the parameter mapping shows a matched key word vector and semantic vector.
As shown in
M
ij
=Σw
q
Sim
q(Oi,Kj)
where Mij is a correlation, O is a semantic vector, k is a key word vector, wq is a weight, Simq is a correlation algorithm, and i, j, q are natural numbers. A higher weighted value may be given to the correlation between the form title and semantic class name, this is because a name generally expresses more information than each parameter.
Finally, the method also includes the following step after step S4: the extracting apparatus 150 extracts instance data of the semi-structured file of the key word vector corresponding to the semantic vector to the database 160. The extracting apparatus 150 extracts form data based on output of the comparing and identifying apparatus 130. In an implementation, only matched data may be extracted from the semantic model. In another implementation, data matched with and not matched with form parameters are extracted and stored, however, these data are marked with different correlation ranks. Extraction of unmatched form parameters is for the purpose of potential future analysis and utilization. Data correlation is also identified and extracted.
According to a second aspect, the present invention provides a semantic model instantiation system, including a processor; and a memory coupled with the processor, where the memory has instructions stored therein, the instructions enable an electronic device to execute actions when being executed by the processor, and the actions include: S1, receiving an ontology-based semantic model, parsing the semantic model and converting the semantic model into a characteristic vector set, where the characteristic vectors represent the classes and attributes of an ontology and a relation between the attributes; S3, importing a semi-structured file, and converting the semi-structured file into a key word vector based on the semantic vector of the semantic model; and S4, comparing a correlation between the semantic vector and the key word vector, and identifying a key word vector corresponding to the semantic vector. Further, the following action is included between action S1 and action S3: S2, matching a near-synonym of a word of the semantic vector based on the semantic vector of the semantic model. Action S3 also includes: converting the semi-structured file into a key word vector based on the semantic vector based on the semantic model and the near-synonym thereof.
Further, the following action is included after action S4: extracting instance data of the semi-structured file of the key word vector corresponding to the semantic vector to a database.
Further, the ontology includes classes, attributes and a relation between the attributes.
Further, action S3 also includes the following action when the semi-structured file is a form file: determining a header position of the form file, and identifying a data division of the form file. Further, action S4 also includes: executing multiple correlation computing methods based on the semantic vector, a synonym lexicon and the key word vector to obtain multiple correlation values to compare a correlation of the semantic vector and the key word vector, weighting the correlation values to construct a correlation matrix and screening out parameter mapping to identify a key word vector corresponding to the semantic vector, where the parameter mapping shows a matched key word vector and semantic vector. Further, the correlation matrix is constructed according to the following algorithm:
M
ij
=Σw
q
Sim
q(Oi,Kj)
where Mij is a correlation, O is a semantic vector, k is a key word vector, wq is a weight, Simq is a correlation algorithm, and i, j, q are natural numbers.
According to a third aspect, the present invention provides a semantic model instantiation apparatus, including a first converting apparatus, for receiving an ontology-based semantic model, parsing the semantic model and converting the semantic model into a characteristic vector set, where the characteristic vectors represent the classes and attributes of an ontology and a relation between the attributes; a second converting apparatus, for importing a semi-structured file, and converting the semi-structured file into a key word vector based on the semantic vector of the semantic model; and a comparing and identifying apparatus, for comparing a correlation of the semantic vector and the key word vector, and identifying a key word vector corresponding to the semantic vector.
According to a fourth aspect, the present invention provides a computer program product, the computer program product is tangibly stored on a computer readable medium and includes a computer executable instruction, and the computer executable instruction enables at least one processor to execute the method described according to the first aspect of the present invention when being executed.
According to a fifth aspect, the present invention provides a computer readable medium, the computer readable medium stores a computer executable instruction, and the computer executable instruction enables at least one processor to execute the method described according to the first aspect of the present invention when being executed.
Innovations of the present invention lie in that a semantic model is converted into semantic vectors, including class vectors and correlation vectors, synonyms are computed and a synonym lexicon is constructed for each semantic vector. A separate semantic vector acts as an information extraction guidance. As a result, any semantic model may be dissected to be many retrieval formulae for data retrieval, being conducive to automatic matching and a data retrieval process described by the semantic model.
Innovation of the present invention also lies in that useful header data coming from any semi-structured file are organized and converted into key word vectors, including a key word parameter division identifying form files and a data division, and these key word parameters are extracted to obtain a tree structure. As a result, a form may be converted into vectors, and the vectors may be used for further comparison and computation for data extraction. Innovation of the present invention further lies in that correlation mapping of any semantic vector and a key word vector is extracted, and relevant information is extracted from a semi-structured file. This is for computing distinction between the semantic vector and the key word vector, and matching parameter mapping. According to the present invention, a model-based rapid and automatic mode for estimating and matching data is realized.
The present invention can greatly reduce workload and expense for constructing a knowledge graph, and thus accelerates knowledge-based convenient service.
Although the content of the present invention has been described in detail through the above preferred embodiments, it should be understood that the above description should not be considered as a limitation on the present invention. For those skilled in the art, various modifications and replacements to the present invention will be apparent after reading the above content. Therefore, the protection scope of the present invention should be subject to the appended claims. In addition, any reference numerals in the claims shall not be construed as limiting the claims; the word “include/comprise” does not exclude other apparatuses or steps not listed in claims or the specification; the words such as “first” and “second” are only used to indicate names, and do not indicate any particular order.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2019/093873 | 6/28/2019 | WO | 00 |