The following relates to a method and system for producing a semantic mapping of sensor data.
Data preparation is one of the first steps in all pipelines for developing domain-specific applications. While it seems that this step varies based on factors such as the domain, the source systems, or the data formats, a majority of the effort is actually repetitive, especially the effort in data modeling and data integration. This leads to several problems. First, high efforts in data preparation activities due to the requirement of both domain expertise and expertise in data/knowledge engineering are required. Second, repeated efforts in Extraction-Transformation-Loading (ETL) activities by writing many scripts with overlapping code are necessary. Different data formats and storage systems need to be understood in order to perform the ETL activities. Performing these tasks can be difficult for inexperienced users, thereby leading to incorrect data preparation practices, which are usually discovered quite late and can cause substantial delays.
Overall, data preparation is a cost-intensive and time-consuming process, in addition to being error-prone and repetitive. This makes it one of the major bottlenecks while developing data-driven applications for various businesses.
So far, data integration requires the extensive involvement of (i) domain experts, who are well-versed with the industrial domain (e.g., plant engineer, turbine engineer) but lack the required programming skills or knowledge of ETL tasks for integration into knowledge graphs, as well as (ii) data scientists/engineers, who are equipped with the necessary IT skills but not with the domain know-how. A data scientist or a data engineer would typically go through each data source, identify which data is of interest, usually together with a domain expert, and define transformations and mappings, which unify these data with other sources. This process usually includes writing many scripts with potentially overlapping code. Moreover, the interaction between the domain and data experts is quite tightly coupled.
An aspect relates to provide an alternative to the state of the art.
According to the computer implemented method for producing a semantic mapping of sensor data, the following operations are performed by one or more processors:
The system for producing a semantic mapping of sensor data comprises:
The method and system provide an automated or semi-automated creation of a semantic mapping for sensor data. The semantic mapping assigns each element of the data sources to one of the semantic types. The automated or semi-automated creation of the semantic mapping loosens the coupling between a domain expert and data scientist, serves as a bridge between them and significantly reduces their workload, speeding up data modeling and further data integration steps. Furthermore, it provides inexperienced users with access to domain expertise. Re-use of data models is facilitated, which simplifies further integration and exchange activities. The adaptive learning algorithm provides an incremental enhancement of the classification model.
According to an embodiment of the method and system, the adaptive learning algorithm comprises the steps of:
According to an embodiment of the method and system, the adaptive learning algorithm comprises the steps of
An embodiment of the method comprises the additional steps of
An embodiment of the method comprises the additional step of
An embodiment of the method comprises the initial steps of
An embodiment of the method comprises the initial step of
An embodiment of the method comprises the initial step of
The computer-readable storage media have stored thereon instructions executable by one or more processors of a computer system, wherein execution of the instructions causes the computer system to perform the method.
The computer program is being executed by one or more processors of a computer system and performs the method.
Some of the embodiments will be described in detail, with reference to the following figures, wherein like designations denote like members, wherein:
According to an embodiment of the method, energy consumption by the industrial plant IP needs to be optimized, in particular by regulating temperature. To consider all contributing factors to the temperature setting, it is necessary to integrate information from several levels of the industrial plant IP: First, information about installed hardware elements on the plant and their heat generation levels need to be considered. As heat generation not only depends on mechanical work performed by hardware, software executed by corresponding hardware elements and a manufacturing plan need to be considered as well. In addition, external factors need to be considered, such as a weather forecast. To integrate all this information, domain experts as well as data engineers are needed.
Semantically identical concepts (for example, temperature, or heat, or pressure, or machine tool) need to be identified across various data formats (for example, CSV, or JSON) and storage systems (for example relational databases), in potentially different unit measures (for example, Celsius, or Fahrenheit), and with potentially different syntactic formats (for example, 1° C., or 1, or one).
According to this first embodiment of the method, semantically similar concepts in data sources are detected. A semantic type groups together semantically similar or identical concepts. Given a set of potential semantic types (for example, temperature, heat, pressure, machine tool), automated suggestions how to label elements in the data sources are presented. Further, a mapping of each data source (for example CSV) to a knowledge graph is suggested.
For sensor data, we assume that the data that is available in data sources DS as shown in
Returning to
Automatically detecting the data types can be achieved by first inferring the data type of each value in a column of a given data sources DS. This can be done in Python using built-in functions such as type( ) and isinstance( ) as documented in the internet in subsections of the webpage “Built-in Functions”, available in the internet at https://docs.python.org/3/library/functions.html#type and https://docs.python.org/3/library/functions.html#isinstance on 21.09.2020.
Afterwards, the data type of the column is set to be the most occurred data type of the column values.
Another way of automatically detecting the data types is to directly use a built-in method/property to fetch the data types of columns, e.g., dtypes or infer_objects in Python, if data is loaded as a Pandas dataframe. These methods and properties are documented on the webpage “pandas.DataFrame.dtypes”, available in the internet at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html on 21.09.2020, and on the webpage “pandas.DataFrame.infer_objects”, available in the internet at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.infer_objects.htmlon 21.09.2020. The way these approaches work is by trying to convert the values in the columns to different data types and see which data type conversion works for most values.
Automatically detecting the data types facilitates later processing steps. In order to process text data and numerical data differently, it is necessary that the data types of the corresponding columns are known. For mapping the columns to the correct semantic types, we compute latent features for these columns. These latent features differ based on the data types. E.g., features such as domain, range, min, max, mean, median, only apply to numerical data whereas features such as proportion of special characters, character frequency, are applicable to textual data.
A second aspect of the data extraction step DES is comprehensive modelling of numeric data types, in particular, measurement data using various unit formats (for example metric system vs imperial system). This feature is crucial in industrial domains and essential to correctly identify semantically similar concepts.
The comprehensive modelling of numeric data types can be achieved by automatically computing latent features of columns as mentioned above (e.g., domain, range, min, max, mean, median). Additionally, extraction of the units in which the values are expressed, can also be done. Overall, this step can be done semi-automatically, with the possibility for a user to make modifications, if and where necessary.
A third aspect of the data extraction step DES is extracting semantic types from the target data model TDM, which is an industrial model of the industrial plant IP as described with reference to
In other words, the data extraction step DES contains the step of
If the target data model TDM is an ontology expressed in OWL, then extracting the semantic types can be simply achieved by automatically fetching the semantic types as different classes, data properties, and objectproperties from the ontology.
According to a first aspect of a data integration step DIS, the semantic types are used to label first elements in the data sources DS, with the labelled first elements forming a labelled cohort. This part of the data integration step DIS is performed manually by a domain expert.
In other words, the data integration step DIS begins with
Based on this labelled cohort, a classification model CM is trained according to a second aspect of the data integration step DIS. In other words, one or more of the processors are receiving:
The classification model CM can be implemented as a multi-class classification machine learning model. Details on apt multi-class classification machine learning models and their training can be found in Natalia Rümmele, Yuriy Tyshetskiy, and Alex Collins: “Evaluating approaches for supervised semantic labeling”, available in the internet on 21.09.2020 at https://arxiv.org/pdf/1801.09788.pdf.
According to a further aspect of the data integration step DIS, after the initial training, the classification model CM is re-trained by one or more of the processors with an adaptive learning algorithm AL, with the adaptive learning algorithm AL implementing active learning and/or incremental learning, until the classification model is fully capable of mapping each element of the data sources to one of the semantic types.
Once trained, the classification model CM can be used to automatically recommend (predict) semantic types for other data sources DS.
The adaptive learning algorithm AL can harvest user feedback from the user U, for example a domain expert, and can be implemented as a form of active learning as described on the webpage “Active learning (machine learning)”, available in the internet at https://en.wikipedia.org/wiki/Active_learning_(machine_learning) on 21.09.2020, and/or as a form of incremental learning as described on the webpage “Incremental learning”, available in the internet at https://en.wikipedia.org/wiki/Incremental_learning on 23.09.2020.
This has the advantage that rather than stopping at the stage where the classification model CM generates predictions, the user U can be involved as a means for further enhancing the classification model CM.
While there are several ways of incorporating user feedback, one workflow is as follows. The user U is presented with the semantic types predicted for various elements in the data sources DS along with corresponding confidence scores. The user U can then choose to either auto-accept the predictions based on the confidence (for example if they are above a certain threshold) or correct the misclassifications, where necessary and known. This then serves as additional training data for retraining the classification model CM. There can be variations with respect to which predictions require attention of the user U, for example, those with lower confidence scores, those with fewer or no training data, or those which stand out as potential outliers.
In more general words, the adaptive learning algorithm AL can comprise the steps of
The classification model CM (further enhanced through adaptive learning policies as described above) can be leveraged to potentially rectify errors and inconsistencies in user-labelled data. This is done by forcing the classification model CM to perform predictions not only on the unseen/unlabelled data but also on the training data and identify any mismatches between the user-provided labels and the ones generated by the classification model CM. These mismatches can then be highlighted to the user U for further correction, wherever applicable.
In more general words, the adaptive learning algorithm AL can comprise the steps of:
The classification model CM can be retrained iteratively until each element of the data sources DS is mapped to one of the semantic type with a desired accuracy, which means that the classification model CM is now fully capable of mapping each element of the data sources DS to one of the semantic types. As a result, a semantic mapping is formed, with the semantic mapping assigning each element of the data sources DS to one of the semantic types. Any of the approaches described above can be used or combined for training and retraining the classification model CM in order to create the semantic mapping.
Once the semantic mapping is finalized, i.e., the user U (who can either be a data scientist or a domain expert) is satisfied with the mappings produced by the classification model CM, it can be exported in one of the several supported formats. The choice of export format (for example, R2RML, CSV, JSON, Excel) could be based on various factors, such as the specifications of a mapping execution engine, a necessity for a target system to be ODBA (Ontology-Based Data Access) compliant, requirements to store the mappings along with instance data in a knowledge graph, and so on.
In addition or as an alternative, the semantic mapping can be used to create or update a knowledge graph from data stored in the data sources DS in a materialization step MS.
In more general words, the workflow can continue with the steps of
The knowledge graph can then be processed by one or more of the processors in order to control physical devices of the industrial plant IP based on the sensor data contained in the data sources DS. For example, and with regard to the above-mentioned scenario of optimizing energy consumption by the industrial plant IP by regulating temperature, the knowledge graph can be processed automatically in order to determine an optimal setting for a heating and/or cooling system installed in the industrial plant IP, and the heating and/or cooling system can then be automatically controlled with the determined setting.
The method can be executed by a processor such as a microcontroller or a microprocessor, by an Application Specific Integrated Circuit (ASIC), by any kind of computer, including mobile computing devices such as tablet computers, smartphones or laptops, or by one or more servers in a control room or cloud. For example, a processor, controller, or integrated circuit of the computer system and/or another processor may be configured to implement the acts described herein.
The above-described method may be implemented via a computer program product (non-transitory computer readable storage medium having instructions, which when executed by a processor, perform actions) including one or more computer-readable storage media having stored thereon instructions executable by one or more processors of a computing system. Execution of the instructions causes the computing system to perform operations corresponding with the acts of the method described above.
The instructions for implementing processes or methods described herein may be provided on non-transitory computer-readable storage media or memories, such as a cache, buffer, RAM, FLASH, removable media, hard drive, or other computer readable storage media. Computer readable storage media include various types of volatile and non-volatile storage media. The functions, acts, or tasks illustrated in the figures or described herein may be executed in response to one or more sets of instructions stored in or on computer readable storage media. The functions, acts or tasks may be independent of the particular type of instruction set, storage media, processor or processing strategy and may be performed by software, hardware, integrated circuits, firmware, micro code and the like, operating alone or in combination. Likewise, processing strategies may include multiprocessing, multitasking, parallel processing and the like.
Although the present invention has been disclosed in the form of embodiments and variations thereon, it will be understood that numerous additional modifications and variations could be made thereto without departing from the scope of the invention.
For the sake of clarity, it is to be understood that the use of “a” or “an” throughout this application does not exclude a plurality, and “comprising” does not exclude other steps or elements.
| Number | Date | Country | Kind |
|---|---|---|---|
| 20198304.6 | Sep 2020 | EP | regional |
This application claims priority to PCT Application No. PCT/EP2021/074081, having a filing date of Sep. 1, 2021, which claims priority to EP Application No. 20198304.6, having a filing date of Sep. 25, 2020, the entire contents both of which are hereby incorporated by reference.
| Filing Document | Filing Date | Country | Kind |
|---|---|---|---|
| PCT/EP2021/074081 | 9/1/2021 | WO |