The present disclosure relates generally to machine learning, and more specifically to application of machine learning to the detection and suggestion of fixes for data errors in digital representations of infrastructure.
As part of the design, construction and operation of infrastructure (e.g., electrical networks, gas networks, water and/or wastewater networks, rail networks, road networks, buildings, plants, bridges, etc.) it is often desirable to create digital representations of the infrastructure, where each component in the infrastructure is represented by a detail counterpart. A digital representation may take the form of a built infrastructure model (BIM) or a digital twin of the infrastructure. A BIM is a digital representation as it should be built, providing a mechanism for visualization and collaboration. A digital twin is a digital representation of infrastructure as it is actually built, and is often synchronized with information representing current status, working conditions, position, or other qualities.
To create a digital representation of infrastructure (e.g., a BIM or digital twin), one typically imports data describing the components from multiple different data sources. This imported data typical includes information describing the type (e.g., class) of each component, as well as the properties of each component. There is typically a large number of components, dispersed across many types (e.g., classes), each of which may have a different number and/or selection of properties. Depending on the properties, their values may be discrete or continuous. For example, where the infrastructure is an electrical network (e.g., of a medium-sized city) there may be several hundred thousand components, arranged into several dozen types (classes), each with about 3 to 15 properties (the number and selection of properties differing for each type of component). Some of these 3 to 15 properties may be discrete and others may be continuous.
Ensuring quality and consistency of this sort of heterogenous data in a digital representation of infrastructure presents a significant technical challenge. Imported data often includes large numbers of missing and erroneous property values, due to sourcing from incomplete or unreliable data sources, data corruption, human error and/or other factors. If missing/erroneous property values are not fixed, the digital representation of infrastructure may be unreliable or unusable.
Existing techniques for addressing data errors in a digital representation of infrastructure may be broadly classified into manual techniques, scripting language-based techniques, and Extract-Transform-Load (ETL)/Extract-Load-Transform (ELT) techniques. In manual techniques, a subject matter expert (SME), who is usually an experienced engineer, looks through the entirety of the imported data, and examines properties of each component to detect any erroneous/missing property values. The SME then, based on their technical understanding, fills in any missing property values and corrects any erroneous property values.
While usable on small projects, manual techniques have significant shortcomings. Foremost is that they do not scale, such that they become impractical as project size grows. For instance, considering the example above of an electrical network having several hundred thousand components, each with about 3 to 15 properties, manual techniques may require SME review of millions of individual property values. Even if the expense of all the engineer-hours required to conduct such review could be borne, reviewer fatigue may decrease the quality of review, such that significant missing/erroneous property values may still remain.
In scripting language-based techniques, a SME uses a scripting language to create custom rule-based algorithms, which are then applied to the imported data. Such rule-based algorithms are generally simplistic in nature, such that they look for particular predefined error types and, upon their detection, apply predefined fixes.
While usable with some types of simple data errors, rule-based algorithms are typically inadequate for addressing many types of data errors in digital representations of infrastructure. Infrastructure is typically very complex, and all the possible missing/erroneous property values that may be present in the data, and their appropriate fixes, typically cannot be foreseen beforehand and hand-coded into rules by a SME. As such, significant missing/erroneous property values may still remain after application of rule-based algorithms.
In ETL/ELT techniques, data science procedures are applied to validate, cleanse, and enrich data as it is being imported. ETL and ELT both share common stages of extraction (i.e., pulling the data from its original source), transformation (i.e., changing the structure of the data so it integrates with the target system), and loading (i.e., depositing the data in the storage of the target system). ETL and ELT differ in the ordering of these stages and their manner of execution. For example, ELT typically moves data from a data source to a staging area and transforms it before it is deposited. In contrast, ELT avoids data staging, and instead takes advantage of the target system to transform the data, which may result in better performance and flexibility.
While at least theoretically usable for addressing data errors in a digital representation of infrastructure, ETL/ELT techniques are not specifically designed for this use case and may require significant expertise to apply properly. Even if so applied, ETL/ELT techniques generally utilize rule-based algorithms, and therefore may suffer many shortcomings similar to scripting languages. Further, they may be biased towards specific error types and introduce additional errors in the transformation process. As such, significant missing/erroneous property values, as well as new data errors, may be present after application of ETL/ELT techniques.
Accordingly, there is a need for improved techniques for ensuring quality and consistency of the data in a digital representation of infrastructure. It would be useful if such techniques could both detect data errors, and suggest fixes for the data errors.
In example embodiments, machine learning techniques are provided for ensuring quality and consistency of the data in a digital representation of infrastructure (e.g., a BIM or digital twin). A machine learning model learns the structure of the digital representation of infrastructure, and then detects and suggests fixes for data errors. The machine learning model may include an embedding generator, an autoencoder, and decoding logic, employing embeddings and metamorphic truth to enable the handling of heterogenous data, with missing and erroneous property values. The machine learning model may be trained in an unsupervised manner from the digital representation of infrastructure itself (e.g., by assuming that a significant portion is correct). An SME review workflow may be provided to correct predictions and inject ground truth to improve performance.
It should be understood that a variety of additional features and alternative embodiments may be implemented other than those discussed in this Summary. This Summary is intended simply as a brief introduction to the reader and does not indicate or imply that the examples mentioned herein cover all aspects of the disclosure or are necessary or essential aspects of the disclosure.
The description below refers to the accompanying drawings of example embodiments, of which:
The application 100 may be a stand-alone software application or a component of a larger software application. In one example implementation, the application 100 is the Open Utilities™ digital twin services application, available from Bentley Systems, Inc. of Exton, Pa. However, it should be understood that the application 100 may take a variety of other forms. The application 100 may be divided into local software 110 that executes on one or more computing devices local to an end-user (collectively “local devices”) and, in some cases, cloud-based software 120 that executes on one or more computing devices remote from the end-user (collectively “cloud computing devices”) accessible via a network (e.g., the Internet). Each computing device may include processors, memory/storage, a display screen, and/or other hardware (not shown) for executing software, storing data and/or displaying information. The local software 110 may include a number of software modules operating on a local device, and in some cases within a web-browser of the local device, and the cloud-based software 120, if present, may include additional software modules operating on cloud computing devices.
Operations may be divided in a variety of different manners among the software modules. For example, in one implementation, software modules of the local software 110 may be responsible for performing non-processing intensive operations such as providing user interface functionality. To such end, the software modules of the local software 110 may include a user interface module 130, as well as other software modules (not shown). The software modules of the cloud-based software 120 may perform more processing intensive operations, such as operations related to machine learning. To such end, the software modules of the cloud-based software 120 may include a preprocessing module 140 that preprocesses a digital representation of infrastructure and ancillary data to facilitate machine leaning. The software modules of the cloud-based software 120 may also include a machine learning model 150 that learns the structure of the digital representation of infrastructure and then predicts values for missing/erroneous property values. As explained in more detail below, the machine learning model 150 may include an embedding generator, an autoencoder, and decoding logic. The software modules of the cloud-based software 120 may also include an imputation and correction module 160 that uses the predicted values to replace missing/erroneous property values, subject to SME review.
At step 220 of the data acquisition and preparation phase, the preprocessing module 140 configures an embedding generator and decoding logic based on the received data. As explained in more detail below, the machine learning model 150 may include an embedding generator that includes a number of component-type specific encoders that convert a heterogeneous representation of data into a homogenous representation of data. Likewise, the machine learning model may include decoding logic that converts a homogenous representation of data back into a heterogeneous representation of data. Configuration of the embedding generator and decoding logic may include selection of individual encoders and decoders based on component types in the received data (e.g., in the list of components to check) as well as other customizations based on the received data.
At step 230 of the data acquisition and preparation phase, the preprocessing module 140 preprocesses the received data to at least convert the digital representation of infrastructure to a form more suitable for machine learning. The conversion may include converting data in the digital representation of infrastructure to a graph representation, where each component is a node, and the nodes are connected by edges that indicate connections among components or parent/child relationships. For instance, in an example where the infrastructure is an electrical network, the nodes may represent transformers, circuits, wires, etc. and the edges may indicate that wires are connected to particular transformers, that circuits include particular wires, etc. Each node in the graph representation may be associated with the type (e.g., class) of the component and one or more properties of the component. As mentioned above, properties may be discrete or continuous. In this context “discrete” refers to taking on values that come from a specific and restrained set, and “continuous” refers to taking upon values that may fall within a range but that are not otherwise constrained. For instance, in an example where the infrastructure is an electrical network, the primary voltage of a transformer may be discrete (e.g., it may be 25 kV or 50 kV, but not 25.887 kV), while the length of a wire may be continuous (e.g., it may be any value greater than 0 but less than 100 m, such as 32.345 m). As part of the preprocess of step 230, the preprocessing module 140 may convert discrete properties to a one-hot vector representation (i.e., a vector that typically contains all zeros except one position that indicates the discrete value). As a special case, if a particular property value is missing, the one-hot vector may contain only zeros. Further, the preprocessing module 140 may normalize continuous properties (e.g., such that their values all have a mean of 0 and a standard deviation of 1). The preprocessing module 140 may concatenate the one-hot vector and the continuous values to produce a property value vector for each node of the graph representation. The resulting graph representation with a property value vector for each node may be considered heterogeneous. In this context, “heterogeneous” refers to data that that occupies multiple distinct and different spaces of potentially different dimensionalities for different members (e.g., types of components).
At step 240 of the machine learning model unsupervised training phase, the embedding generator of the machine learning model 150 converts the graph representation of heterogeneous data produced by the preprocessing module 140 to a graph representation of homogenous data. In this context, “homogenous” refers to data that occupies a common space across all members (e.g., across all components). Such a homogenous representation may be better suited for machine learning as it permits comparison and simultaneous processing across different types of components.
To convert the graph representation of heterogeneous data to a graph representation of homogenous data, the embedding generator may first group (“bag”) the nodes by their type. For example, as shown in
At step 250 of the machine learning model unsupervised training phase, the machine learning model 150 processes and predicts abstract feature values for the nodes of the graph representation of homogenous data. The machine learning model may use an autoencoder architecture based on metamorphic truth, such that it learns to produce the same abstract feature values it sees as inputs as outputs. The autoencoder architecture may use graph attention networks (GATs). A GAT is a type of neural network that operates on graph-structured data and leverages masked self-attentional layers that when stacked are able to attend over their neighborhoods' features.
Each GAT layer 612-626 of the autoencoder architecture may include a number of sublayers, including a leaky-rectified linear unit (RELU) activation layer 632, a dropout layer (including a drop attention layer 634 and a drop features layer 636) and a fully-connected layer 640. The leaky-RELU activation layer 632 may implement a piecewise linear function that outputs its input directly if it is positive, but only passes small negative values when the input is less than zero. The dropout layer randomly hides (e.g., voids as if they were missing) a portion (e.g., 20%) of its input. In the dropout layer, the drop attention layer 634 may randomly hide nodes while the drop features layer 636 may randomly hide abstract features. Use of a dropout layer may improve performance and robustness by avoiding a pure “copy-paste” operation between inputs and outputs and forcing consideration of node neighbors.
At step 260 of the machine learning model unsupervised training phase, the decoding logic of the machine learning model 150 converts the reconstructed graph representation of homogenous data back into a graph representation of heterogeneous data, where each node is associated with a property value vector that indicates real-world significant qualities. The decoder logic of the machine learning model 150 may implement an architecture similar to the architecture of the embedding generator of
At step 270 of the machine learning model unsupervised training phase, the application monitors training progress, and loops to step 240 if one or more metrics indicate further training is required. Various metrics may be examined alone or in combination. One metric may be a number of training iterations (e.g., specified in the error detection parameters and settings), and execution may loop to step 240 if a predetermined number of training iterations have not been completed. Another metric may be reconstruction loss for each component type and/or each property type, and execution may loop to step 240 if reconstruction loss fails to meet a predetermined minimum performance. In some implementations, reconstruction loss may be evaluated by applying the machine learning model to data where all values have been voided and comparing predictions to validation data.
At step 280 of the inference and user feedback phase, the graph representation of heterogeneous data is applied again to the now trained machine learning model 150 to produce final predictions for property values of components. Operations similar to steps 240-260 discussed above may be repeated, such that the embedding generator converts the graph representation of heterogeneous data to a graph representation of homogenous data, an autoencoder architecture processes and predicts abstract feature values for the nodes of the graph representation of homogenous data, and decoding logic converts the reconstructed graph representation of homogenous data back into a graph representation of heterogeneous data, and ultimately produces for each component predicted property values with respective confidences.
At step 285 of the inference and user feedback phase, the imputation and correction module 160 tentatively replaces missing property values in the digital representation of infrastructure with the predicted values for those property values from the final predictions. If there are multiple predicted values (as is typically the case), the predicted value with the greatest confidence may be selected and used for replacement. At least some of the predictions used in the tentative replacements of missing property values may be selected for SME review, as discussed further below. In one implementation, all predictions for missing property values may be selected for SME review. In an alternative implementation, some predictions are selected for SME review based on a comparison of confidence in the predictions to a missing value confidence threshold (which may be provided as part of the detection parameters and settings).
At step 290 of the inference and user feedback phase, the imputation and correction module 160 tentatively replaces erroneous property values in the digital representation of infrastructure with the predicted values for those property values from the final predictions. A property value may be considered erroneous based on a ratio between the confidence in the original property value and the confidence in the predicted property value with the greatest confidence. Where the ratio exceeds a threshold (e.g., 4×), the original property value may be considered erroneous and replaced with the predicted property value with the greatest confidence. In one implementation, all predictions used in the tentative replacement of erroneous property values may be selected for SME review. In an alternative implementation, some predictions are selected for SME review based on a comparison of the ratio to an erroneous value confidence threshold (which may be provided as part of the detection parameters and settings).
At step 295 of the inference and user feedback phase, the imputation and correction module 160 working together with the user interface module 130 solicits SME review of the predictions for missing/erroneous property values used in the tentative replacements in steps 285-290. If a prediction for a missing/erroneous property values is incorrect, the SME may be requested to enter themselves a replacement. Alternatively, the SME may be presented with N prediction alternatives with the highest confidences, and requested to select the correct property value from among them. This alternative can be particularly useful when a property has multiple possible values, and reducing the number of possibilities is helpful. Operation of step 295 produces a final digital representation of infrastructure. Ground truth from the SME review may also be fed back for use in further training of the machine learning model 150 to improve prediction in subsequent operation.
At step 720, the imputation and correction module 160 working together with the user interface module 130 receives a user selection of a prediction in the user interface. For example, referring to
At step 760, the imputation and correction module 160 working together with the user interface module 130 receives a SME indication of whether the selected prediction is correct (and should be validated and saved) or incorrect (and should be rejected and fixed). When the selected prediction is incorrect, a corrected value may also be received. In some implementations, the indication that the selected prediction is correct or incorrect may be implicit. For example, referring to
If the SME indicated the selected prediction is incorrect, at step 770, the imputation and correction module 160 replaces the prediction (or group of predictions if the prediction was placed in a group in step 740) with the corrected value, and sets a correction flag for the prediction (or for each member of the group of predictions). Then, at step 780, the imputation and correction module 160 compares the number of predictions that have correction flags set to a retraining threshold. If the number exceeds the retraining threshold, at step 785, retraining of the machine learning model 150 is initiated. In such case, steps similar to those discussed in connection with step 240-270 of
Returning to
In summary, machine learning techniques are provided for ensuring quality and consistency of the data in a digital representation of infrastructure (e.g., a BIM or digital twin). The techniques may utilize a machine learning model that includes an embedding generator, an autoencoder, and decoding logic, employing embeddings and metamorphic truth to enable the handling of heterogenous data, with missing and erroneous property values. An SME review workflow may be provided to correct predictions and inject ground truth to improve performance. It should be understood that a wide variety of adaptations and modifications may be made to the architecture and techniques used therewith. It should be remembered that functionality may be implemented using different software, hardware, and various combinations thereof. Software implementations may include electronic device-executable instructions (e.g., computer-executable instructions) stored in a non-transitory electronic device-readable medium (e.g., a non-transitory computer-readable medium), such as a volatile memory, a persistent storage device, or other tangible medium. Hardware implementations may include logic circuits, application specific integrated circuits, and/or other types of hardware components. Further, combined software/hardware implementations may include both electronic device-executable instructions stored in a non-transitory electronic device-readable medium, as well as one or more hardware components. Above all, it should be understood that the above description is meant to be taken only by way of example.