This application claims priority to EP Application No. 21152148.9, having a filing date of Jan. 18, 2021, the entire contents of which are hereby incorporated by reference.
The following relates to an industrial device and method for building and/or processing a knowledge graph.
Graph-based data analytics are playing an increasingly crucial role in industrial applications. A prominent example are knowledge graphs, based on graph-structured databases able to ingest and represent (with semantic information) knowledge from potentially multiple sources and domains. Knowledge graphs are rich data structures that enable a symbolic description of abstract concepts and how they relate to each other. The use of knowledge graphs makes it possible to integrate previously isolated data sources in a way that enables AI and data analytics applications to work on a unified, contextualized, semantically rich knowledge base, enabling more generic, interpretable, interoperable and accurate AI algorithms which perform their tasks (e.g., reasoning or inference) working with well-defined entities and relationships from the domain(s) of interest, e.g., industrial automation or building systems.
How these entities relate to each other is modeled with edges of different types between nodes. This way, the graph can be summarized using semantically meaningful statements, so-called triples or triple statements, that take the simple and human-readable shape ‘subject-predicate-object’, or in graph format, ‘node-relation-node’.
Inference on graph data is concerned with evaluating whether the unknown triple statements UT are valid or not given the structure of the knowledge graph KG.
Multi-relational graphs such as the industrial knowledge graph shown in
Although multi-relational graphs are highly expressive, their symbolic nature prevents the direct usage of classical statistical methods for further processing and evaluation. Lately, graph embedding algorithms have been introduced to solve this problem by mapping nodes and edges to a vector space while conserving certain graph properties. For example, one might want to conserve a node's proximity, such that connected nodes or nodes with vastly overlapping neighborhoods are mapped to vectors that are close to each other. These vector representations can then be used in traditional machine learning approaches to make predictions about unseen statements, realizing abstract reasoning over a set of subjects, predicates and objects.
Existing systems able to train AI methods on knowledge-graph data require the extraction of large quantities of raw data (e.g., sensor data) from the source producing them. The extracted data is then mapped to a set of pre-defined vocabularies (e.g., ontologies) in order to produce so-called triples, statements about semantic data in the form of subject-predicate-object, represented in a machine-readable format such as RDF. A collection of such triples constitutes a knowledge graph, to which a wide range of existing algorithms can be applied to perform data analytics.
An example are methods that learn representations (so-called embeddings) for entities in the graph in order to perform an inference task such as performing knowledge graph completion by inferring/predicting unobserved relationships (link prediction) or finding multiple instances of the same entity (entity resolution).
These methods are based on intensive stochastic optimization algorithms that due to their computational complexity are best suitable for offline learning with previously acquired and stored data. Only after an algorithm (e.g., a neural-network for link prediction) has been trained with the extracted data on a dedicated server, it is possible to perform predictions on new data, either by further extracting data from the relevant devices producing them, or by deploying the learned algorithm to the devices so that it can be applied locally. In either case, the learning step is implemented outside of the devices.
Recently, spiking neural networks (SNNs) have started to bridge the gap to their widely used cousins, artificial neural networks (ANNs). One crucial ingredient for this success was the consolidation of the error backpropagation algorithm with SNNs. However, so far SNNs have mostly been applied to tasks akin to sensory processing like image or audio recognition. Such input data is inherently well-structured, e.g., the pixels in an image have fixed positions, and applicability is often limited to a narrow set of tasks that utilize this structure and do not scale well beyond the initial data domain.
Complex systems like industrial factory systems can be described using the common language of knowledge graphs, allowing the usage of graph embedding algorithms to make context-aware predictions in these information-packed environments.
An aspect relates to an industrial device and a method for building and/or processing a knowledge graph that provide an alternative to the state of the art.
The industrial device for building and/or processing a knowledge graph comprises
The method for building and/or processing a knowledge graph comprises the following operations performed by an industrial device:
The following advantages and explanations are not necessarily the result of the object of the independent claims. Rather, they may be advantages and explanations that only apply to certain embodiments or variants.
Training of AI methods on knowledge graph data is typically an intensive task and therefore not implemented directly at the Edge, i.e., on the devices that produce the data. By Edge we refer to computing resources which either directly belong to a system that generates the raw data (e.g., an industrial manufacturing system), or are located very closely to it (physically and/or logically in a networked topology, e.g., in an shop-floor network), and typically have limited computational resources.
According to some embodiments, the industrial device and the method provide training AI algorithms on knowledge graph data which can be embedded directly into the industrial device, being able to continuously learn based on observations without requiring external data processing servers.
It is advantageous to train these algorithms directly at the devices producing the data because no data extraction or additional computing infrastructure is required. The latency between data observation and availability of a trained algorithm that the existing methods incur (due to the need to extract, transformation and process the data off-device) is eliminated.
One of the main advantages of knowledge graphs is that they are able to seamlessly integrate data from multiple sources or multiple domains. Because of this, embodiments of the industrial device and the method are particularly advantageous on industrial devices which typically act as concentrators of information, like PLC controllers (which by design gather all the information from automation systems, e.g., from all the sensors), industrial PCs implementing SCADA systems, network hubs and switches, including industrial ethernet switches, and industrial gateways connecting automation systems to cloud computing resources.
According to some embodiments, the industrial device and the method integrate learning and inference in a single system, which eliminates the need to extract data. The learning system is able to adapt dynamically to data events, as well as more responsive. According to some embodiments, operator input and feedback can control the learning process.
According to some embodiments, the industrial device and the method is integrating knowledge from different domains and sources, like dynamic, real-time process data and static data from diverse engineering tools. As a result, the learned model is capable of making context-aware predictions regarding novel system events and can be used to detect anomalies resulting from, e.g., cybersecurity incidents.
According to an embodiment, the learning component and/or the control component are implemented with a processor, for example a microcontroller or a microprocessor, executing a RESCAL algorithm, a TransE algorithm, a DistMult algorithm, or a Graph convolutional neural network.
According to other embodiments, the learning component and/or the control component are implemented with neuromorphic hardware. The neuromorphic hardware embodiments empower edge learning devices for online graph learning and analytics. Being inspired by the mammalian brain, neuromorphic processors promise energy efficiency, fast emulation times as well as continuous learning capabilities. In contrast, graph-based data processing is commonly found in settings foreign to neuromorphic computing, where huge amounts of symbolic data from different data silos are combined, stored on servers and used to train models on the cloud. The aim of the neuromorphic hardware embodiments is to bridge these two worlds for scenarios where graph-structured data has to be analyzed dynamically, without huge data stores or off-loading to the cloud—an environment where neuromorphic devices have the potential to thrive.
Some embodiments of the industrial device and the method
implement innovative learning rules that facilitate online learning and are suitable to be implemented in ultra-efficient hardware architectures, for example in low-power, highly scalable processing units, e.g., neural processing units, neural network accelerators or neuromorphic processors, for example spiking neural network systems.
Some embodiments of the industrial device and the method combine learning and inference in a seamless manner.
Some embodiments of the industrial device and the method introduce an energy-based model for tensor-based graph embedding that is compatible with features of biological neural networks like dendritic trees, spike-based sampling, feedback-modulated, Hebbian plasticity and memory gating, suitable for deployment on neuromorphic processors.
Some embodiments of the industrial device and the method
provide graph embeddings for multi-relational graphs, where instead of working directly with the graph structure, it is encoded in the temporal domain of spikes: entities and relations are represented as spikes of neuron populations and spike time differences between populations, respectively. Through this mapping from graph to spike-based coding, SNNs can be trained on graph data and predict novel triple statements not seen during training, i.e., perform inference on the semantic space spanned by the training graph. An embodiment uses non-leaky integrate-and-fire neurons, guaranteeing that the model is compatible with current neuromorphic hardware architectures that often realize some variant of the LIF neuron model.
Some embodiments of the industrial device and the method are especially interesting for the applicability of neuromorphic hardware in industrial use-cases, where graph embedding algorithms find many applications, e.g., in form of recommendation systems, digital twins, semantic feature selectors or anomaly detectors.
In an embodiment of the industrial device and method, the learning component and/or the control component implement a RESCAL algorithm, a TransE algorithm, a DistMult algorithm, or a Graph convolutional neural network.
In an embodiment of the industrial device and method, the industrial device is a field device, an edge device, a sensor device, an industrial controller, in particular a PLC controller, an industrial PC implementing a SCADA system, a network hub, a network switch, in particular an industrial ethernet switch, or an industrial gateway connecting an automation system to cloud computing resources.
In an embodiment of the industrial device and method, the control component is autonomous or processing external signals.
In an embodiment of the industrial device and method, the learning component is configured for calculating a likelihood of a triple statement during inference mode.
In an embodiment of the industrial device and method, the triple store also stores a pre-loaded static sub-graph.
In an embodiment, the industrial device includes a statement handler, configured for triggering an automated action based on the inference of the learning component.
In an embodiment of the industrial device and method, the knowledge graph is an industrial knowledge graph describing parts of an industrial system, with nodes of the knowledge graph representing physical objects, in particular sensors, industrial controllers, robots, drives, manufactured objects, tools and/or elements of a bill of materials, and with nodes of the knowledge graph representing abstract entities, in particular attributes, configurations or skills of the physical objects, production schedules and plans, and/or sensor measurements.
In an embodiment of the industrial device and method, the learning component and/or the control component are implemented as neuromorphic hardware, in particular as an application specific integrated circuit, a field-programmable gate array, a wafer-scale integration, a hardware with mixed-mode VLSI neurons, or a neuromorphic processor, in particular a neural processing unit or a mixed-signal neuromorphic processor.
In an embodiment of the industrial device and method, the learning component consists of an input layer containing node embedding populations of neurons, with each node embedding populations representing an entity contained in the triple statements, and an output layer, containing output neurons configured for representing a likelihood for each possible triple statement. The learning component models a probabilistic, sampling-based model derived from an energy function, wherein the triple statements have minimal energy. The control component is configured for switching the learning component into a data-driven learning mode, configured for training the component with a maximum likelihood learning algorithm minimizing energy in the probabilistic, sampling-based model, using only the triple statements, which are assigned low energy values, into a sampling mode, in which the learning component supports generation of triple statements, and
into a model-driven learning mode, configured for training the component with the maximum likelihood learning algorithm using only the generated triple statements, with the learning component learning to assign high energy values to the generated triple statements.
In an embodiment of the industrial device and method, the control component is configured to alternatingly present inputs to the learning component by selectively activating subject and object populations among the node embedding populations, set hyperparameters of the learning component, in particular a factor (ii) that modulates learning updates of the learning component, read output of the learning component, and use output of the learning component as feedback to the learning component.
In an embodiment of the industrial device and method, the output layer has one output neuron for each possible relation type of the knowledge graph.
In an embodiment of the industrial device and method, the output neurons are stochastic dendritic output neurons, storing embeddings of relations that are given between a subject and an object in the triple statements in their dendrites, summing all dendritic branches into a final score, which is transformed into a probability using an activation function.
In an embodiment of the industrial device and method, depending on the mode of the learning component, an output of the activation function is a prediction of the likelihood of a triple statement or a transition probability.
In an embodiment of the industrial device and method, learning updates for relation embeddings are computed directly in dendritic trees of the stochastic, dendritic output neurons.
In an embodiment of the industrial device and method, learning updates for entity embeddings are computed using static feedback connections from each output neuron to neurons of the node embedding populations.
In an embodiment of the industrial device and method, in the sampling mode, by sampling from the activation function, a binary output signals to the control component whether a triple statement is accepted.
In an embodiment of the industrial device and method, the learning component includes first neurons forming a first node embedding population, representing a first entity contained in the triple statements by first spike times of the first neurons during a recurring time interval. The learning component includes second neurons forming a second node embedding population, representing a second entity contained in the triple statements by second spike times of the second neurons during the recurring time interval. A relation between the first entity and the second entity is represented as the differences between the first spike times and the second spike times.
In an embodiment of the industrial device and method, the differences between the first spike times and the second spike times consider an order of the first spike times in relation to the second spike times. Alternatively, the differences are absolute values.
In an embodiment of the industrial device and method, the relation is stored in one of the output neurons. The relation is in particular given by vector components that are stored in dendrites of the output neuron.
In an embodiment of the industrial device and method, the first neurons are connected to a monitoring neuron. Each first neuron is connected to a corresponding parrot neuron. The parrot neurons are connected to the output neurons. The parrot neurons are connected to an inhibiting neuron.
In an embodiment of the industrial device and method, the first neurons and the second neurons are spiking neurons, in particular non-leaky integrate-and-fire neurons or current-based leaky integrate-and-fire neurons.
In an embodiment of the industrial device and method, each of the first neurons and second neurons only spikes once during the recurring time interval. Alternatively, only a first spike during the recurring time interval is counted.
In an embodiment of the industrial device and method, each node embedding population is connected to an inhibiting neuron, and therefore selectable by inhibition of the inhibiting neuron.
Some of the embodiments will be described in detail, with reference to the following figures, wherein like designations denote like members, wherein:
In the following description, various aspects of embodiments of the present invention and embodiments thereof will be described. However, it will be understood by those skilled in the art that embodiments may be practiced with only some or all aspects thereof. For purposes of explanation, specific numbers and configurations are set forth in order to provide a thorough understanding. However, it will also be apparent to those skilled in the art that the embodiments may be practiced without these specific details.
In the following description, the terms “mode” and “phase” are used interchangeably. If a learning component runs in a first mode, then it also runs for the duration of a first phase, and vice versa. Also, the terms “triple” and “triple statement” will be used interchangeably.
Nickel, M., Tresp, V. & Kriegel, H.-P.: A three-way model for collective learning on multi-relational data, in Icml 11 (2011), pp. 809-816, disclose RESCAL, a widely used graph embedding algorithm. The entire contents of that document are incorporated herein by reference.
Yang, B., Yih, W.-t., He, X., Gao, J. and Deng, L.: Embedding entities and relations for learning and inference in knowledge bases, arXiv preprint arXiv: 1412.6575 (2014), disclose DistMult, which is an alternative to RESCAL. The entire contents of that document are incorporated herein by reference.
Bordes, A. et al.: Translating embeddings for modeling multi-relational data, in Advances in neural information processing systems (2013), pp. 2787-2795, disclose TransE, which is a translation based embedding method. The entire contents of that document are incorporated herein by reference.
Schlichtkrull, M., Kipf, T. N., Bloem, P., van den Berg, R., Titov, I. and Welling, M.: Modeling Relational Data with Graph Convolutional Networks, arXiv preprint arXiv:1703.06103 (2017), disclose Graph Convolutional Neural networks. The entire contents of that document are incorporated herein by reference.
Hopfield, J. J.: Neural networks and physical systems with emergent collective computational abilities, in Proceedings of the national academy of sciences 79, pp. 2554-2558 (1982), discloses energy-based models for computational neuroscience and artificial intelligence. The entire contents of that document are incorporated herein by reference.
Hinton, G. E., Sejnowski, T. J., et al.: Learning and relearning in Boltzmann machines, Parallel distributed processing: Explorations in the microstructure of cognition 1, 2 (1986), disclose
Boltzmann machines, which combine sampling with energy-based models, using wake-sleep learning. The entire contents of that document are incorporated herein by reference.
Mostafa, H.: Supervised learning based on temporal coding in spiking neural networks, in IEEE transactions on neural networks and learning systems 29.7 (2017), pp. 3227-3235, discloses the nLIF model, which is particularly relevant for the sections “Weight gradients” and “Regularization of weights” below. The entire contents of that document are incorporated herein by reference.
Comsa, I. M., et al.: Temporal coding in spiking neural networks with alpha synaptic function, arXiv preprint arXiv: 1907.13223 (2019), disclose an extension of the results of Mostafa (2017) for the current-based LIF model. The entire contents of that document are incorporated herein by reference.
Göltz, J., et al.: Fast and deep: Energy-efficient neuromorphic learning with first-spike times, arXiv: 1912.11443 (2020), also discloses an extension of the results of Mostafa (2017) for the current-based LIF model, allowing for broad applications in neuromorphics and more complex dynamics. The entire contents of that document are incorporated herein by reference.
The industrial device ED contains one or more sensors S or is connected to them. The industrial device can also be connected to one or more data sources DS or contain them. In other words, the data sources DS can also be local, for example containing or providing internal events in a PLC controller.
Examples of the industrial device are a field device, an edge device, a sensor device, an industrial controller, in particular a PLC controller, an industrial PC implementing a SCADA system, a network hub, a network switch, in particular an industrial ethernet switch, or an industrial gateway connecting an automation system to cloud computing resources.
The sensors S and data sources DS feed raw data RD into an ETL component ETLC of the industrial device ED. The task of the ETL component ETLC is to extract, transform and load (ETL) sensor data and other events observed at the industrial device ED and received as raw data RD into triple statements T according to a predefined vocabulary (a set of entities and relationships) externally deployed in the industrial device ED in the form of a set of mapping rules MR. The mapping rules MR can map local observations contained in the raw data RD such as sensor values, internal system states or external stimuli to the triples statements T, which are semantic triples in the form ‘s-p-o’ (entity s has relation p with entity o), for example RDF triples. Different alternatives for mapping the raw data RD to the triple statements T exist in the literature, e.g., R2RML for mapping between relational database data and RDF. In this case a similar format can be generated to map events contained in the raw data RD to the triple statements T. An alternative to R2RML is RML, an upcoming, more general standard that is not limited to relational databases or tabular data.
Examples for the triple statements T are
The latter information may be available from events that are logged in an internal memory of the industrial device ED and fed into the raw data RD. The ETL component ETLC applies the mapping rules MR, converting specific sets of local readings contained in the raw data RD into the triple statements T.
The triple statements T are stored in an embedded triple store ETS, creating a dynamically changing knowledge graph. The embedded triple store ETS is a local database in a permanent storage of the industrial device ED (e.g., a SD card or hard disk).
Besides the previously described triple statements T, which are created locally and dynamically by the ETL component ETLC, and which can be termed observed triple statements, the embedded triple store ETS can contain a pre-loaded set of triple statements which constitute a static sub-graph SSG, i.e., a part of the knowledge graph which does not depend on the local observations contained in the raw data RD, i.e., is static in nature. The static sub-graph SSG can provide, for example, a self-description of the system (e.g., which sensors are available, which user-roles or applications can interact with it, etc). The triple statements of the static sub-graph SSG are also stored in the embedded triple store ETS. They can be linked to the observed data and provide additional context.
All triple statements stored in the embedded triple store ETS are provided to a learning component LC, the central element of the architecture. The learning component LC implements a machine learning algorithm such as the ones described below. The learning component LC can perform both learning as well as inference (predictions). It is controlled by a control component CC that can switch between different modes of operation of the learning component LC, either autonomously (e.g., periodically) or based on external stimuli (e.g., a specific system state, or an operator provided input).
One of the selected modes of operation of the learning component LC is a learning mode, where the triple statements T are provided to the learning component LC, which in response iteratively updates its internal state with learning updates LU according to a specific cost function as described below. A further mode of operation is inference mode, where the learning component LC makes predictions about the likelihood of unobserved triple statements. Inference mode can either be a free-running mode, whereby random triple statements are generated by the learning component LC based on the accumulated knowledge, or a targeted inference mode, where the control component CC specifically sets the learning component LC in such a way that the likelihood of specific triple statements is evaluated.
Finally, the industrial device ED can be programmed to take specific actions whenever the learning component LC predicts specific events with an inference IF. Programming of such actions is made via a set of handling rules HR that map specific triple statements to software routines to be executed. The handling rules HR are executed by a statement handler SH that receives the inference IF of the learning component LC.
For instance, in a link prediction setting, the inference IF could be a prediction of a certain triple statement, e.g., “system enters_state error”, by the learning component LC. This inference IF can trigger a routine that alerts a human operator or that initiates a controlled shutdown of the industrial device ED or a connected system. Other types of trigger are also possible, different than a link prediction. For instance, in an anomaly detection setting, a handler could be associated to the actual observation of a specific triple statement, whenever its predicted likelihood (inference IF) by the learning component LC is low, indicating that an unexpected event has occurred.
In a simple case, the handling rules HR can be hardcoded in the industrial device ED (e.g., a fire alarm that tries to predict the likelihood of a fire), but in a more general case can be programmed in a more complex device (e.g., a PLC controller as industrial device ED) from an external source, linking the predictions of the learning component LC to programmable software routines such as PLC function blocks.
Various learning algorithms and optimization functions are described in the following, which are suitable for implementing the learning component LC and/or control component CC. Some of these algorithms combine learning and inference in a seamless manner and are suitable for implementation in low-power, highly scalable processing units, e.g., neural network accelerators or neuromorphic processors such as spiking neural network systems.
The learning component LC (and the control component CC if it guides the learning process) can be implemented with any algorithm that can be trained on the basis of knowledge graphs. The embedded triple store ETS contains potentially multiple graphs derived from system observation (triple statements T generated by the ETL component ETLC, plus the pre-loaded set of triple statements which constitute the static sub-graph SSG). Separation into multiple graphs can be done on the basis of time (e.g., separating observations corresponding to specific time periods), or any other similar criteria, for example, in an industrial manufacturing system, separating the triple statements T into independent graphs can be performed depending on the type of action being carried out by the industrial manufacturing system, or the type of good being manufactured, when the triple statements T are observed.
The learning component LC (and the control component CC if it guides the learning process) can be implemented using either transductive algorithms, which are able to learn representations for a fixed graph, for example RESCAL, TransE, or DistMult, or inductive algorithms, which can learn filters that generalize across different graphs, for example Graph Convolutional Neural networks (Graph CNN). In the case of the former an individual model is trained for each graph (feeding triple statements T corresponding to each single graph to independent model instances) whereas in the case of the latter, a single model is trained based on all the graphs.
In either case, we can differentiate between a learning mode, where the triple statements T are presented to the learning component LC which learns a set of internal operations, parameters and coefficients required to solve a specific training objective, and an inference mode, where learning component LC evaluates the likelihood of newly observed or hypothetical triple statements on the basis of the learned parameters. The training objective defines a task that the learning algorithm implemented in the learning component LC tries to solve, adjusting the model parameters in the process. If the industrial device ED is an embedded device, then it is advantageous to perform this step in a semi-supervised or unsupervised manner, i.e., without explicitly providing ground truth labels (i.e., the solution to the problem). In the case of a graph algorithm, this can be accomplished for instance by using a link prediction task as the training objective. In this setting, the learning process is iteratively presented with batches containing samples from the observed triples, together with internally generated negative examples (non-observed semantic triples), with the objective of minimizing a loss function based on the selected examples, which will assign a lower loss when positive and negative examples are assigned high and low likelihood respectively by the algorithm, iteratively adjusting the model parameters accordingly.
The algorithm selected determines the specific internal operations and parameters as well as the specific loss/scoring function that guides the learning process, which can be implemented in a conventional CPU or DSP processing unit of the industrial device ED, or alternatively on specialized machine learning co-processors. For example, in the case of a RESCAL implementation a graph is initially converted to its adjacency form with which the RESCAL gradient descent optimization process is performed. The mathematical foundations of this approach will be explained in more detail in later embodiments. An alternative is provided by the scoring function of DistMult, which reduces the number of parameters by imposing additional constraints in the learned representations. A further alternative would be to use a translation based embedding method, such as TransE which uses the distance between object embedding and subject embedding translated by a vectorial representation of the predicate connecting them.
The previous examples can be considered as decoder based embedding methods. In the case of a Graph CNN based implementation, the algorithm to be trained consists of an encoder and a decoder. The encoder comprises multiple convolutional and dense filters which are applied to the observed graph provided in a tensor formulation, given by an adjacency matrix indicating existing edges between nodes, and a set of node features which typically correspond to literal values assigned to the corresponding node in the RDF representation in the embedded triple store ETS, to which a transformation can be optionally applied in advance (e.g. a clustering step if the literal is of numeric type, or a simple encoding into integer values if the literal is of categorical type). On the other hand, the decoder can be implemented by a DistMult or similar decoder network that performs link scoring from pairs of entity embeddings.
It should be noted that most of the score functions required by knowledge graph learning algorithms, in addition to tunable parameters which are optimized during learning, typically also contain a set of hyperparameters that control the learning process of the learning component LC itself, such as learning rates, batch sizes, iterations counts, aggregation schemes and other model hyperparameters present in the loss function. In the context of the present embodiment, these can be preconfigured within the control component CC and/or the learning component LC in the industrial device ED with known working values determined by offline experimentation. An alternative, performing a complete or partial hyperparameter search and tuning directly on the industrial device ED would also be possible, at the cost of potentially having to perform an increased number of learning steps, in order to locally evaluate the performance of the algorithms for different sets of hyperparameters on the basis of an additional set of triple statements reserved for this purpose.
To set up the industrial device ED, the mapping rules MR need to be defined and stored on the industrial device ED. The learning process can be controlled with external operator input into the control component CC and feedback, or be autonomous as described above.
The learning component LC is composed of two parts: first, a pool of node embedding populations NEP of neurons N that represent embeddings of graph entities (i.e., the subjects and objects in the triple statements), and second, a population of stochastic, dendritic output neurons SDON that perform the calculations (scoring of triple statements, proposing of new triple statements). Similar to
Each entity in the graph is represented by one of the node embedding populations NEP, storing both its embeddings (real-valued entries) and accumulated gradient updates. The neurons N of each node embedding population NEP project statically one-to-one to dendritic compartments of the stochastic, dendritic output neurons SDON, where inputs are multiplied together with a third factor R, as shown in
In the example shown in
Returning to
In general, the learning component LC can be operated in three modes or phases controlled by a single parameter η=[1,0,−1]: A data-driven learning mode (η=1) as shown in
An additional input ζ is used to explicitly control plasticity, i.e., how to clamp the stochastic, dendritic output neurons SDON, apply updates or clear (reset to 0) accumulated updates. Learning updates LU (as shown in
In other words,
In the data-driven learning mode shown in
In the sampling mode shown in
In case of many entities, to reduce the amount of required wiring, a sparse connectivity can be used between the node embedding populations NEP and the stochastic, dendritic output neurons SDON. To realize the RESCAL score function, each node embedding population NEP has to be doubled (once for subjects and objects, as the scoring function is not symmetric). This way, each graph entity has now two embeddings (for subject and object, respectively), which can be synchronized again by including “subj_embedding isIdenticalTo obj_embedding” triple statements in the training data.
The learning component LC combines global parameters, feedback and local operations to realize distributed computing rendered controllable by a control component CC to allow seamless transition between inference and learning in the same system.
A widely used graph embedding algorithm is RESCAL. In RESCAL, a graph is represented as a tensor Xs,p,o, where entries are 1 if a triple ‘s-p-o’ (entity s has relation p with entity o) occurs in the graph and 0 otherwise. This allows us to rephrase the goal of finding embeddings as a tensor factorization problem
with each graph entity s being represented by a vector es and each relation p by a matrix Rp. The problem of finding embeddings is then equivalent to minimizing the reconstruction loss
which can either be done using alternating least-square optimization or gradient-descent-based optimization. Usually, we are only aware of valid triples, and the validity of all other triples are unknown to us and cannot be modeled by setting the respective tensor entries to 0. However, only training on positive triples would result in trivial solutions that score all possible triples high. To avoid this, so-called ‘negative samples’ are generated from the training data by randomly exchanging either subject or object entity in a data triple, e.g., ‘s-p-o’ E D→‘a-p-o’ or ‘s-p-o’∈D→‘s-p-b’. During training, these negative samples are then presented as invalid triples with tensor entry 0. However, negative samples are not kept but newly generated for each parameter update.
We propose a probabilistic model of graph embeddings based on an energy function that takes inspiration from the RESCAL scoring function. Energy-based models have a long history in computational neuroscience and artificial intelligence, and we use this as a vehicle to explore possible dynamic systems that are capable of implementing computations on multi-relational graph data.
Given a tensor X that represents a graph (or subgraph), we assign it the energy
where θs,p,o is the RESCAL score function (Eq. (4)). From this, we define the probability of observing X
where we sum over all possible graph realizations X′. Here, the Xs,p,o∈[0,1] are binary random variables indicating whether a triple exists, with the probability depending on the score of the triple. For instance, a triple (s, p, o) with positive score θs,p,o is assigned a negative energy and hence a higher probability that Xs,p,o,=1. This elevates RESCAL to a probabilistic model by assuming that the observed graph is merely a sample from an underlying probability distribution, i.e., it is a collection of random variables. Since triples are treated independently here, the probability can be rewritten as
where σ(·) is the logistic function. Thus, the probability of a single triple (s,p,o) appearing is given by σ(θs,p,o).
The model is trained using maximum likelihood learning, i.e., node and edge embeddings are adjusted such that the likelihood (or log-likelihood) of observed triples is maximized
where D is a list of subgraphs (data graphs) available for learning. These update rules can be rewritten as
Relations learn to match the inner product of subject and object embeddings they occur with, while node embeddings learn to match the latent representation of their counterpart, e.g., es learns to match the latent representation of the object Rpeo if the triple ‘s-p-o’ is in the data. Both learning rules consist of two phases, a data-driven phase and a model-driven phase—similar to the wake-sleep algorithm used to train, e.g., Boltzmann machines. In contrast to the data-driven phase, during the model-driven phase, the likelihood of model-generated triples S is reduced. Thus, different from graph embedding algorithms like RESCAL, no negative samples are required to train the model.
To generate triples from the model, we use Markov Chain Monte Carlo (MCMC) sampling—more precisely, the Metropolis-Hastings algorithm—with negative sampling as the proposal distribution. For instance, if the triple (s, p, o) is in the data set, we propose a new sample by randomly replacing either subject, predicate or object, and accepting the change with probability
T({s,p,o}→{s,p,q})=max[1,exp (esTRp(eq−eo))] (13)
The transition probability directly depends on the distance between the embeddings, i.e., if the embeddings of nodes (or relations) are close to each other, a transition is more likely. This process can be repeated on the new sample to generate a chain of samples, exploring the neighborhood of the data triple under the model distribution. It can further be used to approximate conditional or marginal probabilities, e.g., by keeping the subject fixed and sampling over predicates and objects.
The described learning rules and sampling dynamics suggest a neural network structure with specific connectivity and neuron types as shown in
with η∈[−1, 0, 1]. Through η, the stochastic, dendritic output neurons SDON can both return the probability σ(·) of a triple statement to be true (η=0) and the transition probabilities T(·) required for sampling (η=−1 or 1).
n is further used to gate between three different phases or modes for learning: the data-driven learning mode shown in
ΔRp∝η·s(θs,p,o)esTeo (15.1)
Δes∝η·s(θs,p,o)Rpeo (15.2)
Δeo∝η·s(θs,p,o)esTRp (15.3)
where updates are only applied when the stochastic, dendritic output neuron SDON ‘spiked’, i.e., sampling σ(θs,p,o) returns s (θs,p,o)=1.
In this architecture, the learning rule Eq. (11) takes the form of a contrastive Hebbian learning rule and Eq. (12) of a contrastive predictive learning rule. To update the embeddings of the node embedding populations NEP, feedback signals have to be sent from the stochastic, dendritic output neurons SDON to the neurons N—which can be done through a pre-wired feedback structure due to the simple and static forward connectivity, as shown in
Input is presented to the network by selecting the according node embedding populations NEP and stochastic, dendritic output neurons SDON, which can be achieved through inhibitory gating, resembling a ‘memory recall’ of learned concepts. Alternatively, the learned embeddings of concepts could also be interpreted as attractor states of a memory network. During the sampling phase, feedback from the stochastic, dendritic output neurons SDON (Eq. (13)) is used to decide whether the network switches to another memory (or attractor state).
Both the forward inference path and the learning path only require spike times and utilize a biologically inspired neuron model found in the current generation of neuromorphic, spike-based processors, as will be described with more detail in later embodiments. Furthermore, similarly to the previous embodiments, static feedback connections between the node embedding populations NEP and the output neurons ON are utilized to transmit parameter updates. Different from the previous embodiments, no probabilistic sampling is performed by the system.
To select node embedding populations NEP, for example the two active node embedding populations NEP shown in
For the following embodiments, numbering of the equations will begin new.
In the following, we explain our spike-based graph embedding model (SpikE) and derive the required learning rule.
Spike-based graph embeddings
From graphs to spikes:
Our model takes inspiration from TransE, a shallow graph embedding algorithm where node embeddings are represented as vectors and relations as vector translations (see Section “Translating Embeddings” for more details). In principle, we found that these vector representations can be mapped to spike times and translations into spike time differences, offering a natural transition from the graph domain to SNNs.
We propose that the embedding of a node s is given by single spike times of a first node embedding population NEP1 of size N,ts∈[t0, tmax]N as shown in
In other words,
This coding scheme maps the rich semantic space of graphs into the spike domain, where the spike patterns of two populations encode how the represented entities relate to each other, but not only for one single relation p, but the whole set of relations spanning the semantic space. To achieve this, learned relations encompass a range of patterns from mere coincidence detection to complex spike time patterns. In fact, coding of relations as spike coincidence detection does naturally appear as a special case in our model when training SNNs on real data, see for instance
Formally, the ranking of triples can be written as
ϑs,p,o=Σ||d(ts, to)−rp|| (1)
where d is the distance between spike times and the sum is over vector components. In the remaining document, we call ϑs,p,o the score of triple (s, p, o), where valid triples have a score close to 0 and invalid ones >>0. We define the distance function for SpikE to be
d
A(ts,to)=ts−to (2)
where both the order and distance of spike times are used to encode relations. The distance function can be modified to only incorporate spike time differences,
d
s(ts,to)=||ts−to|| (3)
such that there is no difference between subject and object populations. We call this version of the model Spike-S.
A suitable neuron model that suffices the requirements of the presented coding scheme, i.e., single-spike coding and being analytically treatable, is the nLIF neuron model. For similar reasons, it has recently been used in hierarchical networks utilizing spike-latency codes. For the neuron populations encoding entities (the node embedding populations), we use the nLIF model with an exponential synaptic kernel
where us,i is the membrane potential of the ith neuron of population s, τs the synaptic time constant and θ(·) the Heaviside function. A spike is emitted when the membrane potential crosses a threshold value uth. Ws,i,j are synaptic weights from a pre-synaptic neuron population, with every neuron j emitting a single spike at fixed time tj (
Eq. (4) can be solved analytically
which is later used to derive a learning rule for the embedding populations. For relations, we use output neurons ON. Each output neuron ON consists of a ‘dendritic tree’, where branch k evaluates the kth component of the spike pattern difference, i.e., ||d(ts, to)−rp||k), and the tree structure subsequently sums over all contributions, giving ϑs,p,o (
Different from ordinary feedforward or recurrent SNNs, the input is not given by a signal that first has to be translated into spike times and is then fed into the first layer (or specific input neurons) of the network. Instead, inputs to the network are observed triples ‘s-p-o’, i.e., statements that have been observed to be true. Since all possible entities are represented as neuron populations, the input simply gates which populations become active (
Learning Rules
To learn spike-based embeddings for entities and relations, we use a soft margin loss
where ηs,p,o∈{−1} is a modulating teaching signal that establishes whether an observed triple ‘s-p-o’ is regarded as valid (ηs,p,o=1) or invalid (ηs,p,o,=−1). This is required to avoid collapse to zero-embeddings that simply score all possible triples with 0. In the graph embedding literature, invalid examples are generated by corrupting valid triples, i.e., given a training triple ‘s-p-o’, either s or o are randomly replaced—a procedure called ‘negative sampling’.
The learning rules are derived by minimizing the loss Eq. (6b) via gradient descent. In addition, we add a regularization term to the weight learning rule that counters silent neurons. The gradient for entities can be separated into a loss-dependent error and a neuron-model-specific term
while the gradient for relations only consists of the error
The error terms are given by (see section “Spike-based model”)
for SpikE-S, where σ(·) is the logistic function.
The neuron-specific term can be evaluated using Eq. (5), resulting in (see section “Spike-based model”)
For relations, all quantities in the update rule are accessible in the output neuron ON. Apart from an output error, this is also true for the update rules of nLIF spike times. Specifically, the learning rules only depend on spike times—or rather spike time differences—pre-synaptic weights and neuron-specific constants, compatible with recently proposed learning rules for SNNs.
Data:
To evaluate the performance of the spike-based model, we generated graph data from an industrial automation system as shown in
For the following experiments, we use a recording from the industrial automation system with some default network and app activity, resulting in a knowledge graph KG with 3529 nodes, 11 node types, 2 applications, 21 IP addresses, 39 relations, 360 network events and 472 data access events. We randomly split the graph with a ratio of 8/2 into mutually exclusive training and test sets, resulting in 12399 training and 2463 test triples.
We present a model for spike-based graph embeddings, where nodes and relations of a knowledge graph are mapped to spike times and spike time differences in a SNN, respectively. This allows a natural transition from symbolic elements in a graph to the temporal domain of SNNs, going beyond traditional data formats by enabling the encoding of complex structures into spikes. Representations are learned using gradient descent on an output cost function, which yields learning rules that depend on spike times and neuron-specific variables.
In our model, input gates which populations become active and consequently updated by plasticity. This memory mechanism allows the propagation of knowledge through all neuron populations—despite the input being isolated triple statements.
After training, the learned embeddings can be used to evaluate or predict arbitrary triples that are covered by the semantic space of the knowledge graph. Moreover, learned spike embeddings can be used as input to other SNNs, providing a native conversion of data into spike-based input.
The nLIF neuron model used in this embodiment is well suited to represent embeddings, but it comes with the drawback of a missing leak term, i.e., the neurons are modeled as integrators with infinite memory. This is critical for neuromorphic implementations, where—most often—variations of the nLIF model with leak are realized. Gradient-based optimization of current-based LIF neurons, i.e., nLIF with leak, can be used in alternative embodiments, making them applicable to energy-efficient neuromorphic implementations. Moreover, output neurons take a simple, but function-specific form that is different from ordinary nLIF neurons. Although realizable in neuromorphic devices, we believe that alternative forms are possible. For instance, each output neuron might be represented by a small forward network of spiking neurons, or relations could be represented by learnable delays.
Finally, the presented results bridge the areas of graph analytics and SNNs, promising exciting industrial applications of event-based neuromorphic devices, e.g., as energy efficient and flexible processing and learning units for online evaluation of industrial graph data.
In TransE, entities and relations are embedded as vectors in an N-dimensional vector space. If a triple ‘s-p-o’ is valid, then subject es and object eo vectors are connected via the relation vector rp, i.e., relations represent translations between subjects and objects in the vector space
e
s
+r
p
≈e
o (11)
In our experiments, similar to SpikE, we use a soft margin loss to learn the embeddings of TransE.
Spike-Based Model
Spike Time Gradients
The gradients for ds can be calculated as follows
All other gradients can be obtained similarly.
Weight gradients:
The spike times of nLIF neurons can be calculated analytically by setting the membrane potential equal to the spike threshold uth, i.e., us,i(t*)uth:
In addition, for a neuron to spike, three additional conditions have to be met:
the spike occurs before the next causal pre-synaptic spike tc
t*<tc (16)
From this, we can calculate the gradient
where we used that
Regularization of weights:
To ensure that all neurons in the embedding populations spike, we use the regularization term Lδ
with ws,i=ΣjWs,ij.
As was shown in
If an entity is represented by distinct subject s and object o populations, these representations will differ after training—although they represent the same entity. By adding triples of the form ‘s—#isIdenticalTo—o’ and keeping risIdenticalTo=0, further alignment can be enforced that increases performance during training.
Although the present invention has been disclosed in the form of preferred embodiments and variations thereon, it will be understood that numerous additional modifications and variations could be made thereto without departing from the scope of the invention.
For the sake of clarity, it is to be understood that the use of “a” or “an” throughout this application does not exclude a plurality, and “comprising” does not exclude other steps or elements.
Number | Date | Country | Kind |
---|---|---|---|
21152148.9 | Jan 2021 | EP | regional |