The present application relates to a system and method for attention filtering input datasets for multiple-instance learning.
Multiple-instance learning (MIL) is a variation on conventional supervised machine learning (ML) techniques. Typically, ML techniques receive a labelled training dataset in which each training data instance is a labelled feature vector. A ML technique can be trained to generate a ML model based on the labelled training dataset. In MIL, the labelled training dataset X comprises a plurality of labelled sets of training data {X1, . . . , Xn, . . . , XT} for 1≤n≤T, where T is the number of labelled sets of training data. Each labelled set of training data (a.k.a. a set of labelled bags) comprises a plurality of training data instances {x1,n, . . . , xi,n, xN
Although the n-th set of training data Xn is described as, by way of example only but is not limited to, being mapped or associated with label l for 1≤l≤L, this is by way of example only, it is to be appreciated by the skilled person that the remaining plurality of sets of data instances Xj for 1≤j≠n≤T may be associated or mapped to a relationship based on any one of the labels k ∈ for 1≤k≤L, where k may be equal to l, from the set of labels . As each set of training data Xn for 1≤n≤T is mapped to one or more of the labels l∈ for 1≤1≤L, then each training data instance xi,n of a labelled set of training data Xn may be assumed to be associated with the same label(s) l that is mapped to the labelled set of training data Xn. For simplicity, it is assumed that the n-th labelled set of training data Xn is mapped to the l-th label i. Thus, each training data instance xi,n of the n-th labelled set of training data Xn represents evidence that can be used to support the value of l. In the simplest case l may be a boolean variable whose value determines whether a fact/relationship is true or false and Xn comprises a set of potential evidence for this fact/relationship. It is to be appreciated by the skilled person that label l may be any binary or non-binary value whose value determines whether a fact/relationship is more likely or unlikely and Xn comprises a set of potential evidence for this fact/relationship. Each training data instance may be represented as a feature or encoding vector in K-dimensional space, where K>1.
The n-th labelled set of training data Xn includes a plurality of training data instances {x1,n, . . . xi,n, . . . , xN
Generating a labelled training dataset may be costly and time consuming, especially as the number of training data instances increases. In MIL, a labelled training dataset includes a plurality of sets of labelled training data. Each set of labelled training data includes a plurality of training instances, which may be assumed to be associated with the same label representing evidence supporting or not supporting a relationship.
Even though each plurality of training instances of a set of training data may be associated with a label several issues exist: not all the training instances in a labelled set of training data are necessarily relevant and may in fact contradict the label; there may be training instances in each labelled set of training data that do not convey any information about the relationship they are meant to support; there may be training instances in each labelled set of training data that are more related to other sets of labelled training data.
Using labelled training dataset in which each of the labelled sets of training data has one or more of these issues would severely limit or provide confusing information to the training of any ML technique that uses the labelled training dataset for generating a ML model (e.g. an ML classifier).
Although each training data instance in each set of training data may be manually verified/checked, this is impractical to do due to the increasing requirement for large datasets for training ML techniques. Automatically creating a labelled training dataset is preferable for generating the required datasets that are large enough for ML applications. However, the above problems are greatly exacerbated when creating training datasets automatically from, by way of example only but not limited to, a corpus of literature/citations or a corpus of image(s), or any other type of data as the application demands.
There is a desire for efficiently creating and/or using sufficiently large labelled training datasets for MIL and using these in a manner that further improves: a) the training of ML techniques and the resulting generated ML models and classifiers; and/or b) automatic creation of labelled training datasets.
The embodiments described below are not limited to implementations which solve any or all of the disadvantages of the known approaches described above.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to determine the scope of the claimed subject matter; variants and alternative features which facilitate the working of the invention and/or serve to achieve a substantially similar technical effect should be considered as falling into the scope of the invention disclosed herein.
The present disclosure provides an attention mechanism that filters training datasets or input multi-instance datasets, each data instance representing one or more entities that may support a factual relationship associated with the one or more entities, to retain the most relevant training data or input data associated with that relationship by using prior knowledge of the relationships between the training data instances or input data instances.
In a first aspect, the present disclosure provides a computer-implemented method for filtering a set of data, the set of data comprising multiple data instances, the method comprising: receiving a set of scores for the set of data; determining attention filtering information based on prior knowledge of one or more relationships between the data instances in said set of data and calculating attention relevancy weights corresponding to the data instances and the set of scores; and providing the attention filtering information to a machine learning, ML, technique or ML model.
Preferably, calculating the attention relevancy weights comprises searching for a set of attention relevancy weights that minimise a cost function based on the set of scores and prior knowledge of one or more relationships between the data instances in said set of data.
Preferably, determining the attention filtering information further comprises filtering the data instances in the set of data by calculating a weighted combination of the calculated attention relevancy weights with an encoding vector associated with the corresponding data instances of said set of data; and providing the attention filtering information further comprises providing data representative of the filtered data instances to the ML technique or ML model.
Preferably, determining the attention filtering information further comprises: calculating attention weights based on the scoring vector; and calculating a weighted combination of the calculated attention relevancy weights with an encoding vector associated with the corresponding data instances of said set of data; and providing the attention filtering information further comprises providing data representative of the weighted combination and the prior knowledge of one or more relationships between data instances to the ML technique or ML model.
Preferably, the set of data is a labelled set of training data of a training dataset, the training dataset comprising a plurality of labelled sets of training data, wherein each labelled set of training data comprises a multiple training data instances, and wherein each labelled set of training data is filtered.
Preferably, each of the multiple training data instances are representative of a relationship between one or more entities.
Preferably, each training data instance of each set of training data is associated with the same label n in relation to a relationship and comprises data representative of evidence supporting the relationship being true or false, or any other binary or non-binary value.
Preferably, each data instance comprises a sentence extracted from a corpus of literature, said sentence describing a relationship between multiple entities.
Preferably, each data instance comprises an image or image portion extracted from an image or a corpus of images, said data instance describing an object in an image. For example, image portions may be extracted from an image and may comprise image patches or portions that correspond to an object or a portion of an object in the image (e.g., a tumor).
Preferably, the set of scores is based on a scoring network operating on feature encoding vectors embedding the corresponding data instances, the scoring network based on a neural network structure.
Preferably, prior knowledge of one or more relationships comprises a set of prior knowledge networks or graphs, each prior knowledge network or graph representing a particular type of relationship between data instances of the set of data.
Preferably, the set of prior knowledge graphs comprise one or more prior knowledge networks or graphs from the group of: a citation network or graph; or reference network or graph providing an indication of a relationship between data instances located in the same document in a corpus of literature; or a reference network or graph providing an indication of a relationship between data instances located in different documents in a corpus of literature.
Preferably, determining the attention filtering information further comprises searching for an attention relevancy weight vector that minimises, over all attention relevancy weight vectors, a cost function based on a similarity between an attention relevancy weight vector and a scoring vector and prior knowledge between data instances of said set of data.
Preferably, searching for the attention relevancy weight vector further comprises minimising an attention cost function:
in relation to the attention relevancy weight vector, {right arrow over (αn)}, for 1≤n≤T, where T is the number of sets of data, Λ(·) is the attention cost function that maps a score vector, {right arrow over (sn)}, for each set of data to a probability distribution Δn={αi≥0, Σαi=1}, G1, . . . , Gm for 1≤m are prior knowledge networks or graphs representing whether each pair of data instances (xi,n,xj,n), for 1≤i≤j≤Nn, have a relationship or not, each λr∈+ for 1≤r≤m is a hyperparameter selected to adjust the contribution of the prior knowledge graph Gr.
Preferably, a prior knowledge graph Gr assigns equal attention weights αi,n and αj,n to the pair of data instances (xi,n,xj,n) should they be connected/related; and a prior knowledge graph Gr assigns unequal attention weights αi,n and αj,n to the pair of data instances (xi,n,xj,n) which are not related by the prior knowledge network Gr or which depends on how distantly connected/related they are.
Preferably, searching for the set of attention relevancy weights that minimise the cost function further comprises searching for the set of attention relevancy weights using one or more from the group of: a neural network structure or layer configured for determining a set of attention relevancy weights that minimise the cost function; one or more ML techniques configured for determining a set of attention relevancy weights that minimise the cost function; one or more numerical methods or iterative numerical methods configured for determining a set of attention relevancy weights that minimise the cost function; and/or any other algorithm, structure or method for determining a set of attention relevancy weights that minimise the cost function.
Preferably, determining attention filtering information further comprises calculating an attention-loss function, AL(X,,{right arrow over (αn)}) comprising a loss function, L(ƒ(X),) and an attention function, AF(Gl, {right arrow over (αk)}, Xk) for introducing a penalty or reward based on applying one or more prior knowledge graph(s) G1, . . . , Gm and attention relevancy weight vector of attention weights {right arrow over (αn)}=a [α1,n, . . . , αi,n, . . . , αN
Preferably, calculating the attention-loss function, AL, further comprises calculating the attention-loss function based on:
where λl∈+ is a hyperparameter selected to adjust the contribution of the prior knowledge graph G1, and each attention score αi,n may be calculated based on an attention function.
Preferably, calculating the attention function further comprises calculating an attention function based on one or more from the group of: a SOFTMAX attention function, wherein each attention weight, αi,n, is calculated based on
wherein si,n is a corresponding score from the set of scores associated with the set of data; a MAX attention function; a sparsemax attention function; and/or any suitable attention function for calculating attention weights based on at least the set of scores associated with the set of data.
Preferably, determining the attention filtering information further comprises filtering the data instances of the set of data by calculating a weighted combination of the attention relevancy weight vector with the encoding vector of the corresponding set of data.
Preferably, the linear combination is based on a Hadamard multiplication between a matrix of feature encoding vectors associated with the corresponding set of data and the associated attention relevancy weight vector.
Preferably, the attention-loss function is implemented by the ML technique, ML model or classifier, the attention filtering information comprising data representative of the calculated weighted combination and the prior knowledge data associated with the set of data Xn output by each prior knowledge graph or network, wherein the attention filtering information is input to the attention-loss function of the ML technique, ML model or classifier.
Preferably, filtering of the set of data occurs during training of the ML technique when generating a ML model or classifier, wherein the attention-loss function is penalised if the ML model does not correctly associate the relationship between pairs of data instances based on the prior knowledge data.
Preferably, filtering each set of data of an input dataset, wherein the input dataset comprises a plurality of sets of data, in which each set of data comprises multiple data instances.
Preferably, each of the multiple data instances of a set of data are representative of a relationship between one or more entities of the data instances.
Preferably, each set of data is associated with a relationship between a different one or more entities.
Preferably, each set of data is associated with a relationship between one or more entities, wherein one or more of the relationships between each of the sets of data are different or dissimilar.
Preferably, each set of data is associated with a relationship between one or more entities, wherein one or more of the relationships between each of the sets of data are similar or the same.
Preferably, the determining attention filtering information is based on a structure in which attention relevancy weights are regularised with an attention function based on the generalised fused lasso (GFL) using one or more prior knowledge graphs or a graph of mentions.
Preferably, the GFL is used to calculate an attention relevancy weight vector, w, based on:
where G is a prior knowledge graph/network defined on the input data instances and λ∈+ is a hyper-parameter, and z is a vector of potentials associated with a potential function , which maps encoding vectors of data instances to a vector of potentials z or scores {right arrow over (s)}, and a subsequent mapping from the vector of potentials to the probability simplex Δ.
Preferably, the attention filtering information is used in a relational extraction model.
Preferably, the method further comprising: receiving an encoding vector of the data instances in the form of a matrix X of encoding vectors; calculating attention relevancy weights of the data instances with respect to a given relation ρr, based on an attention function, , defined as: (X, rr)r≡(w1r, . . . , wN
Preferably, the attention function embodies the calculation of potentials associated with the matrix X of encoding vectors and the calculation of attention relevancy weights based on the potentials and prior data associated with the data instances.
Preferably, the method further comprising: determining attention filtering information comprising data representative of the attention filtered vector x(r)=Σk=1N
Preferably, the attention function is implemented based on a potential network based on a potential function and an attention network.
Preferably, the attention network is based on a probability mapping function Δ based on:
where G is a prior knowledge graph defined on the input data instances and λ∈+ is a hyper-parameter.
In a second aspect, the present disclosure provides a computer-implemented method for training a ML technique to generate an ML model or classifier based on filtering a labelled training dataset comprising a plurality of sets of data according to the method of the first aspect, modifications thereof, and/or as herein described and the like.
In a third aspect, the present disclosure provides a computer-implemented method for classifying or using an ML model based on filtering an input dataset according to the method of the first aspect, modifications thereof, and/or as herein described and the like.
In a fourth aspect, the present disclosure provides a ML model or classifier obtained from the computer implemented method according to any of the first or third aspects, modifications thereof, and/or as herein described and the like.
In a fifth aspect, the present disclosure provides an attention apparatus comprising a processor, a memory and a communication interface, the processor is connected to the memory and the communication interface, wherein the processor, memory and/or communication interface are configured to implement the method or model according to any of the first, second, third and fourth aspects, modifications thereof, and/or as herein described and the like.
In a sixth aspect, the present disclosure provides an attention apparatus comprising a processor and a communication interface, the processor connected to the communication interface, wherein: the communication interface is configured to receive a set of scores for each set of data of an input dataset comprising a plurality of sets of data, in which each set of data comprises multiple data instances; the processor is configured to determine attention filtering information based on prior knowledge of one or more relationships between the data instances in said each set of data and calculating attention relevancy weights corresponding to the data instances and each set of scores; and the communication interface is configured to provide the attention filtering information to a machine learning, ML, technique or ML model.
Preferably, the processor, memory and/or communication interface are configured to implement the method or model according to any of the first, second, third and fourth aspects, modifications thereof, and/or as herein described and the like.
In a seventh aspect, the present disclosure provides a system comprising: an encoding network configured to encode an input dataset into one or more feature encoding vectors, wherein the input dataset comprises a plurality of sets of data, in which each set of data comprises multiple data instances; a scoring network configured to generate a scoring vector for each of the one or more feature encoding vectors; and an attention mechanism configured according to an attention apparatus according to the fifth and/or sixth aspects, modifications thereof, and/or as herein described and the like, the attention apparatus configured for providing attention filtering information based on the encoding vectors and scoring vectors to a ML technique, ML model and/or classifier.
Preferably, the system further comprising a ML module configured to receive the attention filtering information for training the ML technique to generate an ML model or classifier.
Preferably, the system further comprising a ML module configured to receive the attention filtering information for input to a ML model.
Preferably, the system further comprising a ML module configured to receive the attention filtering information for input to a classifier.
Preferably, the encoding network, scoring network, attention mechanism and the machine learning module are configured to implement the computer-implemented method according to any of the first aspect, modifications thereof, and/or as herein described and the like.
In an eighth aspect, the present disclosure provides a computer-readable medium comprising data or instruction code, which when executed on a processor, causes the processor to implement the computer-implemented method of the first aspect, modifications thereof, and/or as herein described and the like.
In an eighth aspect, the present disclosure provides a tangible (or non-transitory) computer-readable medium comprising data or instruction code for for filtering a set of data, the set of data comprising multiple data instances, which when executed on one or more processor(s), causes at least one of the one or more processor(s) to perform at least one of the steps of the method of: receiving a set of scores for the set of data determining attention filtering information based on prior knowledge of one or more relationships between the data instances in said set of data and calculating attention relevancy weights corresponding to the data instances and the set of scores; and providing the attention filtering information to a machine learning, ML, technique or ML model.
Preferably, the tangible (or non-transitory) computer-readable medium further comprising data or instruction code, which when executed on a processor, causes the processor to implement one or more steps of the computer-implemented method of the first aspect, modifications thereof, and/or as herein described and the like.
The methods described herein may be performed by software in machine readable form on a tangible storage medium e.g. in the form of a computer program comprising computer program code means adapted to perform all the steps of any of the methods described herein when the program is run on a computer and where the computer program may be embodied on a computer readable medium. Examples of tangible (or non-transitory) storage media include disks, thumb drives, memory cards etc. and do not include propagated signals. The software can be suitable for execution on a parallel processor or a serial processor such that the method steps may be carried out in any suitable order, or simultaneously.
This application acknowledges that firmware and software can be valuable, separately tradable commodities. It is intended to encompass software, which runs on or controls “dumb” or standard hardware, to carry out the desired functions. It is also intended to encompass software which “describes” or defines the configuration of hardware, such as HDL (hardware description language) software, as is used for designing silicon chips, or for configuring universal programmable chips, to carry out desired functions.
The preferred features may be combined as appropriate, as would be apparent to a skilled person, and may be combined with any of the aspects of the invention.
Embodiments of the invention will be described, by way of example, with reference to the following drawings, in which:
Common reference numerals are used throughout the figures to indicate similar features.
Embodiments of the present invention are described below by way of example only. These examples represent the best mode of putting the invention into practice that are currently known to the Applicant although they are not the only ways in which this could be achieved. The description sets forth the functions of the example and the sequence of steps for constructing and operating the example. However, the same or equivalent functions and sequences may be accomplished by different examples.
The invention is directed towards an attention mechanism for filtering or extracting not only the most relevant data instances of a set of data instances from an input dataset (e.g. labelled training dataset), but also those relevant data instances of the set of data instances that enhance the training of an ML technique to generate an ML model, and/or enhances the modelling/classification of an ML model/classifier. The input dataset may include a plurality of sets of data (e.g. labelled training data, test data, or input data) for use in training the ML technique and/or for input to a ML model or classifier. The attention mechanism may improve the training of an ML technique to generate an improved ML model or classifier for modelling one or more relationships represented by a set of labels by filtering out what is considered irrelevant or poor training data in relation to the set of labels . The attention mechanism may improve the input dataset to a ML model and/or classifier, which may be trained to classify or model relationship(s) represented by the set of label(s) and for outputting a model estimate or classification of an input set of data in relation to , by filtering out what may be considered irrelevant or poor data instances in the set of data alleviating such data instances from adversely biasing the ML model/classifier output.
For simplicity, the following description uses MIL on natural language processing, by way of example only, to describe an attention mechanism according to the invention. Although the present invention may be described based on natural language processing in which the labelled training dataset is based on sentences from a corpus of literature/citations, it is to be appreciated by the skilled person that the invention may use any type of labelled training dataset as the application demands. For example, an image processing application may require a labelled training dataset X based on data representative of images or image portions from a corpus of images, which are associated with a set of relationships represented by a set of label(s) that are to be modelled by a ML model or classifier. Alternatively or additionally, each data instance may include an image or image portion extracted from an image or a corpus of images, said each data instance may describe an object in an image, where set of label(s) may be associated with or mapped to one or more objects. For example, image portions may be extracted from an image that may include image patches or portions that correspond to an object or a portion of an object in the image (e.g., a tumor).
The attention mechanism may use an attention neural network to determine attention filtering information for each set of data based on calculating attention relevancy weights (also known as attention scores or attention relevancy scores) that represent the relevancy of each data instance (e.g. a labelled training data instance) in each set of data, a set of scores or potentials in relation to each set of data, and using prior knowledge of the relationships between pairs of data instances (e.g. every pair of training data instances). The attention filtering information in relation to an input dataset may be provided to an ML technique, ML model and/or classifier in place of the input dataset.
ML technique(s) are used to train and generate one or more trained models or classifiers having the same or a similar output objective associated with input data. ML technique(s) may comprise or represent one or more or a combination of computational methods that can be used to generate analytical models, classifiers and/or algorithms that lend themselves to solving complex problems such as, by way of example only but is not limited to, prediction and analysis of complex processes and/or compounds; classification of input data in relation to one or more relationships. ML techniques can be used to generate analytical models associated with compounds for use in the drug discovery, identification, and optimization and other related informatics, chem(o)informatics and/or bioinformatics fields.
Examples of ML technique(s) that may be used by the invention as described herein may include or be based on, by way of example only but is not limited to, any ML technique or algorithm/method that can be trained on a labelled and/or unlabelled datasets to generate a model or classifier associated with the labelled and/or unlabelled dataset, one or more supervised ML techniques, semi-supervised ML techniques, unsupervised ML techniques, linear and/or non-linear ML techniques, ML techniques associated with classification, ML techniques associated with regression and the like and/or combinations thereof. Some examples of ML techniques may include or be based on, by way of example only but is not limited to, one or more of active learning, multitask learning, transfer learning, neural message parsing, one-shot learning, dimensionality reduction, decision tree learning, association rule learning, similarity learning, data mining algorithms/methods, artificial neural networks (NNs), deep NNs, deep learning, deep learning ANNs, inductive logic programming, support vector machines (SVMs), sparse dictionary learning, clustering, Bayesian networks, reinforcement learning, representation learning, similarity and metric learning, sparse dictionary learning, genetic algorithms, rule-based machine learning, learning classifier systems, and/or one or more combinations thereof and the like.
Some examples of supervised ML techniques may include or be based on, by way of example only but is not limited to, ANNs, DNNs, association rule learning algorithms, a priori algorithm, Eclat algorithm, case-based reasoning, Gaussian process regression, gene expression programming, group method of data handling (GMDH), inductive logic programming, instance-based learning, lazy learning, learning automata, learning vector quantization, logistic model tree, minimum message length (decision trees, decision graphs, etc.), nearest neighbour algorithm, analogical modelling, probably approximately correct learning (PAC) learning, ripple down rules, a knowledge acquisition methodology, symbolic machine learning algorithms, support vector machines, random forests, ensembles of classifiers, bootstrap aggregating (BAGGING), boosting (meta-algorithm), ordinal classification, information fuzzy networks (IFN), conditional random field, anova, quadratic classifiers, k-nearest neighbour, boosting, sprint, Bayesian networks, Nave Bayes, hidden Markov models (HMMs), hierarchical hidden Markov model (HHMM), and any other ML technique or ML task capable of inferring a function or generating a model from labelled training data and the like.
Some examples of unsupervised ML techniques may include or be based on, by way of example only but is not limited to, expectation-maximization (EM) algorithm, vector quantization, generative topographic map, information bottleneck (IB) method and any other ML technique or ML task capable of inferring a function to describe hidden structure and/or generate a model from unlabelled data and/or by ignoring labels in labelled training datasets and the like. Some examples of semi-supervised ML techniques may include or be based on, by way of example only but is not limited to, one or more of active learning, generative models, low-density separation, graph-based methods, co-training, transduction or any other a ML technique, task, or class of supervised ML technique capable of making use of unlabeled datasets and labelled datasets for training (e.g. typically the training dataset may include a small amount of labelled training data combined with a large amount of unlabeled data and the like.
Some examples of artificial NN (ANN) ML techniques may include or be based on, by way of example only but is not limited to, one or more of artificial NNs, feedforward NNs, recursive NNs (RNNs), Convolutional NNs (CNNs), autoencoder NNs, extreme learning machines, logic learning machines, self-organizing maps, and other ANN ML technique or connectionist system/computing systems inspired by the biological neural networks that constitute animal brains and capable of learning or generating a model based on labelled and/or unlabelled datasets. Some examples of deep learning ML technique may include or be based on, by way of example only but is not limited to, one or more of deep belief networks, deep Boltzmann machines, DNNs, deep CNNs, deep RNNs, hierarchical temporal memory, deep Boltzmann machine (DBM), stacked Auto-Encoders, and/or any other ML technique capable of learning or generating a model based on learning data representations from labelled and/or unlabelled datasets.
The attention mechanism according to the invention may be applied to various ML learning techniques based on multiple-instance learning (MIL). In MIL, an input dataset X may include a plurality of sets of data {X1, . . . , Xn, . . . , XD}, for D≥1 is the number of sets of data in the input dataset X, in which each set of data further includes a plurality of data instances {x1,n, . . . , xN
Typically, ML techniques may receive a labelled training dataset in which each training data instance is a labelled feature or encoding vector. An ML technique can be trained to generate a ML model or classifier based on the labelled training dataset. In MIL, the input dataset X may be a labelled training dataset X that includes a plurality of labelled sets of training data {X1, . . . , Xn, . . . , XT}, where T≥1 is the number of labelled sets of training data. Each labelled set of training data (a.k.a. a set of labelled bags) further includes a plurality of training data instances {z1,n, . . . , xi,n, . . . , xN
The n-th labelled set of training data Xn includes a plurality of training data instances {x1,n, . . . , xi,n, . . . , xN
Given the typically large amount of literature/citations, the reliability of the sentence(s) xi,n as evidences for a potential relationship extraction are mostly unknown, so it is useful to train a classifier/model that can reliably extract this evidence from the literature. The attention mechanism according to the invention may be used to filter input training datasets X using prior knowledge of relationships between the training data instances. This can be used to train reliable ML models/classifiers for relationship extraction for input datasets X.
Typically, ML models or classifiers, which have been trained by an ML technique to model and/or classify relationship(s) represented as a set of labels , may receive input dataset X={X1, . . . , Xn, . . . , XD}, where Xn={x1,n, . . . , xp,n, . . . , xN
For example, in gene/protein fields, the relation may be “C regulates D”, where entities C and D are compounds, proteins and/or targets etc. Each input data instance xi,n may include data representative of a sentence extracted from a corpus of literature/citation(s) containing the two or more entities (e.g. a sentence containing “C” and “D”). The n-th set of input data Xn may include data representative of all sentences from the corpus of literature/citations containing the two or more entities (e.g. “C” and “D”). When input to the ML model or classifier, the ML model or classifier may attempt to determine whether or how the set of input data Xn is associated with the relationship(s) represented by a set of label(s) ={1, . . . , L}, each representing a relationship/fact, where L is the number of relationships/facts that are being modelled. Each other set of input data may include other sentences from the corpus of literature/citations in relation to another two or more entities, where the ML model or classifier may be able to determine whether or how they are associated with the relationship(s) represented by a set of label(s) .
The attention filtering information for each set of data Xn may include, by way of example only but is not limited to, data representative of filtered set(s) of data, attention relevancy weights, and/or prior knowledge data from one or more prior knowledge networks/graphs associated with each set of data. Prior knowledge of one or more relationships between pairs of data instances may be represented as a prior knowledge network or graph providing an indication of the relationship(s) between pairs of data instances.
Each prior knowledge network/graph may be coupled to the attention network of the attention mechanism for assisting in determining attention filtering information, such as by way of example only but not limited to, calculating the attention relevancy weights. The attention relevancy weights may be output from the attention mechanism as an attention vector, where each attention relevancy weight represents the relevancy of the corresponding data instance in a set of data Xn. The attention relevancy weights may be used to “filter” the data instances to ensure only the most relevant are used.
For example, when training a ML technique to generate a ML model or classifier, the attention mechanism provides a filtering function by providing attention filtering information to the ML technique. This allows the ML technique to use only the most relevant labelled training data instances from the labelled set of training data during training of an ML model or classifier. This further optimises or enhances the generated ML model or classifier as it will not become over trained or fixated on irrelevant information, which may otherwise skew or bias the ML model or classifier in relation to the set of labels . The attention mechanism applies prior knowledge of a relationship between pairs of labelled training data instances in each set of training data, which may be represented as a prior knowledge network or graph. There may be multiple different relationships represented by the set of labels , which may be represented by a single prior knowledge network or graph or which may be represented by a plurality of prior knowledge networks/graphs. This prior information may be input to the attention network for calculating an attention relevancy weight for each labelled training data instance of a set of labelled training data. Alternatively or additionally, this prior information may be communicated to the ML technique for use in its loss or cost function when generating the ML model or classifier.
The system 100 includes an encoding module 108 coupled to a scoring module 112, the attention mechanism 116 and an machine learning (ML) module 118. The encoding module 108 may be configured to transform each input set of data Xn 106 into a corresponding set of feature encoding vectors 110 in K-dimensional space. The K-dimensional feature encoding vectors are an embedding of the corresponding data instances of the n-th set of data Xn={x1,n, . . . , xi,n, . . . , xN
The attention mechanism 116 is configured to provide attention filtering information based on the encoding vectors and scoring vectors to ML module 118, which assists and/or enables filtering of the input sets of data Xn 106 and hence input dataset X 104. The attention mechanism 116 receives a set of scores 114 for each set of data 106 of the input dataset 104. The attention mechanism 116 uses the set of scores 114 and also prior knowledge of one or more relationships between the data instances in said each set of data 106 to determine attention filtering information. Determining the attention filtering information may include calculating attention relevancy weights corresponding to the data instances of each set of data 106. The attention filtering information may be provided, by the attention mechanism 116 to ML module 118 (e.g. an ML technique, ML model or classifier) for processing.
For example, when the ML module 118 is configured to implement the training of an ML technique for generating an ML model or classifier, the input dataset X 104 may be a labelled training dataset X={X1, . . . , Xn, . . . , XT}, where T≥1 is the number of labelled sets of training data. The attention mechanism 116 outputs attention filtering information that is used by the ML module 118 for training the ML technique. The attention filtering information is used to identify and/or filter the training dataset X 104 such that a set of relevant training data instances of the training dataset X 104 are used that enhance the training of the ML model or classifier.
In another example, when the ML module 118 is configured to implement an ML model or classifier trained for identifying or classifying input datasets having relationship(s) represented by a set of labels , the input dataset X 104 may be an input dataset X={X1, . . . , Xn, . . . , XD}, where D≥1, comprising extracted data instances requiring modelling and/or classification based on the set of labels . The attention mechanism 116 outputs attention filtering information that is used by the ML module 118 for extracting a set of data instances of the input set of data 106 as represented by the corresponding subset of the encoded data 110 obtained via 108 that are relevant for modelling and/or classification by the ML model or classifier. The attention mechanism 116 removes or accentuates the more irrelevant data instances of the input set of data that may otherwise bias the modelling and/or classification of the ML model or classifier.
Thus, a plurality of training data instances x1,n, . . . , xN
Although the present invention may be described based on natural language processing in which the labelled training dataset is based on sentences from a corpus of literature/citations, it is to be appreciated by the skilled person that the invention may use any type of labelled training dataset as the application demands. For example, an image processing application may require a labelled training dataset X based on data representative of images or image portions from a corpus of images, the images may include objects which are associated with a set of labels that are to be modelled by a classifier.
Calculating the attention relevancy weights may include searching or optimising a set of attention relevancy weights that minimise an overall loss objective such as, by way of example only but not limited to, a cost function based on the set of scores 114 and prior knowledge of one or more relationships between the data instances in said each set of data 106. The calculating the attention relevancy weights may further include searching and/or optimising an attention relevancy weight vector, over a set of attention relevancy weight vectors or all attention relevancy weight vectors, that minimises a function based on a similarity between an attention relevancy weight vector and a scoring vector and prior knowledge between data instances of said set of data. The searching and/or optimising over the set of attention relevancy weight vectors (e.g. set of attention relevancy weights) may include, by way of example only but it not limited to, using one or more search or optimisation process(es) from the group of: a neural network structure configured for determining a set of attention relevancy weights that minimise the cost function; one or more ML techniques configured or trained for determining a set of attention relevancy weights that minimise the cost function; one or more numerical methods or iterative numerical methods configured for determining a set of attention relevancy weights that minimise the cost function; and/or any other process, algorithm, structure or method for determining a set of attention relevancy weights that may be used to minimise the cost function.
In an example, the attention filtering information may further include filtering the data instances in each set of data 106 by calculating a weighted combination of the calculated attention relevancy weights that minimise the cost function with each feature encoding vector 110 associated with the corresponding data instances of said set of data 106. In which step 126 may further include providing the attention filtering information, which includes data representative of the filtered data instances, to the ML technique or ML model/classifier.
In another example, determining the attention filtering information may further include the steps of: calculating attention weights based on an attention function of the scoring vector 114. For example, an attention function may include a SOFTMAX attention function, where an attention relevancy weight is calculated in relation to each score of the set of scores 114. In another example, an attention function may be based on a MAX attention function, which calculates an attention relevancy weight in relation to the maximum score of the set of scores, and assigns the remaining attention relevancy weights either a 0 or minimal value weight. Further examples may include, by way of example only but not limited to, a sparsemax attention function or any suitable attention function for calculating attention weights based on at least the set of scores associated with the set of data. Based on the calculated attention relevancy weights, a weighted combination of the attention relevancy weights with the corresponding feature encoding vectors associated with the corresponding data instances of said set of data is calculated; and step 126 may further include providing the providing data representative of the weighted combination and the prior knowledge of one or more relationships between data instances as attention filtering information to the ML technique or ML model/classifier. The ML technique or ML model/classifier may use the attention filtering information for further filtering of the input dataset X.
In either example, the attention relevancy weights may form an attention relevancy weight vector, each attention relevancy weight in the attention relevancy weight vector corresponding to a feature encoding vector in the set of feature encoding vectors 114. Each feature encoding vector in the set of feature encoding vectors 114 corresponding to a data instance in the set of data 106. The attention filtering information further includes filtering the set of data instances by calculating a weighted combination of the attention relevancy weight vector with the set of feature encoding vectors of the corresponding set of data 106. The weighted combination may be based on a linear weighted combination such as, by way of example only but is not limited to, a Hadamard multiplication between a matrix of feature encoding vectors associated with the corresponding set of data and the associated attention relevancy weight vector.
As described with reference to
For example, in relationship extraction in natural language processing, l may be a label representative of a particular relationship/fact describing the relationship between two or more entities (e.g. “A regulates B”, where entities A and B are compounds, proteins and/or targets etc.). Each training instance x1,n may include data representative of a sentence extracted from a corpus of literature/citation(s) containing the two or more entities (e.g. a sentence containing “A” and “B”). The n-th labelled set of training data Xn may include all sentences from the corpus of literature/citations containing the same or similar two or more entities (e.g. “A” and “B”). Although the n-th set of training data Xn is described as, by way of example only but is not limited to, being mapped or associated with label l for 1≤l≤L, this is by way of example only, it is to be appreciated by the skilled person that the other labelled sets of training data Xj for 1≤j≠n≤T may include other sentences from the corpus of literature/citations containing other two or more entities different to those of Xn that may be associated or mapped to the same or a different relationship as Xn that is represented by any one of the labels k ∈ for 1≤k≤L, where k may be equal to l, from the set of labels .
Although not shown, the following operation of system 200 may be iterated one or more times over the labelled training dataset X 104 until it is considered that the ML model 212 is validly trained. This may be tested using a held out set of the labelled training dataset X 104. Each of the encoding, scoring and attention modules 108, 112, and 116 may also include, where applicable or as demanded by the application, one or more neural network structures (e.g. recursive neural networks (RNNs), feedforward neural networks (FNNs) and the like) that may also be trained during training of the ML model 212.
Referring back to system 200, the n-th set of training data Xn 106 is input to encoding module 108 which may include an embedding or encoding network that is configured to output a set of K-dimensional encoding vectors Vn={{right arrow over (v)}1,n, . . . , {right arrow over (v)}N
The generated vector of scores {right arrow over (s)}n==[s1,n, . . . , sN
As an example, the attention calculation unit 204 may optimise an attention function to determine a suitable attention relevancy weight vector {right arrow over (αn)} for each set of data Xn for 1≤n≤T. This may be used in an attention filter 210 according to the invention. An example attention function for use in determining an optimal vector of attention relevancy weights {right arrow over (αn)} for 1≤n≤T may be based on:
where Λ(·) is the attention function that maps a score vector {right arrow over (sn)} 114 to a probability distribution Δn={αi≥0, Σαi=1}, G is the prior knowledge network/graph 206 that describes whether each pair of training data instances (xi,n,xj,n), for 1≤i≤j≤Nn, have a relationship or not, λ∈+ is a hyperparameter selected to adjust the contribution of the prior knowledge network/graph G 206.
The prior knowledge graph G 206 is used to encourage the attention calculation unit 204 (or attention network) to assign equal weights αi,n and αj,n to the pair of training data instances (x1,n,xj,n) should they be connected/related. The term will be positive if the pairing (xi,n,xj,n) are related but αi,n≠αj,n and, hence, can be considered to apply a penalty to the task of minimising Λ({right arrow over (s)}n). If a pairing (xi,n,xj,n) is related, then the corresponding αi,n and αj,n should reflect this relationship by being, for example, equal for a “strong” relationship with each other or close to each other for a “medium” relationship. Furthermore, the size of this penalty will be proportional to |αi,n−αj,n| and, hence, a greater penalty will be applied when the difference between the attention relevancy weights is greater. However, the attention calculation unit 204 may assign unequal weights αi,n and αj,n to training data instance pairings (xi,n,xj,n) which are not related by the prior knowledge network G without incurring a penalty in the task of minimising Λ({right arrow over (s)}n). Thus, the solution obtained by minimising Λ({right arrow over (s)}n) corresponds to a set of attention relevant weights or the attention relevancy weight vector {right arrow over (α)} that takes into account the relationships between the training data instances for each set of data Xn for 1≤n≤T.
An attention filter 210 may be used by applying each attention vector {right arrow over (αn)} 208 to the corresponding feature encoding vectors Vn={{right arrow over (v)}1,n, . . . , {right arrow over (v)}N
The attention filter 210 may achieve filtering by generating an n-th context vector {circumflex over (v)}n based on summing a weighted combination of each attention relevancy weight αi,n with the corresponding feature vector {right arrow over (v)}i,n to generate the n-th context vector {circumflex over (v)}n representing the most relevant feature encoding vectors/training instances of the labelled set of training data Xn. The attention filter 210 generates an n-th context vector {circumflex over (v)}n based on:
The n-th context vector {circumflex over (v)}n is input as attention relevancy information for training the ML technique 118a, which generates an ML model ƒ(X) 212 for modelling the relationship represented by set of labels . The attention mechanism 116 processes each of a plurality of labelled sets of training data {X1, . . . , Xn, . . . , XT}, in which the relevant training data instances in each labelled set of training data are used or emphasised to train the ML technique 118a for generating the ML model ƒ(X) 212. Alternatively or additionally, the attention mechanism 116 may be configured to process each of a plurality of labelled sets of training data {X1, . . . , Xn, . . . , XT}, in which only the relevant training data instances in each labelled set of training data are used to train the ML technique 118a for generating the ML model ƒ(X) 212.
In this example, the generated vector of scores {right arrow over (s)}n=[s1,n, . . . , sN
As an example, the attention calculation unit 204 provides an attention function for determining the vector of attention weights {right arrow over (αn)} for 1≤n≤T for use in an attention filter based on:
where Λ(·) is the attention function that maps a score vector {right arrow over (sn)} to a probability distribution Δn={αi≥0, Σαi=1}, G1, . . . , Gm are prior knowledge networks/graphs 206a-206m that describe whether each pair of training data instances (xi,n, xj,n), for 1≤i≤j≤Nn, have a relationship or not, each λr∈+ for 1≤r≤m is a hyperparameter selected to adjust the contribution of the prior knowledge graph Gr.
Each prior knowledge graph Gr represents a different relationship between the data instances and is used to encourage the attention calculation unit 204 (or attention network) to assign equal weights αi,n and αj,n to the pair of training data instances (xi,n,xj,n) should they be connected/related. The term will be positive if the pairing (xi,n,xj,n) are related but αi,n≠αj,n and, hence, can be considered to apply a penalty to the task of minimising Λ({right arrow over (s)}). If a pairing (xi,n,xj,n) is related, then the corresponding αi,n and αj,n should reflect this relationship by being, for example, equal for a “strong” relationship with each other or close to each other for a “medium” relationship. Furthermore, the size of this penalty will be proportional to |αi,n−αj,n| and, hence, a greater penalty will be applied when the difference between the attention weights is greater. However, the attention network may assign unequal weights αi,n and αj,n to training data instance pairings (xi,n,xj,n) which are not related by the prior knowledge network Gr, or which depends on how distantly connected/related they are, without incurring a penalty in the task of minimising Λ({right arrow over (s)}n). Thus, the solution obtained by minimising Λ({right arrow over (s)}n) corresponds to a set of attention relevancy weights or the attention relevancy weight vector {right arrow over (α)} 208 that takes into account the relationships between the training data instances for each set of data Xn for 1≤n≤T.
The attention filter 210 applies each attention vector {right arrow over (αn)} 208 to the corresponding encoding vectors Vn={{right arrow over (v)}1,n, . . . , {right arrow over (v)}N
The attention filter 210 may achieve filtering by generating an n-th context vector {circumflex over (v)}n based on summing a weighted combination of each attention relevancy weight αi,n with the corresponding encoding vector {right arrow over (v)}i,n to generate the n-th context vector {circumflex over (v)}n representing the most relevant encoding vectors/training instances of the labelled set of training data Xn. The attention filter 210 generates an n-th context vector {circumflex over (v)}n based on:
The n-th context vector {circumflex over (v)}n is used as input (e.g. as attention relevancy information) for the ML technique 118a, which generates an ML model ƒ(X) 212 for modelling the relationship(s) represented by the set of labels . The attention mechanism 116 processes a plurality of labelled sets of training data {X1, . . . , Xn, . . . , XT}, in which the relevant training data instances in each labelled set of training data are used or emphasised to train the ML technique 118a for generating the ML model ƒ(X) 212. Alternatively or additionally, the attention mechanism 116 may be configured to process each of a plurality of labelled sets of training data {X1, . . . , Xn, . . . , XT}, in which only the relevant training data instances in each labelled set of training data are used to train the ML technique 118a for generating the ML model ƒ(X) 212.
In this example, the modelling/classifying system 220 is illustrated using the attention mechanism 116 of
The n-th set of data Xn 106 may be input to system 220 for classification by ML model/classifier 212. The n-th set of data Xn 106 is input to encoding module 108 to generate a set of N-dimensional encoding vectors Vn={{right arrow over (v)}1,n, . . . , {right arrow over (v)}N
As can be seen, the generated vector of scores 114 are passed to the attention mechanism 116, in which the attention calculation unit 204 coupled with the set of prior knowledge network(s) 120a-120m estimates the best set of attention relevancy weights representing the relevancy of each input data instance in the set of input data Xn 106. The attention mechanism 116 outputs an attention relevancy weight vector {right arrow over (α)}n=[α1,n, . . . , αN
In this example, the generated vector of scores {right arrow over (s)}n=[s1,n, . . . , sN
The attention filter 210 applies each attention vector {right arrow over (αn)} 208 to the corresponding encoding vectors Vn={{right arrow over (v)}1,n, . . . , {right arrow over (v)}N
The n-th context vector {circumflex over (v)}n is generated based on each attention relevancy weight αi,n and the corresponding encoding vector {right arrow over (v)}i,n of input data instance x1,n representing the most relevant feature vectors/input data instances of the set of input data Xn. The n-th context vector {circumflex over (v)}n is input to the ML model/classifier ƒ(X,) 212 and outputs a predicted relationship represented by label estimate in relation to the set of data Xn 106. The label estimate may be compared or matched with the relationship(s) represented by a set of labels ={1, . . . , l, . . . , L} to determine or estimate the relationship/fact represented by the set of data Xn 106.
Although the attention mechanism 116 of
The attention mechanism 230 may include an attention calculation unit 204 (e.g. an attention network) that receives a vector of scores 114 and outputs a vector of attention relevancy weights 208 based on prior knowledge network(s) 206a-206b. Each prior knowledge network 206a or 206b includes data representative of a relationship between pairs of data instances from the set of data Xn.
Prior knowledge network/graph 206a is based on a pairing network 232a that indicates whether a pair of data instances from the set of data Xn are located in the same document from the corpus. For example, the pairing network/graph 232a may provide an indication that two sentences xi,n and xj,n are in the same article or document from the corpus, or an indication that two sentences xi,n and xj,n are not the same article or document. The pairing relationship between pairs of data instances (xi,n, xj,n) is used to adjust the attention function (e.g. the attention function of
Prior knowledge network/graph 206b is based on a citation network 232b that indicates whether a document corresponding to a first data instance xi,n cites another document corresponding to a second data instance xj,n from the set of data Xn. For example, the citation network 232b can be used to understand relationships between documents in the corpus and the sentences x1,n, . . . , xi,n, . . . , xN
The prior knowledge data based on each prior knowledge network/graph 206a and 206b may be injected into the attention calculation unit 204 and used by the attention function to assist the attention calculation unit 204 in calculating and focusing on the most relevant sentences x1,n, . . . , xi,n, . . . xN
Although the attention mechanism 230 uses, by way of example only but is not limited to, two prior knowledge graphs/networks 206a or 206b, it is to be appreciated by the skilled person that a single prior knowledge graph based on merging the citation network and pairing networks may be used to inject prior knowledge data into the attention calculation unit 204, or that further relationships may be used for injecting further prior knowledge data into the attention calculation unit 204.
Given that each of the data instances x1,n, x5,n, . . . 244a-244e represents a sentence describing that the first entity 256a and the second entity 246b are related in some way and may support a relationship/fact with label , an attention mechanism according to the invention may make use of prior knowledge or data that may characterise the relationships between the papers 242a and 242b, and the data instances x1,n, . . . , x5,n, . . . 244a-244e in a set of data Xn. In this example, the first and second papers 242a and 242b are related to each other in that the first paper 242a cites the second paper 242b. The knowledge that the first paper 242a cites the second paper 242b may be exploited as so-called prior knowledge for assisting an attention mechanism in determining/filtering out the most relevant data instances x1,n, . . . , x5,n, . . . 244a-244e in the set of data Xn that are more likely to support a common, same or a similar relationship/fact. There may be different types of prior knowledge that may be represented as data structure representing a prior knowledge graph or network. Each prior knowledge graph or network may be generated based on analysing all of the literature in the corpus of literature, identifying for each set of data Xn those citations that include the data instances x1,n, . . . , x5,n, . . . 244a-244e in the set of data Xn, and forming a graph/network based on these citations and data instances x1,n, . . . , x5,n, . . . 244a-244e representing each particular prior knowledge relationship. Each of the prior knowledge graphs/networks can represent prior knowledge data for each set of data Xn, which may be input to the attention mechanism according to the invention. As described with respect to
In this example, the pairing network/graph 250 comprises pairing subnetwork/subgraphs 252a and 252b representing the first paper 242a and second paper 242b, respectively. Each of the pairing subnetwork/subgraphs 252a and 252b represent pairings between two or more data instances x1,n, . . . , x5,n, . . . 244a-244e contained with in the corresponding documents/papers 242a and 242b. In this example, the pairing subnetwork/subgraph 252a for the first paper 242a comprises a first node 254a representing data instance x1,n 244a, a second node 254b representing data instance x2,n 244b, and a third node 254c representing data instance x3,n 244c. These nodes are connected with edges indicating that these documents appear in the same paper, that is the first paper 242a. Similarly, the pairing subnetwork/subgraph 252b for the second paper 242b comprises a first node 254d representing data instance x4,n 244d and a second node 254e representing data instance x5,n 244d. These nodes of pairing subnetwork/subgraph 252b are connected with edges indicating that the data instances x4,n,x5,n 244d-244e appear in the same paper, that is the second paper 242b. Given the pairing network/graph 250, a pairing relationship between pairs of data instances (xi,n,xj,n) may be determined and can be used to adjust the attention function (e.g. the attention function of
For example, the citation network/graph 260 can be used to understand relationships between documents/papers in the corpus 240 and the data instances x1,n, xi,n, x5,n, . . . 244a-244e in the set of data Xn. Thus, the citation relationships between every pair of data instances x1,n, . . . , xi,n, . . . x5,n, . . . 244a-244e in the set of data Xn. For example, the pairing (x1,n, x4,n) indicates data instance x1,n 244a is related to data instance x4,n 244d, because the first paper 242a (e.g. PAPER A) containing data instance x1,n cites the second paper 242b (e.g. PAPER B), which contains data instance x4,n. Furthermore, the edges connecting each pair of data instances xi,n and xj,n may be given a relationship weight based on, by way of example but not limited to, how close the citation is located to the data instance of the paper/document that cites the other paper/document. For example, a stronger/higher relationship weight may be given to a pairing (xi,n, xj,n) based on whether the citation located in the first paper 242a to the second paper 242b occurs within a data instance xi,n. A weaker/lower relationship weight may be given to this pairing (xi,n, xj,n) based on whether the citation in the first paper 242a that cites the second paper 242b is further away from the data instance xi,n. Given the citation network/graph 260, a citation relationship between pairs of data instances (xi,n,xj,n) may be determined and can also be used to adjust the attention function (e.g. the attention function of
The system 300 includes an encoding module 108, scoring module 112, attention mechanism 316 and ML module 310, which in this example implements training of the ML technique to generate the ML model ƒ(X,) 312. The ML technique uses a loss function 308. The encoding module 108 is coupled to the scoring module and the attention mechanism 316. The scoring module 112 is coupled to the attention mechanism 316. The set of data Xn 106 is input to the encoding module 108 and the attention mechanism 316.
The n-th set of training data Xn 106 is input to the encoding module 108 (e.g. an embedding network) that outputs a set of N-dimensional encoding vectors Vn={{right arrow over (v)}1,n, . . . , {right arrow over (v)}N
the MAX attention function, or any other attention function based on the vector or set of scores {right arrow over (s)}n=[s1,n, . . . , sN
A first attention filter 210 applies each attention vector {right arrow over (αn )} 308 to the corresponding feature encoding vectors Vn={{right arrow over (v)}1,n, . . . , {right arrow over (v)}N
The attention relevancy weight vector {right arrow over (α)}n, n-th context vector {circumflex over (v)}n and the prior knowledge data are input as so-called attention filtering information for use by ML module 310 in conjunction with the encoding vectors Vn={{right arrow over (v)}1,n, . . . , {right arrow over (v)}N
The attention relevancy weight vector {right arrow over (α)}n, n-th context vector {circumflex over (v)}n and the prior knowledge data are input as so-called attention filtering information for use by ML module 310. The n-th context vector {circumflex over (v)}n is based on a function of the encoding vectors Vn={{right arrow over (v)}1,n, . . . , {right arrow over (v)}N
As an example, a loss function, L(ƒ(X),), may be used by the ML technique during training over labelled data set X, where the n-th context vectors {circumflex over (v)}n corresponding to each set of training data Xn 106 is input in place of the labelled dataset X and each set of training data Xn 106. That is the ML technique operates on the context vectors, which is a function or a transformation of the input dataset X. The loss function, L(ƒ(X),), is modified to further include an attention filtering mechanism according to the invention. The modified loss function 308 may be considered an attention-loss function, AL(X,,{right arrow over (αn)}), which includes a regularisation term/function (e.g. attention regularisation function), AF(G,{right arrow over (αk)}, Xk), based on using prior knowledge data PGn of the prior knowledge graph G 306 in conjunction with attention relevancy weight vector {right arrow over (αn)}=[α1,n, . . . , αi,n, . . . , αN
where each λ∈+ is a hyperparameter selected to adjust the contribution of unrelated training data instances based on the prior knowledge graph G 306, and each attention score αi,n corresponds to an element of the attention relevancy weight vector {right arrow over (α)}n, each of which may be calculated based on any attention function such as, by way of example only but is not limited to, the SOFTMAX attention function
Thus, for each pair of training data instances (xi,k,xj,k) which are related in the prior knowledge graph G 306, a loss component λ|αi,k−αj,k| is added to L(ƒ(X),). If the attention network in the ML technique assigns different attention weights to the training data instances (xi,k,xj,k), then this component will be positive and, hence, correspond to adding a penalty to the loss function L(ƒ(X),). Otherwise, a loss component is not added to L(ƒ(X),). Thus, the attention-loss function acts as an attention filter that implicitly filters non-relevant training data instances, while simultaneously attempting to retain instances that are related in the prior knowledge graph if at least one of them is deemed relevant; or allows the ML technique to learn how to filter out which training data instances are more relevant than others, while respecting the assumptions in the prior knowledge network.
where each λl∈+ is a hyperparameter selected to adjust the contribution of the prior knowledge graph Gl, and each attention score (or attention weight) αi,n may be calculated based on any attention function such as, by way of example only but is not limited to, the SOFTMAX attention function
or the MAX attention function, sparsemax attention function, or any suitable attention function for calculating attention weights based on at least the set of scores associated with the set of data.
Thus, for each pair of training data instances (xi,k,xj,k) which are related in the prior knowledge network Gl, a loss component λi|αi,k−αj,k| is added to L(ƒ(X),) which will be positive and, hence, a penalty in the task of minimizing AL(X,,{right arrow over (αn)}) if the attention function assigns different attention scores αi,n≠αj,n to a pair of training data instances (xi,k,xj,k) which are related in the corresponding prior knowledge graph Gl. Furthermore, this penalty is proportional to the difference between the attention weights, |αi,k−αj,k|, and, hence, applies a greater penalty when this difference is greater. Otherwise, a loss component is not added to L(ƒ(X),). Thus, the attention function acts as an attention filter that implicitly filters non-relevant training data instances, while simultaneously retaining instances that are related in the prior knowledge network if at least one of them is deemed to be relevant; or allows the ML model to learn how to filter out which training data instances are more relevant than others, and to use the prior knowledge networks to infer the relevance of a greater number of instances.
The n-th set of training data Xn 106 is input to the encoding module 108 (e.g. an embedding network) that outputs a set of N-dimensional feature encoding vectors Vn={{right arrow over (v)}1,n, {right arrow over (v)}N
Based on these attention scores, the attention mechanism 116 outputs an attention score vector {right arrow over (α)}n=[α1,n, . . . , αN
As an example, a loss function, L(ƒ(X),), used by an ML technique during training over labelled data set X is modified to further include an attention filtering mechanism according to the invention. The modified loss function 212 may be considered an attention-loss function, AL(X,,{right arrow over (αn)}), which includes an attention function, AF(Gl,{right arrow over (αk)},Xk), based on using one or more prior knowledge graph(s) G1, . . . , Gm in conjunction with attention vector of attention scores
{right arrow over (αn)}=[αi,n, . . . , αN
where each λl∈+ is a hyperparameter selected to adjust the contribution of unrelated training data instances based on the prior knowledge graph Gl, and each attention score αi,n may be calculated based on any attention function such as, by way of example only but is not limited to, the SOFTMAX attention function
Thus, for each pair of training data instances (xi,k,xj,k) which are related in the prior knowledge graph Gl, a loss component λl|αi,k ∈αj,k| is added to L(ƒ(X),). If the attention network in the ML model 214 assigns different attention weights to the training data instances (xi,k,xj,k), then this component will be positive and, hence, correspond to adding a penalty to the loss function L(ƒ(X),). Otherwise, a loss component is not added to L(ƒ(X),). Thus, the attention-loss function acts as an attention filter that implicitly filters non-relevant training data instances, while simultaneously attempting to retain instances that are related in the prior knowledge graph if at least one of them is deemed relevant; or allows the ML model 214 to learn how to filter out which training data instances are more relevant than others, while respecting the assumptions in the prior knowledge network.
Although
The attention filtering mechanism(s), method(s), apparatus and system(s) as described with reference to
The application of an example attention mechanism according to the invention is now described with respect to MIL in the context of RE, neural network structures (e.g. RNN to encode sequences and FFNN), and/or NLP. It is to be appreciated by the skilled person that the notation and terminology used in describing the attention mechanism(s) in relation to
RE is the process of extracting semantic relations between entities from across a range of text sources. While important in its own right both as a sub-field of computational linguistics and as a crucial component in natural language understanding/processing turning vast and rapidly expanding bodies of unstructured information embedded in texts into structured data is a necessary prerequisite for a wide variety logical and probabilistic reasoning applications, such as models and/or classifiers generated by ML techniques/tasks. From question answering to relational inference, the performance of these downstream ML techniques/tasks typically rely on having access to complete and up-to-date structured knowledge bases (KB) for, by way of example only but not limited to, labelled training data X.
In most real-world KB completion projects, it is well known that employing a fully supervised RE on unstructured text is not an option given the expense and sheer impracticality of building the necessary datasets. For example, there are, combinatorially, ˜108 potential protein/gene relation entity pairs and ˜107 PubMed Central articles that may be used to generate an input dataset of text mentions X for either training a model/classifier and/or for input to a trained model/classifier for probabilistic reasoning applications, relationship extraction and the like. Instead, distant supervision can be used in which curated databases of known relations/relationships/facts are used to automatically annotate the corpus of text mentions. For example, a dataset of text mentions X may be generated that describes a plurality of sets of text mention data Xn for 1≤n≤T, where T is the number of sets of text mention data. Each set of text mention data Xn is representative of multiple text mentions that are mapped to or associated with one of the known relations that is represented by a label l∈={1, . . . , L}, where is a set of labels representing one or more of the known relations/relationship(s)/fact(s) and L≥1 is the number of known relations/relationships/facts in the set of labels . Each set of data Xn includes multiple of text mention data instances {x1,n, . . . , xN
A text mention may include, by way of example only but it not limited to, a sentence, statement or paragraph that may describe a relation between multiple entities or entity pairs such as, by way of example only but not limited to, the subject(s) of a sentence. In the biological field of drug discovery and/or optimisation, an entity of a text mention (e.g. a sentence or statement in a text corpus) may include, by way of example but not limited to, one or more compound(s), one or more protein(s), one or more target(s), one or more gene(s), or combinations of pairs thereof and the like (e.g. an entity may be a biological/biomedical entity such as, by way of example only but is not limited to, a disease, a gene, a protein, a drug, a compound, a molecule, a biological pathway, a biological process, an anatomical region, anatomical entity, tissue, or cell type, and the like, etc.). For example, a text mention may be a sentence or statement found in a piece of literature such as, by way of example only but not limited to, “A is a treatment for modulating B”, where A is an entity such as a protein or drug/compound and B is another entity such as a gene etc. Thus, “A” may be extracted as a first entity and “B” may be extracted as a second entity. Since the first and second entities occur in a mention, then they may form an entity pair. In another example, a text mention may be a sentence or statement found in a piece of literature such as, by way of example only but not limited to, “More than 40% of familial cerebral cavernous malformations (CCM) flag patients are affected with mutations in KRIT1, most mutations causing the truncation of the KRIT1 protein”, where “CCM” is annotated as a disease entity, “KRIT1” is annotated as a gene entity, and “KRIT1 protein” is annotated as a protein entity. “CCM” may be referred to as the first entity, “KRIT1” may be referred to as the second entity, and “KRIT1 protein” may be referred to as the third entity in the sentence. Each pair of entities in this sentence can be considered an entity pair in a text mention, and give rise to three text mentions describing the relationships between the first and second entities, the first and third entities, and the second and third entities, respectively. A text mention can be generated by a pair of entities, which could be proteins, genes, diseases, compounds, or any other concept of interest, occurring in a span of text, which could be a single sentence, a pair of consecutive sentences, a paragraph, or an entire document.
Although these examples describe using entity pairs within text mentions that are based on, by way of example only but not limited to, protein(s)/gene(s), it is to be appreciated by the skilled person that the entity pairs and mentions as described herein can be based on any kind of information from a corpus of data as the application demands. For example, the text mentions could be replaced by images and entity pairs replaced by multiple portions of an image that may have or be associated with a particular relation, with each of these image portions corresponding to an instance in a multi-instance data set. In binary relational extraction, for example, a positive label l∈={1, . . . , L} may be assigned to every text mention associated with the relation represented by label l and may form a labelled training text mention data instance xi,n containing data representative of the respective entity-group query.
In the preparation of the datasets such as an input dataset of text mentions X comprising a plurality of sets of text mention data Xn and the design/training of the model/classifier, a common question arises as to how much evidence or text mention data should one gather in support of each extracted relation represented by label l∈={1, . . . , L}. For example, in certain cases, a partial subset may be sufficient such as a single text mention data instance xi,n in the extreme case. For instance, the statement “Aage is the son of Niels” unambiguously establishes a paternal relation. In general, however, for the most part such as, by way of example only but not limited to, complex relationships (e.g. genetic interactions), temporal relational facts (e.g. “Barack Obama is the president of the United States”), or incomplete or inconsistent pieces of information this partial-evidence approach is highly fragile. A robust RE framework has to be defined by a commitment to pull in the entire body of textual evidence prediction.
Distantly supervised RE, in its original formulation of labelling every co-occurring entity pair mention, takes this complete, but no doubt trivial, approach to evidence gathering. In practical implementations, however, one is driven to relax the completeness criteria by the need to mitigate the negative effects on the model/classifier predictions of such a noisy dataset with multiple false positive labels. For instance, a Multi-Instance Machine Learning (MIML)-RE approach, under the assumption that at least one mention supports each relation, reverts to the (extreme) partial evidence set-up by learning and using that single mention and ignoring other possibly relevant mentions. It is not difficult to appreciate that in addition to the consistency and temporal issues highlighted above, discarding a large proportion of the available text data, e.g. a large proportion of training data instances {X1,n, . . . , xN
Although implementing a nave attention mechanism over the data text mention instances xi,n, for each set of text mention data Xn, (e.g. via a SOFTMAX function over scores or potentials) may be able to select data instances, it severely suffers from the problem of being too “peaky”; that is being too overly-selective and concentrated around a very limited set of text mention data instances. For example, for any given entity-pair query, it has been observed that almost all the probability mass (>0.99) gets allocated to the top 2 text mention data instances, regardless of the number of text mention data instances. It is desirable that this behaviour is mitigated, but usually maximizing the coverage over text mentions is typically not an explicit objective in the training process of an ML technique using nave attention mechanisms.
The attention mechanism according to the present invention solves the overly “peaky problem” and aims to maximise the coverage over text mention data instances, and further aims to solve the above-mentioned challenges in relation to generating labelled training datasets, by implementing a selective attention mechanism over text mention data instances that is not too selective, yet is capable of selecting a set of text mention data instances that are most relevant for the training process. The attention mechanism makes use of prior knowledge between text mention data instances to maximise coverage over text mentions and, as a result, enhance the training process and reliability of the models/classifiers produced. In this context, the evidence selection task itself is wrapped up in an end-to-end differentiable learning program. This provides various advantages such as, by way of example only but is not limited to, tackling the false-positive labelling problem, and exposing the attention weights (also referred to herein as attention relevancy weights αi,n) that are, naturally, the evidence weights for a given prediction. This latter feature is important for both fact-checking and for presenting to human users nuanced evidence sets for ambiguously-expressed relationships.
The example attention mechanism is based on the concept that two separate text mention data instances that are somehow related are more likely than not to contribute equally weighted evidence towards a putative relation. This relatedness can take many forms. For example, text mention data instances that appear in the same publication, or text mention data instances with the same author, etc. Concretely, this implies that related text mention data instances should have similar attention weights. The example attention mechanism encodes this as prior knowledge between text mention data instances and aims to bias the ML technique for generating a model and also the resulting model towards effectively larger evidence sets and towards better overall generalization abilities. The example attention mechanism may be configured as a simple drop-in replacement for conventional attention mechanisms such as, by way of example only but not limited to, any (unstructured) softmax attention function or nave attention mechanisms and the like.
The application of an example attention mechanism according to the invention is now described with respect to MIL in the context of RE, neural network structures (e.g. RNN to encode sequences and FFNN), and/or NLP. It is to be appreciated by the skilled person that the notation and terminology used in describing the attention mechanism(s) in relation to
The example attention mechanism is based on a structure in which the attention weights (a.k.a. attention relevancy weights) are regularised with an attention function based on the Generalised Fused Lasso (GFL), where the total variation terms are imposed over a graph of mentions (e.g. cf. prior knowledge graph(s)/network(s)) where the edges encode a measure of similarity.
Conventional attention mechanism structures in neural networks are typically implemented as a potential function , which maps a list of text mention data instance encodings (also referred to as encoding vector V) to a vector of potentials z (also referred to herein as vector of scores {right arrow over (s)}), and a subsequent mapping from the vector of potentials to the probability simplex Δ. In most applications of attention mechanisms, this second step is implemented by the softmax operator to generate an attention relevancy weight, denoted wk (also referred to as αk), as:
Instead, the example attention mechanism according to the invention is based on the GFL that can be used to calculate an attention relevancy weight vector, w, based on:
where G is a prior knowledge graph/network defined on the input data instances and λ∈+ is a hyper-parameter. This graph-structured attention mechanism is used to incorporate a citation network as a prior knowledge graph G in biomedical relationship extraction.
The relational extraction model that may be used with the example attention mechanism according to the invention is now described and uses the following notation, definitions and conventions. In the following, the sequence of integers 1, 2, . . . , n is represented by [n]. Abstract objects such as word tokens, mentions, etc. are represented by Greek letters (μ, ρ, . . . ) while vector embeddings/encodings are represented by bold-faced non-italicised letters (v,m).
Relational extraction is based on identifying entities (e.g. protein “A”, protein “B”, gene “C”, compound “D” etc.) within a corpus. The entities are based on a fixed-sized vocabulary of word tokens in the following manner. Let be a fixed-sized vocabulary of word tokens and E={εi}i=1N
:E→P\∅
εi{τi,τi
where P(S) is the power-set of some set S and i*∈[] the indices that define the word representations in the vocabulary of the entity εi. The reverse function that captures the inherent ambiguity of word tokens may be defined as follows:
→P(E)\∅
τj{Σj
where j*∈[NE] indexes in E all the potential entities that the token τj references.
Let be some rule-based or separately-trained entity-linking model that projects a single entity from the set of dictionary candidates for each word, i.e. :{εj
Although a partition of may be defined for defining word embeddings that are appropriate to a given class of words, for simplicity a single word embedding/encoding vector space may be defined. Partitioning of partition of may be useful, in a vocabulary that includes all the elements in a syntactic parse of a piece of text might separate the text symbols from the dependency relations that link them; the former contains ˜106 elements while the latter only 102, which means that separate embedding spaces may be useful.
Having defined entities, the text mention data instances for a dataset X may be constructed based on so-called text mentions. For a given pair of entities (εi, εj), a set of entity-pair text mentions
(also referred to herein as a set of data Xn or a set of text mention data Xn, where n=M(i,j)) may be constructed from a text corpus in which each text mention is a sequence of words that contains, in the following sense, the two entities. Although each text mention is defined, by way of example only but not limited to, as a sequence of words, it is to be appreciated by the skilled person that mentions are not limited to only sequences, but may include, by way of example only but not limited to, any sentence, paragraph, statement of text and the like. Although most forms may be sequential, with sentences being the most common, other examples of mentions that are non-sequential include, by way of example only but is not limited to, a sentence span between the two entities, contiguous sentences, paragraphs, statements or even whole documents. Non-sequential forms may typically be parse-trees.
If μk(i,j)=(τk
Relations/relationships between entities may be defined based on the following. Let R={ρr}r=1NR be a set of relation objects in a pre-specified schema. A binary rank-3 tensor may be defined as: Y≅2N
One of the problems in relational extraction is to estimate the unknown components of P using a set of mentions M (e.g. one or more sets of data Xn) built from the relevant text corpuses. This involves constructing a statistical model to construct a score tensor S≈N
The known components of Ŷ (e.g. cf. relationships represented by labels ) may be used as training labels for a training in a distant supervision framework.
Given this notation an example relational extraction model architecture may be described as follows. In this example, the RE model architecture includes an encoder module, the example attention module/mechanism according to the invention, and a classification module. The encoder module receives the abstract dataset and outputs a vector encoding. The attention module may form the example attention mechanism, which receives the vector encoding and, in this example, includes a potential network (e.g. scoring network) and an attention network, which outputs attention filtering information in the form of a filtered vector encoding. The filtered vector encoding may be based on a weighted sum of the vector encoding weighted with attention relevancy weights as calculated.
The encoder module operates in the following manner. Let → be the function that maps abstract word tokens to -dimensional real vectors, then the abstract word token, τi, may be mapped to ti based on: τi(τi)≡ti, for i∈[] the vocabulary index. Similarly for relations, the embedding :R→d
The functions and may be implemented as neural network structures that are configured to be either a pre-trained, fixed, embedding functions (e.g. via word2vec) or a trainable embedding matrices with × and NR×dR parameters entries respectively. The former may be used with already trained classifier/model and the example attention mechanism, whereas the latter may be trained when using the example attention mechanism during training of a ML technique that generates a model/classifier for relationship extraction and the like.
The entity-pair mentions are represented by the sequence of their word token vector representations ti, i.e. fora mention μk=(τk
ε:× . . . ×→
m
k
x
k
where, ε may be, by way of example only but is not limited to, a simple RNN, for example, xk is the final hidden state, i.e. xk≡hn=σ(Wihtk
The attention module is configured to receive the vector encoding of the mentions, which in this example, is the matrix X of mention encodings. The attention module uses an attention function, , to calculate the attention relevancy weights (or evidence weights) of the mentions with respect to a given relation ρr, defined as: (X,rr)wr≡(w1r, . . . , wN
Once the attention relevancy weight vector wr has been determined for each set of mention encodings (e.g. data instances), then an attention filtered vector based on the aggregation of the evidence of instances or encoded mentions via a simple weighted sum is calculated, where for each entity-pair x′∈d
In this example, the attention module implements the attention mechanism in two parts; a) potential network based on a potential function (e.g. scoring network) and b) an attention network. Although the attention module implements both parts, for the potential network (a.k.a. scoring network), a potential function z, is derived that maps a set or list of mentions to a vector of potentials z (e.g. scores). Then, the attention network applies a function Δ to map z to the probability simplex.
For example, the function may be implemented as a bilinear form that acts on each mention input independently. So for each relation ρr,
z:d
(Xk,rr)zkr≡xkTArr,
where A∈d
In relation to the attention network, a probability mapping function Δ based on:
where G is a prior knowledge graph defined on the input data instances and λ∈+ is a hyper-parameter. The attention network may be, by way of example only but is not limited to, a neural network layer or a minimisation algorithm and the like. This graph-structured attention mechanism is used to incorporate a citation network as a prior knowledge graph G in biomedical relationship extraction, where G is the set of pairs of mention indices defining a structure graph (or prior knowledge graph) of mentions. For example, J Djolonga and A Krause, “Differentiable Learning of Submodular Models”, In Proceedings of Neural Information Processing Systems (NIPS), December, 2017 showed that the mapping from z to w defined by solving the above optimisation problem corresponds to a differentiable function. This permits the use of this mapping as a neural network layer or as a neural network structure and the like.
The structure graph (or prior knowledge graph/network) over mentions is represented by G. In this example, a single structure graph is used for simplicity, but it is to be appreciated (e.g. see
The restriction to the largest connected component of the citation network enables a significant simplification to be made, where the generalized fused lasso on the graph G may be approximated by a 1-dimensional fused lasso on the chain graph corresponding to a Depth First Search (DFS) traversal of G. In which the regularising term of the generalised fused lasso may be is replaced with Σ(a,b)∈G|ya−yb|→Σc∈DFS(G)|yc+1 ∈yc|, where DFS(G) is the sequence of mentions indices corresponding to a depth-first search pre-ordering of nodes in the graph G. By construction, the root of the DFS tree is the ‘oldest’ mention in G.
In this example, the classifier module receives the attention filtering information, which may include the data representative of the attention filtered vector x(r). The classifier module processes the attention filtered vector to calculate a score tensor for classifying the input data associated with the attention filtered vector. For example, the classifier module may implement a simple classifier to calculate a score tensor via a simple linear function Sijr, =Wx(r)+b, where x(r)≡x(ijr), and W∈d
As an example, a simplified setting of a single binary relation with Sijr→Sij may be used to demonstrate the system. Let dM= and let A be a diagonal matrix and, without loss of generality, fix r≡(1, 1, . . . , 1). A model may be trained by a ML technique (e.g. neural network) to minimize the cross-entropy loss L(θ)=Σ(i,j)∉Y, log(1−Ŷij(θ))−Σ(i,j)∈Y′ log(Ŷij(θ)), where Y′ is the set of all known relation pairs in the knowledge base, and θ represents the trainable parameters in the neural network. During training, negative sampling may be performed from the complement set of un-linked pairs Y′c with a negative-positive ratio of 10. Specifically, for each pair positive pair (εi,εj), a negative pair (εi,εj′) is sampled, where j′ is sampled randomly from [NM] and satisfies (εi,εj,)∉Y′. The model/classifier may be trained stochastic gradient descent with adaptive moment estimation (ADAM). In this manner, the attention module selects the most relevant subsets of the training dataset.
The attention module/mechanism according to the invention is now tested by incorporating a structured attention layer for relational extraction in the context of link prediction in the human protein-protein interaction (PPI) network.
Although the following structured attention layer for relational extraction is described, by way of example only but is not limited to, extracting human protein-protein interaction from the literature, the skilled person will appreciate that the structured attention layer may be applied to any relational extraction problem or process. Some biomedical applications/examples may include (where the bold text provides highlights the possible entity pairs), by way of example only but is not limited to, extracting disease-gene associations from the literature (e.g. targeting IRAK1 as a therapeutic approach for Myelodysplastic Syndrome); extracting protein-protein interactions from the literature (e.g. identify the molecular mechanisms by which p-cav-1 leads directly to the upregulation of CD86), extracting disease-drug associations from the literature (e.g. oral administration of topiramate significantly reduced gross pathological signs and microscopic damage in primary affected colon tissue in the TNBS-induced rodent model of IBD); extracting drug mechanism of action associations from the literature (e.g. Topiramate also inhibits some isozymes of carbonic anhydrase (CA), such as CA-II and CA-IV); or extracting any type of first biological/biomedical entity-second biological/biomedical entity (e.g. a biological/biomedical entity may include, by way of example only but is not limited to, a disease, a gene, a protein, a drug, a compound, a molecule, a biological pathway, a biological process, an anatomical region, anatomical entity, tissue, or cell type) interaction, association, mechanism of action, or other relationship of interest from the literature as the application demands.
In operation, a PPI network knowledge base is built from OmniPath to form a database of human signalling pathways curated from the biomedical literature. The set of all PubMed (e.g. http://www.ncbi.nlm.nih.gov/PubMed) abstracts is used as the source of unstructured data. The abstract text was parsed using the Stanford dependency parser and lemmatized by BioLemmatizer. Protein names in the text were linked to proteins in the knowledge base using LeadMine.
For a pair of proteins within a sentence, each mention is defined to be the sequence of lemmas and dependencies along the shortest dependency path between the entities (excluding the entity pair tokens). From this a citation network may be formed. A random 0.70/0.15/0.15 training/validation/test split is performed of the OmniPath interacting protein pairs.
In order to characterise the increase distribution of attention relevancy weights, the mean Effective Sample Size (ESS) is used and is defined by:
where {ŵk}k=1N
Each of the magnitudes of the attention relevancy weights is illustrated by a hatched pattern that corresponds to the hatched pattern of the magnitude scale 419. The attention relevancy weights 418a, 416, 417 and 418b for each mention in the set of mentions are also illustrated along with the hatched patterns corresponding to their magnitudes. Thus, for a given entity pair from the held-out test set, the attention relevancy weights 418a, 416, 417 and 418b are extracted and overlaid over the corresponding nodes 412a-4121 and 414a-414b of the citation network 410. In this example, it is clear that the conventional softmax attention layer focuses on only one mention represented by node 414a, which is overlaid with an attention relevancy weight 416 with a magnitude in the order of 100. The mention represented by node 414b is overlaid with an attention relevancy weight 417 with a magnitude in the order of 10−8 to 10−4. The remaining mentions represented by nodes 412a-4121 are overlaid with attention relevancy weights 428a and 428b with a magnitude in the order of 10−32 to 10−28, which for all intents and purposes is zero.
Each of the magnitudes of the attention relevancy weights is illustrated by a hatched pattern that corresponds to the hatched pattern of the magnitude scale 419. The attention relevancy weights 428a, 426, 428b for each mention in the set of mentions are also illustrated along with the hatched patterns corresponding to their magnitudes. Thus, for a given entity pair from the held-out test set, the attention relevancy weights 428a, 426, and 428b are extracted and overlaid over the corresponding nodes 422a-422i and 424a-424e of the citation network 420. In this example, it is clear that the structured attention structured attention mechanism according to the invention 404 distributes the attention over multiple mentions represented by nodes 424a-424e, which are overlaid with an attention relevancy weight 426 with a magnitude in the order of 100. The remaining mentions represented by nodes 422a-422i are overlaid with attention relevancy weights 428a and 428b with a magnitude in the order of 10−32 to 10−28, which for all intents and purposes is zero. As illustrated the structured attention structured attention mechanism according to the invention 404 has filtered the set of mentions to retain the most relevant mentions represented by nodes 424a-424e of the set of mentions.
Further aspects of the invention may include one or more apparatus and/or devices that include a communications interface, a memory unit, and a processor unit, the processor unit connected to the communications interface and the memory unit, wherein the processor unit, storage unit, communications interface are configured to perform the system(s), apparatus, method(s) and/or process(es) or combinations thereof as described herein with reference to
The ML module/device 528 may be configured to receive the attention filtering information for training a ML technique to generate an ML model or classifier. Additionally or alternatively, the ML module/device 528 may be configured to receive the attention filtering information for input to an ML model (e.g. a trained ML model). Additionally or alternatively, the ML module/device 528 may be configured to receive the attention filtering information for input to a classifier. Although some of the functionalities of the system has, by way of example only but is not limited to, been described with reference to
In other aspects, an attention apparatus according to the invention may include a processor, a memory and/or a communication interface, the processor is connected to the memory and/or the communication interface, where the processor is configured to implement the process(es) 120, 130, and/or apparatus/systems 100, 200, 220, 230, 300, 400, 410, 420, 500 and 520, and/or prior knowledge graphs 240, 250, 260, 270, and/or ML model(s), classifier(s), and/or any method(s)/process(es), step(s) of these process(es), modifications thereof, as described with reference to any one or more
In a further aspect, an attention apparatus according to the invention may include a processor and/or a communication interface, the processor connected to the communication interface, where: the communication interface is configured to receive a set of scores for each set of data of an input dataset comprising a plurality of sets of data, in which each set of data comprises multiple data instances; the processor is configured to determine attention filtering information based on prior knowledge of one or more relationships between the data instances in said each set of data and calculating attention relevancy weights corresponding to the data instances and each set of scores; and the communication interface is configured to provide the attention filtering information to a machine learning, ML, technique or ML model.
Furthermore, the process(es) 120, 130, and/or apparatus/systems 100, 200, 220, 230, 300, 400, 410, 420, 500 and 520, and/or prior knowledge graphs 240, 250, 260, 270, and/or ML model(s), classifier(s), and/or any method(s)/process(es), step(s) of these process(es), modifications thereof, as described with reference to any one or more
Furthermore, an ensemble model or a set of models may also be obtained process(es) 100, 120, 500 and/or apparatus/systems 200, 220, 238, 250, 400, 410, and/or any method(s)/process(es), step(s) of these process(es), as described with reference to any one or more
In the embodiment(s) described above the computing device, system may be based on a server or server system that may comprise a single server or network of servers. In some examples the functionality of the server may be provided by a network of servers distributed across a geographical area, such as a worldwide distributed network of servers, and a user may be connected to an appropriate one of the network of servers based upon a user location.
The above description discusses embodiments of the invention with reference to a single user for clarity. It will be understood that in practice the system may be shared by a plurality of users, and possibly by a very large number of users simultaneously.
The embodiments described above are fully automatic. In some examples a user or operator of the system may manually instruct some steps of the method to be carried out.
In the described embodiments of the invention the system may be implemented as any form of a computing and/or electronic device. Such a device may comprise one or more processors which may be microprocessors, controllers or any other suitable type of processors for processing computer executable instructions to control the operation of the device in order to gather and record routing information. In some examples, for example where a system on a chip architecture is used, the processors may include one or more fixed function blocks (also referred to as accelerators) which implement a part of the method in hardware (rather than software or firmware). Platform software comprising an operating system or any other suitable platform software may be provided at the computing-based device to enable application software to be executed on the device.
Various functions described herein can be implemented in hardware, software, or any combination thereof. If implemented in software, the functions can be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media may include, for example, computer-readable storage media. Computer-readable storage media may include volatile or non-volatile, removable or non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. A computer-readable storage media can be any available storage media that may be accessed by a computer. By way of example, and not limitation, such computer-readable storage media may comprise RAM, ROM, EEPROM, flash memory or other memory devices, CD-ROM or other optical disc storage, magnetic disc storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Disc and disk, as used herein, include compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk, and blu-ray disc (BD). Further, a propagated signal is not included within the scope of computer-readable storage media. Computer-readable media also includes communication media including any medium that facilitates transfer of a computer program from one place to another. A connection, for instance, can be a communication medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of communication medium. Combinations of the above should also be included within the scope of computer-readable media.
Alternatively, or in addition, the functionality described herein can be performed, at least in part, by one or more hardware logic components. For example, and without limitation, hardware logic components that can be used may include Field-programmable Gate Arrays (FPGAs), Program-specific Integrated Circuits (ASICs), Program-specific Standard Products (ASSPs), System-on-a-chip systems (SOCs). Complex Programmable Logic Devices (CPLDs), etc.
Although illustrated as a single system, it is to be understood that the computing device may be a distributed system. Thus, for instance, several devices may be in communication by way of a network connection and may collectively perform tasks described as being performed by the computing device.
Although illustrated as a local device it will be appreciated that the computing device may be located remotely and accessed via a network or other communication link (for example using a communication interface).
The term ‘computer’ is used herein to refer to any device with processing capability such that it can execute instructions. Those skilled in the art will realise that such processing capabilities are incorporated into many different devices and therefore the term ‘computer’ includes PCs, servers, mobile telephones, personal digital assistants and many other devices.
Those skilled in the art will realise that storage devices utilised to store program instructions can be distributed across a network. For example, a remote computer may store an example of the process described as software. A local or terminal computer may access the remote computer and download a part or all of the software to run the program. Alternatively, the local computer may download pieces of the software as needed, or execute some software instructions at the local terminal and some at the remote computer (or computer network). Those skilled in the art will also realise that by utilising conventional techniques known to those skilled in the art that all, or a portion of the software instructions may be carried out by a dedicated circuit, such as a DSP, programmable logic array, or the like.
It will be understood that the benefits and advantages described above may relate to one embodiment or may relate to several embodiments. The embodiments are not limited to those that solve any or all of the stated problems or those that have any or all of the stated benefits and advantages. Variants should be considered to be included into the scope of the invention.
Any reference to ‘an’ item refers to one or more of those items. The term ‘comprising’ is used herein to mean including the method steps or elements identified, but that such steps or elements do not comprise an exclusive list and a method or apparatus may contain additional steps or elements.
As used herein, the terms “component” and “system” are intended to encompass computer-readable data storage that is configured with computer-executable instructions that cause certain functionality to be performed when executed by a processor. The computer-executable instructions may include a routine, a function, or the like. It is also to be understood that a component or system may be localized on a single device or distributed across several devices.
Further, as used herein, the term “exemplary” is intended to mean “serving as an illustration or example of something”.
Further, to the extent that the term “includes” is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term “comprising” as “comprising” is interpreted when employed as a transitional word in a claim.
The figures illustrate exemplary methods. While the methods are shown and described as being a series of acts that are performed in a particular sequence, it is to be understood and appreciated that the methods are not limited by the order of the sequence. For example, some acts can occur in a different order than what is described herein. In addition, an act can occur concurrently with another act. Further, in some instances, not all acts may be required to implement a method described herein.
Moreover, the acts described herein may comprise computer-executable instructions that can be implemented by one or more processors and/or stored on a computer-readable medium or media. The computer-executable instructions can include routines, sub-routines, programs, threads of execution, and/or the like. Still further, results of acts of the methods can be stored in a computer-readable medium, displayed on a display device, and/or the like.
The order of the steps of the methods described herein is exemplary, but the steps may be carried out in any suitable order, or simultaneously where appropriate. Additionally, steps may be added or substituted in, or individual steps may be deleted from any of the methods without departing from the scope of the subject matter described herein. Aspects of any of the examples described above may be combined with aspects of any of the other examples described to form further examples without losing the effect sought.
It will be understood that the above description of a preferred embodiment is given by way of example only and that various modifications may be made by those skilled in the art. What has been described above includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable modification and alteration of the above devices or methods for purposes of describing the aforementioned aspects, but one of ordinary skill in the art can recognize that many further modifications and permutations of various aspects are possible. Accordingly, the described aspects are intended to embrace all such alterations, modifications, and variations that fall within the scope of the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
1805293.6 | Mar 2018 | GB | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/GB2019/050927 | 3/29/2019 | WO | 00 |