The present invention relates to information processing, and more particularly to analogy-based reasoning with memory networks for future prediction.
Making predictions about what might happen in the future is important for reacting adequately in many situations. For example, observing that “Man kidnaps girl” may have the consequence that “Man kills girl”. While this is part of common sense reasoning for humans, it is not obvious how machines can learn and generalize over such knowledge automatically. Hence, there is a need for a technique to enable machines to make accurate future predictions.
According to an aspect of the present invention, a computer-implemented method is provided. The method includes accessing, by a processor, a training set of positive and negative event pairs. The method further includes calculating, by the processor, (i) positive similarity scores between an input pair of events and the positive event pairs in the training set, and (ii) negative similarity scores between the input pair of events and the negative event pairs in the training set. The method also includes applying, by the processor, a Softmax process to (i) the positive similarity scores to produce an overall positive similarity score for the input pair of events relative to the negative event pairs, and (ii) the negative similarity scores to produce an overall negative similarity score for the input pair of events relative to the positive event pairs. The method additionally includes calculating, by the processor, the difference between the overall positive similarity score and the overall negative similarity score to obtain a future event prediction score indicating a future occurrence likelihood of at least one of two constituent events forming the input pair of events. The method also includes performing, by the processor, an action responsive to the future event prediction score.
According to another aspect of the present invention, a computer program product is provided. The computer program product includes a non-transitory computer readable storage medium having program instructions embodied therewith. The program instructions are executable by a computer to cause the computer to perform a method. The method includes accessing, by a processor, a training set of positive and negative event pairs. The method further includes calculating, by the processor, (i) positive similarity scores between an input pair of events and the positive event pairs in the training set, and (ii) negative similarity scores between the input pair of events and the negative event pairs in the training set. The method also includes applying, by the processor, a Softmax process to (i) the positive similarity scores to produce an overall positive similarity score for the input pair of events relative to the negative event pairs, and (ii) the negative similarity scores to produce an overall negative similarity score for the input pair of events relative to the positive event pairs. The method additionally includes calculating, by the processor, the difference between the overall positive similarity score and the overall negative similarity score to obtain a future event prediction score indicating a future occurrence likelihood of at least one of two constituent events forming the input pair of events. The method also includes performing, by the processor, an action responsive to the future event prediction score.
According to yet another aspect of the present invention, a computer processing system is provided. The computer processing system includes a processing element. The processing element is configured to access a training set of positive and negative event pairs. The processing element is further configured to calculate (i) positive similarity scores between an input pair of events and the positive event pairs in the training set, and (ii) negative similarity scores between the input pair of events and the negative event pairs in the training set. The processing element is also configured to apply a Softmax process to (i) the positive similarity scores to produce an overall positive similarity score for the input pair of events relative to the negative event pairs, and (ii) the negative similarity scores to produce an overall negative similarity score for the input pair of events relative to the positive event pairs. The processing element is additionally configured to calculate the difference between the overall positive similarity score and the overall negative similarity score to obtain a future event prediction score indicating a future occurrence likelihood of at least one of two constituent events forming the input pair of events. The processing element is also configured to perform an action responsive to the future event prediction score.
These and other features and advantages will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.
The disclosure will provide details in the following description of preferred embodiments with reference to the following figures wherein:
The present invention is directed to analogy-based reasoning with memory networks for future prediction.
In an embodiment, the present invention exploits the distinction between logical relations and temporal relations. It has been noted that if an entailment relation holds between two events, then the second event is likely to be not a new future event. For example, the phrase “man kissed woman” entails that “man met woman”, where “man met woman” happens before (not after) “man kissed woman”. To find such entailments, we can leverage relation of verbs in a lexical database (e.g. of English, if the target language is English, otherwise, a different language could be used, while maintaining the spirit of the present invention), wherein nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. Verbs that tend to be in a temporal (happens-before) relation have been extracted on a large scale. For example, we observe (subject, buy, object) tends to be temporally preceding (subject, use, object). We consider here entailment and (logical) implication as equivalent. In particular, synonyms are considered to be in an entailment relation.
In an embodiment, a model is presented that can predict future events given a current event triplet (subject, verb, object). To make the model generalizable to unseen events, a deep learning structure is adopted such that the semantics of unseen events can be learned through word/event embeddings. A novel Memory Comparison Network (MCN) is provided that can learn to compare and combine the similarity of input events to the event relations saved in memory.
Referring now in detail to the figures in which like numerals represent the same or similar elements and initially to
A first storage device 122 and a second storage device 124 are coupled to system bus 102 by the I/O adapter 120. The storage devices 122 and 124 can be any of a disk storage device (e.g., a magnetic or optical disk storage device), a solid state magnetic device, and so forth. The storage devices 122 and 124 can be the same type of storage device or different types of storage devices.
A speaker 132 is operatively coupled to system bus 102 by the sound adapter 130. A transceiver 142 is operatively coupled to system bus 102 by network adapter 140. A display device 162 is operatively coupled to system bus 102 by display adapter 160.
A first user input device 152, a second user input device 154, and a third user input device 156 are operatively coupled to system bus 102 by user interface adapter 150. The user input devices 152, 154, and 156 can be any of a keyboard, a mouse, a keypad, an image capture device, a motion sensing device, a microphone, a device incorporating the functionality of at least two of the preceding devices, and so forth. Of course, other types of input devices can also be used, while maintaining the spirit of the present principles. The user input devices 152, 154, and 156 can be the same type of user input device or different types of user input devices. The user input devices 152, 154, and 156 are used to input and output information to and from system 100.
Of course, the processing system 100 may also include other elements (not shown), as readily contemplated by one of skill in the art, as well as omit certain elements. For example, various other input devices and/or output devices can be included in processing system 100, depending upon the particular implementation of the same, as readily understood by one of ordinary skill in the art. For example, various types of wireless and/or wired input and/or output devices can be used. Moreover, additional processors, controllers, memories, and so forth, in various configurations can also be utilized as readily appreciated by one of ordinary skill in the art. These and other variations of the processing system 100 are readily contemplated by one of ordinary skill in the art given the teachings of the present principles provided herein.
Moreover, it is to be appreciated that environment 200 described below with respect to
Also, it is to be appreciated that mechanism 300 described below with respect to
Further, it is to be appreciated that processing system 100 may perform at least part of the method described herein including, for example, at least part of method 400 of
The environment 200 at least includes a computing node 210 operatively coupled to a set of computing nodes (e.g., servers, providers of services, etc.) 220.
Each of the computing node 210 and the computing nodes 220 at least include a processing element 231, a memory 232, and a communication device 233. The communication device 233 can be, for example, but is not limited to, a wireless transceiver, an Ethernet adapter, a Network Interface Card (NIC), and so forth.
The computing node 210 is trained using training data. The training data can include, for example, positive event pairs and negative event pairs. The training data can be obtained from the set of computing nodes 220 or another source(s). The training data is used for analogy-based reasoning by a memory comparison network formed by and/or otherwise included in the computing node 210. The analogy-based reasoning can be implemented by a model or other data structure in order to generate prediction scores as described herein. The Softmax process described herein with respect to at least
The computing node 210 receives testing data from the set of computing nodes 220. The computing node 210 then performs analogy-based reasoning using the memory comparison network to generate a future prediction.
The computing node 210 and/or any of the computing nodes 220 can be and/or otherwise include any type of computer processing system or device such as, but not limited to, servers, desktops, laptops, tablets, smart phones, media playback devices, and so forth, depending upon the particular implementation. For the sake of illustration, the computing node 210 and the computing nodes 220 are servers.
The computing node 210 can be configured to perform an action (e.g., a control action) on a controlled system, machine, and/or device 230 responsive to detecting an anomaly. Such action can include, but is not limited to, one or more of: applying an antivirus detection and eradication program; powering down the controlled system, machine, and/or device 230 or a portion thereof; powering down, e.g., a system, machine, and/or a device that is affected by an anomaly in another device, opening a valve to relieve excessive pressure (depending upon the anomaly), locking an automatic fire door, and so forth. As is evident to one of ordinary skill in the art, the action taken is dependent upon the type of anomaly and the controlled system, machine, and/or device 230 to which the action is applied.
In an embodiment, a safety system or device 240 can implement the aforementioned or other action, responsive to a control signal from the computing node 210. The safety system or device 240 can be used to control a shut off switch, a fire suppression system, an overpressure valve, and so forth. As is readily appreciated by one of ordinary skill in the art, the particular safety system or device 240 used depends upon the particular implementation to which the present invention is applied. Hence, the safety system 240 can be located within or proximate to or remote from the controlled system, machine, and/or device 230, depending upon the particular implementation.
In the embodiment shown in
At block 410, input a pair of events 311. The pair of events 311 can be represented by embedding vectors of the two events. In the example of
At block 420, calculate the similarity scores, referred to as relation similarity or “rel-sim” 321 in short, between the input pair of events 311 and all positive event pairs 320 in a training set (of positive and negative event pairs). In the example of
At block 430, record the results (the similarity scores “rel-sim” 321).
At block 440, apply a Softmax process (Equation (6)) 340 to the similarity scores “rel-sim” 321 to produce an overall positive similarity score Opos 341 with respect to positive event pairs 320. In an embodiment, Opos 341 can be calculated as a weighted average.
At block 450, calculate the similarity scores, referred to as relation similarity or “rel-sim” 351 in short, between the input and all negative event pairs 350 in the training set. In the example of
At block 460, record the results (the similarity scores “rel-sim” 351).
At step 470, apply a Softmax process (Equation (6)) 370 to the similarity scores “rel-sim” 351 to produce an overall negative similarity score Oneg 371 with respect to negative event pairs 350. In an embodiment, Oneg 371 can be calculated as a weighted average.
At step 480, calculate the difference 380 between the overall positive similarity score Opos and the overall negative similarity score Oneg as the predicted happens-before score (also interchangeably referred to as “future prediction score”).
At step 490, perform an action responsive to the predicted happens-before score.
It is to be noted that for historical reasons, the score was named “happens-before score”. However, in practice, in another embodiment of the present invention, the score can also be treated as “happens-after score” if one fixes the first event. Thus, we use this score to predict the likeliness of a future event.
At step 610, collect labels for positive and negative event pairs.
At step 620, sample a positive happens-before event pair (el, erpos) and a negative happens-before event pair (el, erneg).
At step 630, calculate a loss function value (Equation (9)), and apply backpropagation to reduce the loss function value. In backpropagation, the parameters to be optimized are the parameters in neural network go and/or word embeddings (x and y) in Equation (1).
At step 640, determine whether or not the summed loss value is small enough (e.g., below a threshold value). If so, then terminate the method. Otherwise, return to step 620.
A description will now be given regarding exploiting lexical features, in accordance with an embodiment of the present invention.
In an embodiment, the present invention is directed to distinguishing future events from other events. In texts, like news stories, an event el is more likely to have happened before event er (temporal order), if el occurs earlier in the text than er (textual order). However, there are also many situations where this is not the case: re-phrasing; introducing background knowledge; conclusions; and so forth. One obvious solution is discourse parsers. However, without explicit temporal markers, they suffer from low recall, and therefore in practice most script-learning systems use textual order as a proxy for temporal order. In an embodiment, the present invention explores whether common knowledge can help to improve future detection from event sequences in textual order.
In an embodiment, common knowledge is assumed to be given in the form of simple relations (or rules) like (company, buy, share)→(company, use, share), where “→” denotes the temporal happens-before relation. In contrast, we denote the logical entailment (implication) relation by “⇒”.
To extract such common knowledge rules, the use of the lexical resources is explored. In an embodiment, the lexical resources used included (1) the VerbOcean semantic network of verbs and (2) the WordNet lexical database of English. Of course, the present invention is not limited to the preceding lexical resources and, thus, other lexical resources can also be used in accordance with the teachings of the present invention, while maintaining the spirit of the present invention. Logical and temporal relations are not independent, but an interesting overlap exists as illustrated in
A description will now be given regarding data creation, m accordance with an embodiment of the present invention.
For simplicity, we restrict our investigation here to events of the form (subject, verb, object). In an embodiment, events are extracted from around 790 k English news articles. The news articles were pre-processed using the Stanford dependency parser and co-reference resolution. We lemmatized all words, and for subjects and objects we considered only the head words, and ignored words like WH-pronouns.
All relations are defined between two events of the form (S, Vl, O) and (S, Vr, O) where subject S and object O are the same. As candidates, only events in sequence (occurrence in text) were considered.
A description will now be given regarding positive samples, in accordance with an embodiment of the present invention.
Positive samples are extracted of the form (S, Vl, O)→(S, Vrpos, O), if
1. Vl→Vrpos is listed in VerbOcean as happens-before relation.
2. ¬[Vl⇒Vrpos] according to WordNet. That means, for example, if (S, Vr, O) is paraphrasing (S, Vl, O), then this is not considered as a temporal relation.
This way, we were able to extract 1699 positive samples. Examples of happens-before relations extracted from news articles are shown in TABLE 2.
A description will now be given regarding negative samples, in accordance with an embodiment of the present invention.
Using VerbOcean, we extracted negative samples of the form (S, V1, O) (S, Vrneg, O) i.e., the event on the left hand (S, Vl, O) is the same as for a positive sample. If (S, Vl, O)(S, Vrneg, O), then Vl→Vrneg is not listed in VerbOcean. This way, we extracted 1177 negative samples.
There are several reasons for a relation not being in a temporal relation. Using VerbOcean and WordNet we analyzed the negative samples, and found that the majority (1030 relations) could not be classified with either VerbOcean or WordNet. We estimated conservatively that around 27% of these relations are false negatives: for a sub-set of 100 relations, we labeled a sample as a false negative, if it can have an interpretation as a happens-before relation. Therefore, this over-estimates the number of false negatives. This is because it also counts a happens-before relation that is less likely than a happens-after relation as a false negative.
To simplify the task, we created a balanced data set, by pairing all positive and negative samples: each sample pair contains one positive and one negative sample, and the task is to find that the positive sample is more likely to be a happens-before relation than a negative sample. The resulting data set contains in total 1765 pairs.
A description will now be given regarding analogy-based reasoning for happens-before relation scoring, in accordance with an embodiment of the present invention.
In the following, let r be a happens-before relation of the form:
r: e
l
→e
r
where el and er are two events of the form (S, Vl, O) and (S, Vr, O) respectively. Furthermore, let e′ be any event of the form (S′, V′, O′).
Our working hypotheses consists of the following two claims:
(I) If (e′ ⇒el)̂(el→er), then e′→er.
(II) If (e′⇒er)̂(el→er), then el→e′.
For example, consider the following:
“John buys computer”⇒“John acquires computer”,
“John acquires computer”→“John uses computer”.
Using (I), we can reason that:
“John buys computer”→“John uses computer”.
We note that, in some cases, “⇒” in (I) and (II) cannot be replace by “⇐”. This is illustrated by the following example:
“John knows Sara” 4⇐“John marries Sara”,
“John marries Sara”→“John divorces from Sara”.
However, the next statement is considered wrong (or less likely to be true):
“John knows Sara”→“John divorces from Sara”.
In practice, using word embeddings, it can be difficult to distinguish between “⇒” and “⇐”. Therefore, our proposed method uses the following simplified assumptions:
(I*) If (e′˜e)̂(el→er), then e′ →er.
(II*) If (e′ ˜er)̂(el→er), then el→e′.
where ˜ denotes some similarity that can be measured by means of word embeddings.
A description will now be given regarding a memory comparison network, in accordance with an embodiment of the present invention.
We propose a memory-based network model that uses the assumptions (I*) and (II*). It bases its decision on one (or more) training samples that are similar to a test sample. In contrast to other methods like neural networks for script learning, and (non-linear) SVM ranking models, it has the advantage of giving an explanation of why a relation is considered (or not considered) as a happens-before relation.
In the following, let r1 and r2 be two happens-before relations of the form:
r
1: (S1,Vl1,O1)→(S1,Vr1,O1)
r
2: (S2,Vl2,O2)→(S2,Vr2,O2)
Let xsi, xν
and xo
We define the similarity between two relations r1 and r2 as follows:
where gθ is an artificial neuron with θ={σ, β}, a scale σϵ, and a bias βϵ parameter, followed by a non-linearity. We use as non-linearity the sigmoid function. Furthermore, here we assume that all word embeddings are l2-normalized.
Given the input relation r: el→er, we test whether the relation is correct or wrong as follows. Let npos and nneg denote the number of positive and negative training samples, respectively. First, we compare to all positive and negative training relations in the training data set, and denote the resulting vectors as uposϵ n
u
t
pos
=simθ(r,rtpos) and utneg=simθ(r,rtneg)
where rtpos and rtneg denotes the t-th positive/negative training sample.
Next, we define the score that r is correct/wrong as the weighted average of the relation similarities:
o
pos=softmaxγ(upos)Tupos and oneg=softmaxγ(uneg)Tuneg (2)
where softmaxγ(u) returns a column vector with the t-th output defined as
and γϵ is a weighting parameter. Note that for γ→∞, softmaxγ(u)=max(u), and for γ=0, o is the average of u.
Finally, we define the happens-before score for r as follows:
l(el,er)=opos(el,er)−oneg(el,er) (3)
The score l(el, er) can be considered as an unnormalized log probability that relation r is a happens before relation. The basic components of the network are illustrated in
For optimizing the parameters of our model we minimize the rank margin loss:
L(rpos,rneg)=max{0,1−l(el,erpos)+l(el,ernrg))} (4)
where rpos:el→erpos and rneg:el→erneg are positive and negative samples from the held-out training data. All parameters of the models are trained using stochastic gradient descent (SGD). Word embeddings (xs, xν, and xo) are kept fixed during training.
Our model can be interpreted as an instance of a memory network, where notation-wise I(·) corresponds to the word embedding lookup, G(·) saves all training samples into the memory, the O(·) function corresponds to (opos, oneg), and the output of R(·) equals Equation (3).
Regarding the model of the present invention, it has similarity to a memory-based reasoning system with at least the two following differences. First, we use here a trainable similarity measure (see Equation (1)), rather than a fixed distance measure. Second, we use the trainable Softmax rather than max.
Since the model of the present invention uses analogy-based reasoning, it can easily identify supporting evidence for the output of our system. In an embodiment, the supporting evidence can denote the training sample with the highest similarity simθ to the input. In the first example 801 and the second example 802, the input is a happens-before relation, in the third example 803 and the fourth example 804, the input is not a happens-before relation. It is to be appreciated that the preceding examples are merely illustrative and, thus, the present invention can be applied to many other applications and corresponding triples. The applications can include, for example, surveillance (man, enter, private area), (private area, secure, lock), and so forth, as readily appreciated by one of ordinary skill in the art.
Embodiments described herein may be entirely hardware, entirely software or including both hardware and software elements. In a preferred embodiment, the present invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
Embodiments may include a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. A computer-usable or computer readable medium may include any apparatus that stores, communicates, propagates, or transports the program for use by or in connection with the instruction execution system, apparatus, or device. The medium can be magnetic, optical, electronic, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. The medium may include a computer-readable medium such as a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk, etc.
It is to be appreciated that the use of any of the following “/”, “and/or”, and “at least one of”, for example, in the cases of “A/B”, “A and/or B” and “at least one of A and B”, is intended to encompass the selection of the first listed option (A) only, or the selection of the second listed option (B) only, or the selection of both options (A and B). As a further example, in the cases of “A, B, and/or C” and “at least one of A, B, and C”, such phrasing is intended to encompass the selection of the first listed option (A) only, or the selection of the second listed option (B) only, or the selection of the third listed option (C) only, or the selection of the first and the second listed options (A and B) only, or the selection of the first and third listed options (A and C) only, or the selection of the second and third listed options (B and C) only, or the selection of all three options (A and B and C). This may be extended, as readily apparent by one of ordinary skill in this and related arts, for as many items listed.
Having described preferred embodiments of a system and method (which are intended to be illustrative and not limiting), it is noted that modifications and variations can be made by persons skilled in the art in light of the above teachings. It is therefore to be understood that changes may be made in the particular embodiments disclosed which are within the scope and spirit of the invention as outlined by the appended claims. Having thus described aspects of the invention, with the details and particularity required by the patent laws, what is claimed and desired protected by Letters Patent is set forth in the appended claims.
This application claims priority to provisional application Ser. No. 62/428,069, filed on Nov. 30, 2016, incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
62428069 | Nov 2016 | US |