This disclosure generally relates to machine learning and, more specifically, to scanpath generation.
Natural language processing (NLP) models have a variety of applications, including sentiment analysis, text classification, and others. Digital traces of human cognitive processing can provide valuable signals for training NLP models. The movement of the human eye during reading, or eye tracking, is an example of one such signal. For instance, a recorded data set of eye tracking data developed across a well-known corpus can augment a machine learning model as an auxiliary task in multi-task learning or be used as an embedded representation input to a neural network.
While recorded eye tracking data has been proven to improve some NLP models, the real-world application of these methods remains limited. The equipment required to record eye tracking data is expensive. Additionally, the equipment is cumbersome and requires significant manual labor to set up and use under controlled conditions. Where eye tracking data can be obtained, for instance, through a web cam or similar commercial device, users might have significant privacy concerns associated with the obtaining of such recordings.
Some embodiments described herein relate to a training module comprising a scanpath generation model training system. The training module may be used to generate a scanpath generation model. The training module may comprise an adversarial training neural network. Using training data, which includes a text input and a recorded scanpath corresponding to the text input, the adversarial training neural network may be trained to generate a scanpath generation model. A scanpath may comprise a sequence of words and a corresponding sequence of fixation durations, wherein the sequence of words comprises one or more words comprising the text input. The training module may then output the trained scanpath generation model.
In some embodiments, the adversarial training neural network may include a conditional generator. The conditional generator may receive the text input from the training data. The conditional generator can transform the text input into a generated scanpath. A discriminator also receives the text input as well as the generated scanpath, or alternatively, a recorded scanpath from the training data. The discriminator determines a first probability that the generated scanpath is a recorded scanpath and a second probability that the recorded scanpath is a recorded scanpath. The conditional generator is trained using the first probability, the second probability, the recorded scanpath, and the generated scanpath. The discriminator is trained using the first probability and the second probability.
In some embodiments, generating the scanpath generation model may include transforming the text input into a dense text representation using a pre-trained neural network. The dense text representation may be used to condition the conditional generator and the discriminator. In some embodiments, generating the scanpath generation model may include transforming the dense text representation of the text input into a reconstruction of the text input. In such embodiments, the dense text representation and the reconstruction of the text input may be used to train the conditional generator. In some embodiments, the generated scanpath may include an end-of-sequence probability.
In some embodiments, the trained scanpath generation model may be used to augment one or more natural language processing models. The natural language processing models may also be trained using scanpaths generated by the trained scanpath generation model. In some embodiments, the training data may include feedback from one or more client devices utilizing the trained scanpath generation model. The scanpath generation model may be further trained based on the feedback from one or more client devices.
These illustrative embodiments are mentioned not to limit or define the disclosure, but to provide examples to aid understanding thereof. Additional embodiments are discussed in the Detailed Description, and further description is provided there.
Features, embodiments, and advantages of the present disclosure are better understood when the following Detailed Description is read with reference to the accompanying drawings.
As described above, obtaining recorded eye tracking data for use in augmenting and improving natural language processing (NLP) models can be impractical due to cost, effort, and privacy concerns. However, corpora of eye tracking data obtained from academic or other research settings may be available for developing machine learning models. Some embodiments described herein use a training module, which is or includes a machine-learning model trained to generate scanpaths that mirror those that would be derived from an actual recording of the human eye reading a given text. Generated scanpaths can then be used, for example, to train NLP models at significantly reduced cost and effort, with no privacy considerations. The training module may first access training data to train the machine-learning model. For example, the training data may include text inputs and recorded scanpaths derived from eye tracking data recorded while the text inputs obtained from research corpora were read by humans. The training module may include an adversarial training neural network, such as a generative adversarial network (GAN). The adversarial training neural network may be trained, using the training data, to generate a trained scanpath generation model. The training module may then output the trained scanpath generation model, which can then be used to generate scanpaths for various applications.
The following non-limiting example is provided to introduce certain embodiments. In this example, a training module is incorporated in a scanpath generation model training system. The training module can generate a scanpath generation model, which may then be used to generate realistic scanpaths for various applications.
In this example, the training module includes a GAN, which comprises two competing neural networks: a conditional generator and a discriminator. The conditional generator generates scanpaths given a text input. The output of the conditional generator is input to the discriminator, along with the text input. Alternatively, the discriminator may receive the recorded scanpath data for the same text input. The discriminator distinguishes recorded human scanpaths from the generated scanpaths. The outcome of the discriminator's determination is used to train both the discriminator and the conditional generator. The discriminator is penalized for incorrect determinations and the conditional generator is rewarded for generating a scanpath that the discriminator cannot correctly distinguish from the recorded one. Thus, in a zero-sum fashion, the conditional generator and the discriminator are adversarially trained together to train the scanpath generation model.
Certain embodiments described herein represent improvements in the technical fields of machine learning, NLP, and saliency prediction. Once the scanpath generation model is trained, it can be used as a pre-trained model for a variety of applications. For instance, the pre-trained model can be provided with a text input and configured to output a generated scanpath which can be used to improve and evaluate NLP models. Some example NLP models suffer from an accuracy gap, wherein the model is inaccurate because it is not trained on domain-specific training data. In these examples, the accuracy gap can be improved with the addition of the cognitive signals provided by generated scanpaths.
Another example application includes saliency prediction. The pre-trained model can be provided with a text input and configured to output a generated scanpath which can be used as feedback to adjust the text input to achieve various optimizations. For instance, the generated scanpath can be used to select the words and word ordering of text to help ensure that the reader's fixation duration is maximized on certain key words and phrases. In some examples, saliency analysis is applied to text including content intended to capture a reader's attention and interest. Examples of such content may include emergency alerts, advertisements, or time-sensitive messages, among other possibilities. A reader may form an opinion in the first few seconds of viewing content. Therefore, which words the reader reads in those first few seconds may be of significant importance to the authors of the text. In some examples, the wording of the content may confuse readers or cause them to ignore the text altogether. In both examples, a generated scanpath associated with the text comprising may be used to optimize the impact of the content and better achieve the goal of capturing the reader's attention and interest.
In some examples, saliency analysis may be used to optimize text in other contexts. For instance, an author may begin with an idea and then apply saliency analysis to select the wording to communicate the idea. In another example, an author may create a piece of text, and then submit a portion of the text for saliency analysis, to receive recommendations for how the text may be optimized to achieve a particular goal.
In some examples, intent-aware scanpath generation can be implemented in concert with downstream NLP tasks. For example, the performance of an NLP model with generated scanpath data incorporated may be determined. The performance can be used to determine a gradient that can be fed back to the conditional generator. Conditioning scanpath generation on NLP tasks may bias the conditional generator 108 towards words that are relevant for the particular NLP task and could therefore boost the performance of the downstream NLP task.
As used herein, the term “scanpath” refers to a sequence representing the eye tracking data of a human eye reading a given text input. A scanpath can be either recorded or generated, for example, by a machine learning model. A recorded scanpath may also be called ground truth eye tracking data. For a text input comprising one or more words arranged in an ordered sequence, the ordered sequence comprises zero or more ordered pairs, each pair comprising a word, or fixation point, and the fixation duration of the word. The fixation duration is the amount of time the human eye remains fixed on a given word. The scanpath sequence need not be in the same order as the words comprising the text input, nor is every word from the text input necessarily contained in the scanpath.
As used herein, the term “eye tracking” refers to the conversion of the physical movements and positioning of the human eye into a quantifiable data set. In some instances, the eye tracking data is extracted from video images captured of the human eye while reading a given text string.
As used herein, the term “scanpath generation model” refers to a set of software instructions that implement a machine learning model. For example, the scanpath generation model may be a pre-trained neural network that generates scanpaths given a text input. In another example, the scanpath generation model may be a component of an adversarial training neural network.
As used herein, the term “scanpath generation model training system” refers to a computer system configured to be used by a human user or an automated user to generate scanpath generation models for the purpose of, for example, augmenting natural language processing models or saliency analysis. In some embodiments, a scanpath generation model training system is implemented as one or more computing devices running program code to cause a processing unit to execute machine learning algorithms, access data, or perform other tasks on datasets.
As used herein, the term “training module” refers to a computer-implemented component configured to execute one or more neural networks to implement a machine learning algorithm. In some embodiments, a training module is implemented as software instructions, which, when executed by a processing unit, causes the processing unit to train one or more neural networks.
As used herein, the term “natural language processing” refers to the discipline concerned with the ability of computers to process, analyze, and model large amounts of natural language data. Results of such models can provide a basis for a spectrum of useful applications including searching, machine translation, summarization, paraphrasing, sentiment analysis, text classification, keyword extraction, automatic speech recognition, named entity recognition, paraphrase detection, part of speech tagging, and text difficulty classification, among others.
As used herein, the term “neural network” refers to a set of software instructions which comprise a machine learning model. The machine learning model includes a collection of interconnected neurons which can both receive inputs from and transmit outputs to other neurons and be trained via machine learning algorithms and deep learning methods to accomplish certain high-level tasks, like generation and classification, among other tasks. The term “neural networks” can include feedforward networks, recurrent neutral networks, bi-directional neural networks, convolutional neural networks, long-short-term memory (LSTM) networks, and bi-directional LSTMs (BiLSTM), among other variations.
As used herein, the term “generative adversarial network” or “GAN” refers to a set of software instructions which comprise a machine learning model, in particular, two or more neural networks which are adversarially trained. In some embodiments, GANs can be trained in an unsupervised manner by allowing the one or more neural networks to train in a zero-sum game. A GAN may comprise a generator network and a discriminator network. The generator network is rewarded, through a machine learning training algorithm, for producing model output candidates that the discriminator cannot distinguish from ground-truth training data.
As used herein, the term “conditional generator” refers to a set of software instructions comprising a component of a generative adversarial network. Where a generator may model candidate values by sampling random data, in some embodiments, a conditional generator may model candidate values given some additional information. For example, a conditional generator may generate candidate values given labeled ground-truth training data as input.
As used herein, the term “transformer” refers to a set of software instructions comprising a neural network. In some embodiments, a transformer includes a component that includes a self-attention mechanism. In some examples, the self-attention mechanism provides contextual information about portions of the machine learning model to other portions of the model during training. A transformer may include an encoder component. In some embodiments, an encoder may process a set of input vectors representing a sequence. A transformer may include a decoder component. In some embodiments, a decoder may receive input from an encoder and process the input to produce an output sequence.
The training data 106 may be processed prior to training, for example, to ensure that the scanpath generation model 110 is not unduly affected by outlier data points. For example, fixation durations in a recorded scanpath 118 above a certain threshold can be considered outliers and removed from the training data 106. In some embodiments, fixation durations above the ninety-ninth percentile of all fixation durations are removed from the training data 106. Likewise, recorded scanpath length above a certain threshold can be trimmed, both to eliminate outliers and to ensure that each text input 120 string each contains the same number of fixation points (e.g., words) so that the input to the language encoding neural networks used in the conditional generator 108 and the discriminator 112 have uniform dimension, as discussed below. In some embodiments, the maximum recorded scanpath length can be set to the ninety-ninth percentile, which corresponds to 80 fixation points (words). A recorded scanpath 118 shorter than the maximum scanpath length may be padded and a recorded scanpath 118 longer than the maximum scanpath length may be trimmed.
Both the conditional generator 108 and the discriminator 112 may receive a text input 120 associated with a recorded scanpath 118 or as a prompt for generating a generated scanpath 116. The text input 120 may be converted into dense text representations 202, 224, or embedded representations, using a pre-trained language encoding neural network. The dense text representation 202 along with the text input 120 may be used for conditioning the conditional generator 108. The conditional generator 108 is conditional in the sense that its output is conditional on labeled input, in contrast to random noise, which may be input to a conventional generator. For instance, the dense text representations 202, 224 may be generated using a pre-trained language encoding neural network such as a Bidirectional Encoder
Representations from Transformers (BERT) model (not shown in the figure). The BERT model can receive a text input 120 which comprises one or two sentences. The BERT model may tokenize the text input 120 into a plurality of tokens. Some tokens may be portions of words or individual characters comprising words of the text input 120. The tokenized text input may include a classification (CLS) token and a separation (SEP) token. The CLS token may be appended to the beginning of the token sequence. The SEP token may be placed at the end of the first and/or second sentence of the text input 120. In an example, the BERT model then encodes each word in the text input 120 into at least one 768-dimensional vector. The encoding also includes a 768-dimensional CLS vector, derived from the CLS token, which may be used as input for classification tasks. The encoding may also contain one or two 768-dimensional SEP vectors, derived from the SEP tokens, which may be used for predicting the end of the first or second sentence or the beginning of the second sentence. The result of providing a text input 120 to the BERT model is thus a plurality of 768-dimensional vectors encoding the text input 120 which may be used as a dense, embedded input to a neural network. As discussed above, the expected length of the text input 120 is 80 words, which may include one or more padding tokens and the CLS and SEP tokens. Therefore, in some embodiments, the output of the BERT model is an 80×768-dimensional tensor.
The dense text representation 202 of the text input 120 may be concatenated with random noise 204. The random noise 204 may be sampled from a Gaussian noise distribution, but other distributions are possible. The dense text representation 202 of the text input 120 may be concatenated with random noise 204 to ensure that the output of the conditional generator 108 output is non-deterministic. In some examples, random noise 204 is concatenated with the dense text representation 202 of the text input 120 to account for the fact that different individuals may read a given text input 120 in different ways.
In addition to encoding the text input 120 as a dense text representation 202, positional encoding 206 may be added to the dense text representation 202 so that the relative position of each word, or the distance between different words in the text input 120, is encoded in the data used to train the conditional generator 108. Sinusoidal positional encoding 206 may be applied over the dense text representation 202 that are provided to the conditional generator 108. Other positional encoding 206 schemes may be used.
The conditional generator 108 may comprise a transformer network which itself comprises an encoder 208 and decoder 210. However, embodiments are not limited to transformer configurations. Other types of encoder/decoder frameworks could be used. The encoder 208 may receive the dense text representation 202 of the text input 120 combined with the positional encoding 206. The encoder 208 may comprise one or more layers, each layer comprising a self-attention layer and a feed forward layer. The self-attention layer may contain one or more heads. Multiple heads in the self-attention layer allows the attention mechanism to model different portions of the text input in parallel. Through a self-attention mechanism, each encoder layer generates encodings that contain information about which parts of the text input 120 are relevant to each other. The encoder 208 may have any number of layers and must have a hidden dimension of at least 768 dimensions, conforming to the cardinality of the input embedded vector. The minimum hidden dimension size may be any integer divisible by the number of attention heads. In some embodiments, a 3-layer encoder with four attention heads and a hidden dimension size of 776 followed by a feed-forward network is used, but other configurations are possible. The hidden dimension size of 776 corresponds to the dimension of the concatenation of the 768 dimensions of the BERT-encoded vector and the random noise 204.
The decoder 210 may receive the output of the encoder 208. The decoder 210 may comprise one or more task-specific feed-forward neural networks. The decoder 210 may use contextual information incorporated by the encoder 208 to generate an output sequence. In some embodiments, the output sequence may be a generated scanpath 116. In other embodiments, the output sequence may be a 768-dimensional reconstruction of the CLS token embedding 214 of the text input 120. The CLS token embedding may correspond to a global representation of the text input 120. In certain embodiments, the reconstruction of the CLS token embedding 214 may be adopted as an auxiliary task in order to boost model performance.
The conditional generator 108 may be a multi-task network with at least two tasks. The first task may include one branch of the task-specific feed-forward networks generating a generated scanpath 116. In a pre-trained scanpath generation model 110, this task may be used to generate scanpaths. The generated scanpath 116 can be output as a temporal sequence of word IDs 216, fixation durations 218, and end-of-sequence (EOS) probabilities 220. The second task may include a second branch of the task-specific feed-forward networks, which may include generating a 768-dimensional reconstruction of the CLS token embedding 214 of the text input 120. Both tasks have associated loss functions that are input to the overall GAN 104 loss function, as discussed in detail below.
The discriminator 112 may comprise one or more neural networks. The goal of the discriminator 112 is to distinguish between recorded and generated scanpaths. The outcome of that classification 114 can be used to train both the discriminator 112 itself and the GAN 104 as a whole, by propagating feedback through both the discriminator 112 and the conditional generator 108.
During training, the discriminator 112 may receive a text input 120 associated with a recorded scanpath 118 or a generated scanpath 116. The associated recorded scanpath 118 or a generated scanpath 116 may be offered to the discriminator 112 as a scanpath input 222. The text input 120 can be converted into a dense text representation 224 using a pre-trained language encoding neural network. For instance, the pre-trained language encoding neural network may include a BERT model, as described above. As with the conditional generator 108, the result of providing a text input 120 to the BERT model is thus a plurality of 768-dimensional vectors encoding the text input 120 which may be used as a dense, embedded input to a neural network.
The discriminator 112 may comprise two branches of Bi-directional Long Short-Term Memory (BiLSTM) neural networks 226, 228 that perform sequential modeling over the scanpath input 222 and text input 120 embeddings. The BiLSTMs 226, 228 may be followed by normalization layers 230, 232 for faster model convergence. For instance, in some embodiments, the Batch Norm algorithm may be implemented in normalization layers 230, 232. In some examples, the BiLSTMs 226, 228 can have a hidden size of 64 and a dropout ratio of 0.3, but other configurations are possible. During training, the first branch may receive a scanpath input 222 comprising either a recorded scanpath 118 from the training data 106 or a generated scanpath 116 from the conditional generator 108. The second branch may receive the text input 120 associated with the scanpath input 222 received by the first branch. The outputs of the two branches can be combined and passed to a multi-headed attention fusion network 234, followed by a BiLSTM 236. In some embodiments, the multi-headed attention fusion network 234 may have 4 heads, but other configurations are possible. The hidden states of the last layer of the BiLSTM 236 from both forward and backward directions may be concatenated and supplied to a feed-forward network. The output of the feed-forward network may be activated by the Sigmoid function (not shown), resulting in a probability that can be used for classification. In this example, the probability may be used to determine whether the scanpath input 222 received by the first branch was a recorded scanpath 118 or a generated scanpath 116. This classification by the discriminator 112 may be used to train both the discriminator 112 itself and the GAN 104 as a whole, by propagating feedback through both the discriminator 112 and the conditional generator 108.
In block 302, the training module 102 accesses training data 106 from a memory device. The memory device could be a local storage device included in or accessible to the scanpath generation model training system 100 or from an external source, such as a cloud storage provider or other network location. The training data 106 includes a plurality of text inputs and a recorded scanpath 118 for each text input 120. These recorded scanpaths are the ground-truth training data used for building the scanpath generation model 110. The recorded scanpaths may be obtained from a publicly available eye tracking corpus including, for instance, the Corpus of Eye Movements in L1 and L2 English Reading (CELER) dataset.
In block 304, the training module 102, comprising a GAN 104, generates a trained scanpath generation model 110 through adversarial training. Block 304 will be discussed in detail in
In block 306, the trained scanpath generation model 110 is output. For instance, the trained scanpath generation model 110 can be used as input to improve or augment NLP modeling or to implement a saliency prediction system. The trained scanpath generation model 110 may be a trained neural network but may also be the set of parameters defining a model developed by training a neural network, implemented separately. In the example GAN 104 depicted in
In some examples, the training data 106 may further comprise feedback from one or more client devices that utilize the trained scanpath generation model 110. For example, a client device may include an application that displays generated scanpaths for saliency analysis or incorporates generated scanpaths into an NLP model. The client device may receive user feedback or automated, algorithmic feedback based on the scanpaths generated by the scanpath generation model 110 included in the client device. The training module 102 may receive feedback from the client device and further train the scanpath generation model 110 based on the feedback from the client device. The further training may be offline or online training. Offline training may include training with a static training data 106 set. For example, the feedback may be added to the training data 106. Online training may include training of a machine learning model that occurs while the model is in use, such that the model is constantly updated. For example, the feedback may be used to improve the performance of a trained scanpath generation model 110 while the model is in use. Further training of the scanpath generation model 110 may include additional components. For example, one or more additional neural networks may be used in training module 102 to incorporate the feedback.
In some examples, the algorithmic feedback may be used to implement intent-aware scanpath generation. For example, the performance of an NLP model with generated scanpath data incorporated may be determined. The performance can be used to determine a gradient that can be fed back to the conditional generator 108. Conditioning scanpath generation on NLP tasks may bias the conditional generator 108 towards words that are relevant for the particular NLP task and could therefore boost the performance of the downstream NLP task.
In block 402, a conditional generator 108 receives a text input 120. In some embodiments, the text input 120 is an element of the training data 106. However, if the conditional generator 108 is already trained, it may receive arbitrary text inputs, in order to generate scanpaths. The conditional generator 108 may encode the text input 120 into to a dense text representation 202. For instance, a pre-trained language encoding neural network such as a BERT model may be used to encode the input text 120. The dense text representation 202 may be concatenated with random noise 204 to ensure non-deterministic behavior of the conditional generator. In some examples, the random noise 204 may be sampled from Gaussian noise, but other distributions may be used. The framework may include the addition of positional encoding 206 to the dense text representation 202. In some embodiments, the length of the text input 120 is limited according to the constraints of the dense text representation 202 used by the conditional generator 108, as discussed above.
In block 404, the text input is transformed by the conditional generator 108 into a generated scanpath 116. The transformation may be realized by a transformer-based encoder-decoder framework. The conditional generator 108 may be a multi-task network with at least two tasks. The first task may include generation of a generated scanpath 116. This task is the source of the trained scanpath generation model 110, once the scanpath generation model 110 has been adversarially trained. The generated scanpath 116 is output as a temporal sequence of word IDs 216, fixation durations 218, and EOS probabilities 220. A visualization of a generated scanpath 116 is depicted in
Training of the scanpath generation task tries to minimize the deviation of the generated scanpath 116 from the ground-truth recorded scanpath 118 for a given text input 120. This deviation over a text input of k words for a recorded scanpath r and a generated scanpath g is measured using the generated scanpath loss function s:
The above equation captures the deviation s between generated scanpaths (, ) and recorded scanpaths (, h) included in training data 106. (, ) corresponds to generated scanpaths which are a function of text T concatenated with Gaussian noise N in order to ensure that the generator's output is non-deterministic. (, h) indicates recorded scanpaths which are made by recording a human h reading text . The first term measures the error between real and predicted fixation points (words) given by the mean squared difference between the generated and recorded word positions. In other words, this term captures the difference in position of the word ordering between each generated scanpath 116 and recorded scanpath 118 pair. For instance, if the text input 120 is “have a nice day,” the recorded scanpath 118 might contain the sequence {“have”, “a”, “nice”, “day” } where the generated scanpath 116 results in the sequence {“a”, “nice”, “have”, “day” }. In this example, the difference in position for the word “have” is 2, resulting in a mean squared error of 4 for that term. The second term measures the mean squared difference between generated fixation durations 218 and recorded fixation durations. The fixation durations 218 correspond to the amount of time the human eye rests (or is predicted to rest) on a given word. Finally, the third term measures the between the mean squared difference between the generated end-of-sequence (EOS) probabilities 220 and recorded EOS probabilities. The EOS probabilities 220 correspond to the probability that the associated word IDs 216 are the end of the temporal scanpath sequence. The EOS probabilities 220 may be used to determine the length of the generated scanpath. For instance, an EOS probability above a pre-set threshold can be used to estimate the length of a given generated scanpath. These three terms are weighted by hyperparameters α, β, and γ. The hyperparameters can be tuned during training to optimize the learning process.
The reconstruction of the CLS token embedding 214 task may also be used to train the GAN 104. Because scanpaths depend heavily on the linguistic properties of the text input 120, a reconstruction loss may be used to guide the conditional generator towards probable data manifolds. Reconstruction loss measures the mean-squared error between the reconstruction of the CLS token embedding 214 and the actual BERT-encoded CLS token of the input text 120 encoded by the dense text representation 202. In some examples, the reconstruction of the CLS token embedding 214 task represents how well the conditional generator 108 models the text input 120. Training of the reconstruction of the CLS token embedding 214 task tries to minimize the reconstruction loss r, given by:
The above equation captures the deviation r between the reconstructed CLS vector representations of the generated scanpaths (, ) and recorded scanpaths (,h), as defined previously. BERT(war, wbr, . . . wnr) corresponds to the CLS vector representation of the recorded scanpaths. BERT(wag, wbg, . . . wng) corresponds to the CLS vector representation of the generated scanpaths. Minimization of this loss function improves model performance as observed empirically.
In block 406, the discriminator 112 receives the text input 120. The discriminator 112 may encode the text input 120 into to a dense text representation 224. For instance, a pre-trained language encoding neural network such as a BERT model may be used to encode the input text 120. In some embodiments, the length of the text input 120 is limited according to the constraints of the dense text representation 202 used by the conditional generator 108, as discussed above.
In block 408, the discriminator 112 may receive the generated scanpath 116 generated in block 404. Alternatively, the discriminator 112 may receive the recorded scanpath 118 associated with the text input 120. The training module 102 may select a generated scanpath 116 or a recorded scanpath 118 randomly during training or according to another suitable training algorithm.
In block 410, the discriminator 112 generates a first probability that the generated scanpath 116 is a recorded scanpath 118. The discriminator 112 generates a second probability that the recorded scanpath 118 is a recorded scanpath 118. In other words, the discriminator 112 generates the probability that the scanpath received, whether generated or recorded, is recorded. The probability may be used to make a classification 114 as to whether the scanpath received was a generated scanpath 116 or a recorded scanpath 118. In some examples, the classification 114 may include a classification threshold, above which the scanpath is classified as a recorded scanpath 118 or below which the scanpath is classified as a generated scanpath 116. The classification threshold may be a hyperparameter that can be adjusted during training.
Training of the GAN 104 may include both training of the conditional generator 108 and training of the discriminator 112. In block 412, the conditional generator 108 is adversarially trained using the first probability, the second probability, the recorded scanpath 118, and the generated scanpath 116. In block 414, the discriminator 112 is adversarially trained using the first probability and the second probability.
In blocks 412 and 414, the GAN 104 may be adversarially trained using the classification 114 of the discriminator 112 of whether the input scanpath was a recorded scanpath 118 or a generated scanpath 116. The classification 114 of the discriminator 112 may be used to train both the discriminator 112 and the conditional generator 108. The discriminator 112 is penalized for incorrect determinations and the conditional generator 108 is rewarded for generating a generated scanpath 116 that the discriminator 112 cannot correctly distinguish from the recorded scanpath 118. Thus, in a zero-sum fashion, the conditional generator 108 and the discriminator 112 are adversarially trained together to train the scanpath generation model 110. The overall loss function for the GAN 104 is given by the “minimax” adversarial loss a:
This equation illustrates the algorithm used during adversarial training. During adversarial training, training of the discriminator, D, operates to maximize the loss function, while training the generator, G, operates to minimize the loss function. In this equation, x represents recorded scanpath data, where z represents generated scanpath data. D(x|) stands for the estimate, by the discriminator 112, of the probability that a given recorded scanpath x is a recorded scanpath 118, given text input . x˜pdata(x) stands for the expected value of the log(D) term over the set of all recorded scanpaths. G(z|, ) represents the output of the conditional generator 108, a generated scanpath 116, given text input concatenated with Gaussian noise . D(G(z|, )) stands for the estimate, by the discriminator 112, of the probability that a given generated scanpath 116 is a recorded scanpath 118. z˜pz(z) stands for the expected value of the 1-log(D(G)) term over all the set of all generated scanpaths. The conditional generator 108 cannot directly affect the log(D) term since it comprises only recorded scanpaths and the discriminator 112 output. Therefore, the conditional generator's loss function can be given by:
This equation combines the generated scanpath loss function s and the reconstruction loss r with the second term from the minimax adversarial loss a described above, to yield the conditional generator loss g.
In some examples, conditional generator 108 and discriminator 112 losses are minimized using the Adaptive Moment Estimation (Adam) and Root Mean Square Propagation (RMSProp) optimization algorithms, but other algorithms may be used as well. In some examples, hyperparameters may include a batch size of 128, a learning rate for the conditional generator 108 of 0.0001, a learning rate for the discriminator 112 of 0.00001, but hyperparameters can vary from one embodiment to the next. In some examples, the model is trained over 300 epochs, but this value is meant to be non-limiting.
Two additional examples of generated scanpaths 514, 518 and NLP models are given. In the second example generated scanpath 514, the NLP model is trained for sarcasm detection. The result 516 indicates the output of the sarcasm detection from the NLP model. In the third example generated scanpath 518, the NLP model is trained to paraphrase the text input 120. The result 520 indicates the output of paraphrasing the text input 120 by the NLP model. In all three examples, the example generated scanpaths 502, 514, 518 may be used to augment the training data provided to NLP models during training.
The application 600 allows for selection of a mode. In the content mode 606, a document 608 containing images and words may be identified by the user and displayed. The example application 600 can perform one or more analyses on the images and texts, which may include image saliency analysis and text saliency analysis. Image saliency analysis may be performed in the image saliency analysis mode 602. The text saliency analysis mode 604 is discussed below. In the content mode 606, the user may identify a portion of the text for text saliency analysis using a content optimizer window 612. In some examples, the application 600 may perform text saliency analysis on all text present in the document 608.
In
In
It should be stressed that this example application 600 is just one example of the various applications possible with a trained scanpath generation model 110 and is not in any way limiting. Other applications may use different user interfaces, including different visual presentations of the generated scanpaths. For instance, another example application may implement an NLP model.
The depicted example of a computing system 700 includes a processor 702 communicatively coupled to one or more memory devices 704. The processor 702 executes computer-executable program code stored in a memory device 704, accesses information stored in the memory device 704, or both. Examples of the processor 702 include a microprocessor, an application-specific integrated circuit (“ASIC”), a field-programmable gate array (“FPGA”), or any other suitable processing device. The processor 702 can include any number of processing devices, including a single processing device.
The memory device 704 includes any suitable non-transitory computer-readable medium for storing data, program code, or both. A computer-readable medium can include any electronic, optical, magnetic, or other storage device capable of providing a processor with computer-readable instructions or other program code. Non-limiting examples of a computer-readable medium include a magnetic disk, a memory chip, a ROM, a RAM, an ASIC, optical storage, magnetic tape or other magnetic storage, or any other medium from which a processing device can read instructions. The instructions may include processor-specific instructions generated by a compiler or an interpreter from code written in any suitable computer-programming language, including, for example, C, C++, C #, Visual Basic, Java, Python, Perl, JavaScript, and ActionScript.
The computing system 700 may also include a number of external or internal devices, such as input or output devices. For example, the computing system 700 is shown with one or more input/output (“I/O”) interfaces 708. An I/O interface 708 can receive input from input devices or provide output to output devices. One or more buses are also included in the computing system 700. The bus 706 communicatively couples one or more components of the computing system 700.
The computing system 700 executes program code that configures the processor 702 to perform one or more of the operations described herein. The program code may include, for example, the training module 102, the GAN 104, the scanpath generation model 110, other components of the scanpath generation model training system 100, or applications that perform one or more operations described herein. The program code may be resident in the memory device 704 or any suitable computer-readable medium and may be executed by the processor 702 or any other suitable processor.
The computing system 700 can access other models, datasets, or functions of the scanpath generation model training system 100 in any suitable manner. In some embodiments, some or all of one or more of these models, datasets, and functions are stored in the memory device 704 of a computing system 700, as in the example depicted in
The computing system 700 also includes a network interface device 710. The network interface device 710 includes any device or group of devices suitable for establishing a wired or wireless data connection to one or more data networks. Non-limiting examples of the network interface device 710 include an Ethernet network adapter, a modem, and the like. The computing system 700 is able to communicate with one or more other computing devices via a data network using the network interface device 710.
The computing system 700 also includes a display device 712. The display device 712 includes any device or group of devices suitable for viewing a user interface while the scanpath generation model 110 is trained. Other aspects of the scanpath generation model training system 100 may also be displayed on the display device 712. Examples of the display device 712 include a computer monitor, a laptop screen, a tablet screen, or a smartphone screen.
The computing system 700 also includes an input device 714. The input device 714 includes any device or group of devices suitable for operation of the scanpath generation model training system 100 according to output from the display device 712. Examples of the input device 714 include a keyboard, mouse, tablet screen, or smartphone screen.
Numerous specific details are set forth herein to provide a thorough understanding of the claimed subject matter. However, those skilled in the art will understand that the claimed subject matter may be practiced without these specific details. In other instances, methods, apparatuses, or systems that would be known by one of ordinary skill have not been described in detail so as not to obscure claimed subject matter.
Unless specifically stated otherwise, it is appreciated that throughout this specification discussions utilizing terms such as “processing,” “computing,” “calculating,” “determining,” and “identifying” or the like refer to actions or processes of a computing device, such as one or more computers or a similar electronic computing device or devices, that manipulate or transform data represented as physical electronic or magnetic quantities within memories, registers, or other information storage devices, transmission devices, or display devices of the computing platform.
The system or systems discussed herein are not limited to any particular hardware architecture or configuration. A computing device can include any suitable arrangement of components that provide a result conditioned on one or more inputs. Suitable computing devices include multi-purpose microprocessor-based computer systems accessing stored software that programs or configures the computing system from a general-purpose computing apparatus to a specialized computing apparatus implementing one or more embodiments of the present subject matter. Any suitable programming, scripting, or other type of language or combinations of languages may be used to implement the teachings contained herein in software to be used in programming or configuring a computing device.
Embodiments of the methods disclosed herein may be performed in the operation of such computing devices. The order of the blocks presented in the examples above can be varied—for example, blocks can be re-ordered, combined, and/or broken into sub-blocks. Certain blocks or processes can be performed in parallel.
The use of “adapted to” or “configured to” herein is meant as open and inclusive language that does not foreclose devices adapted to or configured to perform additional tasks or steps. Additionally, the use of “based on” is meant to be open and inclusive, in that a process, step, calculation, or other action “based on” one or more recited conditions or values may, in practice, be based on additional conditions or values beyond those recited. Headings, lists, and numbering included herein are for ease of explanation only and are not meant to be limiting.
While the present subject matter has been described in detail with respect to specific embodiments thereof, it will be appreciated that those skilled in the art, upon attaining an understanding of the foregoing, may readily produce alterations to, variations of, and equivalents to such embodiments. Accordingly, it should be understood that the present disclosure has been presented for purposes of example and explanation rather than limitation, and does not preclude the inclusion of such modifications, variations, and/or additions to the present subject matter as would be readily apparent to one of ordinary skill in the art.