A Computer Program Listing is included in an Appendix to the present specification. The Appendix is provided at the end of the specification and before the claims and includes the following files:
2.87 kb “tom.py.txt”
766 b “core_res_ block.py.txt”
921 b “film.layer.py.txt”
778 b “film.res.block.py.txt”
911 b “inv_film_layer.py.txt”
The subject matter described herein, in general, relates to multi-task learning, and, in particular, relates to multi-task learning through spatial variable embeddings.
Natural organisms benefit from the fact that their sensory inputs and action outputs are all organized in the same space, that is, the physical universe. This consistency makes it easy to apply the same predictive functions across diverse settings. Deep multi-task learning (Deep MTL) has shown a similar ability to adapt knowledge across tasks whose observed variables are embedded in a shared space. Examples include vision, where the input for all tasks (photograph, drawing, or otherwise) is pixels arranged in a 2D plane; natural language, speech processing and genomics, which exploit the 1D structure of text, waveforms, and nucleotide sequences; and video game-playing, where interactions are organized across space and time. Yet, many real-world prediction tasks have no such spatial organization; their input and output variables are simply labeled values, e.g., the height of a tree, the cost of a haircut, or the score on a standardized test. To make matters worse, these sets of variables are often disjoint across a set of tasks.
These challenges have led the MTL community to avoid such tasks, despite the fact that general knowledge about how to make good predictions can arise from solving seemingly “unrelated” tasks. Table 1 highlights Deep MTL methods from the perspective of decomposition into encoders and decoders. In MTL, there are T tasks {(xt, yt)}t=1T that can, in general, be drawn from different domains and have varying input and output dimensionality. The tth task has nt input variables [xt1, . . . , xtn
The standard intra-domain approach is for all task models to share their encoder f, and each to have its own task-specific decoder gt as given in Table 1, (a). This setup was used in the original introduction of MTL and has been broadly explored in the linear regime, and is the most common approach in Deep MTL. The main limitation of the intra-domain approach is that it is limited to sets of tasks that are all drawn from the same domain. It also has the risk of the separate decoders doing so much of the learning that there is not much left to be shared, which is why the decoders are usually single affine layers.
To address the issue of limited sharing, the task embeddings approach as given in Table 1, (b) trains a single encoder f and single decoder g, with all task-specific parameters learned in embedding vectors zt that semantically characterize each task, and which are fed into the model as additional input. Such methods require that all tasks have the same input and output space, but are flexible in how the embeddings can be used to adapt the model to each task. As a result, they can learn tighter connections between tasks than separate decoders, and these relationships can be analyzed by looking at the learned embeddings.
Next, to exploit regularities across tasks from diverse and disjoint domains, cross-domain methods have been introduced. Existing methods address the challenge of disjoint output and input spaces by using separate decoders and encoders for each domain (Table 1, c), and thus they require some other method of sharing model parameters across tasks, such as sharing some of their layers or drawing their parameters from a shared pool. For many datasets, the separate encoder and decoder absorbs too much functionality to share optimally, and their complexity makes it difficult to analyze the relationships between tasks. Earlier work prior to deep learning showed that, from an algorithmic learning theory perspective, sharing knowledge across tasks should always be useful, but the accompanying experiments were limited to learning biases in a decision tree generation process, i.e., the learned models themselves were not shared across tasks.
None of these methods could optimally propose multi task encoder decoder decompositions in cross domain settings. In the background of foregoing limitations, there exists a need for a solution that can extend the notion of task embeddings in order to apply the idea in the cross-domain setting.
In a first exemplary embodiment, a process, implemented in a computing environment, for training a single model across diverse tasks, includes: measuring tasks with disjoint input and output variable sets in a shared space; for each task, encoding by a function f a value of each observed variable xi given its shared space location zi; aggregating encodings by elementwise addition; and decoding by a function g the aggregated encodings to predict yj at its location zj, wherein zi and zj are variable embeddings.
In a second exemplary embodiment, at least one computer readable medium storing instructions that, when executed by a computer, perform a process for training a single model across diverse tasks, including: measuring tasks with disjoint input and output variable sets in a shared space; for each task, encoding by a function f a value of each observed variable xi given its shared space location zi; aggregating encodings by elementwise addition; and decoding by a function g the aggregated encodings to predict yj at its location zj, wherein zi and zj are variable embeddings.
In a third exemplary embodiment, a single universal prediction model trained across diverse tasks in a shared space with disjoint input and output variable sets, includes: an encoder, f, which is conditioned on vector zi, for generating an encoder output for each task variable xi given its location in the shared space; an aggregator for aggregating the encoder outputs; a core, g1, which is independent of output variable; and a decoder, g2, which is conditioned on vector zj, for generating a prediction yj given its location in the shared space.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
In describing the preferred and alternate embodiments of the present disclosure, specific terminology is employed for the sake of clarity. The disclosure, however, is not intended to be limited to the specific terminology so selected, and it is to be understood that each specific element includes all technical equivalents that operate in a similar manner to accomplish similar functions. The disclosed embodiments are merely exemplary methods of the invention, which may be embodied in various forms.
Generally, the embodiments herein propose multi-task learning through spatial variable embeddings, wherein all variable locations in a shared space are learned, while simultaneously training the prediction model itself, as shown in
The input and output spaces of a prediction problem can be standardized so that the measured value of each input and output variable is a scalar. The prediction model can then be completely agnostic about the particular task for which it is making a prediction. By learning variable embeddings (VEs), i.e., the z's, the model can capture variable relationships explicitly and supports joint training of a single architecture across seemingly unrelated tasks with disjoint input and output spaces. TOM thus establishes a new lower bound on the commonalities shared across real-world machine learning problems: They are all drawn from the same space of variables that humans can and do measure.
In accordance with one working embodiment, the input and output spaces of a prediction problem can be standardized so that the measured value of each input and output variable is a scalar. The prediction model can then be completely agnostic about the particular task for which it is making a prediction. By learning variable embeddings, the model can capture variable relationships explicitly and supports joint training of a single architecture across seemingly unrelated tasks with disjoint input and output spaces. TOM thus establishes a new lower bound on the commonalities shared across real-world machine learning problems: They are all drawn from the same space of variables that humans can and do measure.
In accordance with one general embodiment of present disclosure, the proposed solution develops a first implementation of TOM, using an encoder-decoder architecture, with variable embeddings incorporated using existing approaches. In one working embodiment, use of FiLM is proposed for incorporating variable embeddings. In the experiments, the implementation is shown to (1) recover the intuitive locations of variables in space and time, (2) exploit regularities across related datasets with disjoint input and output spaces, and (3) exploit regularities across seemingly unrelated tasks to outperform single-tasks models tuned to each tasks, as well as current Deep MTL alternatives. The results confirm that the proposed solution provides a promising framework for representing and exploiting the underlying processes of seemingly unrelated tasks.
As discussed further herein, TOM extends the notion of task embeddings to variable embeddings (VEs) in order to apply multi-task encoder decoder decomposition in the cross-domain setting. Accordingly, TOM embeds all input and output variables into a shared space as follows:
Consider the set of all scalar random variables that could possibly be measured {v1, v2, . . . }=V. Each vi∈V could be an input or output variable for some prediction task. To characterize each vi semantically, associate with it a vector zi∈2 that encodes the meaning of vi, e.g., “height of left ear of human adult in inches,” “answer to survey question 9 on a scale of 1 to 5”, “severity of heart disease”, “brightness of top-left pixel of photograph”, etc. This vector zi is called the variable embedding (VE) of vi. Variable embeddings could be handcoded, e.g., based on some featurization of the space of variables, but such a handcoding is usually unavailable, and would likely miss some of the underlying semantic regularities across variables. An alternative approach is to learn variable embeddings based on their utility in solving prediction problems of interest.
A prediction task (x, y)=([x1, . . . , xn], [y1, . . . , ym]) is defined by its set of observed variables {xi}i=1n⊆V and its set of target variables {yj}j=1m⊆V whose values are unknown. The goal is to find a prediction function Ω that can be applied across any prediction task of interest, so that it can learn to exploit regularities across such problems. Let zi and zj be the variable embeddings corresponding to xi and yj, respectively. Then, this universal prediction model is of the form
[yj|x]=Ω(x,{zi}i=1n,zj). (1)
Importantly, for any two tasks (xt, yt), (xt′, yt′), their prediction functions (Eq. 1) differ only in their z's, which enforces the constraint that functionality is otherwise completely shared across the models. One can view Ω as a traveling observer, who visits several locations in the C-dimensional variable space, takes measurements at those locations, and uses this information to make predictions of values at other locations.
To make Ω concrete, it must be a function that can be applied to any number of variables, can fit any set of prediction problems, and is invariant to variable ordering, since we cannot in general assume that a meaningful order exists. These requirements lead to the following decomposition:
[yj|x]=Ω(x,{zi}i=1n,zj)=g(Σi=1nf(xi,zi),zj), (2)
where f and g are functions called the encoder and decoder, with trainable parameters θf and θg, respectively. The variable embeddings z tell f and g which variables they are observing, and these z can be learned by gradient descent alongside θf and θg. A depiction of the model is shown in
The question remains: How can f and g be designed so that they can sufficiently capture a broad range of prediction behavior, and be effectively conditioned by variable embeddings? The next section introduces an experimental architecture that satisfies these requirements.
The encoder and decoder are conditioned on VEs via FiLM layers, which provide a flexible yet inexpensive way to adapt functionality to each variable, and have been previously used to incorporate task embeddings. For simplicity, the FiLM layers are based on affine transformations of VEs. Specifically, the th FiLM layer is parameterized by affine layers and , and, given a variable embedding z, the hidden state h is modulated by
(h)=(z)⊙h+(z), (3)
where ⊙ is the Hadamard product. A FiLM layer is located alongside each fully-connected layer in the encoder and decoder, both of which consist primarily of residual blocks. To avoid deleterious behavior of batch norm across diverse tasks and small datasets/batches, the recently proposed SkipInit described in De et al., Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks, arXiv:2002.10444v3, which is incorporated herein by reference in its entirety, is used as a replacement to stabilize training. SkipInit adds a trainable scalar α initialized to 0 at the end of each residual block, and uses dropout for regularization. Finally, for computational efficiency, the decoder is redecomposed into the Core, or g1, which is independent of output variable, and the Decoder proper, or g2, which is conditioned on the output variable. That way, generic transformations of the summed Encoder output can be learned by the Core and run in a single forward and backward pass each iteration. With this decomposition, Eq. 2 is rewritten as
[yj|x]=g2(g1(Σi=1nf(xi,zi)),zj). (4)
The complete architecture is depicted in
In the following sections, all models are implemented in PyTorch, use Adam for optimization, and have hidden layer size of 128 for all layers. Variable embeddings for TOM are initialized from (0,10−3).
In one working embodiment, we can test TOM's ability to learn variable embeddings that reflect our a priori intuition about the domain, in particular, the organization of space and time. In a first embodiment, the CIFAR dataset is utilized. The pixels of the 32×32 images are converted to grayscale values in [0, 1], yielding 1024 variables. The goal is to predict all variable values, given only a subset of them as input. The model is trained to minimize the binary cross-entropy of each output, and it uses 2D VEs. The a priori, i.e., Oracle, expectation is that the VEs form a 32×32 grid corresponding to how pixels are spatially laid out in an image.
In a second working embodiment, Melbourne minimum daily temperature dataset, a subset of a larger database for tracking climate change is utilized. As above, the goal is to predict the daily temperature of the previous 10 days, given only some subset of them, by minimizing the MSE of each variable. The a priori, Oracle, expectation is that the VEs are laid out linearly in a single temporal dimension. The goal is to see whether TOM will also learn VEs (in a 2D space) that follow a clear 1D manifold that can be interpreted as time.
For both experiments, a subset of the input variables is randomly sampled at each training iteration, which simulates drawing tasks from a limited universe. The resulting learning process for the VEs is illustrated in
To get an idea of how learning VEs affects prediction performance, comparisons were run with three cases of fixed VEs: (1) all VEs set to zero, to address the question of whether differentiating variables with VEs is needed at all in the model; (2) random VEs, to address the question of whether simply having any unique label for variables is sufficient; and (3) oracle VEs, which reflect the human a priori expectation of how the variables should be arranged. The results show that the learned embeddings outperform zero and random embeddings, achieving performance on par with the Oracle (Table 2). This table compares test errors (±std. err.) of learned VEs to fixed-VE alternatives in TOM. The results show that learned VEs outperform Zero and Random VEs, reaching performance on par with the Oracle. That is, TOM not only learns meaningful VEs (
The next embodiment shows how such VEs can be used to exploit regularities across tasks in an MTL setting. In accordance with one example embodiment, two synthetic multi-task problems that contain underlying regularities across tasks are considered. These regularities are not known to the model a priori; it can only exploit them via its VEs.
The first problem, a transposed gaussian process problem, evaluates TOM in a regression setting where input and output variables are drawn from the same continuous space; the second problem evaluates TOM in a classification setting. In the first problem, the universe is defined by a Gaussian process (GP). The GP is 1D, is zero-mean, and has an RBF kernel with length-scale 1. One task is generated for each (#inputs, #outputs) pair in {1, . . . , 10}×{1, . . . , 10}, for a total of 100 tasks. The “true” location of each variable lies in the single dimension of the GP, and is sampled uniformly from [0, 5]. Samples for the task are generated by sampling from the GP, and measuring the value at each variable location. Each task contains 10 training samples, 10 validation samples, and 100 test samples. Samples are generated independently for each task. The goal is to minimize MSE of the outputs.
In the second problem, each task is defined by a set of concentric hyperspheres. Many areas of human knowledge have been organized abstractly as such hyperspheres, e.g., planets around a star, electrons around an atom, social relationships around an individual, or suburbs around Washington D.C.; the idea is that a model that discovers this common organization could then share general knowledge across such areas more effectively. To test this hypothesis, one task is generated for each (#features n, #classes m) pair in {1, . . . ,10}×{2, . . . ,10}, for a total of 90 tasks. For each task, its origin ot is drawn from (0, In). Then, for each class c∈{1, . . . , m}, samples are drawn from n uniformly at distance c from ot, i.e., each class is defined by a (hyper) annulus. The dataset for each task contains five training samples, five validation samples, and 100 test samples per class. The model has no a priori knowledge that the classes are structured in annuli, or which annulus corresponds to which class, but it is possible to achieve high accuracy by making analogies of annuli across tasks, i.e., discovering the underlying structure of this universe.
In these experiments, TOM is compared to five alternative methods: (1) TOM-STL, i.e. TOM trained on each task independently; (2) DR-MTL (Deep Residual MTL), the standard cross-domain (Table 1, c) version of TOM, where instead of FiLM layers, each task has its own linear encoder and decoder layers, and all residual blocks are CoreResBlocks; (3) DR-STL, which is like DR-MTL except it is trained on each task independently; (4) SLO, which uses a separate encoder and decoder for each task, and which is (as far as we know) the only prior Deep MTL approach that has been applied across disjoint tabular datasets; and (5) Oracle, i.e. TOM with VEs fixed to intuitively correct values. The Oracle is included to give an upper bound on how well the TOM architecture of
TOM outperforms the competing methods and achieves performance on par with the Oracle (Table 3, shown below). Note that the improvement of TOM over TOM-STL is much greater than that of DR-MTL over DR-STL, indicating that TOM is particularly well-suited to exploiting structure across disjoint data sets.
Learned VEs are shown in
Now that this suitability has been confirmed, the next embodiment evaluates TOM across a suite of disjoint, and seemingly unrelated, real-world problems. TOM is evaluated in the setting for which it was designed: learning a single shared model across seemingly unrelated real-world datasets. In one example embodiment, the set of tasks used is UCI-121, a set of 121 classification tasks, the tasks that come from diverse areas such as medicine, geology, engineering, botany, sociology, politics, and game-playing. Prior work has tuned each model to each task individually in the single-task regime; no prior work has undertaken learning of all 121 tasks in a single joint model. The datasets are highly diverse. Each simply defines a classification task that a machine learning practitioner was interested in solving. The number of features for a task range from 3 to 262, the number of classes from 2 to 100, and the number of samples from 10 to 130,064. To avoid underfitting to the larger tasks, C=128, and after joint training all model parameters (θf, θg
Results across a suite of metrics are shown in Tables 4a and 4b. Mean Accuracy is the test accuracy averaged across all tasks. Normalized Accuracy scales the accuracy within each task before averaging across tasks, with 0 and 100 corresponding to the lowest and highest accuracies. Mean Rank averages the method's rank across tasks, where the best method gets a rank of 0. Best % is the percentage of tasks for which the method achieves the top accuracy (with possible ties). Win % is the percentage of tasks for which the method achieves accuracy strictly greater than all other methods. Table 4a shows comparisons to external results of deep STL models tuned to each task. Table 4b shows comparisons across methods evaluated herein. Metrics are aggregated over all 121 tasks (±std. err.) TOM outperforms the alternative approaches across all metrics, showing its ability to learn many seemingly unrelated tasks successfully in a single model (see
For the experiments underlying the comparisons in Tables 4a and 4b, C was selected to be equal to 128 order to match the number of task-specific parameters of the other Deep MTL methods. Table 5 shows the results of additional experiments that were run on UCI-121 with C=64 and C=256 to evaluate the sensitivity of TOM to the setting of C. Metrics for all settings of C are computed with respect to the external comparison methods, i.e., those in Table 4a. TOM with C=64 produces performance comparable to C=128, suggesting that optimizing C could be a useful lever for balancing performance and VE interpretability.
Additional embodiment details are provided herein below. One skilled in the art will appreciate that variations to the proof-of-concept experimental configurations can be made without departing from the scope of the embodiments.
For the CIFAR experiments, a sigmoid layer is applied at the end of the decoder to squash the output between 0 and 1.
For the CIFAR and Daily Temperature experiments, a subset of the variables is sampled each iteration to be used as input. This subset is sampled in the following way: (1) Sample the size k of the subset uniformly from [1, nt], where nt is the number of variables in the experiment; (2) Sample a subset of variables of size k uniformly from all subsets of size k. This sampling method ensures that every subset size has an equal chance of getting selected, so that the universe is not biased towards tasks of a particular size. E.g., if instead the subset were created by sampling each variable independently with probability p, then the subset size would concentrate tightly around pnt.
For classification tasks, each class defines a distinct output variable, i.e., a K-class classification task has K output variables. The squared hinge loss was used for classification tasks. Square hinge loss is preferable to categorical cross-entropy loss in this setting because it does not require taking a softmax across output variables, so the outputs are kept separate. Also, the loss becomes exactly zero once a sample is learned strongly, so that the model does not continue to overfit as remaining samples and tasks are learned.
The number of blocks in the encoder, core, and decoder is N=3 for all problems except UCI-121, for which it is N=10. All experiments use a hidden size of 128 for all dense layers aside from the final decoder layer that maps to the output space.
The batch size was 32 for CIFAR and Daily Temperature, and max(200, #trainsamples) for all other tasks. At each step, To tasks are uniformly sampled from the set of all tasks, and gradients are summed over a batch for each task in the sample. To=1 in all experiments except UCI-121, for which To=32.
To allow for multi-task training with datasets of varying numbers of samples, we say the model has completed one epoch each time it is evaluated on the validation set. An epoch is 1000 steps for CIFAR, 100 steps for Daily Temperature, 1000 steps for Transposed Gaussian Process, 1000 steps for Concentric Hyperspheres, and 10,000 steps for UCI-121.
For CIFAR, the official training and test splits are used for training and testing. No validation set is needed for CIFAR, because none of the models can overfit to the training set. For Daily Temperature, the second-to-last year of data is withheld for validation, and the final year is withheld for testing. The UCI-121 experiments use the preprocessed versions of the official train-val-test splits which are publicly available and known to those skilled in the art.
Adam is used for all experiments, with all parameters initialized to their default values. In all experiments except UCI-121, the learning rate is kept constant at 0.001 throughout training. In UCI-121, the learning rate is decreased by a factor of two when the mean validation accuracy has not increased in 20 epochs; it is decreased five times; model training stops when it would be decreased a sixth time. Models are trained for 500K steps for CIFAR, 100K steps for Daily Temperature, and 250K for Transposed Gaussian Process and Concentric Hyperspheres. The test performance for each task is its performance on the test set after the epoch of its best validation performance.
Weights are initialized using the default PyTorch initialization (aside from the SkipInit α scalars, which are initialized to zero). The CIFAR and daily temperature experiments use no weight decay; the transposed gaussian process and concentric hyperspheres experiments use weight decay of 10−4; and the UCI-121 experiments use weight decay of 10−5. Dropout is set to 0.0 for CIFAR, Daily Temperature, and Concentric Hyperspheres; and 0.5 for Transposed Gaussian Process and UCI-121.
In UCI-121, fully-trained MTL models are finetuned to tasks with more than 5,000 samples, using the same optimizer configuration as for joint training, except the steps-per-epoch is set to ┌#trainsamplesbatchsize┐, the learning rate is initialized to 0.0001, the patience for early stopping is set to 100, and the validation performance is smoothed over every 10 epochs (simple moving average), following the protocol used to train single-task models in prior work by Klambauer et al., Self-normalizing neural networks, In Proc. of NeurIPS, pp. 971-980 (2017), which is incorporated herein by reference in its entirety.
TOM uses a VE size of C=2 for all experiments, except for UCI-121, where C=128 in order to accommodate the complexity of such a large and diverse set of tasks. For
Autoencoding (i.e., predicting the input variables as well as unseen variables) was used for CIFAR, Daily Temperature, and Transposed Guassian Process; it was not used for Concentric Hyperspheres or UCI-121.
The Soft Layer Ordering (SLO) architecture follows the original implementation described in co-owned U.S. patent application Ser. No. 16/172,660 entitled BEYOND SHARED HIERARCHIES: DEEP MULTITASK LEARNING THROUGH SOFT LAYER ORDERING, which is incorporated herein by reference in its entirety. There are four shared ReLU layers, each of size 128, with dropout after each to ease sharing across different soft combinations of layers.
As discussed herein with respect to various exemplary embodiments, TOM enables a single model to be trained across diverse tasks by embedding all task variables into a shared space. The framework is shown to discover intuitive notions of space and time and use them to learn variable embeddings that exploit knowledge across tasks, outperforming single- and multi-task alternatives. Thus, learning a single function that cares only about variable locations and their values is a promising approach to integrating knowledge across data sets that have no a priori connection. The TOM approach thus extends the benefits of multi-task learning to broader sets of tasks.
It is submitted that one skilled in the art would understand the various computing or processing environments, including computer readable mediums, which may be used to implement the processes described herein. Selection of computing environment and individual components may be determined in accordance with memory requirements, processing requirements, security requirements and the like. It is submitted that one or more steps or combinations of steps of the methods described herein may be developed locally or remotely, i.e., on a remote physical computer or virtual machine (VM). Virtual machines may be hosted on cloud-based IaaS platforms such as Amazon Web Services (AWS) and Google Cloud Platform (GCP), which are configurable in accordance memory, processing, and data storage requirements. One skilled in the art further recognizes that physical and/or virtual machines may be servers, either stand-alone or distributed. Distributed environments many include coordination software such as Spark, Hadoop, and the like. For additional description of exemplary programming languages, development software and platforms and computing environments which may be considered to implement one or more of the features, components and methods described herein, the following articles are referenced and incorporated herein by reference in their entirety: Python vs R for Artificial Intelligence, Machine Learning, and Data Science; Production vs Development Artificial Intelligence and Machine Learning; Advanced Analytics Packages, Frameworks, and Platforms by Scenario or Task by Alex Cistrons of InnoArchiTech, published online by O'Reilly Media, Copyright InnoArchiTech LLC 2020.
The foregoing description is a specific embodiment of the present disclosure. It should be appreciated that this embodiment is described for purpose of illustration only, and that those skilled in the art may practice numerous alterations and modifications without departing from the spirit and scope of the invention. It is intended that all such modifications and alterations be included insofar as they come within the scope of the invention as claimed or the equivalents thereof.
The present application claims benefit of and priority to U.S. Provisional Patent Application No. 63/132,591 similarly entitled SYSTEM AND METHOD FOR MULTI-TASK LEARNING THROUGH SPATIAL VARIABLE EMBEDDINGS filed on Dec. 31, 2020, which is incorporated herein by reference in its entirety. Cross-reference is made to commonly-owned U.S. patent application Ser. No. 16/817,153 entitled System and Method For Implementing Modular Universal Reparameterization For Deep Multi-Task Learning Across Diverse Domains and U.S. patent application Ser. No. 16/172,660 entitled BEYOND SHARED HIERARCHIES: DEEP MULTITASK LEARNING THROUGH SOFT LAYER ORDERING, which are incorporated herein by reference in their entirety. The following document is also incorporated herein by reference in its entirety: Meyerson et al., THE TRAVELING OBSERVER MODEL: MULTI-TASK LEARNING THROUGH SPATIAL VARIABLE EMBEDDINGS, arXiv:2010.02354v4, Mar. 22, 2021. Additionally, one skilled in the art appreciates the scope of the existing art which is assumed to be part of the present disclosure for purposes of supporting various concepts underlying the embodiments described herein. By way of particular example only, prior publications, including academic papers, patents and published patent applications listing one or more of the inventors herein are considered to be within the skill of the art and constitute supporting documentation for the embodiments discussed herein.
Number | Date | Country | |
---|---|---|---|
63132591 | Dec 2020 | US |