The present teaching generally relates to machine learning. More specifically, the present teaching relates to augmented machine learning.
With the development of the Internet and the ubiquitous network connections, more and more commercial and social activities are conducted online. Digitized content is served or recommended to millions of online users. Advertising activities are also more and more shifted to online and ads are displayed to users while content is delivered to the users. To make online advertising more effective, targeting has been practiced. This includes targeting users from the perspective of advertisers and selecting appropriate ads for online users who may be interested in the content of the ads. Online advertising has played an important role in continued growth in many industries. To continue the growth in online advertising, content customization has been practiced which allows online advertisers and other parties participating in online advertising to effectively target/budget/schedule/display ad budget and delivery operations to maximize the gains. A typical task in targeting is to, given a user and one or more predefined groups or segments, assign the user to a corresponding user segment based on available user's online activities and possibly existing membership between users and user segments. An effective solution to this commonly faced issue in online information exchange usually has a significant social and economic impact.
In a user segmentation system, preferences of content consumers (end users) and advertisers may be identified either based on declared interests or learned from their activities or specifications. The ads may be ranked with respect to different consumers or consumer segments based on predicted outcome of displaying ads and recommendations are made according to the rankings. Large scale ranking and recommendation models may be specifically designed for particular type of prediction tasks. This type of approaches has drawbacks especially when the amount of possible prediction tasks is high. For example, task heterogeneity is an issue because digital footprints, such as users' online activities, may be logged and integrated from a wide range of contexts, platforms, and even physical machine types (heterogeneity) so that they may vary significantly in terms of modality and schema, making it hard for learning systems to adapt. Another issue has to do with the long-tailness of data, i.e., the fine granularity of segments used for information customization often results in extremely large numbers of prediction tasks, many of which belong to the long-tail of the distribution with insufficient signals or observations. As another example, data availability is also an issue, due to, sometimes, the missing at random (MAR) effects and, at other times, due to the restrictions imposed because of user privacy and regulation compliance considerations. As a consequence, a learning scheme that relies on adequate availability of training data may not be able to learn adequately.
Efforts have been made to address some of these issues. For example, to integrate heterogeneous systems, ensemble learning is used to leverage multiple machine learning models or commonly known as experts to achieve super learning. Such integration approaches are developed from the branch of statistics that employs machine learning models based on linear models such as regression model for combining predictions from individual experts into a final decision. However, due to the complexity of inter-relationships among data and different data sources, it is not possible to capture such inter-relationships via linear models.
In this kind of scheme, each of the individual experts may learn knowledge that can be learned from their own training data from a limited setting without being able to leveraging the relationships among the experts and the data used for training. This is particularly so when the experts are heterogeneous. In addition, each expert system may be designed in some way given the local circumstances so that they learn only from such perspectives without more or without being able to leveraging the knowledge from other experts. Given that, linearly combining their outputs using a linear combination cannot capture the actuality of the world.
Thus, there is a need for solutions that address the challenges discussed above and enhance the performance of segment prediction for targeting.
The teachings disclosed herein relate to methods, systems, and programming for information management. More particularly, the present teaching relates to methods, systems, and programming related to hash table and storage management using the same.
In one example, a method, implemented on a machine having at least one processor, storage, and a communication platform capable of connecting to a network for predicting user segment. An expert hierarchy is created with an initial expert layer with multiple initial experts and at least one augmented expert layer. Each augmented expert layer has one or more augmented experts that are derived via machine training to augment at least the initial experts. When an input is received by the expert hierarchy, each of the experts, including initial and augmented, generates an expert prediction based on the input.
In a different example, a system is disclosed for predicting user segment. The system includes an initial expert layer with a plurality of initial experts for prediction and at least one augmented expert layer with one or more augmented experts at each of the at least one augmented expert layer. An augmented expert at any of the at least one augmented expert layer augments the plurality of initial experts and is trained via machine learning for the prediction. An expert hierarchy is generated to include the initial expert layer and the at least one augmented expert layer configured for facilitating each of the initial and augmented experts in the expert hierarchy to generate a respective expert prediction based on an input received.
Other concepts relate to software for implementing the present teaching. A software product, in accordance with this concept, includes at least one machine-readable non-transitory medium and information carried by the medium. The information carried by the medium may be executable program code data, parameters in association with the executable program code, and/or information related to a user, a request, content, or other additional information.
Another example is a machine-readable, non-transitory and tangible medium having information recorded thereon for predicting user segment. The information, when read by the machine, causes the machine to perform various steps to create an expert hierarchy with an initial expert layer with multiple initial experts and at least one augmented expert layer. Each augmented expert layer has one or more augmented experts that are derived via machine training to augment at least the initial experts. When an input is received by the expert hierarchy, each of the experts, including initial and augmented, generates an expert prediction based on the input.
Additional advantages and novel features will be set forth in part in the description which follows, and in part will become apparent to those skilled in the art upon examination of the following and the accompanying drawings or may be learned by production or operation of the examples. The advantages of the present teachings may be realized and attained by practice or use of various aspects of the methodologies, instrumentalities and combinations set forth in the detailed examples discussed below.
The methods, systems and/or programming described herein are further described in terms of exemplary embodiments. These exemplary embodiments are described in detail with reference to the drawings. These embodiments are non-limiting exemplary embodiments, in which like reference numerals represent similar structures throughout the several views of the drawings, and wherein:
In the following detailed description, numerous specific details are set forth by way of examples in order to facilitate a thorough understanding of the relevant teachings. However, it should be apparent to those skilled in the art that the present teachings may be practiced without such details. In other instances, well known methods, procedures, components, and/or system have been described at a relatively high-level, without detail, in order to avoid unnecessarily obscuring aspects of the present teachings.
The present teaching discloses solutions that address challenges in the art. To resolve the issues associated with task heterogeneity, data long-tailness, and data availability in predicting based on online data, the present teaching presents a scheme of augmenting experts at one or more levels to not only leverage the learned expertise from original experts but also expand the expertise in terms of aspects of knowledge not yet learned by the existing experts including inter-relationships among existing experts that the traditional systems completely ignore. To achieve that, in deriving a new augmented expert, in addition to training data, the outputs from previously trained experts (including original and previously augmented experts) are also used to train the new augmented expert, where the outputs from the previously trained experts are generated by these experts based on the same training data. The disclosed expert augmentation scheme yields heterogeneous experts which form an expert hierarchy. This expert hierarchy provides an expanded range of knowledge learned by different experts so that their respective expertise on the same task may be integrated to enhance the quality of the prediction as compared with the traditional systems.
The present teaching also discloses a nonlinear framework for integrating outputs from different experts to overcome the deficiencies of the traditional approaches that use linear weighted sum in integrating different experts. The present teaching presents a scheme of combining multiple experts in a nonlinear manner via learning. The multiple experts being combined using the scheme as disclosed herein may include homogeneous and/or heterogeneous experts. In some embodiments, the experts being combined may include conventional experts and/or augmented experts created based on some given existing experts using the augmentation scheme as disclosed herein. In some embodiments, an artificial neural network (ANN) is employed for integration so that embeddings of the ANN may be learned to capture the nonlinear complex relationships and serve as a nonlinear integration function for combining multiple expert outputs. Such a trained ANN with learned embeddings, when receiving outputs from multiple experts as input, yields an integrated expert via complex non-linear function learned and implicitly specified via the parameterized ANN.
Augmented experts in the augmented expert layers 220 may be organized as multiple layers. Augmented experts at each layer may be created at a separate time. For instance, augmented expert 1 220-1, augmented expert 2 220-2, . . . , augmented expert j 220-j may be at the first augmented expert layer 1 and are created via learning from both training data and by leveraging the expertise of previously trained experts at the initial expert layer 210. In this case, the original experts from the initial expert layer 210 constitute the base experts in creating each of the augmented experts 2210-1, 220-2, . . . , 220-j. Augmented experts at the next higher level in the augmented expert layers 220 may be further created by augmenting based on, e.g., the initial experts in the initial expert layer 210 as well as the augmented expert 1 220-1, augmented expert 2 220-2, . . . , and augmented expert j 220-j, etc. In this manner, augmented experts incorporate the expertise from all or part of the (base) experts at lower levels and their creation may be based on both the training data as well as the knowledge learned by lower-level experts via outputs from such base experts generated based on the same training data.
Once the expert hierarchy is created via iterative augmentation, the experts in the hierarchy are connected in such a manner to facilitate nonlinear integration of expert outputs to generate, based on an input feature vector, an integrated expert prediction decision. These layers of experts form an expert hierarchy, in which each expert's output may be connected to the input of every expert at higher levels. When an input feature vector is received by the experts in the hierarchy, each may generate its output not only based on the given input and the previously trained models, as conventional approaches, but also by considering the expertise of other experts at lower levels in the system, i.e., taking outputs from experts from lower expert levels as input. As shown in
The framework of 200 addresses the deficiencies of the traditional approaches by introducing the mechanism of creating augmented experts so that additional dimension of expertise may be expanded, and the experts can be enriched. The augmentation is carried out hierarchically so that knowledge may be deepened with each level of augmentation. Through this mechanism, the expansion can be extended vertically with the layers, the complementary and interactive relationships among experts at the same level may be leveraged. Integrating expertise of different experts is in general not a simplistic linear combination. To overcome this deficiency of the prior art, the present teaching employs a nonlinear approach to model the complex relationships among different experts to learn the embedded interactions among experts and their expertise.
The number of augmentation levels and the number of augmented experts at each of the augmented levels may be controlled based on different criteria in accordance with, e.g., needs of specific applications. For instance, to ensure the dynamic coverage of the learning, the augmentation of the experts in the hierarchy may be developed along multiple directions by varying, e.g., model parameters, ways to initialize model parameters, cost functions, converging criteria, etc. and can be set up when each of the augmented experts is created.
Once the expert hierarchy is built, i.e., all the experts, including the original and the augmented experts, are all trained and ready to be used, it can be used for making predictions as they are trained for. As discussed herein, when an input, e.g., a feature vector, is received by the expert hierarchy, the input is sent to all experts in the hierarchy and each expert may then act on the input and generate its respective output. Some of the expert outputs are further sent to augmented experts at a higher level as additional input to these augmented experts so that augmented experts in the hierarchy also generate their respective outputs based on outputs of other experts. These multiple expert outputs are further combined to generate an integrated expert output as the integrated expert prediction of the expert hierarchy. To facilitate the integration, a nonlinear expert heterogeneous integration model is trained first. To do so, it is configured, at 235, prior to its training. To train this nonlinear expert integration model, input training data is used together with the outputs from experts in the hierarchy and during the training, the model parameters or embeddings are adjusted or trained, at 245, by learning from the discrepancies between the ground truth from the training data as well as the integrated expert outputs using the current model parameters. Once the nonlinear integration model converges via learning, when outputs from experts (generated based on given input feature vector) in the hierarchy are received, at 255, the trained nonlinear heterogeneous expert integration model is used to combine the outputs from the experts in the hierarchy to generate, at 265, an integrated output.
As discussed herein, experts in the expert hierarchy, once trained, may be used to carry out the tasks that they are trained to perform. Outputs from the experts are then integrated so that all experts' opinions can be leveraged to derive a more reliable integrated expert prediction.
In the illustrated example, to develop the augmented expert at augmented expert layer 320, training data set T2 is used for training the augmented experts 320-1 and 320-2. In creating the augmented experts 320-1 and 320-2, the trained initial experts 310-1 and 310-2 are used to generate their respect output (expert outputs) based on the same training data from T2. That is, training data T2 is fed to both initial expert layer 310 and augmented expert layer 320 for training the augmented experts 320-1 and 320-2. As discussed herein, the augmented experts are trained by leveraging the expertise of the initial experts 310-1 and 310-2 so that the augmented experts are more refined. To achieve that, the training data in T2 are also provided to the initial experts 310-1 and 310-2 so that they produce expert outputs o11 and o12, both of which are fed to the augmented experts 320-1 and 320-2 as input. That is, the training of augmented experts 320-1 and 320-2 are based on both input from the training data T2 as well as the input expert outputs from the initial experts. As shown, the initial expert outputs o11 and o12 are sent to as inputs to both the augmented expert 21 320-1 and augmented expert 22 320-2 so that the augmented experts are learned in consideration of the expertise of the initial experts in a manner that is enhanced further in light of what the initial experts can achieve. The exemplary formal formulation to generating augmented experts is provided in detail below.
With training data set T2, the augmented experts 21 and 22 are trained. In this exemplary architecture with two layers of experts, once the training of augmented experts 21320-1 and 22320-2 are completed, the nonlinear integration modeling unit 240 may be trained based on training data T3. The training data in T3 is sent to all experts in the expert hierarchy, i.e., initial experts 11 310-1 and initial expert 12 310-2 as well as augmented expert 21 320-1 and augmented expert 22 320-2. These trained experts, reacting to the training data in T3, generate their respective expert outputs, i.e., o11 from initial expert 11 310-1, o12 from initial expert 12 310-2, o21 from augmented expert 21 320-1 and o22 from augmented expert 22 320-2. Note that the expert outputs o11 and o12 from the initial experts 310-1 and 310-2 are also sent to the augmented expert 21 320-1 and 22320-2 as inputs. All these expert outputs are then all sent to the nonlinear integration modeling unit 240 so that the nonlinear expert integration model 260 can be trained, as shown in
The nonlinear integration modeling unit 240 is provided for training the nonlinear expert integration model 260. In some embodiments, the modeling unit 240 includes a deep learning engine 300 that takes input data (including training data T3 as well as expert outputs o11, o12, o21, and o22 generated based on the same training data T3) and learns various parameters that define the nonlinear expert integration model 260 by adjusting these parameters during training to minimize some define loss function. During learning, the current weights in the learned integration weights 260-3 that implicitly define a nonlinear integration function for combining the expert outputs are used by the deep learning engine 300 to combine the expert output to come up with an integrated prediction. Such a combined integrated prediction is compared with the ground truth prediction provided by the trained data in T3. The discrepancy is used to determine how to adjust the current weight stored in 260-3 to minimize the loss. The process repeats until a convergence condition define in the operational parameters 260-2 is satisfied. As discussed herein, in some embodiments, the learnable parameters during training may include embedding parameters of the neural network (nonlinear integration weights 260-3). In some embodiments, the learning may also be conducted to learn other parameters such as model parameters 260-1 and operational parameters 260-2. The exemplary formal formulation in terms of learning the nonlinear integration model for combining expert output is provided in detail below.
As discussed herein, in some embodiments, experts in the hierarchy are trained one layer at a time. That is, the initial experts may be trained first. When the initial experts are trained, they are used in training augmented experts at the next layer. For example, in
To ensure augmented experts to learn or expand the expertise already learned by lower-level experts, different learning dynamics may be introduced. This may include using heterogeneous experts in diversified modalities, applying different initialization approaches, employing different loss functions, or controlling the learning process with different convergence conditions. In some embodiments, the parameters that can be learned by different experts via training may also vary. For instance, some experts may be trained using parameters initialized using random numbers. Some experts may be trained to learn initialized parameters. Although the example architectures depicted in
To train the augmented experts at the next layer, training data for training augmented experts at the next augmented expert layer are fed, at 370, to both the previously trained experts (including the initial experts and the augmented experts at the first augmented expert layer) and the augmented experts at the next layer. The trained initial and augmented experts then generate outputs based on the training data and such outputs are used as input to the augmented experts at the next layer, which are trained, at 375, based on both the training data as well as the outputs from all previously trained experts. The augmented experts at this next layer are trained in an iterative learning process until the learning converges. If there are more layers, determined at 380, the steps of 370 and 375 are repeated until augmented experts of all layers are trained.
The above-described framework for developing an expert hierarchy of heterogeneous experts corresponds to a concept called SuperCone. This learning framework is general and is a unified approach that can be applied to all prediction tasks, such as user segmentation, performance prediction, etc. It builds the distributed concept representation to obtain a reliable representation of signals from outputs from heterogeneous experts and model each of the tasks by combining heterogeneous prediction models that may vary in architectures, learning methods, or ways employed to learn the prediction tasks. The framework as disclosed herein, can flexibly incorporate adaptive expert combination modules and deep representation learning module from original input to augmenting the heterogeneous experts. It is an end-to-end approach for jointly learning the heterogeneous experts and the expert combination module. In some embodiments, the problem of representation learning may be formulated via a meta-learning framework known as “learning to learning,” which focuses on the learning mechanism that gains experience and improves its performance over multiple learning episodes. In some embodiments, the meta-learning is applied, according to the present teaching, in the context of optimizing learning based on heterogeneous experts.
Below, the learning of experts in the SuperCone framework is formally defined in the context of meta learning. Solutions according to the present teaching are formally formulated with respect to the exemplary task of predicting user segmentation. With this framework, items of interest may be ingested from a variety of domains with a diverse range of knowledge enrichment, resulting in a heterogeneous information network of users and events, and existing segments, each with their schema, modality, and patterns of interconnection. Formally, in order to predict a particular segment, let be the set of users (i.e., entity) for which segment prediction is to be performed, and be the set of possible prediction labels. Assume to represent resulting unfolded concepts as a real-valued concept vector ({right arrow over (c)}s) for each user s, with the index being the list of concept vocabulary (C), and value being the intensity of its association to corresponding concepts.
For clarity, the scenario for learning with homogeneous experts is first disclosed. Specifically, assume a particular expert hj associated with a hypothesis space hj⊆→. Assume that the algorithm for training the expert corresponds to an efficient oracle θ*j(ω; ) which may be used for obtaining trained experts based on a given dataset and meta-parameter ω∈Ω, that controls how the models are learned such as model hyperparameters.
where θj is the set of learn-able parameters contained in the parameter space Θj, and Lj is the loss used for training hj; e.g., the loss function used for back-propagation. The task of learning unfolded concept with homogeneous expert may utilize one such an oracle, which can be defined as follows.
Assuming the label function of interest y: → mapping each user to a label in , a probability density of the entity q: S→[0,1], and a sampled dataset , the task is to learn a model hj∈Hj, that minimizes the expected risk according to a given criterion L defined below:
where θj∈Θj denotes the task specific parameter and ω∈Ω denotes the meta-parameter.
The formalization of the user segmentation problem can be considered as a meta-learning problem in a more general setting. Assuming a distribution over tasks :→[0,1], and a source (i.e. meta training) dataset of M tasks sampled from , each containing a training set (i.e., support set for meta-learning) and a validation set (i.e., a query set in meta-learning) with non-overlapping i.i.d. samples drawn from instances distribution qj of task Tj, as source{, }j=1M. Likewise, a target dataset (i.e. meta test) is also assumed of Q tasks sampled from , each containing a training set (i.e., a support set) and test set (i.e., a query set) with non-overlapping i.i.d. samples drawn from instances distribution qj of task Tj, as target{, }j=1Q. The goal is to obtain the “meta knowledge” in the form of ω from source, which may then be applied to improve the downstream performance in target by fine-tuning on each individual training set at meta-test time.
In learning heterogeneous experts, however, a source and a target set may not be separate. In some embodiments, the only requirement may be that one dataset serve as the source dataset for meta-training and there is one target dataset for meta-test. It is assumed that each of the tasks j,j=1 . . . J, where the only difference between tasks is the particular expert hj, is associated with a hypothesis space hj⊆→, a set of learn-able parameter θj∈Θj, and a training oracle θj*(ω; ) that satisfies Equation 1. The goal of meta-training is to obtain optimal generalization error on the single test target set.
In some embodiments, formally, it is assumed that all the available instances will be used as both the source and target set. Given a sample of data {, } drawn i.i.d from the instance distribution q(s), some or all of the instances from train for training the individual experts hj(·; θj; ω) may be used, i.e., (⊆, ⊆, ∩=0). Likewise, the dataset used for meta-testing may consume some or all of the training instances, i.e., ⊆, j=1, . . . J. The goal is to learn a joint model based on the adapted experts on the target training set, θj*(ω; ) for j=1 . . . J, denoted as h(·; ω, {hj(·; θj*(ω; ))}), to achieve the best generalization error.
Learning of heterogeneous experts may be formally defined as below. DEFINITION 2 (LEARNING UNFOLDED CONCEPT WITH HETEROGENEOUS EXPERTS). Assuming the label function of interest y:→, a sampled dataset , a set of heterogeneous experts hj with inner training oracle θj*(ω; ) for j=1 . . . J, the task is to learn a combined model h that minimize a given loss criteria L:×→
which defines an objective function as what Equation (1) with respect to learning with homogeneous experts. That is,
where Lmeta is a meta-loss to be specified by the meta-training procedure, e.g., the cross-entropy error of temporal difference error. The formulation as presented herein on unfolded concept learning with heterogeneous experts enhances the efficiency and scalable distributed machine learning yet retains the representation power.
The discussion below involves two parts. The first part involves the representation of a meta module. The second part has to do with an optimization procedure. In terms of representation of a meta module, Ω is used to denote a meta parameter space with respect to any given choice of Θj of each of the individual experts Hj. A solution space induced by meta parameter ω brings inductive bias to the downstream tasks and affects the efficiency of learning procedure of each task. In general, there are some key desired criteria in building a model for the task of user segmentation. For instance, one issue is related to task agnostic expertise modeling. The choice of Ω in general should allow flexibly modeling over a large variety of task types and best utilizing the power of experts from ={Hj|j=1 . . . J} in an adaptive way without task-specific engineering. Another issue is related to representation power, i.e., the choice of Ω should possess adequate representation capacity for inducing deep representation of data and not limit itself to specific features or classes of functions. Another example issue is on first order influence. The influence of meta parameter co over the learning mechanism should allow for efficient meta-optimization for performance-critical application, and not incurring higher order gradient computation when learning co.
Traditional approaches mostly fall into the categories of traditional super learning and ensemble learning scheme that are heuristic in nature and fail to meet the second criterion stated above. Traditional deep learning approaches fail the first criterion because they do not incorporate the power of heterogeneous experts. Other existing meta-learning approaches rely on higher order and bi-level optimization so that they do not meet the third criterion. The expert augmentation as disclosed herein according to the present teaching, has a meta-learning architecture that constructs a large portfolio of augmented experts and learns deep representation for both direct prediction from unfolded concepts and indirect combination of heterogeneous experts. At the same time, each of the experts possesses its own respective individual prediction power and expertise learned in their training.
A sluice network for heterogeneous experts may be developed based on exemplary criteria as discussed below. Give a set of experts, an augmented set of experts Aug may be constructed by, e.g., enumerating nested combinations among the experts. For example, an augmented expert in Aug can be (1) any expert model with hypothesis space belonging to initial experts , (2) any arithmetic combination between an arbitrary number of experts in Aug, and (3) Any recursive application of an expert with hypothesis belong to Aug over an arbitrary number of outputs from models from Aug. Such expert expansion may be implemented following the sluice network architecture, with, e.g., additional layer-by-layer skip connections. As discussed herein, the output at each level of densely connected experts σ(·) may be fed to both the immediate next level as input, as well as higher levels, and the subsequent connected layers henceforth.
To further augment the model capacity and obtain a deep representation of the data, a complementary expert module hcomp may be incorporated with hypothesis space Hcomp that allows flexible modulation of information flow while respecting the simplicity of network design. To that end, the neural multi-mixture of experts' architecture may be adopted that learns an ensemble of individual experts in an end-to-end fashion. To achieve that, the neural net may be divided into an end output module Tower and inner expert neural submodules. The output module produces the output for a particular task in hand. All experts in the inner expert neural submodules are called InnerExpertt, 1≤t≤E. There may also be certain gating network Gatei, that projects an input from the original data representation {right arrow over (c)}s directly into . The prediction of the final complementary expert maps a concept vector representation {right arrow over (c)}s into label space , Hcomp({right arrow over (c)}s), which can be expressed as:
Here, the intermediate representation vs may corresponds to a weighted sum by a shallow network Gatei(cs{right arrow over (m)}eta) after normalizing into unit simplex via softmax(·). Each InnerExpertt, may then, in turn, correspond to an ensemble of sub-modules mapping {right arrow over (c)}s to a fixed-length vector.
where Deptht,i denotes an intermediate output of inner expert t at depth i, consisting of projection in the form of Projt,i, which may be implemented as a linear layer followed by a relu activation. In some embodiments, an ensemble of neural experts may first be combined to form a deep representation from the concept vector, and then further be combined with the rest of heterogeneous experts.
On combining different experts, the approaches as disclosed herein according to the present teaching is capable of adaptively weigh-in different predictions across experts. To achieve that, weights over individual candidates may be learned in an adaptive fashion based on a separate neural network component, denoted by, e.g., Comb(·), which may be implemented using an architecture similar to hcomp. Assuming experts from Aug are arranged as an array of mapping functions {h1, h2, . . . , }, Comb(·) may then be used to map the concept vector {right arrow over (c)}s into a vector with dimension equal to |Aug|+1. The final model prediction, h({right arrow over (c)}s), may then be produced using an additional layer of weighted sums over all possible experts.
As discussed herein, the learning process during meta learning is to learn or optimize meta-parameters ω, that are agnostic to the heterogeneous experts in . Naive approach that directly treats the original input dataset D to compute the meta loss Lmeta, or using it as the support set might lead to “meta-overfit” where the combination network and the added experts from Aug may falsely rely on overfitted experts. To avoid such issues, the present teaching discloses a principled framework to construct a meta-training set that eliminates the phenomenon in general. The basic consideration is to extract non-overlapping subsets of the data as the support and query set as the source data meta-training to minimize the discrepancy between meta-training and deployment. Such an optimization scheme makes no assumptions about the heterogeneous experts, including the existence of gradients in its learning process.
According to this optimization scheme, each level of heterogeneous experts is trained recursively on previous levels with its own meta-training set based on the cross-validation split approach as discussed herein, with the final level corresponding to a super augmented expert hierarchy or architecture. Heterogeneous experts may be indexed by the depths that it depends on, e.g., with hj(k) denoting the jth expert at kth layer, k=1, 2, . . . , K. At each depth, a cross validation scheme may be adopted with, V(k) mapping instance s from to a fold among 1, 2, . . . , V, and the learning proceeds by creating higher order meta-training datasets at each kth layer, as:
{(xs{right arrow over (()}k),zs{right arrow over (()}k))|{right arrow over (x)}s∈} (9)
x
s
{right arrow over (()}k)
(j)
h
j
(k)({right arrow over (x)}s;θj*(ω,()˜S)) (10)
with ()˜S denoting the subset of () not in the same fold as instance i, formally:
()˜S{V(k)(s)≠|V(k)(s′)|{right arrow over (x)}s′∈} (11)
Meta-parameter set ω is trained using the last layer of the constructed meta-training dataset , with respect to the meta loss, which may be defined as follows:
with the meta-training time model htrain({right arrow over (x)}s) defined by replacing the output of all heterogeneous experts directly by taking all but the first |C| elements from the input, {right arrow over (x)}s[:|C|] and feeding the alternative expert and the combination network with the original feature, {right arrow over (x)}s[:|C|]. Formally,
That is, in this illustrated embodiment, the learning of the network parameter is posed as an end-to-end optimization problem, which can be solved using efficient gradient based methods. At meta-test time, the source set for each of the heterogeneous experts hj(k), is defined as the k-th higher order meta-training dataset, i.e. .
In this illustrated embodiment, to learn learnable model parameters 440 of the HEI model 420, previously trained experts in the expert hierarchy (e.g., initial experts at layer 210 and all augmented experts at higher layers) take input training data (e.g., feature vectors) as input and generate their respective expert outputs (some experts need to generate their outputs based on expert outputs from experts from lower levels of the expert hierarchy). These expert outputs are then fed to the HEI model learning engine 410 as inputs. To learn the values of the learnable parameters 440, in each iteration, the HEI model learning engine 410 takes expert outputs from experts in the expert hierarchy and computes an integrated expert decision by integrating the input expert outputs using current learnable model parameters in 440. This is performed by the ANN 430 that is configured using the current values of the learnable model parameters in 440. This ANN generated integrated expert decision is then compared with the ground truth label corresponding to the training data to determine a discrepancy, if any. If the discrepancy warrants a modification (learning) to the current values of the learnable model parameters, the modifications to the learnable parameters are determined by minimizing some defined loss function determined based on the discrepancy. The iteration continues until the discrepancy meets some pre-defined convergence criterion. Upon convergence, the ANN 430 configured using converged learnable model parameters 440 constitute a learned HEI model 420, which can be used to combine expert outputs in a manner consistent with the knowledge learned from the training data.
An ANN is a network of neurons at different layers that are connected in some fashion to form some structure. As such, an ANN may be alternatively structured which accordingly determines the parameters involved in the architecture that can be learned during training so that the converged network that operates under these parameters in a manner consistent with the training data provided. Such parameters include weights on the connections connecting neurons as well as variables and constants associated with the node function(s) that each and different neurons perform. These are embeddings 440-2 of the ANN and are all embedded in the operation of the ANN and are learnable parameters.
As discussed herein, via learning, the HEI model learning engine 410 is to learn a non-linear function embedded in the HEI model 420 which can be used to map a set of expert outputs to an integrated expert decision. To facilitate the learning, the integrated label prediction unit 540 in the HEI model learning engine 410 combines, based on the current values of learnable model parameters (e.g., the values of the embeddings 440-2), the input expert outputs to generate, at 525, an integrated expert decision (or label). As discussed herein, each piece of training data includes a ground truth label, which serves as the ultimate answer as to the label and can be used to facilitate learning. That is, if there is a discrepancy between the ground truth label from the training data and the predicted integrated expert decision, a loss is computed, at 535, by the loss assessment unit 550 in accordance with the parameters related to the loss function (e.g., 440-3). The computed loss is then used to evaluate, at 545, whether the loss is such that it satisfies a convergence condition expressed by convergence parameters 440-4.
If the loss is such that there is a convergence, determined at 545, it may mean that the current values of learnable parameters in 440 are satisfactory to produce an integration result that can achieve substantially similar results as the training data. In this case, the learning process may end at 565. If otherwise, the integration parameter adjuster 560 updates, at 555, values of various learnable parameters to minimize the loss. In this scenario, the training enters into the next iteration based on a next piece of training data. In the next iteration, the updated values of the learnable parameters are then used to compute the integrated expert decision. The iterations may continue until the convergence condition is satisfied.
Below, an exemplary algorithm implementation of training the original experts, generating, and training augmented experts, and learn the learnable parameters of the nonlinear integration parameters via an ANN architecture is disclosed:
←
This exemplary implementation operates on the unfolded concepts. In this exemplary implementation, the meta-training set for experts from level 1 to level K is constructed in a, e.g., bottom-up progressive fashion following the cross-validation scheme (line 2-7), with the Kth layer of the meta-data-set for end-to-end training of meta parameters (line 8), the meta-testing time model can be obtained by adaptation on the support set (line 9-13) and combine the expert outputs according to the disclosed architecture (line 14). The above algorithm requires
times more computation cost compared to vanilla differentiable architecture training, with
being the ratio of average training cost between heterogeneous experts and the differentiable architecture.
To implement various modules, units, and their functionalities described in the present disclosure, computer hardware platforms may be used as the hardware platform(s) for one or more of the elements described herein. The hardware elements, operating systems and programming languages of such computers are conventional in nature, and it is presumed that those skilled in the art are adequately familiar with to adapt those technologies to appropriate settings as described herein. A computer with user interface elements may be used to implement a personal computer (PC) or other type of workstation or terminal device, although a computer may also act as a server if appropriately programmed. It is believed that those skilled in the art are familiar with the structure, programming, and general operation of such computer equipment and as a result the drawings should be self-explanatory.
Computer 800, for example, includes COM ports 850 connected to and from a network connected thereto to facilitate data communications. Computer 800 also includes a central processing unit (CPU) 820, in the form of one or more processors, for executing program instructions. The exemplary computer platform includes an internal communication bus 810, program storage and data storage of different forms (e.g., disk 870, read only memory (ROM) 830, or random-access memory (RAM) 840), for various data files to be processed and/or communicated by computer 800, as well as possibly program instructions to be executed by CPU 820. Computer 800 also includes an I/O component 860, supporting input/output flows between the computer and other components therein such as user interface elements 880. Computer 800 may also receive programming and data via network communications.
Hence, aspects of the methods of information analytics and management and/or other processes, as outlined above, may be embodied in programming. Program aspects of the technology may be thought of as “products” or “articles of manufacture” typically in the form of executable code and/or associated data that is carried on or embodied in a type of machine readable medium. Tangible non-transitory “storage” type media include any or all of the memory or other storage for the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide storage at any time for the software programming.
All or portions of the software may at times be communicated through a network such as the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, in connection with information analytics and management. Thus, another type of media that may bear the software elements includes optical, electrical, and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links. The physical elements that carry such waves, such as wired or wireless links, optical links, or the like, also may be considered as media bearing the software. As used herein, unless restricted to tangible “storage” media, terms such as computer or machine “readable medium” refer to any medium that participates in providing instructions to a processor for execution.
Hence, a machine-readable medium may take many forms, including but not limited to, a tangible storage medium, a carrier wave medium or physical transmission medium. Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, which may be used to implement the system or any of its components as shown in the drawings. Volatile storage media include dynamic memory, such as a main memory of such a computer platform. Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that form a bus within a computer system. Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications. Common forms of computer-readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data. Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a physical processor for execution.
Those skilled in the art will recognize that the present teachings are amenable to a variety of modifications and/or enhancements. For example, although the implementation of various components described above may be embodied in a hardware device, it may also be implemented as a software only solution, e.g., an installation on an existing server. In addition, the techniques as disclosed herein may be implemented as a firmware, firmware/software combination, firmware/hardware combination, or a hardware/firmware/software combination.
While the foregoing has described what are considered to constitute the present teachings and/or other examples, it is understood that various modifications may be made thereto and that the subject matter disclosed herein may be implemented in various forms and examples, and that the teachings may be applied in numerous applications, only some of which have been described herein. It is intended by the following claims to claim any and all applications, modifications and variations that fall within the true scope of the present teachings.
The present application is related to U.S. Patent Application No. ______ (Attorney Docket No. 146555.560778), entitled “SYSTEM AND METHOD FOR INTEGRATED LARGE-SCALE AUDIENCE TARGETING VIA AUGMENTED HETEROGENEOUS SUB-SYSTEMS” and U.S. Patent Application No. ______ (Attorney Docket No. 146555.562796), entitled “SYSTEM AND METHOD FOR INTEGRATING MULTIPLE EXPERT PREDICTIONS IN A NONLINEAR FRAMEWORK VIA LEARNING”, both of which are incorporated herein by reference in their entireties.