Importance Sampling with Bandwidth Constraints

Information

  • Patent Application
  • 20230162022
  • Publication Number
    20230162022
  • Date Filed
    November 24, 2021
    2 years ago
  • Date Published
    May 25, 2023
    a year ago
Abstract
At an iteration k of a training procedure for training a deep neural network (DNN), a first computer system can sample a batch bk of data instances from a training dataset local to that computer system in a manner that mostly conforms to importance sampling probabilities of the data instances, but also applies a “stiffness” factor with respect to data instances appearing in batch bk−1 of a prior iteration k−1. This stiffness factor makes it more likely, or guarantees, that some portion of the data instances in prior batch bk−1—which is present on a second computer system holding the DNN—will be reused in current batch bk. The first computer system can then transmit the new data instances in batch bk to the second computer system and the second computer system can reconstruct batch bk using the received new data instances and its local copy of prior batch bk−1.
Description
BACKGROUND

Unless otherwise indicated, the subject matter described in this section is not prior art to the claims of the present application and is not admitted as being prior art by inclusion in this section.


Deep neural networks (DNNs), which are machine learning (ML) models composed of multiple layers of interconnected nodes, are widely used to solve tasks in various fields such as computer vision, natural language processing, telecommunications, bioinformatics, and so on. A DNN is typically trained via a batch-based stochastic gradient descent (SGD) training procedure that involves (1) randomly sampling a batch (sometimes referred to as a “minibatch”) of labeled data instances from a training dataset, (2) forward propagating the batch through the DNN to generate a set of predictions, (3) computing a difference (i.e., “loss”) between the predictions and the batch's labels, (4) performing backpropagation through the DNN with respect to the loss to compute a gradient estimate, (5) updating the DNN's parameters in accordance with the gradient estimate, and (6) iterating steps (1)-(5) until the DNN converges (i.e., reaches a state where the loss falls below a desired threshold). Once trained in this manner, the DNN can be applied during an inference phase to generate predictions for unlabeled data instances.


Generally speaking, the use of larger datasets for training results in more accurate DNNs. However, as the amount of training data increases, the computational overhead and time needed to carry out the SGD training procedure also rises. To address this, importance sampling has been proposed as a technique for accelerating the training of DNNs. With importance sampling, each data instance in the training dataset is assigned an importance sampling probability that corresponds to the “importance” of the data instance to the training procedure, or in other words the degree to which that data instance contributes to progress of the training towards model convergence. Then, at each training iteration, data instances are sampled from the training dataset based on their respective importance sampling probabilities rather than at random, thereby causing more important data instances to be selected with higher likelihood than less important data instances and leading to an overall reduction in training time. It has been found that the optimal sampling probability for a given data instance is proportional to the norm (i.e., size) of the gradient computed for that data instance via SGD.


While existing importance sampling implementations work reasonably well if the training dataset and DNN are co-located, in many real-world scenarios the training dataset will be held by a first computer system (or group of systems) and the DNN will be trained by a second computer system (or group of systems) that is remote from the first computer system. In these scenarios, network congestion and/or other issues may introduce fluctuating bandwidth constraints that limit, to varying degrees, the amount of training data (and thus data instance batch size) that may be communicated from the first computer system to the second computer system in each training iteration. Because larger batch sizes generally result in faster training, a reduction in batch size caused by such bandwidth constraints can undesirably negate some or all of the speed gains provided by importance sampling.





BRIEF DESCRIPTION OF THE DRAWINGS


FIG. 1 depicts an example environment in which embodiments of the present disclosure may be implemented.



FIG. 2 depicts an example DNN.



FIG. 3 depicts a flowchart for training a DNN via a batch-based SGD training procedure with importance sampling according to certain embodiments.



FIG. 4 depicts a workflow of an enhanced importance sampling solution according to certain embodiments.



FIG. 5 depicts a flowchart of a first implementation of the solution of FIG. 4 according to certain embodiments.



FIG. 6 depicts a flowchart of a second implementation of the solution of FIG. 4 according to certain embodiments.





DETAILED DESCRIPTION

In the following description, for purposes of explanation, numerous examples and details are set forth in order to provide an understanding of various embodiments. It will be evident, however, to one skilled in the art that certain embodiments can be practiced without some of these details or can be practiced with modifications or equivalents thereof.


1. Overview

Embodiments of the present disclosure are directed to techniques for implementing importance sampling in the presence of bandwidth constraints. For example, consider a scenario in which (1) a first computer system holds (i.e., maintains in local storage) a training dataset, (2) a second computer system remote from the first computer system trains a DNN on that training dataset using a batch-based SGD training procedure with importance sampling, and (3) the first and second computer systems are subject to one or more bandwidth constraints that limit the amount of data that may be communicated between the systems over the course of the training procedure.


In this and other similar scenarios, the first computer system can, at each training iteration k, sample data instances from the training dataset for inclusion in batch bk of k in a manner that mostly conforms to the conventional (e.g., optimal or near-optimal) importance sampling probabilities of the data instances, but also applies a “stiffness” factor with respect to data instances appearing in batch bk−1 of prior iteration k−1. This stiffness factor makes it more likely, or guarantees, that some portion of the data instances in prior batch bk−1—which are already present on the second computer system by virtue of being processed in iteration k−1—will be reused (i.e., included again) in current batch bk. The first computer system can then transmit the “new” data instances in batch bk (i.e., those that are not also in prior batch bk−1) to the second computer system, and the second computer system can reconstruct the entirety of batch bk using the received new data instances and local copies of the reused data instances from batch bk−1. Finally, the second computer system can execute iteration k of the training procedure using reconstructed batch bk.


In one set of embodiments, the stiffness factor can be implemented probabilistically by modifying the importance sampling probability distribution used for sampling batch bk in a way that favors/prioritizes data instances appearing in prior batch bk−1 over those not appearing in bk−1 according to a weight Qk. Weight Qk can be chosen such that, on average, the number of new data instances in batch bk (and thus the number of data instances that need to be sent from the first computer system to the second computer system in iteration k) will be less than or equal to a data instance limit Lk imposed by bandwidth constraints in effect at the time of k. In another set of embodiments, the stiffness factor can be implemented deterministically by bounding the number of new data instances in batch bk according to a fixed value nk that is less than or equal to limit Lk.


With this general approach—which effectively recycles certain data instances from prior batches that are locally available to the second computer system for use in subsequent batches—the amount of training data that is sent from the first computer system to the second computer system in each iteration can be substantially reduced, thereby allowing the training procedure to adhere to the bandwidth constraints placed on those systems.


2. Example Environment and High-Level Solution Design


FIG. 1 depicts an example environment 100 in which embodiments of the present disclosure may be implemented. As shown, environment 100 includes two computer systems S1 and S2 (reference numerals 102 and 104) that are communicatively coupled via a network 106. Computer system S1 holds a training dataset X (reference numeral 108 that comprises N data instances x1, . . . , xN, each associated with a label yi indicating the correct prediction/output for that data instance and an importance sampling probability pi indicating the training importance of that data instance.


Computer system S2 holds a DNN M (reference numeral 110) and is configured to train M on training dataset X. A DNN is type of ML model that comprises a collection of nodes, also known as neurons, that are organized into layers and interconnected via directed edges. For instance, FIG. 2 depicts an example representation 200 of DNN M that includes a total of fourteen nodes and four layers 1-4. The nodes and edges are associated with parameters (e.g., weights and biases, not shown) that control how a data instance, when provided as input via the first layer, is forward propagated through the DNN to generate a prediction, which is output by the last layer. These parameters are the aspects of the DNN that are adjusted via training in order to optimize the DNN's accuracy (i.e., ability to generate correct predictions).



FIG. 3 depicts a flowchart 300 that may be executed by computer systems S1 and S2 for training DNN M on training dataset X using a batch-based SGD training procedure with conventional importance sampling. Generally speaking, the goal of this training procedure is to minimize a risk function











F
N

(
x
)

:=



1
N



Σ

i
=
1

N



f

(

x
,


ξ
i


)


:=


1
N



Σ

i
=
1

N




f
i

(
x
)













where x represents the parameters of the output (i.e., prediction) generated by DNN M, ξi represents a data instance xi and its corresponding label yi in training dataset X, and f(x, ξi) is a loss function computed on x and ξi. Flowchart 300 depicts the steps performed in a single training iteration k.


Starting with steps 302 and 304, computer system S1 samples a batch bk of data instances from training dataset X in accordance with current importance sampling probabilities p1k, . . . , pNk in X and transmits bk to computer system S2.


At step 306, computer system S2 forward propagates batch bk through DNN M, resulting in a set of predictions. Computer system S2 further computes a loss between the predictions and the labels of the data instances in batch bk using loss function f (step 308) and performs backpropagation through DNN M with respect to the computed loss, resulting in a gradient estimate Gk for bk (step 310). In a particular embodiment, this gradient estimate can be computed as shown below, where bik represents data instance xi in batch bk and pbikk represents the importance sampling probability of bik at iteration k:










G
k

:=


1

|

b
k

|








|

b
k

|



i
=
1




1

N


p

b
i
k

k








f

b
i
k


(

x
k

)









Listing


1







Finally, computer system S2 updates the parameters of DNN M using gradient estimate Gk (step 312), sends a message to computer system S1 indicating completion of the current iteration k (step 314), and the flowchart ends. Steps 302-314 are thereafter repeated for further iterations until DNN M converges (i.e., achieves a desired level of accuracy) or some other termination criterion, such as a maximum number of training iterations, is reached.


In some embodiments, prior to the sending the message to computer system S1 at step 314, computer system S1 can compute, based on the current state of DNN M, updated importance sampling probabilities p1k+1, . . . , pNk+1 corresponding to data instances x1, . . . , xN for use in next training iteration k+1 and can include these updated importance sampling probabilities in the message. According to one approach, the computation of each pik+1 can comprise taking the norm (i.e., size) of the gradient for data instance xi in iteration k and dividing that value by the sum of the gradient norms of all data instances as shown below:










p
i

k
+
1


=







f
i

(

x
k

)








j

X








f
j

(

x
k

)










Listing


2







Computer system S1 can then receive the updated importance sampling probabilities and store them in training dataset X, thereby overwriting prior probabilities p1k, . . . , pNk. In alternative embodiments, the computation of updated importance sampling probabilities p1k+1, . . . , pNk+1 can be performed by a different entity and/or via a different method, such as an ML-based gradient approximation approach that is disclosed in commonly owned U.S. patent application Ser. No. 17/518,107 (Atty. Docket No. H833 (86-38800)), entitled “Importance Sampling via Machine Learning (ML)-Based Gradient Approximation.”


As mentioned previously, in some scenarios computer systems S1 and S2 may be subject to one or more hard or soft bandwidth constraints that place a limit on the number of data instances that may be communicated from S1 to S2 at each training iteration k. A hard network bandwidth constraint is one where the data instance limit cannot be exceeded due to, e.g., characteristics of the systems or the network. For example, computer system S2 may be an edge device (e.g., a smartphone, tablet, Internet of Things (IoT) device, etc.) with unstable network reception and/or network hardware that is constrained by power limitations. A soft network bandwidth constraint is one where the data instance limit can be exceeded, but there are reasons/motivations to avoid doing so. For example, computer system S1 may be part of a cloud storage service platform such as Amazon S3 that charges customers a fee for every M units of data that are retrieved from the platform, thereby motivating the owner/operator of computer system S2 to stay within the per-iteration limit in order to minimize training costs. The presence of these hard or soft bandwidth constraints are problematic because they can significantly lengthen the overall time need to train DNN M.


To address the foregoing and other similar problems, FIG. 4 depicts a high-level workflow 400 of an enhanced importance sampling solution that can be implemented by computer systems S1 and S2 of FIG. 1 as part of training DNN M according to certain embodiments. Workflow 400 illustrates the steps of this enhanced solution with respect to a single training iteration k.


Starting with step 402, computer system S1 can sample, based at least in part on importance sampling probabilities p1k, . . . , pNk, a batch bk of data instances from training dataset X composed of two logically distinct sub-batches: a first sub-batch reusedk that comprises zero or more data instances from batch bk−1 of immediately prior iteration k−1 and a second sub-batch newk that comprises zero or more data instances from the entirety of training dataset X (or a subset of X that excludes batch bk−1). It is assumed that the sizes of these two sub-batches add up to a desired batch size B for batch bk. Further, it is assumed that computer system S2 maintains, in a local memory or storage, a copy 404 of the data instances in batch bk−1 by virtue of having processed those data instances during prior iteration k−1.


In various embodiments, computer system S1 can perform the sampling at step 402 in a manner that makes it likely, or guarantees, that the size of sub-batch newk will not exceed a limit Lk on the number of data instances that may be sent from S1 to S2 in iteration k, per the bandwidth constraints in effect at the time of k. For example, according to one set of embodiments (referred to herein as the “probabilistic approach” and detailed in section (3) below), computer system S1 can select a weight Qk between 0 and 1, modify (or compute) importance sampling probabilities p1k, . . . , pNk such that the sum of the probabilities of the data instances in prior batch bk−1 equals Qk (and conversely, the sum of the probabilities of the data instances not in bk−1 equals (1−Qk)), and sample batch bk from training dataset X in accordance with these modified importance sampling probabilities (resulting in a natural partitioning of data instances in bk into sub-batches newk and reusedk). By selecting a sufficiently large value for weight Qk, computer system S1 can bias the sampling process to probabilistically favor the selection of data instances in prior batch bk−1 over data instances that are not in bk−1 (while maintaining the relative differences in training importance between the data instances in each of these groups) and thereby make it likely that the size of sub-batch newk will not exceed data instance limit Lk.


According to another set of embodiments (referred to herein as the “deterministic approach” and detailed in section (4) below), computer system S1 can directly fix the size of sub-batch newk to a value nk that is less than or equal data instance limit Lk. In addition, computer system S1 can fix the size of sub-batch reusedk to B−nk. Computer system S1 can then perform two independent sampling procedures as part of step 402: (1) a first sampling procedure that samples nk data instances from training dataset X for inclusion in sub-batch newk based on importance sampling probabilities p1k, . . . , pNk in X, and (2) a second sampling procedure that samples B−nk data instances from prior batch bk−1 for inclusion in sub-batch reusedk based on another set of sampling probabilities qb1k−1k, . . . , qbBk−1k that are specific to the members of bk−1. Sampling probabilities qb1k−1k, . . . , qbBk−1k can be defined in several different ways, which are discussed in section (4).


Once batch bk and its constituent sub-batches newk and reusedk have been sampled/determined, computer system S1 can transmit the full data content of the data instances in newk, along with identifiers (IDs) of the data instances in reusedk, to computer system S2 (step 408). In response, computer system S2 can reconstruct batch bk by retrieving, from its local copy 404 of prior batch bk−1, the data instances identified as being included in sub-batch reusedk and combining those data instances with the received data instances in sub-batch newk (step 406).


Finally, at step 410, computer system S2 can carry out the training of DNN M for iteration k using reconstructed batch bk (per, e.g., steps 306-314 of flowchart 300) and workflow 400 can end.


With the enhanced importance sampling solution shown in FIG. 4, a number of advantages are achieved. First, because computer system S1 only needs to send the contents of the data instances in sub-batch newk of batch bk to computer system S2 (due the existence of a local copy of prior batch bk−1 at S2), this solution significantly reduces the amount of training data that needs to be communicated over network 106 in each training iteration, which in turn enables the training procedure to operate successfully in the presence of network bandwidth constraints. As mentioned previously, in various embodiments sub-batch newk can be sampled/constructed in a manner that ensures, or at least makes it probable, that its size will not exceed a data instance limit Lk imposed by the bandwidth constraints present at the time of iteration k.


Second, because this solution still leverages, at least in part, importance sampling probabilities to sample data instances and allows for the use of a constant (e.g., large) batch size, the gains in training speed provided by these features/optimizations can be mostly preserved.


It should be appreciated FIGS. 1-4 and the foregoing description are illustrative and not intended to limit embodiments of the present disclosure. For example, although workflow 400 of FIG. 4 indicates that computer system S1 sends IDs of the data instances in sub-batch reusedk to computer system S2 at step 406 in order to facilitate reconstruction of batch bk at S2, in some embodiments this may not be needed. For example, it is possible for computer system S2 to independently carry out the exact same sampling process executed by computer system S1 at step 402 (using, e.g., a mutually agreed-upon random number generator seed value) and thereby determine the data instances in sub-batch reusedk. Accordingly, in these embodiments computer system S1 can simply provide the content of the data instances in sub-batch newk to computer system S2 at step 406 and S2 can thereafter reconstruct batch bk using that data and its local sampling of reusedk.


Further, although computer systems S1 and S2 are shown in FIG. 1 as singular systems, each of these entities may implemented using multiple computer systems for increased performance, redundancy, and/or other reasons.


Yet further, in certain embodiments training dataset X and DNN M may be held by two different components C1 and C2 of a single computer system S that are subject to inter-component, rather than network, bandwidth constraints. For example, training dataset X may be stored in a memory or storage component that is accessible by a central processing unit (CPU) of S, while DNN M may be held and trained by a graphics processing unit (GPU) of S that is coupled with the CPU via a peripheral bus. In these embodiments, the same techniques described with respect to computer systems S1 and S2 may be applied to components C1 and C2 for implementing importance sampling in the presence of bandwidth constraints (arising out of, e.g., bus bandwidth limitations, bus contention, etc.) between the components.


3. Probabilistic Approach


FIG. 5 depicts a flowchart 500 that may be performed by computer systems S1 and S2 for implementing the enhanced importance sampling solution of FIG. 4 via the probabilistic approach according to certain embodiments. Like workflow 400, flowchart 500 illustrates the steps performed in a single training iteration k.


Starting with step 502, computer system S1 can select a weight value Qk where 0≤Qk≤1 and where Qk is intended to bias the sampling of data instances for batch bk of iteration k in a manner that favors data instances appearing in batch bk−1 of prior iteration k−1 over those not appearing in batch bk−1. In certain embodiments, computer system S1 can select Qk in consideration of data instance limit Lk mentioned previously, such that it will be unlikely for the number of new data instances in batch bk (or in other words, the size of sub-batch newk) to exceed Lk.


At step 504, computer system S1 can modify (or compute) importance sampling probabilities p1k, . . . , pNk for the data instances in training dataset X according to the constraint that the sum of the probabilities for the data instances in batch bk−1 equals Qk (i.e., Σiϵbk−1pbik−1k=Qk). In the scenario where computer system S1 computes importance sampling probabilities p1k, . . . , pNk from scratch, S1 can compute pik for each data instance xi in batch bk−1 (i.e., ∀i ϵ bk−1) and each data instance xi not in batch bk−1 (i.e., ∀i ∉ bk−1) as follows according to one embodiment:















i



b

k
-
1


:

p
i
k




=


Q
k









f
i

(

x
k

)








j


b
j

k
-
1










f
j

(

x
k

)
















i



b

k
-
1


:

p
i
k




=


(

1
-

Q
k


)









f
i

(

x
k

)








j


b
j

k
-
1










f
j

(

x
k

)














Listing


3







At step 506, computer system S1 can sample data instances from training dataset X in accordance with the importance sampling probabilities modified/computed at step 504, resulting in batch bk comprising sub-batches newk and reusedk. Computer system S1 can then transmit (1) the content of the data instances in sub-batch newk and (2) IDs of the data instances in sub-batch reusedk (without the content of those data instances) to computer system S2 (step 508).


At step 510, computer system S2 can receive (1) and (2) from computer system S1 and can reconstruct the entirety of batch bk using the received information and its local copy 404 of prior batch bk−1. For example, for each data instance ID received for sub-batch reusedk, computer system S2 can retrieve the content of that data instance from local copy 404. As part of this step, computer system S2 can overwrite local copy 404 with the contents of reconstructed batch bk.


Finally, at step 512, computer system S2 can execute the training of DNN M at iteration k using reconstructed batch bk and flowchart 500 can end. Although not shown in flowchart 500, in certain embodiments computer system S2 may transmit a message to computer system S1 at the conclusion of iteration k that includes gradient (or gradient norm) information which S1 can use to compute updated importance sampling probabilities (per step 504) in the next iteration k+1.


In alternative embodiments, computer system S2 may directly compute updated importance sampling probabilities in accordance with steps 502 and 504 and provide those probabilities to computer system S1 for use in the next iteration.


4. Deterministic Approach


FIG. 6 depicts a flowchart 600 that may be performed by computer systems S1 and S2 for implementing the enhanced importance sampling solution of FIG. 4 via the deterministic approach according to certain embodiments. Like workflows/flowcharts 400 and 500, flowchart 600 illustrates the steps performed in a single training iteration k.


Starting with step 602, computer system S1 can determine or retrieve a value nk indicating the number of data instances whose contents will be transmitted to computer system S2 as part of batch bk of iteration k (or in other words, the size of sub-batch newk), where nk is less than or equal to the data instance limit Lk.


At step 604, computer system S1 can sample nk data instances from training dataset X (or from a subset of data instances in X that excludes those in prior batch bk) in accordance with their current importance sampling probabilities p1k, . . . , pNk. This group of nk data instances constitutes sub-batch newk of batch bk.


In addition, at step 606, computer system S1 can sample B−nk data instances from prior batch bk−1 (where B is the batch size for bk) in accordance with a set of sampling probabilities qb1k−1k, . . . , qbBk−1k determined for the data instances in bk−1. This group of B−nk data instances constitutes sub-batch reusedk of batch bk.


In one set of embodiments, computer system S1 can define sampling probabilities qb1k−1k, . . . , qbBk−1k using a uniform distribution such that all the probabilities are equal (i.e., qb1k−1k= . . . =qbBk−1k:=q where 0≤q<1 and Σi=1Bqbik−1k=1). In these embodiments, if a data instance xi in prior batch bk−1 also appeared in the batch before that one (i.e., bk−2), computer system S1 can optionally “penalize” xi—or other words reduce its likelihood of being sampled for current batch bk−1—by reducing its sampling probability qbik−1k by some factor and increasing the sampling probabilities of all other data instances in bk−1 accordingly.


In another set of embodiments, computer system S1 can define sampling probabilities qb1k−1k, . . . , qbBk−1k to reflect the relative training importance of the data instances in batch bk−1, thereby increasing the probability that more important data instances in bk−1 will be sampled at step 606. In a particular embodiment, this can be achieved by computing each qbik−1k as follows:










q

b
i

k
-
1


k

=







f
i

(

x
k

)








j


b
j

k
-
1










f
j

(

x
k

)










Listing


4







Upon completing steps 604 and 606, computer system S1 can then transmit (1) the content of the data instances in sub-batch newk and (2) IDs of the data instances in sub-batch reusedk (without the content of those data instances) to computer system S2 (step 608). In response, computer system S2 can receive (1) and (2) from computer system S1 and can reconstruct the entirety of batch bk using the received information and its local copy 404 of prior batch bk−1 (step 610). For example, for each data instance ID received for sub-batch reusedk, computer system S2 can retrieve the content of that data instance from local copy 404. As part of this step, computer system S2 can further overwrite local copy 404 with the contents of reconstructed batch bk.


Finally, at step 612, computer system S2 can execute the training of DNN M at iteration k using reconstructed batch bk and flowchart 600 can end.


Certain embodiments described herein can employ various computer-implemented operations involving data stored in computer systems. For example, these operations can require physical manipulation of physical quantities—usually, though not necessarily, these quantities take the form of electrical or magnetic signals, where they (or representations of them) are capable of being stored, transferred, combined, compared, or otherwise manipulated. Such manipulations are often referred to in terms such as producing, identifying, determining, comparing, etc. Any operations described herein that form part of one or more embodiments can be useful machine operations.


Further, one or more embodiments can relate to a device or an apparatus for performing the foregoing operations. The apparatus can be specially constructed for specific required purposes, or it can be a generic computer system comprising one or more general purpose processors (e.g., Intel or AMD x86 processors) selectively activated or configured by program code stored in the computer system. In particular, various generic computer systems may be used with computer programs written in accordance with the teachings herein, or it may be more convenient to construct a more specialized apparatus to perform the required operations. The various embodiments described herein can be practiced with other computer system configurations including handheld devices, microprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers, and the like.


Yet further, one or more embodiments can be implemented as one or more computer programs or as one or more computer program modules embodied in one or more non-transitory computer readable storage media. The term non-transitory computer readable storage medium refers to any storage device, based on any existing or subsequently developed technology, that can store data and/or computer programs in a non-transitory state for access by a computer system. Examples of non-transitory computer readable media include a hard drive, network attached storage (NAS), read-only memory, random-access memory, flash-based nonvolatile memory (e.g., a flash memory card or a solid state disk), persistent memory, NVMe device, a CD (Compact Disc) (e.g., CD-ROM, CD-R, CD-RW, etc.), a DVD (Digital Versatile Disc), a magnetic tape, and other optical and non-optical data storage devices. The non-transitory computer readable media can also be distributed over a network coupled computer system so that the computer readable code is stored and executed in a distributed fashion.


Finally, boundaries between various components, operations, and data stores are somewhat arbitrary, and particular operations are illustrated in the context of specific illustrative configurations. Other allocations of functionality are envisioned and may fall within the scope of the invention(s). In general, structures and functionality presented as separate components in exemplary configurations can be implemented as a combined structure or component. Similarly, structures and functionality presented as a single component can be implemented as separate components.


As used in the description herein and throughout the claims that follow, “a,” “an,” and “the” includes plural references unless the context clearly dictates otherwise. Also, as used in the description herein and throughout the claims that follow, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise.


The above description illustrates various embodiments along with examples of how aspects of particular embodiments may be implemented. These examples and embodiments should not be deemed to be the only embodiments and are presented to illustrate the flexibility and advantages of particular embodiments as defined by the following claims. Other arrangements, embodiments, implementations, and equivalents can be employed without departing from the scope hereof as defined by the claims.

Claims
  • 1. A method comprising: sampling, by a first computer system, a batch of data instances from a training dataset local to the first computer system, wherein the sampling is based at least in part on importance sampling probabilities associated with the training dataset, and wherein the batch is composed of a first sub-batch of new data instances not present in a prior batch and a second sub-batch of reused data instances present in the prior batch; andtransmitting, by the first computer system, contents of the new data instances in the first sub-batch and identifiers of the reused data instances in the second sub-batch to a second computer system.
  • 2. The method of claim 1 wherein the second computer system: reconstructs the batch using the contents of the new data instances, the identifiers of the reused data instances, and a local copy of the prior batch; andexecutes an iteration of a batch-based training procedure for training a local machine learning (ML) model using the reconstructed batch.
  • 3. The method of claim 1 wherein the first and second computer systems are subject to one or more bandwidth constraints that place a limit on a number of data instances that may be communicated between the first and second computer systems, and wherein a size of the first sub-batch is less than or equal to the limit.
  • 4. The method of claim 1 wherein the sampling comprises selecting a weight for that favors sampling of data instances present in the prior batch.
  • 5. The method of claim 4 wherein the sampling further comprises: modifying or computing the importance sampling probabilities based on the weight; andsampling data instances from the training dataset in accordance with the modified or computed importance sampling probabilities.
  • 6. The method of claim 1 wherein the sampling comprises: setting a size of the first sub-batch to a value n; andsampling n data instances from the training dataset in accordance with the importance sampling probabilities.
  • 7. The method of claim 6 wherein the sampling further comprises: sampling B−n data instances from the prior batch in accordance with a set of sampling probabilities different from the importance sampling probabilities, wherein B is a desired batch size for the batch.
  • 8. A non-transitory computer readable storage medium having stored thereon program code executable by a first computer system holding a training dataset, the program code causing the first computer system to execute a method comprising: sampling a batch of data instances from the training dataset, wherein the sampling is based at least in part on importance sampling probabilities associated with the training dataset, and wherein the batch is composed of a first sub-batch of new data instances not present in a prior batch and a second sub-batch of reused data instances present in the prior batch; andtransmitting contents of the new data instances in the first sub-batch and identifiers of the reused data instances in the second sub-batch to a second computer system.
  • 9. The non-transitory computer readable storage medium of claim 8 wherein the second computer system: reconstructs the batch using the contents of the new data instances, the identifiers of the reused data instances, and a local copy of the prior batch; andexecutes an iteration of a batch-based training procedure for training a local machine learning (ML) model using the reconstructed batch.
  • 10. The non-transitory computer readable storage medium of claim 8 wherein the first and second computer systems are subject to one or more bandwidth constraints that place a limit on a number of data instances that may be communicated between the first and second computer systems, and wherein a size of the first sub-batch is less than or equal to the limit.
  • 11. The non-transitory computer readable storage medium of claim 8 wherein the sampling comprises selecting a weight that favors sampling of data instances present in the prior batch.
  • 12. The non-transitory computer readable storage medium of claim 11 wherein the sampling further comprises: modifying or computing the importance sampling probabilities based on the weight; andsampling data instances from the training dataset in accordance with the modified or computed importance sampling probabilities.
  • 13. The non-transitory computer readable storage medium of claim 8 wherein the sampling comprises: setting a size of the first sub-batch to a value n; andsampling n data instances from the training dataset in accordance with the importance sampling probabilities.
  • 14. The non-transitory computer readable storage medium of claim 13 wherein the sampling further comprises: sampling B−n data instances from the prior batch in accordance with a set of sampling probabilities different from the importance sampling probabilities, wherein B is a desired batch size for the batch.
  • 15. A computer system comprising: a processor;a storage component holding a training dataset; anda non-transitory computer readable medium having stored thereon program code that, when executed by the processor, causes the processor to: sample a batch of data instances from the training dataset, wherein the sampling is based at least in part on importance sampling probabilities associated with the training dataset, and wherein the batch is composed of a first sub-batch of new data instances not present in a prior batch and a second sub-batch of reused data instances present in the prior batch; andtransmit contents of the new data instances in the first sub-batch and identifiers of the reused data instances in the second sub-batch to another computer system.
  • 16. The computer system of claim 15 wherein said another computer system: reconstructs the batch using the contents of the new data instances, the identifiers of the reused data instances, and a local copy of the prior batch; andexecutes an iteration of a batch-based training procedure for training a local machine learning (ML) model using the reconstructed batch.
  • 17. The computer system of claim 15 wherein the computer system and said another computer system are subject to one or more bandwidth constraints that place a limit on a number of data instances that may be communicated between them, and wherein a size of the first sub-batch is less than or equal to the limit.
  • 18. The computer system of claim 15 wherein the program code that causes the processor to sample the batch comprises program code that causes the processor to select a weight that favors sampling of data instances present in the prior batch.
  • 19. The computer system of claim 15 wherein the program code that causes the processor to sample the batch further comprises program code that causes the processor to: modify or compute the importance sampling probabilities based on the weight; andsample data instances from the training dataset in accordance with the modified or computed importance sampling probabilities.
  • 20. The computer system of claim 15 wherein the program code that causes the processor to sample the batch comprises program code that causes the processor to: set a size of the first sub-batch to a value n; andsample n data instances from the training dataset in accordance with the importance sampling probabilities.
  • 21. The computer system of claim 15 wherein the program code that causes the processor to sample the batch further comprises program code that causes the processor to: sample B−n data instances from the prior batch in accordance with a set of sampling probabilities different from the importance sampling probabilities, wherein B is a desired batch size for the batch.