The present disclosure pertains to the field of machine learning technologies, such as but not necessarily limited to the training of machine learning models, and in particular to a method and apparatus for distributed training machine learning models where large amounts of update vector data are communicated.
Machine learning includes classes of computer implemented solutions and applications where computers are trained and optimized to perform tasks without being explicitly programmed to do so. Machine learning models may be advantageously used for problems where it may be challenging for a human to create the needed algorithms manually. Examples of applications that benefit from machine learning solutions include self-driving vehicles, language translation, fraud detection, weather forecasting, and others.
Machine learning models are trained using sample datasets selected for a particular application. For example, a character recognition model may be trained using a database of handwriting samples. Training data includes input data and information indicating the correct solution to the problem and is used to train and improve the machine learning model until it produces sufficiently accurate results. Training can involve very large datasets and require significant time to produce and train a sufficiently accurate model. Solutions involving decentralized, distributed computers and multiple computer nodes connected through computer networks have been used to decrease training time.
Decentralized optimization using multiple computer nodes, where update training vectors are exchanged among nodes, has become the norm for training machine learning models on large datasets. With the need to train bigger models on ever-growing datasets, scalability of communications between computer nodes has become a concern. A potential solution to growing dataset size is to increase the number of nodes, however communication amongst nodes can become a processing bottleneck and communication time can account for a significant portion of the overall machine learning model training time.
Therefore, there is a need for a method and apparatus for optimizing computer node updates by minimizing the size of update vector transmissions that obviates or mitigates one or more limitations of the prior art, for example by reducing communication overhead, while minimizing any impact on the convergence rate, and by reducing the amount of time and bandwidth required to communicate update vector transmissions.
This background information is provided to reveal information believed by the applicant to be of possible relevance to the present disclosure. No admission is necessarily intended, nor should be construed, that any of the preceding information constitutes prior art against the present disclosure.
An object of embodiments of the present disclosure is to provide a method and apparatus for compressing a time series of vector data for transmission in a network such as computer network. Embodiments use vector data where there is a temporal correlation between consecutive vector data of the time series. The time series of vector data may also refer to time correlated vector data.
Embodiments may be used in applications where machine learning models are trained using a plurality of computer nodes connected by a computer network, which may be configured in a master-worker arrangement where one master computer node coordinates update vector calculations performed by one or more worker computer nodes.
Embodiments may use error-feedback to improve compression rates without decreasing the convergence rate while training a machine learning model.
In accordance with embodiments of the present disclosure, there is provided a method of communicating vector data within a network. The method includes obtaining, by a transmitting node, a first vector data including a plurality of elements. Then, selecting, by the transmitting node, a subset of elements of the plurality of elements and sending, by the transmitting node, the subset of elements to a receiving node. Also, estimating, by the transmitting node, a plurality of elements not included in the subset of elements based on a previously transmitted subset of element based on a second vector data, and forming, by the transmitting node, a reconstructed vector data including the subset of elements and the estimated plurality of elements not included in the subset of elements.
In further embodiments, the estimating the plurality of elements not included in the subset of elements includes updating, when one of the subset of elements is transmitted, a predicted value of the one of the subset of elements and resetting a counter, and setting, when one of the subset of elements is not transmitted, one of the plurality of elements not included in the subset of elements with the predicted value, and incrementing the counter.
In further embodiments, the subset of elements is selected based on a criteria, and the criteria is an absolute value of each of the plurality of elements of the first vector data.
In further embodiments, the first vector data and the second vector data are update vectors as part of a machine learning model training process.
In further embodiments, the first vector data and the second vector data are part of a time series of vectors.
In further embodiments, the first vector data is obtained by combining an initial vector data with a weighted difference of the reconstructed vector data and the first vector data.
In accordance with embodiments of the present disclosure, there is provided a network node for transmitting vector data over a network connection. The network node includes a processor and a non-transient memory for storing instructions which when executed by the processor cause the network node to read a first vector data including a plurality of elements, select a subset of elements of the plurality of elements, and send the subset of elements to a receiving node. Also, to estimate a plurality of elements not included in the subset of elements based on a previously transmitted subset of elements based on a second vector data, and to form a reconstructed vector data including the subset of elements and the estimated plurality of elements not included in the subset of elements.
In further embodiments, the estimating the plurality of elements not included in the subset of elements includes updating, when one of the subset of elements is transmitted, a predicted value of the one of the subset of elements and resetting a counter, and setting, when one of the subset of elements is not transmitted, one of the plurality of elements not included in the subset of elements with the predicted value, and incrementing the counter.
In further embodiments, the subset of elements is selected based on a criteria, and the criteria is an absolute value of each of the plurality of elements of the first vector data.
In further embodiments, the first vector data and the second vector data are update vectors as part of a machine learning model training process.
In further embodiments, the first vector data is obtained by combining an initial vector data with a weighted difference of the reconstructed vector data and the first vector data.
In accordance with embodiments of the present disclosure, there is provided a network node for receiving vector data over a network connection. The network node includes a processor and a non-transient memory for storing instructions which when executed by the processor cause the network node to receive, from a transmitting node, a subset of elements of a first vector data. Furthermore, to estimate a plurality of elements not included in the subset of elements based on a previously received subset of element based on a second vector data, the first vector data and the second vector data being part of a time series of vectors, the subset of elements selected by the transmitting node. Also, to form a reconstructed vector data including the subset of elements and the estimated plurality of elements not included in the subset of elements.
In further embodiments, the estimating the plurality of elements not included in the subset of elements includes updating, when one of the subset of elements is received, a predicted value of the one of the subset of elements and clearing a counter, and setting, when one of the subset of elements is not received, one of the plurality of elements not included in the subset of elements with the predicted value, and incrementing the counter.
In further embodiments, the subset of elements is selected based on a criteria, and the criteria is an absolute value of each of the plurality of elements of the first vector data.
In further embodiments, the first vector data and the second vector data are update vectors as part of a machine learning model training process.
In further embodiments, the first vector data is obtained by combining an initial vector data with a weighted difference of the reconstructed vector data and the first vector data.
In accordance with embodiments of the present disclosure, there is provided a method of communicating vector data within a network. The method includes receiving, from a transmitting node, a subset of elements of a first vector data, and estimating a plurality of elements not included in the subset of elements based on a previously received subset of element based on a second vector data. The first vector data and the second vector data are part of a time series of vectors. The subset of elements selected by the transmitting node. Also, forming a reconstructed vector data including the subset of elements and the estimated plurality of elements not included in the subset of elements.
In further embodiments, the estimating the plurality of elements not included in the subset of elements includes updating, when one of the subset of elements is received, a predicted value of the one of the subset of elements and resetting a counter, and setting, when one of the subset of elements is not received, one of the plurality of elements not included in the subset of elements with the predicted value, and incrementing the counter.
In further embodiments, the subset of elements is selected based on a criteria, and the criteria is an absolute value of each of the plurality of elements of the first vector data.
In further embodiments, the first vector data and the second vector data are update vectors as part of a machine learning model training process.
In further embodiments, the first vector data is obtained by combining an initial vector data with a weighted difference of the reconstructed vector data and the first vector data.
In accordance with embodiments of the present disclosure, there is provided a method of communicating vector data within a network. The method includes obtaining, by a transmitting node, a first vector data including a plurality of elements, compressing, by the transmitting node, the first vector data to produce a compressed vector data, and sending, by the transmitting node, the compressed vector data to a receiving node. Also, estimating, by the transmitting node, a reconstructed vector from the compressed vector data and a second vector data, the first vector data and the second vector data being a part of a time correlated series of vector data, the second vector data being earlier in time than the first vector data. Furthermore, receiving, by the receiving node, from a transmitting node, the compressed vector data, and estimating, by the receiving node, the reconstructed vector.
Embodiments have been described above in conjunctions with aspects of the present disclosure upon which they can be implemented. Those skilled in the art will appreciate that embodiments may be implemented in conjunction with the aspect with which they are described, but may also be implemented with other embodiments of that aspect. When embodiments are mutually exclusive, or are otherwise incompatible with each other, it will be apparent to those skilled in the art. Some embodiments may be described in relation to one aspect, but may also be applicable to other aspects, as will be apparent to those of skill in the art.
Further features and advantages of the present disclosure will become apparent from the following detailed description, taken in combination with the appended drawings, in which:
It will be noted that throughout the appended drawings, like features are identified by like reference numerals.
Embodiments of the present disclosure relate to methods, systems, and apparatus for compressing a time series of vector data for transmission in a computer network. Embodiments use time series vector data where there is a temporal correlation between consecutive vector data in the time series.
Embodiments may use error-feedback to improve compression rates without decreasing the convergence rate during a process to train a machine learning model.
Embodiments may be used in applications where machine learning models are trained using a plurality of computer nodes connected by a computer network, which may be configured in a master-worker arrangement where one master computer node coordinates update vector calculations performed by one or more worker computer nodes.
Though
Stochastic gradient descent (SDG) is an algorithm for training a wide range of models in machine learning and for training artificial neural networks. SGD is an iterative method for optimizing an objective function with suitable smoothness properties and may be seen as a stochastic approximation of gradient descent optimization. SGD replaces the actual gradient calculated from the entire data set with an estimate gradient calculated from a randomly selected subset of the data. In high-dimensional optimization problems SGD reduces the computational burden and achieves faster iterations at the expense of a lower convergence rate. Other algorithms may also be used in embodiments such as the ADAM algorithm, an optimization algorithm for stochastic gradient descent for training deep learning models, which combines momentum ideas with adaptive step size.
A variation of the SGD algorithm is the momentum-SGD which is an iterative algorithm where all workers 104 collectively optimize a machine learning model while the master 102 facilitates synchronization. Each worker 104 computes an update vector and sends it to the master 102. The master 102 computes the average of all update vectors received from the workers 104 and broadcasts the average back to the workers 104. Each worker 104 then uses the average to update the learning model. These steps are executed iteratively until a convergence criteria is met. Successive update vectors transmitted between the master 102 and each worker 104 may be viewed as a plurality of time series of vector data, which in this embodiment may be iterative optimization parameters of the machine learning model. As used herein, “time series” refers to iterations of data occurring at consecutive times. In the cases, each iteration may be spaced equally apart in time, while in other cases, each iteration may be spaced at varying or random times from each other. In embodiments, each time iteration may depend on processing or communication time so that the time between iterations will vary within an expected range of values. In embodiments, each iteration occurs subsequent or previous to another while the time between samples places no limitations on embodiments. Examples of time series data includes update vectors used to train machine learning models, video processing, analysing stock market data, and processing astronomical and meteorological data.
Vector data transmitted from n workers 104 to the master 102 may be viewed as n separate time series of vector data. Vector data transmitted from master 102 to each of the n workers 104 may be viewed as another n separate time series of vector data. By using a momentum-SGD algorithm, the update vector smooths the stochastic gradient over the iterations for each time series of vector data. The momentum-SGD algorithm applies an exponentially weighted low-pass filter (LPF) to gradients across time iterations of the update vector which filters out high-frequency components and preserves low-frequency ones and reduces the variation in the resulting update vectors in consecutive iterations. Embodiments may also be optimized using different filter variations, such as filters that implement a combination of low-pass and band-pass characteristics. This causes each entry in the update vector to change slowly over the time iterations. Embodiments use this temporal correlation between elements of update vectors when compressing the update vectors transmitted between master 102 and workers 104.
The memory 220 may include any type of non-transitory memory such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous DRAM (SDRAM), read-only memory (ROM), any combination of such, or the like. The mass storage element 230 may include any type of non-transitory storage device, such as a solid state drive, hard disk drive, a magnetic disk drive, an optical disk drive, USB drive, or any computer program product configured to store data and machine executable program code. According to certain embodiments, the memory 220 or mass storage 230 may have recorded thereon statements and instructions executable by the processor 210 for performing any of the aforementioned method operations described above.
Computing device 200 may also include one or more optional components and modules such as video adapter 270 coupled to a display 275 for providing output, and I/O interface 240 coupled to I/O devices 245 for providing input and output interfaces.
In embodiments, the machine learning model 302 includes one or more machine learning algorithms that may be broadly classified as decision trees, support vector machines, regression, clustering, and other machine learning algorithms as is known in the art. Machine learning model 302 may also include an evaluation module 306 to evaluate the results of algorithm 304 which, in the case of supervised learning applications, may be done by comparing the tags of the training dataset 310 to the classification results produced by algorithm 304 in response to the training data 310. Training dataset 310 is used to tune and configure algorithm 304 which may include tuning parameters of the algorithm 304. Once tuned, machine learning model 302 may be tested or verified using testing dataset 312. The machine learning model 302 may be tested for accuracy, speed, and other parameters such as the number of false negative or false positive results, as required to qualify machine learning model 302 for use on production data 316. Once qualified, production data 316 may be input to the model 314, which implements machine learning model 302, to produce prediction results 318.
With reference to
With reference to
Parameter β 403, 0≤β<1, is used to control the low pass filter effects of the momentum-SGD algorithm and in practice may be set close to 1. In embodiments, this may be 0.9 or 0.99. The gradient vector, gti 402, is used to produce update vector, vti 404, where vti=βvt−1i+(1−β)gti. Using a value of β close to 1 ensures that values of vti are determined mainly by the previous value of v, that is vt−1i.
In embodiments, switch EF 424 may be open and vector rti=vti. In embodiments with error feedback, switch EF 424 is closed and vector
where nt is a learning rate, and eti is an error vector, indicating a difference between the vector, rti 406, and the reconstructed or predicted vector, {tilde over (r)}ti 428.
In embodiments, a quantizer, Q 414, is used to compress the vector, rti 406, to produce a sparse vector, {circumflex over (r)}ti 408. Vector {circumflex over (r)}ti 408 is given by the equation {circumflex over (r)}ti=Q(rti). In embodiments, the Q 414 operator produces sparse vector, {circumflex over (r)}ti 408, by setting all elements in vector rti 406 to zero except for the K elements with the largest absolute value magnitudes. Alternatively, other quantizers may be used that select or omit vector elements based on other criteria.
In embodiments, encoder, ε 416, is used to produce a bit stream 426 that is transmitted to the master 102 by encoding the non-zero locations in {circumflex over (r)}ti 408 and the corresponding values.
Master 102 received bitstream 426 at decoder, D 418 and recreates the sparse vector, {circumflex over (r)}ti 410, that was transmitted by worker 104. A prediction system, P 420, may then be used to predict one of more of the zero value elements of vector, rti 406 that were not included in sparse vector, {circumflex over (r)}ti 410, and not included in received bitstream 426. Predicted vector elements may be used to create a reconstructed vector, {tilde over (r)}ti, using the equation {tilde over (r)}ti=P({circumflex over (r)}ti), and are combined with received vector elements to produce vector {tilde over (r)}ti. An example of a prediction method is illustrated in
In embodiments, worker node 104 may also use its own prediction system, P 422, to apply the same predicted vector elements to the sparse vector to obtain a vector {tilde over (r)}ti 428, that is the same vector {tilde over (r)}ti 412 as used by master 102.
In embodiments, workers 104a through 104h may calculate stochastic gradient vectors, gt1, gt2, . . . , gtn, where in the example of
used in training the machine learning model.
Though the embodiment of
In embodiments, the number of elements in the vectors will vary depending on the application and the value of K may also be varied to obtain a compression factor that yields acceptable results.
As shall be appreciated on a more generic level, quantizer 414 may be any compression method that may be used on a time correlated series of vector data. Furthermore, the prediction systems 420 and 422 may include any number of methods, designed jointly with quantizer 414, in order to produce a more efficiently compressed bit stream 426, consisting of fewer bits. The decoder 418 and prediction system 420 can act on the bit stream 426 to produce the predicted vector 412.
Referring again to
In embodiments, predictors, P 422 at worker 104 and P 420 at master 102, may be used to predict or estimate any of the vector elements that are not in the top K most significant elements. Since both computer nodes, master 102 and worker 104, use the same predictor, both sides have access to the same data.
The operation of encoder 416 is illustrated in
and in step 516, the counter, τi[k], is reset to zero. In step 518, the next vector element is analyzed until all vector elements have been processed and a complete predicted vector, {tilde over (r)}ti, is produced.
With reference to
In accordance with embodiments of the present disclosure, there is provided a method of communicating time correlated vector data within a network. The method includes reading, by a transmitting node, a first vector data including a plurality of elements. The transmitting node selects a subset of elements of the plurality of elements based on a criteria and sends the subset of elements to a receiving node. A receiving node receives the subset of elements and estimates a plurality of elements not included in the subset of elements based on a previously received subset of element based on a second vector data. The first vector data and the second vector data are part of a time series of vectors.
In a further embodiment, the estimating the plurality of elements not included in the subset of elements includes updating, when one of the subset of elements is received, a predicted value of the one of the subset of elements and clearing a counter, and setting, when one of the subset of elements is not received, one of the plurality of elements not included in the subset of elements with the predicted value, and incrementing the counter.
In a further embodiment, the criteria is an absolute value of each of the plurality of elements of the first vector data.
In a further embodiment, the first vector data and the second vector data are update vectors as part of a machine learning model training process.
Further embodiments include estimating, by the transmitting node, the plurality of elements not included in the subset of elements based on the previously received subset of element based on the second vector data. The transmitting node forms a reconstructed vector data including the subset of elements and the estimated plurality of elements not included in the subset of elements.
In a further embodiment, the first vector data is obtained by combining an initial vector data with a weighted difference of the reconstructed vector data and the first vector data.
In accordance with embodiments of the present disclosure, there is provided a network node for transmitting vector data over a network connection. The network node includes a processor and a non-transient memory for storing instructions which when executed by the processor cause the network node to read a first vector data including a plurality of elements, select a subset of elements of the plurality of elements based on a criteria, and send the subset of elements to a receiving node. The instructions further cause the network node to estimate a plurality of elements not included in the subset of elements based on a previously transmitted subset of elements based on a second vector data, and form a reconstructed vector data including the subset of elements and the estimated plurality of elements not included in the subset of elements.
In a further embodiment, the estimating the plurality of elements not included in the subset of elements includes updating, when one of the subset of elements is received, a predicted value of the one of the subset of elements and clearing a counter, and setting, when one of the subset of elements is not received, one of the plurality of elements not included in the subset of elements with the predicted value, and incrementing the counter.
In further embodiments, the criteria is an absolute value of each of the plurality of elements of the first vector data.
In further embodiments, the first vector data and the second vector data are update vectors as part of a machine learning model training process.
In further embodiments, the first vector data is obtained by combining an initial vector data with a weighted difference of the reconstructed vector data and the first vector data.
In accordance with embodiments of the present disclosure, there is provided a network node for receiving vector data over a network connection. The network node includes a processor and a non-transient memory for storing instructions which when executed by the processor cause the network node to receive, from a transmitting node, a subset of elements of a first vector data and estimating a plurality of elements not included in the subset of elements based on a previously received subset of element based on a second vector data. The first vector data and the second vector data are part of a time series of vectors and the subset of elements selected by the transmitting node is based on a criteria. The receiving node also forms a reconstructed vector data including the subset of elements and the estimated plurality of elements not included in the subset of elements.
In further embodiments, the estimating the plurality of elements not included in the subset of elements includes updating, when one of the subset of elements is received, a predicted value of the one of the subset of elements and clearing a counter, and setting, when one of the subset of elements is not received, one of the plurality of elements not included in the subset of elements with the predicted value, and incrementing the counter.
In further embodiments, the criteria is an absolute value of each of the plurality of elements of the first vector data.
In further embodiments, the first vector data and the second vector data are update vectors as part of a machine learning model training process.
In further embodiments, the first vector data is obtained by combining an initial vector data with a weighted difference of the reconstructed vector data and the first vector data.
Acts associated with the method described herein can be implemented as coded instructions in a computer program product. In other words, the computer program product is a computer-readable medium upon which software code is recorded to execute the method when the computer program product is loaded into memory and executed on the microprocessor of the wireless communication device.
Further, each operation of the method may be executed on any computing device, such as a personal computer, server, PDA, or the like and pursuant to one or more, or a part of one or more, program elements, modules or objects generated from any programming language, such as C++, Java, or the like. In addition, each operation, or a file or object or the like implementing each said operation, may be executed by special purpose hardware or a circuit module designed for that purpose.
Through the descriptions of the preceding embodiments, the present disclosure may be implemented by using hardware or by using software and a necessary universal hardware platform. Based on such understandings, the technical solution of the present disclosure may be embodied in the form of a software product. The software product may be stored in a non-volatile or non-transitory storage medium, which can be a compact disk read-only memory (CD-ROM), USB flash disk, or a removable hard disk. The software product includes a number of instructions that enable a computer device (personal computer, server, or network device) to execute the methods provided in the embodiments of the present disclosure. For example, such an execution may correspond to a simulation of the logical operations as described herein. The software product may additionally or alternatively include number of instructions that enable a computer device to execute operations for configuring or programming a digital logic apparatus in accordance with embodiments of the present disclosure.
It will be appreciated that, although specific embodiments of the technology have been described herein for purposes of illustration, various modifications may be made without departing from the scope of the technology. The specification and drawings are, accordingly, to be regarded simply as an illustration of the disclosure as defined by the appended claims, and are contemplated to cover any and all modifications, variations, combinations or equivalents that fall within the scope of the present disclosure. In particular, it is within the scope of the technology to provide a computer program product or program element, or a program storage or memory device such as a magnetic or optical wire, tape or disc, or the like, for storing signals readable by a machine, for controlling the operation of a computer according to the method of the technology and/or to structure some or all of its components in accordance with the system of the technology.
This application claims the benefit of and priority to U.S. Provisional Application Ser. No. 63/218,468 filed Jul. 5, 2021, the entire contents of which are hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
63218468 | Jul 2021 | US |