The present disclosure relates to a data processing system comprising multiple sets of processing units and, in particular, to techniques that adapt the training of neural networks for such a system.
Neural networks are used in the field of machine learning and artificial intelligence. Neural networks comprise arrangements of sets of nodes which are interconnected by links and which interact with each other. The principles of neural networks in computing are based on information about how electrical stimuli convey information in the human brain. For this reason, the nodes are often referred to as neurons. They may also be referred to as vertices. The links are sometimes referred to as edges. The network can take input data and certain nodes perform operations on the data. The result of these operations is passed to other nodes. The output of each node is referred to as its activation or node value. Each link is associated with a weight. A weight defines the connectivity between nodes of the neural network. Many different techniques are known by which neural networks are capable of learning, which takes place by altering values of the weights to reproduce a target or label.
There are different learning approaches, but in each case there is a forward propagation through the network from left to right in
In order to determine the magnitude and direction of the updates that are applied to each of the model parameters in the network, a loss function is evaluated. Updates to the model parameters are made to attempt to minimise the loss function. The loss function represents the difference between the output of a neural network and the target defined in the training data for the neural network. The loss function (Ls) calculated for sets of training data is used to update the model parameters, θ. In some cases, the loss function may be calculated to perform updating of the model parameters for each sample of input data to the neural network. However, typically the training data will be divided into mini-batches with the loss function being calculated once for each mini-batch. Learning is based on backpropagation of the gradient of the loss function with respect to the model parameters. At iteration k, the updated model parameters for the next training iteration, k+1, are calculated:
θk+1=θk−ηk∇θLs(θk) Equation 1
where ηk is the learning rate for training iteration k·∇θLs(θk) is the gradient of the loss function with respect to the model parameters or, in the case of mini-batch Stochastic Gradient Descent, is the average gradient across the mini-batch.
Different types of loss function to be minimised are known. For example, one type of loss function corresponds to the sum of the squares of the differences between the output values and the target values. Another type of loss function, which may be used for classification problems, is given by the cross-entropy:
where p(x) is the model's output probability (also referred to as a prediction) of the class being correct. p(x) is derived from the output of the xth node in the output layer of the neural network, and y(x) is the corresponding target value (also referred to as a label) for the class.
New types of data processing systems are being designed that are specifically adapted for the training of neural networks. Such data processing systems make use of a very large number of processors that are capable of performing massively parallel processing that can be applied for training neural networks. Such data processing systems may make use of sets of processing units provided, for example, in clusters. Each of the processing units may itself contain a plurality of processors. The Graphcore Intelligence Processing Unit (IPU) is an example of such a processing unit.
When training neural networks, it is important to consider how to optimise the training to make use of the parallel processing that is provided by a data processing system having multiple sets of processing units. These multiple set of processing units enable the training of neural networks to be performed in a distributed fashion.
According to a first aspect, there is provided a data processing system for training a neural network, the data processing system comprising: a first set of one or more processing units, a second set of one or more processing units, at least one data storage, and at least one interconnect between the first set of one or more processing units, the second set of processing units and the at least one data storage, wherein the at least one data storage is configured to provide over the at least one interconnect, training data to the first set of one or more processing units and the second set of one more processing units, wherein each of the first and second set of processing units is configured to, for each of at least some of a plurality of training iterations for training the neural network: perform a series of operations on at least part of the respective training data received from the at least one data storage to derive output values for the neural network; exchange over the at least one interconnect, with the other of the first and second set of processing units, the output values calculated by the respective one of the first and second set of processing units; evaluate a loss function for the respective training iteration, said loss function including a metric measuring the dissimilarity between the output values calculated by the first and second set of processing units, wherein the metric is weighted in the evaluation of the loss function in accordance with a parameter; update model parameters of the neural network using the respective evaluated loss function; and update the parameter for use in subsequent ones of the training iterations.
The different sets of processing units are configured to each train a model, but to do so by exchanging their predictions for each training iteration, and using the predications of the other sets of processing units to each update their models. This effect is optimised by introducing a parameter that changes over the course of training to control how much the dissimilarity of the predictions between sets of processing units impacts the updates to the model parameters.
In some embodiments, each of the first set of one or more processing units and the second set of one or more processing units comprises a cluster of processing units, each of the processing units being formed as part of a separate integrated circuit.
In some embodiments, the updating of the parameter by each of the first and second set of processing units comprises at least one of the first and second set of processing units receiving an updated value for the parameter.
In some embodiments, the updating the parameter comprises updating a value of the parameter to one of a set of values predefined before the training of the neural network.
In some embodiments, each of the first and second set of processing units is configured to perform the updating of the parameter for a predefined portion of the training iterations.
In some embodiments, the training data provided by the at least one data storage over the interconnect comprises a first set of training data provided to the first set of one or more processing units and a second set of training data provided to the second set of one or more processing units, wherein the first set of training data is different to the second set of training data.
In some embodiments, the training data provided by the at least one data storage over the interconnect comprises a same set of training data provided to the first set of one or more processing units and the second set of one or more processing units.
In some embodiments, the updating the parameter is performed in dependence upon a learning rate for the neural network.
In some embodiments, at least one of the first and second set of processing units is configured to calculate the updated parameter in dependence upon values calculated in dependence upon the training data and model parameters during the respective training iteration.
In some embodiments, the values calculated in dependence upon the training data comprise at least one: the loss function; one or more gradients of the loss function; and a learning rate for the previous training iteration.
In some embodiments, the calculating the updated parameter comprises calculating the updated parameter in dependence upon a moving average using previously determined parameter values for a plurality of previous training iterations.
In some embodiments, the moving average is an exponential moving average.
In some embodiments, each of the processing units is configured to alternate between operating in: a compute phase in which the respective processing unit performs calculations for training the neural network; and an exchange phase in which data for training the neural network is exchanged with others of the processing units, said data for training the neural network including the output values calculated by the first and second sets of processing units.
In some embodiments, the metric measuring the dissimilarity comprises the Kullback-Leibler divergence between the output values calculated by the first and second sets of processing units.
In some embodiments, the metric measuring the dissimilarity comprises the mean squared error between the output values calculated by the first and second sets of processing units.
In some embodiments, the data processing system comprises a host system comprising at least one processor configured to: interface the first and second set of processing units with the at least one data storage; and provide the training data to the first and second set of processing units from the at least one data storage.
According to a second aspect, there is provided a method for training a neural network, the method implemented in a data processing system comprising: a first set of one or more processing units, a second set of one or more processing units, at least one data storage, and at least one interconnect between the first set of one or more processing units, the second set of processing units and the at least one data storage, wherein the method comprises: provide from the at least one data storage, over the at least one interconnect, training data to the first set of one or more processing units and the second set of one more processing units, for each of at least some of a plurality of training iterations for training the neural network: perform a series of operations on at least part of the respective training data received from the at least one data storage to derive output values for the neural network; exchange over the at least one interconnect, with the other of the first and second set of processing units, the output values calculated by the respective one of the first and second set of processing units; evaluate a loss function for the respective training iteration, said loss function including a metric measuring the dissimilarity between the output values calculated by the first and second set of processing units, wherein the metric is weighted in the evaluation of the loss function in accordance with a parameter; update model parameters of the neural network using the respective evaluated loss function; and update the parameter for use in subsequent ones of the training iterations.
In some embodiments, each of the first set of one or more processing units and the second set of one or more processing units comprises a cluster of processing units, each of the processing units being formed as part of a separate integrated circuit.
In some embodiments, the updating of the parameter by each of the first and second set of processing units comprises at least one of the first and second set of processing units receiving an updated value for the parameter.
In some embodiments, the updating the parameter comprises updating a value of the parameter to one of a set of values predefined before the training of the neural network.
In some embodiments, the updating of the parameter is performed for a predefined portion of the training iterations.
In some embodiments, the training data provided by the at least one data storage over the interconnect comprises a first set of training data provided to the first set of one or more processing units and a second set of training data provided to the second set of one or more processing units, wherein the first set of training data is different to the second set of training data.
In some embodiments, the training data provided by the at least one data storage over the interconnect comprises a same set of training data provided to the first set of one or more processing units and the second set of one or more processing units.
In some embodiments, the updating the parameter is performed in dependence upon a learning rate for the neural network.
In some embodiments, at least one of the first and second set of processing units is configured to calculate the updated parameter in dependence upon values calculated in dependence upon the training data and model parameters used for the respective training iteration.
In some embodiments, the values calculated in dependence upon the training data comprise at least one: the loss function; one or more gradients of the loss function; and a learning rate for the previous training iteration.
In some embodiments, the calculating the updated parameter comprises calculating the updated parameter in dependence upon a moving average using previously determined parameter values for a plurality of previous training iterations.
In some embodiments, wherein the moving average is an exponential moving average.
In some embodiments, the method comprises each of the processing units of the first and second sets of processing unit alternating between operating in: a compute phase in which the respective processing unit performs calculations for training the neural network; and an exchange phase in which data for training the neural network is exchanged with others of the processing units, said data for training the neural network including the output values calculated by the first and second sets of processing units, wherein the step of exchanging, over the at least one interconnect, the output values is performed during one of the exchange phases.
In some embodiments, the metric measuring the dissimilarity comprises the Kullback-Leibler divergence between the output values calculated by the first and second sets of processing units.
In some embodiments, wherein the metric measuring the dissimilarity comprises the mean squared error between the output values calculated by the first and second sets of processing units.
In some embodiments, comprising a host system comprising at least one processor configured to: interface the first and second set of processing units with the at least one data storage; and provide the training data to the first and second set of processing units from the at least one data storage.
For a better understanding of the present invention and to show how the same may be carried into effect, reference will now be made by way of example to the accompanying Figures in which:
Reference is made to
In embodiments, each processing unit 2 also comprises one or more external links 8, enabling the processing unit 2 to be connected to one or more other processing units (e.g. one or more other instances of the same processing unit 2). These external links 8 may comprise any one or more of: one or more processor-to-host links for connecting the processing unit 2 to a host processor, and/or one or more processor-to-processor links for connecting together with one or more other instances of the processing unit 2 on the same IC package or card, or on different cards. In one example arrangement, the processing unit 2 receives work from a host processor (not shown) which is connected to the processing unit 2 via one of the processor-to-host links in the form of input data to be processed by the processing unit 2. Multiple instances of the processing unit 2 can be connected together into cards by processor-to-processor links. Thus a host accesses a computer, which is architected as a multi-tile system on a chip, depending on the workload required for the host application. The processing unit 2 functions as an accelerator subsystem for the host processor.
The interconnect 34 is configured to enable the different tiles 4 in the array 6 to communicate with one another. However, as well as there potentially being dependencies between threads on the same tile 4, there may also be dependencies between the portions of the program running on different tiles 4 in the array 6. A technique is, therefore, required to prevent a piece of code on one tile 4 running ahead of data upon which it is dependent being made available by another piece of code on another tile 4.
Each tile 4 is itself a processor capable of executing instructions (code) from a local instruction memory and handling data in local data memory. A tile 4 may comprise a respective instance of a barrel-threaded processor and a memory. For instance, by way of illustration the processing unit 2 may comprise of the order of hundreds of tiles 4, or even over a thousand. For completeness, note also that an “array” as referred to herein does not necessarily imply any particular number of dimensions or physical layout of the tiles 4.
Communication between tiles 4 on the processing unit 2 occurs in a time deterministic fashion. However, other forms of inter tile exchange are possible. There may be dependencies between the portions of the program running on different tiles 4 in the array 6. That is, processing data on one tile may depend on results from another tile, e.g. may provide results on which another tile depends. A technique is, therefore, required to prevent a piece of code on one tile 4 running ahead of data upon which it is dependent being made available by another piece of code on another tile 4.
Parallel programming models for AI and Data Science usually follows a 3-phase iterative execution model: Compute, Barrier, and Exchange. The implications are that data transfer to and from a processor is usually barrier dependent to provide data-consistency between the processors and between each processor and a host. Typically used data consistency models are Bulk Synchronous Parallel (BSP), Stale Synchronous Parallel (SSP) and Asynchronous. Embodiments described herein use a BSP model, but it will be apparent that the other synch models could be utilised as an alternative.
Reference is made to
During the compute phase 33, each tile 4 performs one or more computation tasks locally on-tile, but does not communicate any results of these computations with any others of the tiles 4. In the exchange phase 32, each tile 4 is allowed to exchange one or more results of the computations from the preceding compute phase to and/or from one or more others of the tiles, but does not perform any new computations until it has received from other tiles 4 any data on which its task(s) has/have dependency. Neither does it send to any other tile, any data except that computed in the preceding compute phase. It is not excluded that other operations such as internal control-related operations may be performed in the exchange phase 32. The communication external to the tile group may optionally utilise the BSP mechanism, but alternatively may not utilize BSP and may instead use some other synchronization mechanism of its own.
According to the BSP principle, a barrier synchronization 30 is placed at the juncture transitioning from the compute phase 33 into the exchange phase 32, or the juncture transitioning from the exchange phase 32 into the compute phase 33, or both. That is to say, either: (a) all tiles 4 are required to complete their respective compute phases 33 before any in the group is allowed to proceed to the next exchange phase 32, or (b) all tiles 4 in the group are required to complete their respective exchange phases 32 before any tile in the group is allowed to proceed to the next compute phase 33, or (c) both of these conditions are enforced. In all three variants, it is the individual tiles which alternate between phases, and the whole assembly which synchronizes. The sequence of exchange and compute phases may then repeat over multiple repetitions. In BSP terminology, each repetition of exchange phase and compute phase is sometimes referred to as a “superstep” (though note that in the literature the terminology is not always used consistently: sometimes each individual exchange phase and compute phase individually is called a superstep, whereas elsewhere, as in the terminology adopted herein, the exchange and compute phases together are referred to as a superstep).
Note also, it is not excluded that multiple different independent groups of tiles 4 on the same processing unit 2 or different processing units could each form a separate respective BSP group operating asynchronously with respect to one another, with the BSP cycle of compute, synchronize and exchange being imposed only within each given group, but each group doing so independently of the other groups. I.e. a multi-tile array 6 might include multiple internally synchronous groups each operating independently and asynchronously to the other such groups (discussed in more detail later). In some embodiments there is a hierarchical grouping of sync and exchange, as will be discussed in more detail later.
In embodiments, multiple instances of the processing unit 2 are connected together to form an even larger array of tiles 4 spanning multiple processing units 2. This is illustrated in
When using the processing units 2 for training a neural network, the calculations performed are performed during the compute phase 33. These calculations include the determining of activations, the evaluation of the loss function and the gradient of the loss function, and the determining of the updates to the model parameters, i.e. weights and biases. During the exchange phase, the activations of the output layer (i.e. the outputs of the neural network) are exchanged between processing units 2 and used, during a following one or more compute phases, to determine a dissimilarity metric of the loss function. The dissimilarity metric may also be referred to as a disimilliary measure of the loss function This dissimilarity metric/measure is referred to in this description as the disiltation loss.
At the physical layer, the interconnect mechanism is lossy, but at the transaction layer, the mechanism is not lossy due to the architecture of the link layer: if a packet is not acknowledged it will be resent automatically by the hardware in the interconnect 72. The possibility for loss and resending at the data link layer, however, means that the delivery of data packets over the external interconnect 72 is not time-deterministic. Further, all the packets of a given exchange may arrive together or separated apart in time, and in any order, so the external interconnect requires flow control and queuing. Further, the interconnect may use dock-data-recovery (CDR) technology to infer a clock from a received data stream having sufficient data signal transitions to maintain bit-lock. This inferred clock will be of unknown phase relationship to the sending clock and hence represent an additional source of non-determinism.
As illustrated, the external interconnect 72 comprises an external exchange block (XB) 78. The compiler nominates one of the tiles 4 to send an external exchange request (XREQ) to the exchange block 78 (operation S1). The XREQ is a message comprising one or more control packets, indicating which of the tiles 4 have data packets (content) to send to another tile or tiles 4 on another processing unit 2. This is illustrated schematically in
An example mechanism for implementing the synchronization amongst a selected sync group 91, 92 is illustrated in
The respective GSP 95 associated with each chip 2 is connected to its respective chip 2, such that it can detect the sync request (Sync_req) raised by that chip 2 and the exit state of that chip 2, and so that it can return the sync acknowledgment (Sync_ack) and global exit state to the respective chip 2. The respective GSP 95 associated with each chip 2 is also connected to the GSP 95 of at least one other of the chips 2 via an external sync interface comprising a bundle of four sync wires 96, details of which will be discussed in more detailed shortly. This may be part of one of the chip-to-chip links 8. In the case of a link between chips on different cards, the interface 8 may for example comprise a PCI interface and the four sync wires 96 may be implemented by re-using four wires of the PCI interface. Some of the chips' GSPs 95 are connected to that of two adjacent chips 2, each connection via a respective instance of the four sync wires 96. This way, the chips 2 can be connected in one or more daisy chains via their GSPs 95. This enables the sync requests, sync acknowledgments, running aggregates of exit states, and global exit states, to be propagated up and down the chain.
In operation, for each sync group 91, 92, the GSP 95 associated with one of the chips 2 in that group is set as the master for synchronization and exit state aggregation purposes, the rest in the group being slaves for this purpose. Each of the slave sync blocks 95 is configured with the direction (e.g. left or right) that it needs to propagate sync requests, sync acknowledgments and exit states for each sync group 91, 92 (i.e. the direction toward the master). In embodiments these settings are configurable by software, e.g. in an initial configuration phase after which the configuration remains set throughout the subsequent operation of the system. For instance, this may be configured by the host processor. Alternatively, it is not excluded that the configuration could be hard-wired. Either way, the different sync groups 91, 92 can have different masters and, in general, it is possible for a given chip 2 (or rather its GSP 95) to be master of one group and not another group of which it is a member, or to be master of multiple groups.
For instance, by way of illustration, consider the example scenario of
The GSP 95 of the master then determines a global aggregate of all the exit states based on the running aggregate it receives and the exit state of its own chip 2IV. It propagates this global aggregate back out along the chain to all the chips 2, along with the sync acknowledgement (Sync_ack).
If the master is part way along a chain, as opposed to being at one end as in the above example, then the sync and exit state information propagates in opposite directions either side of the master, both sides toward the master. In this case, the master only issues the sync acknowledgment and global exit state once the sync request from both sides has been received. E.g. consider the case where chip 2III is master of group 92. Further, in embodiments the GSP 95 of some of the chips 2 could connect to that of three or more other chips 2, thus creating multiple branches of chains toward the master. Each chain then behaves as described above, and the master only issues the sync acknowledgment and global exit state once the sync request from all chains has been received. And/or, one or more of the chips 2 could connect to an external resource such as the host processor, a network card, a storage device or an FPGA.
In embodiments, the signalling of the sync and exit state information is implemented as follows. The bundle of four sync wires 96 between each pair of chips 2 comprises two pairs of wires, a first pair 96_0 and a second pair 96_1. Each pair comprises an instance of a sync request wire and an instance of a sync acknowledgment wire. To signal a running aggregate exit state of value 0, the GSP 95 of the sending chip 2 uses the sync request wire of the first wire pair 96_0 when signalling the sync request (sync_req), or to signal a running aggregate of value 1 the GSP 95 uses the sync request wire of the second wire pair 961 when signalling the sync request. To signal a global aggregate exit state of value 0, the GSP 95 of the sending chip 2 uses the sync acknowledgment wire of the first wire pair 96_0 when signalling the sync acknowledgment (sync_ack), or to signal a global aggregate of value 1 the GSP 95 uses the sync request wire of the second wire pair 961 when signalling the sync acknowledgment.
Note that the above is only the mechanism for propagating sync and exit state information. The actual data (content) is transmitted by another channel, for example as discussed earlier with reference to
There is additionally provided a mechanism for enabling a host processor 93 to communicate with any processing unit 2 that operates with either a single point of rendezvous for all its participants (such as BSP), or in some embodiments a sufficiently small number of points of rendezvous (such as a number of independent processing units all connected to one host) such that implementation of a host-processor friendly synchronisation mechanism can be implemented in hardware in a particularly efficient manner. This situation may contrasted with a traditional CSP approach in which the number of points of rendezvous is application specific and thus the synchronization mechanisms such as semaphores must be software defined and thus subject to inefficiencies that follow from this (e.g. processor interrupt latency).
As shown in
In embodiments, one HSP module 98 is provided per chip 2 and per corresponding GSP 95. In this case, whichever GSP 95 is configured as the master of a given sync group 91, 92, the HSP 98 of that sync block is set as the proxy of the host 93 within the group and the other HSPs are disabled. Thus, as with the sync blocks 95, the HSPs 98 can be configured per sync group 91, 92. So one HSP 98 can be set as the host proxy for one sync group, e.g. 91A or 91B, whilst another HSP 98 can be set as the host proxy for another group, e.g. 91B or 92; or the same HSP 98 may be set as the host proxy for multiple groups, e.g. both 91 and 92. To this end, the host interface 97 is connected to the HSPs 98 so that the HSP 98 selected for each group 91, 92 may be configurable by software by writing to registers of the HSP modules 98 via the PCI interface 97. Alternatively, it is not excluded that the configuration could be hard-wired or the HSP registers updated via a different interface or protocol. It is also not exduded that in yet further alternative embodiments, there could be a single fixed HSP 98 per sync group 91, 92, or even a single fixed HSP 98 for the whole array or subsystem 6.
The or each host sync proxy (HSP) module 98 comprises hardware circuitry configured to enable the host 93 to participate in the respective sync group 91, 92 in which that HSP 98 is arranged to act as the host's proxy. A sync request emitted by the tiles 4, if it is a sync with host involvement, will be conveyed by the sync logic 95 to the active HSP 98 for that group whereas a sync request which does not specify host involvement will be aggregated and returned to the requesting tiles without involving the HSP 98 in any way. Thus the tiles 4 determine by virtue of the program they execute when, if at all, the processing unit 2 requires to interact with the host via the HSP 98.
By way of illustration, consider an instance of the HSP 98 configured to act as proxy of the host 93 with respect to the global sync group 92. E.g. in
The host 93 is asynchronous and non-time-deterministic with respect to the rest of the sync group 92, and separated by a relatively large amount of wiring and physical logic. In addition any communication with the host likely requires the host to take an interrupt following which there is a considerable latency for handling the interrupt and then switching contexts to the host code that would deal with the sync request. These factors mean the latency of any interaction involving the host 93 is poor. It would be desirable to avoid needing to communicate directly with the host 93 as much as possible.
To this end, the HSP 98 comprises a set of registers comprising at least one counter 99, and associated counting logic arranged to operate as follows. The counter 99 is arranged so that an integer value n can be written to it by the host 93 via the host interface 97, in embodiments such that the value written is added to the value already present in this register 99. When the HSP counter 99 has a value of 1 or greater then in the sync group 92 in which the HSP 98 in question is acting as the host's proxy, the HSP 98 is then configured to generate a sync acknowledgement (sync_ack) when it receives a sync request from the tiles 4 in the sync group 92. The associated counting logic automatically decrements n by one in the counter 99 each time a sync acknowledgement is generated and the corresponding barrier is passed (e.g. barrier 80 in the case of sync group 92). This process occurs without the requirement for the HSP 98 to contact or otherwise interrupt the host. But if the counter value n has now reached zero, the HSP 98 does not generate the sync-acknowledgment and therefore does not allow the tiles 4 in the group 92 to continue running again until both: i) all the tiles 4 in that group 92 have sent a sync request (sync_req), and ii) the HSP 98 performs a write to the HSP 98 via the host interface 97 explicitly granting the barrier to be released. In embodiments, this second subcondition ii) is implemented by the HSP 98 checking that the HSP counter 99 now has a value of 1 or greater—i.e. the counter has been granted with more credits again by the host 93 writing to the counter 99 via the host interface 97. Thus the tiles 4 of the group can be allowed to continue running through n barriers without deferring at all to the host 93, after which they must then synchronize with the host 93 (and may then exchange data to and/or from the host). In some cases, the host may arrange its operation for maximum efficiency by ensuring that the HSP counter value never falls to zero and thus the processing unit 2 never pauses to sync with the host.
Preferably the software running on the tiles 4 is free to choose whether to request HSP involvement or not, by collectively marking their respective sync requests as either requiring or not requiring host involvement. In such embodiments the above behaviour is applied only by the HSP 98 for the barriers corresponding to sync requests marked as requiring host involvement (the “involvement” of the host for any given barrier being either the proxy granting of the sync ack by the HSP 98 on behalf of the host, or occasionally the explicit granting of more credit). The program is arranged so that all tiles 4 in a given group 91, 92 signal the same choice in their sync requests (HSP involvement or not) for a given barrier synchronization. In embodiments the host involvement is selected by different variants of the mode of the SYNC instruction. That is, for each sync group 91, 92, there is effectively two variants that the operand of the SYNC instruction can take: zone_1_host, zone_1_no_host; and zone_2_host, zone_2_no_host. The execution unit 18 is configured to act upon the operand, and in response to cause the synchronization logic in the interconnect 72, 76 to signal the host involvement marker accordingly. In other embodiments however, it is not excluded that other mechanisms could be implemented for requesting host involvement, or even (though less preferred) that host involvement is hardwired and therefore always imposed (i.e. counter 99 is always consulted).
Another function of the HSP 98 is to notify the host by writing a notification message directly to the host's memory (in this embodiment, over the PCI interface). The notification message includes the current contents of the HSP 98 which includes the aforementioned counter value. Optionally the HSP 98 can also be configured to interrupt the host at this point. The host therefore has the option of waiting for an interrupt from the HSP or of polling the memory location written by the HSP with either method serving to alert the host to the current new state of the HSP including the value of its counter. The host program may then take such steps as it requires in order to prepare for future barriers following which it posts incremental values to the HSP counter 99.
In embodiments, preparation for barriers performed by the host may include the preparation of data to be fetched by the processing unit 2, such as experience data sets required by the processing unit 2 for the next stage in learning a model. Preparation in this context may include fetching the data from storage disks or other media, formatting data in a form which is required by the training algorithm running on the processing unit 2 or decompression of image data. Additionally, preparation for barriers may include consuming output data produced by the processing unit 2.
Another function of the HSP 98 is to communicate the exit state value of the processing unit 2 that accompanies the sync request from the Tiles 4 to the host 93, via the notification message mentioned previously.
Another function of the HSP 98 is to allow the host program to specify its own exit state value by writing it to one of the HSP registers. Thereafter, when the HSP 98 generates a sync-acknowledgment for the tiles 4, the aggregated exit state of all the tiles 4 is also aggregated with the exit state value that has been provided by the host 93.
Another function of the HSP 98 is to allow the host program to specify an expected exit state value which corresponds to the exit state it most commonly expects the tiles 4 to provide along with their sync request. When the host 93 provides an expected exit state in this way, then so long as the tiles 4 exit state matches the value provided by the host the operation of the HSP is as described previously, with the HSP generating a sync-acknowledge while the HSP counter value n is greater than zero. Alternatively, if the host's expected exit state value does not match the value provided by the tile 4 then the HSP 98 does not generate a sync-acknowledgment to the Tiles 4. Because the tile's exit state 4 is provided during the notification write mentioned above and the processing unit 2 will be stalled at the barrier where the tile exit state and host exit state differ, the host program is able to take such barrier preparation steps as may be required to satisfy the conditions signalled by the change in exit state and then re-establish the counter value n such that the value reflects the new preparations made. To facilitate this re-establishment of the counter value, the HSP interprets a write to the HSP register with a count value of zero as an instruction to zero the counter value rather than to increment the counter value by zero which would have the undesired effect of leaving the counter value unchanged.
An unexpected exit state event as described above may entail abandoning previous preparations made by the host in anticipation of the Tile exit state matching the expected value but in general the loss of efficiency resulting from this event is small compared to the loss of efficiency that would be incurred if the processing unit 2 had to interrupt or involve the host directly at each barrier, so long as the occurrence of the unexpected exit state value is rare relative to occurrences of the expected exit state value.
In some cases, the processing units 2 may be arranged into clusters and connected together using gateways. Such clusters may be applied for training neural networks when a larger amount of processing power is required than is available in a single machine. Reference is made to
In this model illustrated by
Reference is made to
Each processing unit in a cluster comprises at least one processor configured to execute computer readable instructions to perform the calculating and exchanging operations described herein. Each processing unit in a cluster 710, 720 may be provided on a separate integrated circuit. Each of the processing units in the dusters 710,720 may be an intelligence processing unit 2 as described above with respect to
The external storage 750 is configured to provide sets of training data for training a neural network to both of the dusters 710,720 of processing units. The external storage 750 is associated with a host 740, which provides the training data from the external storage 750 to the dusters 710, 720 over the interconnect 730. The training data provided to one of the clusters 710, 720 may be the same or different to the training data provided to the other of the clusters 710, 720. In the example, only a single host 740 is used to provide the training data to each of the clusters 710, 720. However, in other examples multiple hosts may be used or alternatively, a decentralised setup with no explicit host could be used, but with each of the clusters 710, 720 being able to read data from the external storage 750.
The training data is preferably divided into mini-batches for training. A mini-batch of data is a plurality of training samples that are a subset of the whole training data set for training the neural network. The whole training data set comprises a plurality of mini-batches. When the training data is distributed to the clusters of processing units in mini-batches, each mini-batch is used to determine a single set of updated model parameters during a single training iteration. During each training iteration, each of the clusters will produce sets of output values based on the mini-batch of data received and use these output values to compute a gradient of a loss function, which is used to perform an update to the model parameters. Once each of the clusters 710, 720 has performed the training using all of the mini-batches defined from the training data set, each of the clusters again performs updating of the model parameters using a number of mini-batches from the same training data set. Each time the clusters 710, 720 cycle through the training data set in this manner is known as an epoch. In some cases, the mini-batches that are used may be the same as the mini-batches used in previous epochs. In other cases, the training data set is shuffled after each epoch and used to define a new set of mini-batches that differ from the previous mini-batches defined from the same training data set. The training process for training the neural network therefore, comprises a plurality of epochs, with each of the plurality of epochs comprising a plurality of training iterations.
Two different methods may be applied to distribute training data to the dusters 710,720 of processing units. In a first method, a host 740 (which is associated with the external storage 750) accesses the external storage 740 and distributes mini-batches of the training data to each of the clusters. The mini-batches of data distributed to each duster 710, 720 of processing units may be the same or different.
In a second method, the host 740 distributes the entire set of training data from the external storage 750 to each of the clusters 710, 720 of processing units. Each of the clusters 710, 720, generates or receives from the host 740 a random seed, which it uses to sample the training data set to obtain a mini-batch. In this way, each cluster 710, 720 will use a randomly selected mini-batch of training data for performing the training during a particular training iteration.
In the case the different data sets are distributed to the clusters 710, 720, one of the clusters 710, 720 will provide training data to the other of the clusters 710, 720 to allow the other cluster 710, 720 to determine predictions using the same training data. By doing so, both clusters 710,720 will then obtain predictions that can be compared since they were generated using the same training data. It is these output values that are then exchanged and compared by the models.
The initial model parameters that are used by each of the clusters 710, 720 at the start of training are initialised to different starting values. This is particularly important when the same training data is used by each cluster during an iteration, since otherwise the output values produced by each cluster 710, 720 would be the same and the value of the distributed training would be lost.
Once each cluster 710, 720 has computed one or more sets of output values during a training iteration, each of the dusters 710, 720 exchanges over the interconnect 730, the one or more sets of output values. To benefit from the distributed training, each of the dusters 710,720 then computes a dissimilarity metric, which measures how different the sets of output values are from one another.
There are different methods that may be applied to calculate the dissimilarity metric between the sets of output values. One approach that works well is the use of the Kullback-Leibler divergence as the dissimilarity metric. The Kullback-Leibler divergence between two different probability distributions, P and Q, is given by:
Therefore, when applying the Kullback-divergence to determine the dissimilarity metric, each of the clusters 710, 720 of processing units determines the measure of dissimilarity as:
respectively, where p1 is the set of predictions calculated by one of the clusters 710, 720 and p2 is the set of predictions calculated by the other of the clusters 710, 720.
It would be appreciated that whilst Kullback-divergence is one example of a calculation used to determine the dissimilarity metric, other measures of the differences between probability distributions may be applied for that purpose. For example, in some examples, the mean squared error may be used as the dissimilarity metric:
where C is the number of predication values output by the neural network.
Although in
The dissimilarity metric may be referred to herein as the distillation loss. This distillation loss is included as an additional penalty term in the loss function that is calculated by each of the clusters 710, 720 of processing units. The overall loss function calculated by the ith cluster in the system is given by:
L
(i)(θk)=LS(θk(i))+ξkLD(i)(θk) Equation 5
where θk represents the parameters for the models held by each of the clusters, θk={θk(i)}i=1N, where N is the number of clusters (which is two in the example shown in
In equation 5, LS(θk(i)) represents the supervised loss calculated by the ith cluster by comparing its output values obtained to the labels for the model. This term is the same as the loss function term shown in Equation 1. In equation 5, LD(i)(θk) represents the distillation loss calculated by the ith cluster by comparing its predictions to the predictions of the other cluster/s.
Using the distillation loss to perform distributed training using sets of processing units has advantages when compared to data parallel training in which each set of processing units independently obtains updates to weights, with these updates/updated weights then being shared between the sets of processing units. Specifically, the data parallel training scheme has its limits in that, once a batch size exceeds a certain size, generalisation is observed to degrade. Using the distillation loss to derive the model parameters updates on each set of processing units avoids this limitation.
The inventors have realised that the distributed training over dusters of processing units can be improved by varying a weighting that is applied to the distillation loss over the training process. This weighting is represented by the parameter ξ in equation 5, which is varied over the training process. The parameter ξ is a hyperparameter since its value is not derived from training. The hyperparameter ξk is the value of ξ for the kth training iteration. In particular, the hyperparameter is increased over the training period, which has been empirically determined to improve the training process by improving the accuracy of predications made by a neural network that is undergoing training.
Reference is made to
In one embodiment, the values taken by the hyperparameter are predefined before the training begins. In this case, the values taken by the hyperparameter ξ may be the same or different for each of the dusters 710, 720. The values for the hyperparameter may be stored in the host 740, external storage 750 or in the dusters 710, 720, and then used to adjust the hyperparameter to weight the distillation loss term differently throughout the training process. Such predefined values are set such that the hyperparameter (gradually increases over the training process representing an increase in the weighting of the distillation loss.
In another embodiment, the values taken by the hyperparameter are calculated by the clusters 710, 720 during the training. In this case, each cluster 710, 720 uses a different value for the hyperparameter and updates the hyperparameter during training using the supervised loss functions and the distillation loss functions calculated by the respective duster 710,720 in real time during training.
To show how the values for a hyperparameter may be updated by each cluster 710, 720, consider that an optimum value for k is one that minimises the expected loss function. This minimum may be found by setting the gradient of the loss function with respect to the hyperparameter, ξ, to be equal to 0:
In equation 6, the expression .,. denotes the inner product. To determine the optimum value of the hyperparameter k from equation 6, an expression for
is used. Using the general expression for updating θk via Stochastic Gradient Descent from equation 1, it is seen that:
By substituting this expression for
in equation 7 into equation 6, it is seen that:
Re-arranging the expression in Equation 8, allows the hyperparameter ξk for the kth training iteration to be expressed as:
Therefore, the hyperparameter, ξk, for a particular training iteration, k, can be calculated as a function of the loss functions, the gradients of the loss functions with respect to the model parameters, and the learning rate. Each cluster 710, 720 performs this calculation to determine a new value for the hyperparameter ξ.
The learning rate itself may change throughout the training process, with each cluster 710, 720 being configured to calculate new learning rates for a training iteration and use the newly calculated learning rate to update the hyperparameter ξ.
Therefore, each cluster 710, 720 may calculate a new value for the hyperparameter ξ for each of at least some of the training iterations and use the new value to calculate the overall loss function, which is then applied in equation 1 to determine updates to the model parameters.
It may be unnecessary to update the value of the hyperparameter ξ for each and every training iteration, due to the small rate of change for the parameter ξ between the training iterations. Therefore, to reduce the burden placed on the computational resources of the clusters 710, 720, by finding updated values for ξ, each of the clusters may be configured to only calculate an updated value for ξ for a predefined portion of the training iterations.
One issue that may arise when calculating an updated value for ξ is that the value calculated is heavily dependent upon the particular training data used for the current training iteration and the preceding training iteration. This can result in noise in the values that are calculated for k. This noise is addressed by the system 700 as follows.
Firstly, to smooth out the updates to the hyperparameter ξ, a moving average is taken. The moving average is a moving average using the previously calculated values for ξ. The moving average used is the exponential moving average. Each duster 710, 720 applies this moving average when determining an update to the hyperparameter ξ.
When applying the moving average, the updated value for the hyperparameter ξk+1 for the k+1th training iteration is given by:
ξk+1=αξk+(1−α) Equation 10
where a is a smoothing coefficient, and k is a new value that is calculated when determining the updated value for the hyperparameter ξk+1. The new value is input into the moving average function applied by the clusters 710, 720, such that the moving average that is determined is taken over the new value and the previously calculated values. The previously calculated values are represented by ξk in equation 10. The value of a is between 0 and 1, and preferably between 0.9 and 1.
When calculating , instead of applying equation 9, a different expression is found to yield updated values for the hyperparameter with less noise. Specifically, by determining the value using only values for the gradient based on the data used for the current training iteration, noise resulting from the inner product of gradients from two different mini-batches is avoided. This may be expressed by determining as:
where ∈ is a small constant used to prevent division by 0.
Each of the dusters 710, 720 is configured to determine new value and uses this as an input into the moving average function along with the previously calculated values for ξ. By doing so, the clusters 710, 720, each determine a value for the updated hyperparameter ξ that is less affected by noise.
Reference is made to
At S910, each of the sets of processing units performs a series of operations on at least part of the respective training data to derive output values for the neural network.
At S920, the sets of processing units exchange with one another, the output values that each of those processing units has calculated.
At S930, the sets of processing units each evaluate a loss function for the training iteration. The loss function includes the metric measuring the dissimilarity between the predictions. The metric is weighted by the parameter.
At S940, the set of processing units each updated model parameters of the neural network using their evaulted loss function.
At S950, the sets of processing units update the parameter for use in subsequent ones of the training iterations.
It will be appreciated that the above embodiments have been described by way of example only.