Some electronic devices perform operations for processing instances of input data through computational models, or “models,” to generate outputs. There are a number of different types of models, for each of which electronic devices generate specified outputs based on processing respective instances of input data. For example, one type of model is a recommendation model. Processing instances of input data through a recommendation model causes an electronic device to generate outputs such as ranked lists of items from among a set of items to be presented to users as recommendations (e.g., products for sale, movies or videos, social media posts, etc.), probabilities that a particular user will click on/select a given item if presented with the item (e.g., on a web page, etc.), and/or other outputs. For a recommendation model, instances of input data therefore include information about users and/or others, information about the items, information about context, etc.
In some electronic devices, multiple compute nodes, or “nodes,” are used for processing instances of input data through models to generate outputs. These electronic devices can include many nodes, with each node including one or more processors and a local memory. For example, the nodes can be or include interconnected graphics processing unit (GPUs) on a circuit board or in an integrated circuit chip, server nodes in a data center, etc. When using multiple nodes for processing instances of input data through models, different schemes can be used for determining where model data is to be stored in memories in the nodes. Generally, model data includes information that describes, enumerates, and/or identifies arrangements or properties of internal elements of a model—and thus defines or characterizes the model. For example, for model 100, model data includes embedding tables 112, information about the internal arrangement of multilayer perceptrons 106 and 116, and/or other model data. One scheme for determining where model data is stored in memories in the nodes is “data parallelism.” For data parallelism, full copies of model data are replicated/stored in the memory in individual nodes. For example, a full copy of model data for multilayer perceptrons 106 and/or 116 can be replicated in each node that performs processing operations for multilayer perceptrons 106 and/or 116. Another scheme for determining where model data is stored in memories in the nodes is “model parallelism.” For model parallelism, separate portions of model data are stored in the memory in individual nodes. The memory in each node therefore stores a different part—and possibly a relatively small part—of the particular model data. For example, for model 100, a different subset of embedding tables (i.e., the model data) from among embedding tables 112 can be stored in the local memory of each node among multiple nodes. For instance, given M embedding tables and N nodes, the memory in each node can store a subset that includes M/N of the embedding tables (M=100, 1000, or another number and N=10, 50, or another number). In some cases, model parallelism is used where particular model data is sufficiently large in terms of bytes that it is impractical or impossible to store a full copy of the model data in any particular node's memory. For example, embedding tables 112 can include thousands of embedding tables that are too large as a group to be stored in any individual node's memory and thus the embedding tables are distributed to the local memories in multiple nodes.
In electronic devices in which portions of model data are distributed among multiple nodes in accordance with model parallelism, individual nodes may need model data stored in memories in other nodes for processing instances of input data through the model. For example, when the individual embedding tables from among embedding tables 112 in model 100 are stored in the local memories of multiple nodes, a given node may need lookup data from the individual embedding tables stored in other node's local memories for processing instances of input data. In this case, each node receives or acquires indices (or other records) that identify lookup data from the individual embedding tables stored in that node's memory that is needed by each other node. Each node then acquires/looks-up and communicates, to each other node, respective lookup data from the individual embedding tables stored in that node's memory (or data generated based thereon, e.g., by combining or adding multiple rows, etc.),
In many electronic devices, a significant part of the computational and/or communication effort expended by nodes for processing instances of input data through a model is expended on the above described acquisition of lookup data (i.e., model data) for other nodes and/or all-to-all communication of lookup data to the other nodes. Performing the lookups and the all-to-all communication can therefore absorb a considerable amount of processing capacity for the nodes and add latency to operations for processing instances of input data through the model.
Throughout the figures and the description, like reference numerals refer to the same figure elements.
The following description is presented to enable any person skilled in the art to make and use the described embodiments and is provided in the context of a particular application and its requirements. Various modifications to the described embodiments will be readily apparent to those skilled in the art, and the general principles described herein may be applied to other embodiments and applications. Thus, the described embodiments are not limited to the embodiments shown, but are to be accorded the widest scope consistent with the principles and features described herein.
In the following description, various terms are used for describing embodiments. The following is a simplified and general description of some of the terms. Note that these terms may have significant additional aspects that are not recited herein for clarity and brevity and thus the description is not intended to limit these terms.
Functional block: functional block refers to a set of interrelated circuitry such as integrated circuit circuitry, discrete circuitry, etc. The circuitry is “interrelated” in that circuit elements in the circuitry share at least one property. For example, the circuitry may be included in, fabricated on, or otherwise coupled to a particular integrated circuit chip, substrate, circuit board, or portion thereof, may be involved in the performance of specified operations (e.g., computational operations, control operations, memory operations, etc.), may be controlled by a common control element and/or a common clock, etc. The circuitry in a functional block can have any number of circuit elements, from a single circuit element (e.g., a single integrated circuit logic gate or discrete circuit element) to millions or billions of circuit elements (e.g., an integrated circuit memory). In some embodiments, functional blocks perform operations “in hardware,” using circuitry that performs the operations without executing program code.
Data: data is a generic term that indicates information that can be stored in memories and/or used in computational, control, and/or other operations. Data includes information such as actual data (e.g., results of computational or control operations, outputs of processing circuitry, inputs for computational or control operations, variable values, sensor values, etc.), files, program code instructions, control values, variables, and/or other information.
In the described embodiments, computational nodes, or “nodes,” in an electronic device perform operations for processing instances of input data through a computational model, or “model.” A model generally includes, or is defined as, a number of operations to be performed on, for, or using instances of input data to generate corresponding outputs. For example, in some embodiments, the nodes perform operations for processing instances of input data through such as model 100 as shown in
Models are defined or characterized by model data, which is or includes information that describes, enumerates, and identifies arrangements or properties of internal elements of a model. For example, for model 100, the model data includes embedding tables 112 such as tables, hashes, or other data structures including index-value pairings; configuration information for multilayer perceptrons 106 and 116 such as weights, bias values, etc. used for processing operations for hidden layers within the multilayer perceptrons (not shown in
For processing instances of input data through a model, the instances of input data are processed through internal elements of the model to generate an output from the model. Generally, an “instance of input data” is one piece of the particular input data that is to be processed by the model, such as information about a user to whom a recommendation is to be provided for a recommendation model, information about an item to be recommended, etc. Using model 100 as an example, each instance of input data includes dense features 108 and categorical features 110, which include and/or are generated based on information about a user, context information, item information, and/or other information.
In some embodiments, for processing instances of input data through the model, a number of instances of input data are divided up and assigned to each of multiple nodes in an electronic device to be processed therein. As an example, assume that there are eight nodes and 32,000 instances of input data to be processed. In this case, evenly dividing the instances of input data up among the eight nodes means that each node will process 4000 instances of input data through the model. Further assume that model 100 is the model and that there are 1024 total embedding tables 112, with 128 different embedding tables stored in the local memory in each of the eight nodes. For processing instances of input data through the model, each of the eight nodes receives the dense features 108 for all the instances of input data to be processed by that node—and therefore receives the dense features for 4,000 instances of input data. Each node also receives a respective portion of the categorical features 110 for all 32,000 instances of input data. The respective portion for each node includes a portion of the categorical features for which the node is to perform lookups in locally stored embedding tables 112. Generally, the categorical features 110 include 1024 input index vectors, with one input index vector for each embedding table. Each input index vector includes elements with indices to be looked up in the corresponding embedding table for each instance of input data and thus each of the 1024 input index vectors has 32,000 elements. For receiving the respective portion of the categorical features 110, each node receives an input index vector for each of the 128 locally stored embedding tables with 32000 indices to be looked up in that locally stored embedding table. In other words, in the respective set of input index vectors, each node receives a different 128 of the 1024 input index vectors.
After receiving dense features 108 and categorical features 110, each node uses the respective embedding tables for processing the categorical features 110. For this operation, each node performs lookups in the embedding tables stored in that node's memory using indices from the received input index vectors to acquire lookup data needed for processing instances of input data. Continuing the example, based on the 32,000 input indices in each of the 128 input index vectors, each node performs 32,000 lookups in each of the 128 locally stored embedding tables to acquire both that node's own data and data that is needed by the other seven nodes for processing their respective instances of input data. Each node then communicates lookup data acquired during the lookups to other nodes in an all-to-all communication via a communication fabric. For this operation, each node communicates a portion of the lookup data acquired from the locally stored embedding table to the other node that is to use the lookup data for processing instances of input data. Continuing the example from above, each node communicates the lookup data from the 128 locally stored embedding tables for processing the respective 4,000 instances of input data to each other node, so that each other node receives a block of lookup data that is 128×4,000 in size. For example, a first node can communicate a block of lookup data for the second 4,000 instances of input data to a second node, a block of lookup data for the third 4,000 instances of input data to a third node, and so forth (the first node keeps the lookup data for the first 4,000 instances of input data for processing its own instances of input data).
Each of the nodes additionally processes dense features 108 through multilayer perceptron 106 to generate an output for multilayer perceptron 106. Each node next combines the outputs from multilayer perceptron 106 and that node's lookup data in combination 114 to generate corresponding intermediate values (e.g., combined vectors, etc.). For this operation, that node's lookup data includes the lookup data acquired by that node from the locally stored embedding tables as well as all the portions of lookup data received by that node from the other nodes. As an output of this operation each node produces 4,000 intermediate values, one intermediate value for each instance of input data being processed in that node. Each node processes each of that node's intermediate values through multilayer perceptron 116 to generate model output 118. The model output 118 for each instance of input data in each node is in the form of a ranked list (e.g., a vector or other listing) of items to be presented to a user as a recommendation, an identification of a probability of a user clicking on/selecting an item presented on a website, etc.
Although a particular model (i.e., model 100) is used as an example herein, the described embodiments are operable with other types of models. Generally, in the described embodiments, any type of model can be used for which separate embedding tables are stored in local memories in multiple nodes in an electronic device (i.e., for which the embedding tables are distributed using model parallelism). In addition, although eight nodes are used for describing processing 32,000 instances of input data through a model in the example above, in some embodiments, different numbers of nodes are used for processing different numbers of instances of input data. Generally, in the described embodiments, any number and/or arrangement of nodes in an electronic device can be used for processing instances of input data through a model, as long as some or all of the nodes have a local memory in which separate embedding tables are stored.
In the described embodiments, nodes perform lookups in embedding tables stored in local memories in the nodes in order to acquire lookup data that is needed by the nodes themselves and other nodes for processing instances of input data through a model. The nodes perform the lookups using indices from respective sets of input index vectors from among input index vectors in a full set of input index vectors.
For the example in
Processing each instance of input data through the model includes using lookup data acquired from each of the twelve lookup tables as input to specified operations for the model (as described in more detail above). In other words, in order to process an instance of input data through the model, a given node needs a piece of lookup data from each of the twelve lookup tables—the three lookup tables in that node's local memory and the nine lookup tables stored in local memories of other nodes. For example, when processing the first instance of input data (labeled as 0 at the top left of
The indices to be looked up in each embedding table for processing the instances of input data are included in a corresponding input index vector 302 in the full set of input index vectors 300 (only three input index vectors 302 are labeled in
The four heavier solid-line vertical boxes in
When processing instances of input data through the model, each node performs lookups in embedding tables in the local memory in that node using the respective input index vector 302. Each node then either uses the lookup data itself or provides respective portions of the lookup data to other nodes. Generally, for providing respective portions of lookup data to other nodes, each node provides, to each other node, only the lookup data that is needed by that other node. In other words, a given node looks up data in the locally stored embedding tables using the indices from the input index vectors in the respective set of input index vectors and then communicates only the data needed by each other node to that other node via the above-described all-to-all communication. The input indices in each respective set of input index vectors that are used for acquiring lookup data for the node itself and each other node are shown via three heavy dashed horizontal lines in
The above-described division of the respective sets of input index vectors by the heavier dashed lines divides each respective set of input index vectors into four “parts” (only three parts are labeled in
In the described embodiments, an electronic device includes a number of nodes communicatively coupled together via a communication fabric. Each of the nodes includes at least one processor and a local memory. For example, in some embodiments, the nodes are or include GPUs, with the processors being GPU cores and the local memories being GPU memories, and the communication fabric being a GPU interconnect to which the GPUs are connected. The nodes (i.e., the processors and memories in the nodes) perform operations for processing instances of input data through a model. For example, in some embodiments, the nodes perform operations for processing instances of input data through a recommendation model such as model 100 as shown in
For “compressing” the lookup data communicated between the nodes, the described embodiments avoid communicating duplicated lookup data that would be communicated between the nodes in existing electronic devices. Generally, duplicated lookup data includes two or more pieces of lookup data that match one another. The described embodiments identify duplicated lookup data that would otherwise be communicated between the nodes and perform operations for preventing duplicated lookup data from being communicated between the nodes. For this operation, when processing instances of input data through the model, each node receives or acquires a respective set of input index vectors (e.g., respective set of input index vectors 310) having a number of parts (e.g., part 312). Each node processes each of the parts to identify duplicate input indices in the input index vectors in that part, i.e., input indices that match other input indices elsewhere in the elements of each input index vector that are included within that part. Each node removes duplicate input indices from each input vector in each part to generate a compressed set of input index vectors for that part. In this way, each node generates a compressed set of input index vectors for each part with duplicate input indices removed. Each node then uses the compressed set of input index vectors for each part for performing lookups for acquiring lookup data from the embedding table(s) stored in the local memory in that node. Each node locally uses the lookup data acquired using the indices from one of the parts for processing instances of instance of input data. Each node communicates the lookup data acquired using the compressed set of input index vectors for each remaining part, i.e., compressed lookup data for that part, to a corresponding other node in an all-to-all communication (such as shown in
In some embodiments, for generating the compressed set of input index vectors as described above, each node first receives or acquires a respective set of input index vectors with a number of parts. Each node then processes input index vectors in each of the parts to identify duplicate input indices in the input index vectors. For example, in some embodiments, each node identifies unique input indices in the elements of each input vector in each of the parts. The unique input indices are first appearances of input indices that are subsequently duplicated one or more times within the elements of a given input index vector in a part—as well as input indices that appear only once in a given input index vector in the part. Each node then identifies locations of input indices in the elements of each input vector in each part that are duplicates of unique input indices in that input index vector. Each node then generates the compressed set of input index vectors for each part by removing duplicate input indices from each input index vector for that part—leaving only unique input indices in the input index vectors in the compressed set of input index vectors for each part. Each node then uses the compressed set of input index vectors for each of the parts of the respective set of input index vectors to perform lookups in the embedding tables stored in the local memory to generate compressed lookup data as described above.
Along with generating the compressed set of input index vectors for each part of the respective set of input index vectors, each node produces a record of locations of duplicate input index indices for each part. Generally, the record of locations of duplicate input indices for each part identifies locations from where input indices were removed from input index vectors in that part when generating the compressed set of input index vectors for that part. Using the information in the records of duplicate input indices, the original/uncompressed input index vectors in each part can be regenerated from the compressed set of input index vectors for that part. In addition, and as described in more detail below, the information in the record of locations of duplicate input index indices for each part can be used to generate decompressed lookup data from compressed lookup data that was acquired using the compressed set of input index vectors for that part. For producing the record of locations of duplicate input index indices for each part, each node produces a record that identifies: (1) locations of unique input indices in each input index vector in the compressed set of input index vectors for that part, (2) locations of removed input indices that are duplicates of the unique input indices in each input index vector in that part, and (3) a number of indices in the input index vectors in the compressed set of input index vectors for that part. After generating a record of locations of duplicate input indices for each part, the node communicates that record to a corresponding other node—i.e., to an other node that is to use that record for decompressing compressed lookup data as described herein.
In some embodiments, although lookup data is compressed for the all-to-all communication between the nodes, the compressed lookup data is eventually decompressed by a receiving node for use in subsequent operations (i.e., for use in operations that rely on the lookup data that is not present in the compressed lookup data). For this operation, a sending node first generates compressed lookup data for a part of a respective set of input index vectors using a compressed set of input index vectors for the part as described above. The sending node then generates and communicates a record of locations of duplicate input index indices for the part to a given node and separately communicates the compressed lookup data for the part to the given node in an all-to-all communication as described above. The given node sets aside/generates a buffer for storing the compressed lookup data based on a size of the data value associated with the compressed lookup data (e.g., included in or with the record of locations of duplicate input index indices). The given node then receives the compressed lookup data for the part, which is stored in the buffer, and decompresses the compressed lookup data for the part to generate decompressed lookup data for the part. For this operation, the given node identifies locations of missing duplicate lookup data in the compressed lookup data for the part using the record of locations of duplicate input indices for the part. The given node then copies the missing duplicate lookup data from locations in the compressed lookup data for the part to the locations. During the decompression operation, therefore, the given node creates the full lookup data for the part that would have been generated by the sending node had the sending node performed lookups using all of the indices in the part of the respective set of input index vectors (instead of using the compressed set of input index vectors for the part). The given node combines similar decompressed lookup data from all other nodes to produce the full lookup data used for processing instances of input data through the model. Because the lookup data in the compressed lookup data is sufficient to create the full lookup data using the record of duplicate input indices, the compression of the lookup data is “lossless.” That is, all data needed for creating the full lookup data can be found in the compressed lookup data.
By using the compressed sets of input index vectors for performing lookups in embedding tables to generate compressed lookup data, the described embodiments reduce the number of lookups that are performed in each of the nodes for acquiring the lookup data (in contrast to existing electronic devices, which use the full lookup data). This reduces the operational load on the nodes and the local memories in the nodes. In addition, by communicating compressed lookup data from node to node, the described embodiments reduce the amount of data communicated on the communication fabric and reduce the latency for operations of the nodes that rely on the lookup data (in contrast to existing electronic devices that communicate the full lookup data). This renders the nodes and the communication fabric more available for performing other operations, which can increase the performance of the nodes and the communication fabric. Increasing the performance of the nodes or the communication fabric increases the performance of the electronic device, which increases user satisfaction with the electronic device.
Each node 402 includes a processor 406. The processor 406 in each node 402 is a functional block that performs computational, memory access, and/or other operations (e.g., control operations, configuration operations, etc.). For example, each processor 406 can be or include a graphics processing unit (GPU) or GPU core, a central processing unit (CPU) or CPU core, an accelerated processing unit (APU), a system on a chip (SOC), a field programmable gate array (FPGA), and/or another form of processor.
Each node 402 includes a memory 408 (which can be called a “local memory” herein). The memory 408 in each node 402 is a functional block that performs operations for storing data for accesses by the processor 406 in that node 402 (and possibly processors 406 in other nodes). Each memory 408 includes volatile and/or non-volatile memory circuits for storing data, as well as control circuits for handling accesses of the data stored in the memory circuits, performing control or configuration operations, etc. For example, in some embodiments, the processor 406 in each node 402 is a GPU or GPU core and the respective local memory 408 is or includes graphics memory circuitry such as graphics double data rate synchronous DRAM (GDDR). As described herein, the memories 408 in some or all of the nodes 402 store embedding tables and other model data for use in processing instances of input data through a model (e.g., model 100).
Communication fabric 404 is a functional block and/or device that performs operations for or associated with communicating data between nodes 402. Communication fabric 404 is or includes wires, guides, traces, wireless communication channels, transceivers, control circuits, antennas, and/or other functional blocks and devices that are used for communicating data. For example, in some embodiments, nodes 402 are or include GPUs and communication fabric 404 is a graphics interconnect and/or other system bus. In some embodiments, compressed lookup data and records of duplicate input indices are communicated from node 402 to node 402 via communication fabric 404 as described herein.
Although electronic device 400 is shown in
Electronic device 400 and nodes 402 are simplified for illustrative purposes. In some embodiments, however, electronic device 400 and/or nodes 402 include additional or different functional blocks, subsystems, elements, and/or communication paths. For example, electronic device 400 and/or nodes 402 may include display subsystems, power subsystems, input-output (I/O) subsystems, etc. Electronic device 400 generally includes sufficient functional blocks, subsystems, elements, and/or communication paths to perform the operations herein described. In addition, although four nodes 402 are shown in
Electronic device 400 can be, or can be included in, any device that can perform the operations described herein. For example, electronic device 400 can be, or can be included in, a desktop computer, a laptop computer, a wearable computing device, a tablet computer, a piece of virtual or augmented reality equipment, a smart phone, an artificial intelligence (AI) or machine learning device, a server, a network appliance, a toy, a piece of audio-visual equipment, a home appliance, a vehicle, and/or combinations thereof. In some embodiments, electronic device 400 is or includes a circuit board or other interposer to which multiple nodes 402 are mounted or connected and communication fabric 404 is an inter-node communication route. In some embodiments, electronic device 400 is or includes a set or group of computers (e.g., a group of server nodes in a data center) and communication fabric 404 is a wired and/or wireless network that connects the nodes 402. In some embodiments, electronic device 400 is included on one or more semiconductor chips. For example, in some embodiments, electronic device 400 is entirely included in a single “system on a chip” (SOC) semiconductor chip, is included on one or more ASICs, etc.
In the described embodiments, nodes in an electronic device perform operations for compressing lookup data. Generally, for this operation, the nodes generate compressed sets of input index vectors for parts of respective sets of input index vectors that are used for acquiring lookup data from embedding tables. Because the data is acquired from the lookup tables using compressed sets of input index vectors, the acquired lookup data, i.e., compressed lookup data, includes less lookup data than would be present if the full respective sets of input index vectors were used for acquiring the lookup data. The nodes also generate records of duplicate input indices that identify indices that were removed from the input index vectors during the compression operations. The records of duplicate input indices can be used for operations including decompressing corresponding compressed lookup data.
For the example in
For the example in
As can be seen in
The operations in
Each node also generates a record of duplicate input indices. For this operation, each node creates a record that can be used for identifying locations from where indices were removed from input index vectors in the corresponding part when generating the compressed set of input index vectors. Generally, each record of duplicate input indices includes information for identifying locations of unique input indices in each input index vector in the compressed set of input index vectors for the corresponding part, as well as for identifying locations of removed input indices that are duplicates of the unique input indices in each input index vector for that part. In some embodiments, the records of the duplicate input indices are arranged shown in records of duplicate input indices (RECORD OF DUPLICATE INP IND) 516-522, with each location in the record associated with a location in the part (i.e., having a one-to-one correspondence with elements that were present in the part of the respective set of input index vectors). In these embodiments, each element in the record of duplicated input indices includes a reference to a location of a unique index in the compressed set of input index vectors. Continuing the example from part 502, a first column in the record of duplicate input indices 518 has 0s in the first and third elements, which identifies that the index in the first input index vector in part 502 that was originally found in the first and third locations of the first input index vector in part 502 can now be found in the first element of the first input index vector in compressed set of input index vectors 510. In other words, record of duplicate input indices 518 indicates that, in the original state of part 502, the index in the first and third elements matched (the second instance was removed during the compression as described above). In addition, a first column in the record of duplicate input indices has a 1 in the second element, which identifies that the index in the first input index vector in part 502 that was originally found in the second location of the first input index vector in part 502 can now be found in the second element of the first input index vector in compressed set of input index vectors 510. Also, a first column in the record of duplicate input indices has 2s in the fourth and fifth elements, which identify that the index in the first input index vector in part 502 that was originally found in the fourth and fifth locations of the first input index vector in part 502 can now be found in the third element of the first input index vector in compressed set of input index vectors 510. In other words, the record of duplicate input indices indicates that, in the original state of part 502, the index in the fourth and fifth elements of the first input vector matched. The first instance of the index 3 was moved from the fourth location during the compression and the second instance of the index was removed during the compression as described above.
Each node also separately generates a size of the data and includes the size of the data in or with the record of duplicate input indices. Examples of sizes of the data are shown as size of the data (SIZE OF DATA) 524-530 for each of parts 500-506, respectively, in
In the described embodiments, nodes perform operations for decompressing compressed lookup data received from other nodes. Generally, for this operation, when processing instances of input data through a model, each node receives, from the other nodes, compressed lookup data that was generated using a compressed set of input index vectors. The nodes decompress the compressed lookup data using a record of duplicate input indices to generate decompressed data that is then used for subsequent operations for processing instances of input data though the model.
For the examples in
For the example in
The operations in
After generating compressed lookup data 600, node1 communicates compressed lookup data 600 to node0 in an all-to-all communication (recall that node0 is the node processing the corresponding instances of input data). Before communicating compressed lookup data 600 to node0, however, node1 communicates record of duplicated input indices 518 and size of the data 526 to node0. Receiving size of data 526 in advance of compressed lookup data 600 enables node0 to determine the size/amount of data in compressed lookup data 600 for operations such as reserving receiver buffer space for storing compressed lookup data 600, etc.
The operations in
Although a particular compressed set of input index vectors, embedding tables, compressed lookup data, record of duplicate input indices, size of the data, and decompressed lookup data are shown in
In the described embodiments, lookup data is compressed in order to avoid the need for communicating full lookup data between nodes during an all-to-all communication while processing instances of input data through a model.
For the operations in
For the example in
The operations shown in
The nodes then communicate the records of duplicate input indices to corresponding other nodes (step 808). Recall that, in some embodiments, the records include size of data information (e.g., size of the data 524, etc.) that is subsequently used by a receiving node for determining a size of compressed input data to be received from the node that sent the record of the duplicate input indices. The nodes therefore send the records of duplicate input indices prior to the all-to-all communication in step 812 (and possibly substantially in parallel with performing embedding table lookups in step 810) so that the receiving nodes will be able to process received compressed lookup data.
The nodes then use the compressed sets of input index vectors to perform embedding table lookups for generating compressed lookup data (step 810). For this operation, for input index vectors (or elements thereof) in the compressed set of input index vectors for each part of the respective set of input index vectors, each node looks up data in the corresponding embedding table. The nodes generate corresponding compressed lookup data using lookup data acquired during the lookups. For example, in some embodiments, for each compressed set of input index vectors, each node performs operations similar to those shown in
The nodes then perform an all-to-all communication of the compressed lookup data (step 812). For this operation, the nodes communicate the compressed lookup data for each other node to that other node in an operation similar to that shown in
The nodes receive the compressed lookup data from each other node (i.e., each node receives compressed lookup data from three other nodes) and perform operations for decompressing the compressed lookup data (step 814). For this operation, the nodes use the respective record of duplicate input indices to identify locations of missing lookup data in the compressed lookup data received from each other node and copy the missing lookup data to the locations in preparation for using the decompressed lookup data in subsequent operations for processing instances of input data through the model. For example, in some embodiments, the nodes can perform operations similar to those shown in
The nodes then use the decompressed lookup data for processing instances of input data through the model (step 816). For example, in some embodiments, the nodes use the decompressed lookup data as input to combination 114, where the decompressed lookup data is combined with output from multilayer perceptron 106 to prepare input data for multilayer perceptron 116. Step 816 is the last operation of the forward portion 800, and so a model output 118 is generated during/as a result of step 816. For example, during step 816, the nodes can generate ranked lists of items from among a set of items to be presented to users as recommendations, probabilities that a particular user will click on/select a given item if presented with the item, and/or other outputs.
The nodes then commence operations for backward portion 802, i.e., for using the output from model 100 to update/train the model. The nodes therefore calculate a loss based on the output from the model (step 818). The nodes then use the record that identifies locations of duplicate input indices (from steps 806-808) to generate compressed training data (step 820). For this operation, the nodes remove training data based on the locations of duplicated input indices from the full set of training data, thereby reducing the size of the training data. The nodes then send the compressed training data (i.e., corresponding portions of the compressed training data) to all the other nodes using an all-to-all communication (step 822). The other nodes use the compressed training data to make model data updates (step 824). The model data updates include updates to the embedding tables stored in the local memory in the nodes—and to other model data, such as parameters for the multilayer perceptrons, etc. The other nodes therefore use the compressed data to compute/determine model data updates and then write the updates to the embedding tables. For this operation, the use of the compressed training data (i.e., without decompressing the training data) is functionally correct due to the duplicative nature of training data that is removed from the compressed training data.
In some embodiments, as part of using the decompressed lookup data in the model (step 816), the nodes perform operations for preparing the decompressed lookup data for being used in the model. For example, in some embodiments, the nodes perform accumulation operations on decompressed lookup data in order to combine and/or reduce the lookup data. For instance, the nodes can perform mathematical, bitwise, logical, and/or other operations on portions of the decompressed lookup data (e.g., data from individual rows of the embedding table) in order to combine the portions of the decompressed lookup data.
Although a training process having a forward portion 800 and a backward portion 802 is presented as an example in
In the described embodiments, a node performs operations for compressing lookup data that is to be communicated to other nodes for processing instances input data through a model. Generally, these operations include generating compressed sets of input index vectors for parts of respective sets of input index vectors (e.g., parts 312-316 of respective set of input vector 310, etc.) so that nodes can perform a smaller numbers of lookups in embedding tables stored in local memories in the nodes—and can therefore generate less lookup data to be communicated to other nodes.
For the example in
For the example in
The operations in
In the described embodiments, nodes perform operations for decompressing compressed lookup data received from other nodes. Generally, for this operation, when processing instances of input data through a model, each node receives, from each other node, compressed lookup data that was generated using a compressed set of input index vectors. Each node then decompresses the compressed lookup data using a record of duplicate input indices to generate decompressed data that is then used for subsequent operations for processing instances of input data though the model.
For the example in
For the example in
The process shown in
In some embodiments, at least one electronic device (e.g., electronic device 400, etc.) or some portion thereof uses code and/or data stored on a non-transitory computer-readable storage medium to perform some or all of the operations described herein. More specifically, the at least one electronic device reads code and/or data from the computer-readable storage medium and executes the code and/or uses the data when performing the described operations. A computer-readable storage medium can be any device, medium, or combination thereof that stores code and/or data for use by an electronic device. For example, the computer-readable storage medium can include, but is not limited to, volatile and/or non-volatile memory, including flash memory, random access memory (e.g., DDR5 DRAM, SRAM, eDRAM, etc.), non-volatile RAM (e.g., phase change memory, ferroelectric random access memory, spin-transfer torque random access memory, magnetoresistive random access memory, etc.), read-only memory (ROM), and/or magnetic or optical storage mediums (e.g., disk drives, magnetic tape, CDs, DVDs, etc.).
In some embodiments, one or more hardware modules perform the operations described herein. For example, the hardware modules can include, but are not limited to, one or more central processing units (CPUs)/CPU cores, graphics processing units (GPUs)/GPU cores, application-specific integrated circuit (ASIC) chips, field-programmable gate arrays (FPGAs), compressors or encoders, encryption functional blocks, compute units, embedded processors, accelerated processing units (APUs), controllers, requesters, completers, network communication links, and/or other functional blocks. When circuitry (e.g., integrated circuit elements, discrete circuit elements, etc.) in such hardware modules is activated, the circuitry performs some or all of the operations. In some embodiments, the hardware modules include general purpose circuitry such as execution pipelines, compute or processing units, etc. that, upon executing instructions (e.g., program code, firmware, etc.), performs the operations. In some embodiments, the hardware modules include purpose-specific or dedicated circuitry that performs the operations “in hardware” and without executing instructions.
In some embodiments, a data structure representative of some or all of the functional blocks and circuit elements described herein (e.g., electronic device 400 or some portion thereof) is stored on a non-transitory computer-readable storage medium that includes a database or other data structure which can be read by an electronic device and used, directly or indirectly, to fabricate hardware including the functional blocks and circuit elements. For example, the data structure may be a behavioral-level description or register-transfer level (RTL) description of the hardware functionality in a high-level design language (HDL) such as Verilog or VHDL. The description may be read by a synthesis tool which may synthesize the description to produce a netlist including a list of transistors/circuit elements from a synthesis library that represent the functionality of the hardware including the above-described functional blocks and circuit elements. The netlist may then be placed and routed to produce a data set describing geometric shapes to be applied to masks. The masks may then be used in various semiconductor fabrication steps to produce a semiconductor circuit or circuits (e.g., integrated circuits) corresponding to the above-described functional blocks and circuit elements. Alternatively, the database on the computer accessible storage medium may be the netlist (with or without the synthesis library) or the data set, as desired, or Graphic Data System (GDS) II data.
In this description, variables or unspecified values (i.e., general descriptions of values without particular instances of the values) are represented by letters such as N, T, and X. As used herein, despite possibly using similar letters in different locations in this description, the variables and unspecified values in each case are not necessarily the same, i.e., there may be different variable amounts and values intended for some or all of the general variables and unspecified values. In other words, particular instances of N and any other letters used to represent variables and unspecified values in this description are not necessarily related to one another.
The expression “et cetera” or “etc.” as used herein is intended to present an and/or case, i.e., the equivalent of “at least one of” the elements in a list with which the etc. is associated. For example, in the statement “the electronic device performs a first operation, a second operation, etc.,” the electronic device performs at least one of the first operation, the second operation, and other operations. In addition, the elements in a list associated with an etc. are merely examples from among a set of examples—and at least some of the examples may not appear in some embodiments.
The foregoing descriptions of embodiments have been presented only for purposes of illustration and description. They are not intended to be exhaustive or to limit the embodiments to the forms disclosed. Accordingly, many modifications and variations will be apparent to practitioners skilled in the art. Additionally, the above disclosure is not intended to limit the embodiments. The scope of the embodiments is defined by the appended claims.