When training machine learning models, computations are frequently performed on large matrices (e.g. with tens of thousands or hundreds of thousands of rows and columns). For example, matrix multiplication operations on such matrices are frequently performed. These large matrices may occupy large amounts of memory when stored. In addition, computations performed on large matrices are often very computationally resource-intensive in terms of both memory and processor utilization.
According to one aspect of the present disclosure, a computing device is provided, including one or more processing devices configured to receive a first matrix including a plurality of first matrix elements arranged in a plurality of submatrices. The one or more processing devices may be further configured to generate first matrix sparsity metadata indicating one or more zero submatrices and one or more nonzero submatrices of the plurality of submatrices. Each of the first matrix elements included in the one or more zero submatrices may be equal to zero. The one or more processing devices may be further configured to store, in memory, a compressed first matrix including the first matrix sparsity metadata and the one or more nonzero submatrices and not including the one or more zero submatrices.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. Furthermore, the claimed subject matter is not limited to implementations that solve any or all disadvantages noted in any part of this disclosure.
Matrices that are processed in machine learning settings are frequently sparse matrices in which large proportions of the matrix elements are equal to zero. In order to reduce the amount of memory required to store such matrices, the systems and methods for compressing sparse matrices described herein are provided, as discussed in further detail below. In addition, when sparse matrices are compressed according to such systems and methods, shortcuts may be performed when performing computations using the compressed matrices. These shortcuts may allow the processor and memory utilization for such computations to be reduced.
In some examples, the functionality of the computing device 10 may be distributed between a plurality of networked physical computing devices rather than being provided in a single physical computing device. For example, the computing device 10 may be instantiated in a data center, and one or more components of the computing device 10 may be provided in a plurality of physical computing devices that are located in the data center and connected via a network. The physical computing devices located in the data center may be configured to communicate with one or more client computing devices which may be located outside the data center and which may also at least partially instantiate one or more of the components of the computing device 10.
The one or more processing devices 12 may be configured to receive a first matrix 20 including a plurality of first matrix elements 24. Each first matrix element 24 included in the first matrix 20 may be a numerical value. In addition, the first matrix elements 24 may be arranged in a plurality of first submatrices 22. The plurality of first submatrices 22 may each be of a same size, such as 16×16 or 16×32. The size shared by each of the plurality of first submatrices 22 may be set at the one or more processing devices 12, for example, in response to receiving a user input. The number of rows included in the first matrix 20 may be a multiple of the number of rows included in each of the plurality of first submatrices 22, and the number of columns included in the first matrix 20 may be a multiple of the number of columns included in each of the plurality of first submatrices 22.
The one or more processing devices 12 may be further configured to generate first matrix sparsity metadata 26 indicating one or more zero submatrices 22A and one or more nonzero submatrices 22B of the plurality of first submatrices 22. Each of the first matrix elements 24 included in the one or more zero submatrices 22A are equal to zero. In addition, each of the one or more nonzero submatrices 22B includes at least one first matrix element 24 that is not equal to zero. Each first submatrix 22 may, in some examples, have a corresponding bit in the first matrix sparsity metadata 26 that indicates whether that submatrix is a zero submatrix 22A or a nonzero submatrix 22B. In such examples, the first matrix sparsity metadata 26 may indicate each of the one or more zero submatrices 22A with a zero and each of the one or more nonzero submatrices 22B with a one. Alternatively, the first matrix sparsity metadata 26 may indicate each of the one or more nonzero submatrices 22B with a zero and each of the one or more zero submatrices 22A with a one.
Returning to
In some examples, prior to generating the first matrix sparsity metadata 26, the one or more processing devices 12 may be further configured to determine that one or more first matrix elements 24 of the plurality of first matrix elements 24 are below a predefined threshold 28. In response to making this determination, the one or more processing devices 12 may be further configured to set the one or more first matrix elements 24 that are below the predefined threshold 28 to zero. For example, the predefined threshold 28 may be equal to zero. Thus, in such examples, the one or more processing devices 12 may be configured to apply a rectified linear unit (ReLU) function to the first matrix elements 24. In other examples, the predefined threshold 28 may be a positive number.
Although, in the example of
In some examples, as shown in
The hardware accelerator 12B may be configured to compute the result matrix 70 at least in part by computing a plurality of submatrix products 60 of the plurality of first submatrices 22 of the first matrix 20 and the plurality of second submatrices 52 of the second matrix 50, respectively. The plurality of submatrix products 60 may be computed at a front-end processing area 42 of the hardware accelerator 12B. As discussed in further detail below, the plurality of submatrix products 60 may be summed to compute the result submatrices 72. Computing the plurality of submatrix products 60 may include, for each submatrix product 60 of a zero submatrix 22A of the one or more zero submatrices 22A and a second submatrix 52 of the plurality of second submatrices 52, setting each submatrix product element 62 of the submatrix product 60 to zero. Each submatrix product element 62 of the submatrix product of a zero submatrix 22A and a second submatrix 52 may be set to zero without retrieving, from the memory 14, the plurality of first matrix elements 24 included in the zero submatrix 22A or the plurality of second matrix elements 54 included in the second submatrix 52. Thus, the number of memory calls made by the hardware accelerator 12B when multiplying the first matrix 20 and the second matrix 50 may be reduced. In addition, the hardware accelerator 12B may save processing time and bandwidth that would otherwise have been spent computing dot products between the first matrix elements 24 of the zero submatrix 22A and the second matrix elements 54 of the second submatrix 52.
In examples in which the hardware accelerator 12B is configured to compute a plurality of submatrix products 60, the hardware accelerator 12B may be further configured to assign submatrix product sparsity metadata 64 to each submatrix product 60 of the plurality of submatrix products 60. The submatrix product sparsity metadata 64 may indicate whether the submatrix product 60 is a zero submatrix product for which all the submatrix product elements 62 of the submatrix product 60 are equal to zero. For example, the hardware accelerator 12B may be configured to assign a zero to the submatrix product 60 as the submatrix product sparsity metadata 64 when the submatrix product 60 is a zero submatrix product and assign a one to the submatrix product 60 as the submatrix product sparsity metadata 64 when the submatrix product 60 is a nonzero submatrix product.
Multiplying the first matrix 20 and the second matrix 50 may further include computing a submatrix product sum 66 of two or more submatrix products 60 of the plurality of submatrix products 60 that share respective locations in the result matrix 70. The location of a submatrix product 60 in the result matrix 70 may be determined by the respective locations, in the first matrix 20 and the second matrix 50, of the first submatrix 22 and the second submatrix 52 for which the submatrix product 60 is computed.
When computing the submatrix product sum 66, the hardware accelerator 12B may be configured to determine, for each submatrix product 60 of the two or more submatrix products 60, whether that submatrix product 60 is a zero submatrix product in which all the submatrix product elements 62 are equal to zero. This determination may be made based on the submatrix product sparsity metadata 64 associated with each submatrix product 60. The hardware accelerator 12B may be further configured to skip adding each zero submatrix product to the submatrix product sum 66. Thus, unnecessary computations that would not change the submatrix product sum 66 may be avoided.
Although, in the example of
Subsequently to computing the result matrix 70, the one or more processing devices 12 may be further configured to generate a compressed result matrix 80, as shown in the example of
At step 102, the method 100 may include receiving a first matrix including a plurality of first matrix elements arranged in a plurality of first submatrices. The first matrix may be received from memory at a processing device of the one or more processing devices. The plurality of first submatrices may each be of a same size, such as 16×16 or 16×32.
At step 104, the method 100 may further include generating first matrix sparsity metadata for the first matrix. The first matrix sparsity metadata may indicate one or more zero submatrices and one or more nonzero submatrices of the plurality of first submatrices, where each of the first matrix elements included in the one or more zero submatrices are equal to zero. Each of the one or more nonzero submatrices includes at least one respective first matrix element that is not equal to zero. In some examples, the first matrix sparsity metadata may be stored as a header of the compressed first matrix. The first matrix sparsity metadata may use a respective bit associated with each of the first submatrices to indicate whether that submatrix is a zero submatrix. For example, the first matrix sparsity metadata may indicate each of the one or more zero submatrices with a zero and each of the one or more nonzero submatrices with a one.
At step 106, the method 100 may further include storing, in memory, a compressed first matrix including the first matrix sparsity metadata and the one or more nonzero submatrices. The compressed first matrix does not include the one or more zero submatrices. Thus, storage space that would otherwise be used to store the one or more zero submatrices may be saved.
At step 112, computing the plurality of submatrix products may include, for each submatrix product of a zero submatrix of the one or more zero submatrices and a second submatrix of the plurality of second submatrices, setting each submatrix product element of the submatrix product to zero. The submatrix product elements may be set to zero without retrieving, from the memory, the plurality of first matrix elements included in the zero submatrix or the plurality of second matrix elements included in the second submatrix. Instead, the one or more processing devices at which the method 100 is performed may refer to the first matrix sparsity metadata and shortcut the computation of the submatrix product elements when the first submatrix is a zero submatrix. When the first submatrix is a nonzero submatrix, the submatrix product may instead be computed by computing a plurality of dot products between rows and columns of the nonzero submatrix and the second submatrix.
In some examples, at step 114, step 108 may further include assigning submatrix product sparsity metadata to each submatrix product of the plurality of submatrix products computed at step 110. The submatrix product sparsity metadata may indicate whether the submatrix product is a zero submatrix product for which all the submatrix product elements of the submatrix product are equal to zero. In some examples, the submatrix product sparsity metadata may be a single bit provided as a header of the submatrix product.
In examples in which the submatrix products are assigned submatrix product sparsity metadata, step 108 may further include, at step 116, computing a submatrix product sum of two or more submatrix products of the plurality of submatrix products that share respective locations in the result matrix. At step 118, computing the submatrix product sum may include, for each submatrix product of the two or more submatrix products, determining whether that submatrix product is a zero submatrix product. Whether the submatrix product is a zero submatrix product may be determined based on the submatrix product sparsity metadata for that submatrix product. In addition, at step 120, step 116 may further include skipping adding each zero submatrix product to the submatrix product sum. Thus, addition operations that would not affect the values of the result matrix elements may be skipped. In examples in which the result matrix is computed at the hardware accelerator, the result matrix may be output to a result buffer of the hardware accelerator after each result submatrix of the result submatrix has been computed.
Using the devices and methods discussed above, the amount of memory used to store sparse matrices may be reduced. In addition, matrix multiplication operations performed on the compressed matrices may be performed more quickly by referring to matrix sparsity metadata. These savings in storage space and computing time may be large in machine learning applications, in which sparse matrices are frequently used.
In some embodiments, the methods and processes described herein may be tied to a computing system of one or more computing devices. In particular, such methods and processes may be implemented as a computer-application program or service, an application-programming interface (API), a library, and/or other computer-program product.
Computing system 200 includes a logic processor 202 volatile memory 204, and a non-volatile storage device 206. Computing system 200 may optionally include a display subsystem 208, input subsystem 210, communication subsystem 212, and/or other components not shown in
Logic processor 202 includes one or more physical devices configured to execute instructions. For example, the logic processor may be configured to execute instructions that are part of one or more applications, programs, routines, libraries, objects, components, data structures, or other logical constructs. Such instructions may be implemented to perform a task, implement a data type, transform the state of one or more components, achieve a technical effect, or otherwise arrive at a desired result.
The logic processor may include one or more physical processors (hardware) configured to execute software instructions. Additionally or alternatively, the logic processor may include one or more hardware logic circuits or firmware devices configured to execute hardware-implemented logic or firmware instructions. Processors of the logic processor 202 may be single-core or multi-core, and the instructions executed thereon may be configured for sequential, parallel, and/or distributed processing. Individual components of the logic processor optionally may be distributed among two or more separate devices, which may be remotely located and/or configured for coordinated processing. Aspects of the logic processor may be virtualized and executed by remotely accessible, networked computing devices configured in a cloud-computing configuration. In such a case, these virtualized aspects are run on different physical logic processors of various different machines, it will be understood.
Non-volatile storage device 206 includes one or more physical devices configured to hold instructions executable by the logic processors to implement the methods and processes described herein. When such methods and processes are implemented, the state of non-volatile storage device 206 may be transformed—e.g., to hold different data.
Non-volatile storage device 206 may include physical devices that are removable and/or built-in. Non-volatile storage device 206 may include optical memory (e.g., CD, DVD, HD-DVD, Blu-Ray Disc, etc.), semiconductor memory (e.g., ROM, EPROM, EEPROM, FLASH memory, etc.), and/or magnetic memory (e.g., hard-disk drive, floppy-disk drive, tape drive, MRAM, etc.), or other mass storage device technology. Non-volatile storage device 206 may include nonvolatile, dynamic, static, read/write, read-only, sequential-access, location-addressable, file-addressable, and/or content-addressable devices. It will be appreciated that non-volatile storage device 206 is configured to hold instructions even when power is cut to the non-volatile storage device 206.
Volatile memory 204 may include physical devices that include random access memory. Volatile memory 204 is typically utilized by logic processor 202 to temporarily store information during processing of software instructions. It will be appreciated that volatile memory 204 typically does not continue to store instructions when power is cut to the volatile memory 204.
Aspects of logic processor 202, volatile memory 204, and non-volatile storage device 206 may be integrated together into one or more hardware-logic components. Such hardware-logic components may include field-programmable gate arrays (FPGAs), program- and application-specific integrated circuits (PASIC/ASICs), program- and application-specific standard products (PSSP/ASSPs), system-on-a-chip (SOC), and complex programmable logic devices (CPLDs), for example.
The terms “module,” “program,” and “engine” may be used to describe an aspect of computing system 200 typically implemented in software by a processor to perform a particular function using portions of volatile memory, which function involves transformative processing that specially configures the processor to perform the function. Thus, a module, program, or engine may be instantiated via logic processor 202 executing instructions held by non-volatile storage device 206, using portions of volatile memory 204. It will be understood that different modules, programs, and/or engines may be instantiated from the same application, service, code block, object, library, routine, API, function, etc. Likewise, the same module, program, and/or engine may be instantiated by different applications, services, code blocks, objects, routines, APIs, functions, etc. The terms “module,” “program,” and “engine” may encompass individual or groups of executable files, data files, libraries, drivers, scripts, database records, etc.
When included, display subsystem 208 may be used to present a visual representation of data held by non-volatile storage device 206. The visual representation may take the form of a graphical user interface (GUI). As the herein described methods and processes change the data held by the non-volatile storage device, and thus transform the state of the non-volatile storage device, the state of display subsystem 208 may likewise be transformed to visually represent changes in the underlying data. Display subsystem 208 may include one or more display devices utilizing virtually any type of technology. Such display devices may be combined with logic processor 202, volatile memory 204, and/or non-volatile storage device 206 in a shared enclosure, or such display devices may be peripheral display devices.
When included, input subsystem 210 may comprise or interface with one or more user-input devices such as a keyboard, mouse, touch screen, or game controller. In some embodiments, the input subsystem may comprise or interface with selected natural user input (NUI) componentry. Such componentry may be integrated or peripheral, and the transduction and/or processing of input actions may be handled on- or off-board. Example NUI componentry may include a microphone for speech and/or voice recognition; an infrared, color, stereoscopic, and/or depth camera for machine vision and/or gesture recognition; a head tracker, eye tracker, accelerometer, and/or gyroscope for motion detection and/or intent recognition; as well as electric-field sensing componentry for assessing brain activity; and/or any other suitable sensor.
When included, communication subsystem 212 may be configured to communicatively couple various computing devices described herein with each other, and with other devices. Communication subsystem 212 may include wired and/or wireless communication devices compatible with one or more different communication protocols. As non-limiting examples, the communication subsystem may be configured for communication via a wireless telephone network, or a wired or wireless local- or wide-area network, such as a HDMI over Wi-Fi connection. In some embodiments, the communication subsystem may allow computing system 200 to send and/or receive messages to and/or from other devices via a network such as the Internet.
The following paragraphs describe several aspects of the present disclosure. According to one aspect of the present disclosure, a computing device is provided, including one or more processing devices configured to receive a first matrix including a plurality of first matrix elements arranged in a plurality of first submatrices. The one or more processing devices may be further configured to generate first matrix sparsity metadata indicating one or more zero submatrices and one or more nonzero submatrices of the plurality of first submatrices. Each of the first matrix elements included in the one or more zero submatrices may be equal to zero. The one or more processing devices may be further configured to store, in memory, a compressed first matrix including the first matrix sparsity metadata and the one or more nonzero submatrices and not including the one or more zero submatrices.
According to this aspect, the one or more processing devices may be further configured to multiply the first matrix and a second matrix to compute a result matrix. Multiplying the first matrix and the second matrix may include computing a plurality of submatrix products of the plurality of first submatrices of the first matrix and a plurality of second submatrices of the second matrix respectively. Computing the plurality of submatrix products may include, for each submatrix product of a zero submatrix of the one or more zero submatrices and a second submatrix of the plurality of second submatrices, setting each submatrix product element of the submatrix product to zero without retrieving, from the memory, the plurality of first matrix elements included in the zero submatrix or the plurality of second matrix elements included in the second submatrix.
According to this aspect, the one or more processing devices may be further configured to assign, to each submatrix product of the plurality of submatrix products, submatrix product sparsity metadata indicating whether the submatrix product is a zero submatrix product for which all the submatrix product elements of the submatrix product are equal to zero.
According to this aspect, multiplying the first matrix and the second matrix may further include computing a submatrix product sum of two or more submatrix products of the plurality of submatrix products that share respective locations in the result matrix. When computing the submatrix product sum, based on the submatrix product sparsity metadata, for each submatrix product of the two or more submatrix products, the one or more processing devices may be configured to determine whether that submatrix product is a zero submatrix product. The one or more processing devices may be further configured to skip adding each zero submatrix product to the submatrix product sum.
According to this aspect, the one or more processing devices may include a hardware accelerator configured to receive the compressed first matrix at a first input buffer, receive the second matrix at a second input buffer, and output the result matrix to a result buffer.
According to this aspect, the one or more processing devices may be further configured to generate a compressed result matrix including result matrix sparsity metadata indicating one or more zero result submatrices and one or more nonzero result submatrices of the result matrix. The compressed result matrix may further include the one or more nonzero result submatrices. The compressed result matrix may not include the one or more zero result submatrices. The one or more processing devices may be further configured to store the compressed result matrix in the memory.
According to this aspect, the first matrix sparsity metadata may indicate each of the one or more zero submatrices with a zero and each of the one or more nonzero submatrices with a one.
According to this aspect, the first matrix sparsity metadata may be stored as a header of the compressed first matrix.
According to this aspect, the plurality of first submatrices may each be of a same size.
According to this aspect, prior to generating the first matrix sparsity metadata, the one or more processing devices may be further configured to determine that one or more first matrix elements of the plurality of first matrix elements are below a predefined threshold. The one or more processing devices may be further configured to set the one or more first matrix elements that are below the predefined threshold to zero.
According to another aspect of the present disclosure, a method for use with a computing device is provided. The method may include receiving a first matrix including a plurality of first matrix elements arranged in a plurality of first submatrices. The method may further include generating first matrix sparsity metadata indicating one or more zero submatrices and one or more nonzero submatrices of the plurality of first submatrices. Each of the first matrix elements included in the one or more zero submatrices may be equal to zero. The method may further include storing, in memory, a compressed first matrix including the first matrix sparsity metadata and the one or more nonzero submatrices and not including the one or more zero submatrices.
According to this aspect, the method may further include multiplying the first matrix and a second matrix to compute a result matrix. Multiplying the first matrix and the second matrix may include computing a plurality of submatrix products of the plurality of first submatrices of the first matrix and a plurality of second submatrices of the second matrix respectively. Computing the plurality of submatrix products may include, for each submatrix product of a zero submatrix of the one or more zero submatrices and a second submatrix of the plurality of second submatrices, setting each submatrix product element of the submatrix product to zero without retrieving, from the memory, the plurality of first matrix elements included in the zero submatrix or the plurality of second matrix elements included in the second submatrix.
According to this aspect, the method may further include assigning, to each submatrix product of the plurality of submatrix products, submatrix product sparsity metadata indicating whether the submatrix product is a zero submatrix product for which all the submatrix product elements of the submatrix product are equal to zero.
According to this aspect, multiplying the first matrix and the second matrix may further include computing a submatrix product sum of two or more submatrix products of the plurality of submatrix products that share respective locations in the result matrix. Based on the submatrix product sparsity metadata, for each submatrix product of the two or more submatrix products, computing the submatrix product sum may include determining whether that submatrix product is a zero submatrix product. Computing the submatrix product sum may further include skipping adding each zero submatrix product to the submatrix product sum.
According to this aspect, the method may further include generating a compressed result matrix including result matrix sparsity metadata indicating one or more zero result submatrices and one or more nonzero result submatrices of the result matrix. The compressed result matrix may further include the one or more nonzero result submatrices. The compressed result matrix may not include the one or more zero result submatrices. The method may further include storing the compressed result matrix in the memory.
According to this aspect, the first matrix sparsity metadata may indicate each of the one or more zero submatrices with a zero and each of the one or more nonzero submatrices with a one.
According to this aspect, the first matrix sparsity metadata may be stored as a header of the compressed first matrix.
According to this aspect, the plurality of first submatrices may each be of a same size.
According to this aspect, the method may further include determining that one or more first matrix elements of the plurality of first matrix elements are below a predefined threshold. The method may further include setting the one or more first matrix elements that are below the predefined threshold to zero.
According to another aspect of the present disclosure, a computing device is provided, including one or more processing devices configured to receive a compressed first matrix including first matrix sparsity metadata and one or more nonzero submatrices. The compressed first matrix may be a compressed form of a first matrix arranged in a plurality of first submatrices and stored in memory. The one or more nonzero submatrices may each include a respective plurality of first matrix elements of the first matrix, with at least one first matrix element included in each of the nonzero submatrices not being equal to zero. The first matrix sparsity metadata may indicate the one or more nonzero submatrices and one or more zero submatrices of the first matrix. Each of the first matrix elements included in the one or more zero submatrices may be equal to zero. The one or more processing devices may be further configured to multiply the compressed first matrix and a second matrix to compute a result matrix. Multiplying the compressed first matrix and the second matrix may include computing a plurality of submatrix products of the plurality of first submatrices of the first matrix and a plurality of second submatrices of the second matrix respectively. Computing the plurality of submatrix products may include, for each submatrix product of a zero submatrix of the one or more zero submatrices and a second submatrix of the plurality of second submatrices, setting each submatrix product element of the submatrix product to zero without retrieving, from the memory, the plurality of first matrix elements included in the zero submatrix or the plurality of second matrix elements included in the second submatrix. The one or more processing devices may be further configured to output the result matrix.
It will be understood that the configurations and/or approaches described herein are exemplary in nature, and that these specific embodiments or examples are not to be considered in a limiting sense, because numerous variations are possible. The specific routines or methods described herein may represent one or more of any number of processing strategies. As such, various acts illustrated and/or described may be performed in the sequence illustrated and/or described, in other sequences, in parallel, or omitted. Likewise, the order of the above-described processes may be changed.
The subject matter of the present disclosure includes all novel and non-obvious combinations and sub-combinations of the various processes, systems and configurations, and other features, functions, acts, and/or properties disclosed herein, as well as any and all equivalents thereof.