This application claims priority under 35 U.S.C. § 119 to Korean Patent Application No. 10-2024-0011143 filed on Jan. 24, 2024, in the Korean Intellectual Property Office, the disclosures of which are incorporated by reference herein in their entireties.
Embodiments of the present disclosure described herein relate to an electronic device, and more particularly, relate to a machine learning accelerator with an improved data loading speed, a computing device including the machine learning accelerator, and a method of loading data to the machine learning accelerator.
A machine learning accelerator is configured to load weight data and to perform a machine learning-based operation. With the development of technologies for machine learning, the capacity of data loaded to the machine learning accelerator is increasing. Accordingly, a time taken to load weight data to the machine learning accelerator is also increasing.
The weight data may be compressed to decrease the time taken to load the weight data to the machine learning accelerator. The compressed weight data may be transferred to the machine learning accelerator during a shorter time.
Embodiments of the present disclosure provide a machine learning accelerator capable of loading compressed weight data during a shorter time, a computing device including the machine learning accelerator, and a method of loading data to the machine learning accelerator.
According to at least one embodiment, a machine learning accelerator a first data controller configured to store original length information indicating an original length for data, to receive first data with a first length, to generate second data having the original length by decompressing the first data with the first length, and to output the second data having the original length; a second data controller configured to store the original length information, to receive third data with a second length shorter than the first length, to generate fourth data having the original length by decompressing the third data with the second length, and to output the fourth data having the original length; a first accelerator core configured to receive the second data having the original length and to perform a first machine learning-based operation using the second data as first weight data; and a second accelerator core configured to receive the fourth data having the original length and to perform a second machine learning-based operation using the fourth data as second weight data, wherein each of the first data controller and the second data controller is configured to monitor a timing at which decompression is completed, based on the original length, and terminate the decompression at the timing at which the decompression is completed.
According to at least one embodiment, a computing device includes a memory configured to store first data with a first length and second data with a second length shorter than the first length; and a machine learning accelerator configured to receive the first data and third data from the memory, the third data having the first length and including the second data, wherein the machine learning accelerator is configured to generate first weight data by decompressing the first data with the first length, convert the third data with the first length into the second data with the second length, generate second weight data by decompressing the second data with the second length, and perform a machine learning-based operation based on the first weight data and the second weight data.
According to at least one embodiment, a method in which a processor loads data to a machine learning accelerator includes simultaneously programming, at the processor, two or more direct memory access (DMA) masters using a first start address and first length information; and reading, at the two or more DMA masters, data from a memory in parallel based on the first start address and the first length information and transferring the data read in parallel to the machine learning accelerator, wherein the machine learning accelerator is configured to generate first weight data by decompressing first data corresponding to the first length information from among the data read in parallel, generate third data with a second length by converting second data corresponding to the first length information, the second length being shorter than a first length indicated by the first length information, generate second weight data by decompressing the third data with the second length, and perform a machine learning-based operation based on the first weight data and the second weight data.
The above and other objects and features of the present disclosure will become apparent by describing in detail embodiments thereof with reference to the accompanying drawings.
Below, embodiments of the present disclosure will be described in detail and clearly to such an extent that an ordinary one in the art easily carries out the present disclosure.
The system bus 110 may be configured to enable communication between the components of the computing device 10, e.g., by providing channels between the components of the computing device 10. For example, the system bus 110 may provide the channels based on one or more of various communication protocols such as peripheral component interconnect express (PCIe), non-volatile memory express (NVMe), a dual in-line memory module (DIMM), an advanced extensible interface (AXI), and/or the like.
The processor 120 may be configured to execute an operating system and various applications. The processor 120 may be configured to control the components of the computing device 10 depending on requests of the operating system and the applications. For example, the processor 120 may include a central processing unit (CPU) and/or an application processor (CP) which includes one or more processing cores.
The accelerator 130 may be a machine learning accelerator which runs a machine learning module trained based on machine learning or a trained machine learning module. For example, the machine learning may be based on various algorithms including a convolutional neural network (CNN), a deep NN (DNN), and a generative adversarial network (GAN). In at least some embodiments, for example, the trained machine learning module may be configured to perform voice recognition, text-to-speech, image recognition, image classification, and/or image processing by using a neural network, a tablet device, a smart TV, an augmented reality (AR) device, an Internet of things (IoT) device, a self-driving vehicle, robots, a medical device, a drone, an advanced drivers assistance system (ADAS), an image display device, a data processing server, a measuring device, etc. and/or may be mounted in one of various kinds of electronic devices.
The accelerator 130 may include a plurality of accelerator cores “C”. In the at least one embodiment, the accelerator 130 may include four accelerator cores “C”. The accelerator cores “C” may be configured to operate in parallel to perform a machine learning operation. The machine learning operation may include at least some of various operations of the machine learning module, such as learning, inference, and/or classification.
The accelerator 130 may include data controllers CT respectively corresponding to the accelerator cores “C”. For example, the accelerator 130 may include four data controllers CT. The data controllers CT may be configured to operate in parallel. The data controllers CT may decompress the compressed data transferred to the accelerator 130 and may transfer the decompressed data to the corresponding accelerator core “C”. The data controllers CT may also provide uncompressed data transferred to the accelerator 130 to the accelerator cores “C”
The modem 140 may be configured to communicate with an external device. For example, the modem 140 may communicate with the external device based on various wired and/or wireless communication protocols such as IEEE 802.11, Ethernet, 5G, and Bluetooth. The modem 140 may transmit data stored in the memory 160 to the external device or may store data transmitted from the external device to the modem 140.
The memory controller 150 may be configured to access the memory 160. For example, the memory controller 150 may access the memory 160 depending on a request of the processor 120 and/or the DMAC 170. For example, the memory controller 150 may write data in the memory 160 or may read data from the memory 160. The memory controller 150 may perform various operations for managing the data stored in the memory 160, for example, various background operations including a refresh operation and a row hammering prevention operation.
The memory 160 may be a random access memory. For example, the memory 160 may be implemented with one or more of various random access memories such as a dynamic random access memory (DRAM), a static RAM (SRAM), a phase-change RAM (PRAM), a magnetic RAM (MRAM), a ferroelectric RAM (FRAM), a resistive RAM (RRAM), and/or the like.
The DMAC 170 may be configured to request the memory controller 150 to write data output from at least one of the computing device 10 in the memory 160, in compliance with the program of the processor 120. For example, the DMAC 170 may request the memory controller 150 to output the data stored in the memory 160 to at least one of the components of the computing device 10, in compliance with the program of the processor 120. The DMAC 170 may include a plurality of DMA masters “M”. In at least one embodiment, the DMAC 170 may include four DMA masters “M”. The DMA masters “M” may request the access to the memory 160, for example, a read operation or a write operation from the memory controller 150 independently of each other.
In at least one embodiment, the DMA masters “M” may request the read or write operation of the memory 160 from the memory controller 150 in parallel, independently, or simultaneously. Depending on requests of the DMA masters “M”, the memory controller 150 may perform the write operations or the read operations on the memory 160 in parallel, independently, or simultaneously. For example, a plurality of channels may be provided between the DMA masters “M” and the memory controller 150 in parallel, independently, or simultaneously. Also, a plurality of channels may be provided between the memory controller 150 and the memory 160 in parallel, independently, or simultaneously.
The storage device 180 may be an auxiliary memory device of the computing device 10. For example, the storage device 180 may be a hard disk drive (HDD), a solid state drive (SSD), an embedded storage device, and/or the like.
First compressed layer weight data L1_WD may be compressed weight data to be used in the first layer of the machine learning module. The first compressed layer weight data L1_WD may include 1a-th compressed weight data WD1a, 1b-th compressed weight data WD1b, 1c-th compressed weight data WD1c, and 1d-th compressed weight data WD1d respectively corresponding to the four accelerator cores “C”.
Second compressed layer weight data L2_WD may be compressed weight data to be used in the second layer of the machine learning module. The second compressed layer weight data L2_WD may include 2a-th compressed weight data WD2a, 2b-th compressed weight data WD2b, 2c-th compressed weight data WD2c, and 2d-th compressed weight data WD2d respectively corresponding to the four accelerator cores “C”.
Third compressed layer weight data L3_WD may be compressed weight data to be used in the third layer of the machine learning module. The third compressed layer weight data L3_WD may include 3a-th compressed weight data WD3a, 3b-th compressed weight data WD3b, 3c-th compressed weight data WD3c, and 3d-th compressed weight data WD3d respectively corresponding to the four accelerator cores “C”.
In at least one embodiment, a compression ratio may vary depending on a pattern of weight data. As a difference between a ratio of 0's of weight data and a ratio of 1's of weight data decreases, a compression ratio of weight data may decrease. As a difference between a ratio of 0's of weight data and a ratio of 1's of weight data increases, a compression ratio of weight data may increase. As values of “0” or “1” are focused and placed in weight data, a compression ratio of weight data may increase. As values of “0” or “1” are distributed and placed in weight data, a compression ratio of weight data may decrease.
When weight data have different compression ratios, the compressed weight data may have different sizes. Examples in which compressed weight data have different sizes are illustrated in
The first compressed layer weight data L1_WD, the second compressed layer weight data L2_WD, and the third compressed layer weight data L3_WD may be stored in the storage device 180. For example, the first compressed layer weight data L1_WD, the second compressed layer weight data L2_WD, and the third compressed layer weight data L3_WD may be acquired and stored in the memory 160 (e.g., through the modem 140 from an external device) and/or may be stored in the storage device 180. As another example, the first compressed layer weight data L1_WD, the second compressed layer weight data L2_WD, and the third compressed layer weight data L3_WD may be stored in a removable storage device, and the removable storage device may be connected to the computing device 10 as a part of the storage device 180.
The first compressed layer weight data L1_WD may further include 1a-th length information LI1a, 1b-th length information LI1b, 1c-th length information LI1c, and 1d-th length information LI1d respectively indicating lengths of the 1a-th compressed weight data WD1a, the 1b-th compressed weight data WD1b, the 1c-th compressed weight data WD1c, and the 1d-th compressed weight data WD1d. The 1a-th length information LI1a, the 1b-th length information LI1b, the 1c-th length information LI1c, and the 1d-th length information LI1d may include information about start addresses on the storage device 180, from which the 1a-th compressed weight data WD1a, the 1b-th compressed weight data WD1b, the 1c-th compressed weight data WD1c, and the 1d-th compressed weight data WD1d are stored, and information of lengths (or sizes) of the addresses of the 1a-th compressed weight data WD1a, the 1b-th compressed weight data WD1b, the 1c-th compressed weight data WD1c, and the 1d-th compressed weight data WD1d.
The second compressed layer weight data L2_WD may further include 2a-th length information LI2a, 2b-th length information LI2b, 2c-th length information LI2c, and 2d-th length information LI2d respectively indicating lengths of the 2a-th compressed weight data WD2a, the 2b-th compressed weight data WD2b, the 2c-th compressed weight data WD2c, and the 2d-th compressed weight data WD2d. The 2a-th length information LI2a, the 2b-th length information LI2b, the 2c-th length information LI2c, and the 2d-th length information LI2d may include information about start addresses on the storage device 180, from which the 2a-th compressed weight data WD2a, the 2b-th compressed weight data WD2b, the 2c-th compressed weight data WD2c, and the 2d-th compressed weight data WD2d are stored, and information of lengths (or sizes) of the addresses of the 2a-th compressed weight data WD2a, the 2b-th compressed weight data WD2b, the 2c-th compressed weight data WD2c, and the 2d-th compressed weight data WD2d.
The third compressed layer weight data L3_WD may further include 3a-th length information LI3a, 3b-th length information LI3b, 3c-th length information LI3c, and 3d-th length information LI3d respectively indicating lengths of the 3a-th compressed weight data WD3a, the 3b-th compressed weight data WD3b, the 3c-th compressed weight data WD3c, and the 3d-th compressed weight data WD3d. The 3a-th length information LI3a, the 3b-th length information LI3b, the 3c-th length information LI3c, and the 3d-th length information LI3d may include information about start addresses on the storage device 180, from which the 3a-th compressed weight data WD3a, the 3b-th compressed weight data WD3b, the 3c-th compressed weight data WD3c, and the 3d-th compressed weight data WD3d are stored, and information of lengths (or sizes) of the addresses of the 3a-th compressed weight data WD3a, the 3b-th compressed weight data WD3b, the 3c-th compressed weight data WD3c, and the 3d-th compressed weight data WD3d.
The first type length information may include the 1a-th length information LI1a, the 1b-th length information LI1b, the 1c-th length information LI1c, and the 1d-th length information LI1d of the first compressed layer weight data L1_WD, the 2a-th length information LI2a, the 2b-th length information LI2b, the 2c-th length information LI2c, and the 2d-th length information LI2d of the second compressed layer weight data L2_WD, and the 3a-th length information LI3a, the 3b-th length information LI3b, the 3c-th length information LI3c, and the 3d-th length information LI3d of the third compressed layer weight data L3_WD.
The processor 120 may read the compressed weight data WD from the storage device 180 so as to be loaded to the memory 160 and/or may program the DMAC 170 to read the compressed weight data WD from the storage device 180 and to load the compressed weight data WD to the memory 160.
In operation S120, the computing device 10 loads the compressed weight data WD from the memory 160 to the accelerator 130 based on second type length information. For example, the second type length information may be a portion of the first type length information. As the compressed weight data WD are loaded to the accelerator 130 (using the second type length information), the computing device 10 may improve a speed at which the compressed weight data WD are loaded to the accelerator 130 and may reduce a loading time.
In at least one embodiment, the processor 120 may program the DMAC 170 to read the compressed weight data WD stored in the memory 160 and to load the compressed weight data WD to the accelerator 130. The accelerator 130 may receive the compressed weight data WD and may generate weight data by decompressing the compressed weight data WD (or releasing the compression of the compressed weight data WD). The accelerator 130 may load the weight data to the accelerator cores “C”.
In operation S130, the computing device 10 runs the accelerator 130. For example, the processor 120 may control the accelerator 130 such that the accelerator 130 performs machine learning operations (e.g., such as learning, inference, or classification). The accelerator cores “C” of the accelerator 130 may perform the machine learning operations based on machine learning, based on data transferred from the processor 120 and the weight data loaded the accelerator cores “C”.
Referring to
In operation S220, the computing device 10 may read the compressed weight data WD. Each of the 1a-th length information LI1a, the 1b-th length information LI1b, the 1c-th length information LI1c, the 1d-th length information LI1d, the 2a-th length information LI2a, the 2b-th length information LI2b, the 2c-th length information LI2c, the 2d-th length information LI2d, the 3a-th length information LI3a, the 3b-th length information LI3b, the 3c-th length information LI3c, and the 3d-th length information LI3d may include information about a start address (referenced with an up arrow) of the storage device 180, from which relevant compressed weight data are stored, and a length (references with a side-to-side arrow) of the relevant compressed weight data.
The processor 120 or the DMAC 170 may read the compressed weight data WD from the storage device 180 by using the length information LI.
In operation S230, the computing device 10 may load the compressed weight data WD to the memory 160 based on the same address. For example, the processor 120 and/or the DMAC 170 may load the 1a-th compressed weight data WD1a, the 1b-th compressed weight data WD1b, the 1c-th compressed weight data WD1c, and the 1d-th compressed weight data WD1d of the first compressed layer weight data L1_WD onto storage spaces of the memory 160, which start from the same first start address and have sequential addresses, in parallel.
As illustrated in
The processor 120 and/or the DMAC 170 may load the 2a-th compressed weight data WD2a, the 2b-th compressed weight data WD2b, the 2c-th compressed weight data WD2c, and the 2d-th compressed weight data WD2d of the second compressed layer weight data L2_WD onto storage spaces of the memory 160, which start from the same second start address and have sequential addresses, in parallel.
For example, the second start address may be an address immediately following the end address of the 1c-th compressed weight data WD1c having the largest length (or size) from among the 1a-th compressed weight data WD1a, the 1b-th compressed weight data WD1b, the 1c-th compressed weight data WD1c, and the 1d-th compressed weight data WD1d of the first compressed layer weight data L1_WD. In other words, the location of the second start address in the storage spaces of the memory 160 may be based on the length (or size) of the largest length (or size) from among the 1a-th compressed weight data WD1a, the 1b-th compressed weight data WD1b, the 1c-th compressed weight data WD1c, and the 1d-th compressed weight data WD1d of the first compressed layer weight data L1_WD.
As such, the end address of the 1a-th compressed weight data WD1a may be smaller than the end address of the 1c-th compressed weight data WD1c. The processor 120 or the DMAC 170 may write dummy data, a random pattern, a given pattern in the storage space of the memory 160, which ranges from an address immediately following the end address of the 1a-th compressed weight data WD1a to the end address of the 1c-th compressed weight data WD1c, or may not write data therein. The storage space where data are not written may have a fixed value, for example, a value of “O” or “1”.
As illustrated in
The processor 120 and/or the DMAC 170 may load the 3a-th compressed weight data WD3a, the 3b-th compressed weight data WD3b, the 3c-th compressed weight data WD3c, and the 3d-th compressed weight data WD3d of the third compressed layer weight data L3_WD onto storage spaces of the memory 160, which start from the same third start address and have sequential addresses, in parallel.
For example, the third start address may be an address immediately following the end address of the 2c-th compressed weight data WD2c having the largest length (or size) from among the 2a-th compressed weight data WD2a, the 2b-th compressed weight data WD2b, the 2c-th compressed weight data WD2c, and the 2d-th compressed weight data WD2d of the second compressed layer weight data L2_WD. In other words, the location of the third start address in the storage spaces of the memory 160 may be based on the length (or size) of the largest length (or size) from among the 2a-th compressed weight data WD2a, the 2b-th compressed weight data WD2b, the 2c-th compressed weight data WD2c, and the 2d-th compressed weight data WD2d of the second compressed layer weight data L2_WD.
As such, the end address of the 2a-th compressed weight data WD2a may be smaller than the end address of the 2c-th compressed weight data WD2c. The processor 120 or the DMAC 170 may write dummy data, a random pattern, a given pattern in the storage space of the memory 160, which ranges from an address immediately following the end address of the 2a-th compressed weight data WD2a to the end address of the 2c-th compressed weight data WD2c, or may not write data therein. The storage space where data are not written may have a fixed value, for example, a value of “O” or “1”.
As illustrated in
The end address of the 3a-th compressed weight data WD3a may be smaller than the end address of the 3c-th compressed weight data WD3c. The processor 120 and/or the DMAC 170 may write dummy data, a random pattern, a given pattern in the storage space of the memory 160, which ranges from an address immediately following the end address of the 3a-th compressed weight data WD3a to the end address of the 3c-th compressed weight data WD3c, or may not write data therein. The storage space where data are not written may have a fixed value, for example, a value of “O” or “1”.
In operation S240, the computing device 10 may generate layer length information LLI. The layer length information LLI may include first layer length information LLI1 including information about a start address (referenced with an up arrow) of the memory 160, from which the first compressed layer weight data L1_WD are stored, or information about a length (referenced with a side-to-side arrow) of the first compressed layer weight data L1_WD.
The layer length information LLI may include second layer length information LLI2 including information about a start address (referenced with an up arrow) of the memory 160, from which the second compressed layer weight data L2_WD are stored, or information about a length (referenced with a side-to-side arrow) of the second compressed layer weight data L2_WD.
The layer length information LLI may further include third layer length information LLI3 including information about a start address (referenced with an up arrow) of the memory 160, from which the third compressed layer weight data L3_WD are stored, or information about a length (referenced with a side-to-side arrow) of the third compressed layer weight data L3_WD.
In operation S250, the computing device 10 may load the original length information OLI and the layer length information LLI to the memory 160. For example, the processor 120 and/or the DMAC 170 may store the original length information OLI and the layer length information LLI in a second length information area LIA2 of the memory 160.
In operation S320, the computing device 10 may simultaneously program the DMA masters “M”. For example, the processor 120 may program the DMA masters “M” by using the second type length information.
The processor 120 may program the direct memory access (DMA) masters “M” to read data from the memory 160 by using the first layer length information LLI1 (see
In operation S330, the processor 120 may determine whether the data transferred to the accelerator 130 are the last compressed layer weight data L_WD. For example, the processor 120 may determine whether the data transferred to the accelerator 130 are data including the third compressed layer weight data L3_WD. When the data transferred to the accelerator 130 correspond to the last compressed layer weight data L_WD, the processor 120 may terminate the loading of the compressed weight data WD to the accelerator 130.
When the data transferred to the accelerator 130 do not correspond to the last compressed layer weight data L_WD, the processor 120 may load the compressed layer weight data of a next layer to the accelerator 130. For example, the processor 120 may load the second compressed layer weight data L2_WD to the accelerator 130 by programming the DMA masters “M” by using the second layer length information LLI2. Also, the processor 120 may load the third compressed layer weight data L3_WD to the accelerator 130 by programming the DMA masters “M” by using the third layer length information LLI3.
As described above, the computing device 10 according to at least one embodiment of the present disclosure may load the compressed weight data WD to the accelerator 130 by using the plurality of DMA masters “M”. Accordingly, a speed at which the compressed weight data WD are loaded to the accelerator 130 may be improved, and a loading time may be shortened.
Also, the computing device 10 according to at least embodiment of the present disclosure may load weight data of all the layers to the accelerator 130 by using the length information of weight data, whose length is the largest, from among the weight data of the layers, which have different lengths (sizes). Accordingly, because the number of times that the DMA masters “M” are programmed decreases, a speed at which the compressed weight data WD are loaded to the accelerator 130 may be improved, and a loading time may be shortened.
Each of the data controllers CT may include a drain circuit DC, a decompression circuit DCC, and a validity monitor VM (or a validity monitor circuit). One DMA master “M” may be connected to one accelerator core “C” through one drain circuit DC, one decompression circuit DCC, and one validity monitor VM of one data controller CT. In an embodiment, one DMA master “M”, one drain circuit DC, one decompression circuit DCC, and one validity monitor VM may constitute a data chain which converts data read from the memory 160, that is, data including compressed weight data into weight data so as to be transferred to one accelerator core “C”,
Each of the drain circuits DC may receive data from the corresponding DMA master “M” among the DMA masters “M”. Each of the drain circuits DC may receive a drain signal DS from the corresponding validity monitor VM among the validity monitors VM. When the drain circuit DC is in an inactive state, each of the drain circuits DC may be configured to transfer data provided from the corresponding DMA master “M” to the corresponding decompression circuit DCC among the decompression circuits DCC. When the drain circuit DC is in an active state, each of the drain circuits DC may be configured to drain the data provided from the corresponding DMA master “M” without transferring the data to the corresponding decompression circuits DCC. For example, when the drain circuit DC is in the active state, each of the drain circuits DC may discard or ignore the data provided from the corresponding DMA master “M”.
Each of the decompression circuits DCC may be configured to perform decompression for the data transferred from the corresponding drain circuit DC. Each of the decompression circuits DCC may transfer the decompressed data, that is, the weight data to the corresponding validity monitor VM.
Each of the validity monitors VM may receive the weight data (e.g., the decompressed weight data) from the corresponding decompression circuit DCC. The validity monitors VM may receive the original length information OLI in common. Each of the validity monitors VM may monitor whether the length (or size) of the data (e.g., decompressed weight data) received from the corresponding decompression circuit DCC reaches a length (or size) indicated by the original length information OLI.
When the length (or size) of the data received from the corresponding decompression circuit DCC does not reach the length (or size) indicated by the original length information OLI, each of the validity monitors VM may deactivate the drain signal DS and may transfer the data provided from the decompression circuit DCC to the corresponding accelerator core among the accelerator cores “C”. When the length (or size) of the data received from the corresponding decompression circuit DCC reaches the length (or size) indicated by the original length information OLI, each of the validity monitors VM may activate the drain signal DS.
In operation S420, the drain circuit DC may determine whether the drain signal DS is in the active state. When the drain signal DS is not in the active state, in operation S430, the drain circuit DC may transfer the data. For example, the drain circuit DC may transfer the input data received from the corresponding DMA master “M” to the corresponding decompression circuit DCC. Afterwards, operation S450 may be performed.
When the drain signal DS is in the active state, in operation S440, the drain circuit DC may drain the input data. For example, the drain circuit DC may ignore or discard the input data. Afterwards operation S450 may be performed.
In operation S450, the drain circuit DC may determine whether the data ended. For example, when the input data are not received from the corresponding DMA master “M” any longer, the drain circuit DC may determine that the data has ended. When it is determined that the data ended, the drain circuit DC may terminate the process. When it is determined that the data has not end, the drain circuit DC may again perform operation S410.
As described above, the computing device 10 according to at least one embodiment of the present disclosure may store compressed weight data with different sizes in the storage space of the memory 160, which has the same start address and the same layer length information. The computing device 10 may simultaneously program the DMA masters “M” to transfer weight data to the accelerator 130 in parallel by using the same start address and the same layer length information. The drain circuits DC and the validity monitors VM of the data controllers CT of the accelerator 130 may extract decompressed weight data so as to be transferred to the accelerator cores “C”. Because the number of times that the DMA masters “M” are programmed decreases, a speed at which the computing device 10 loads compressed weight data to the accelerator 130 may be improved, and a loading time may decrease.
The demultiplixers DX may be provided between the DMA masters “M” and the drain circuits DC. The multiplexers MX may be provided between the validity monitors VM and the accelerator cores “C”. The demultiplixers DX and the multiplexers MX may operate in response to a mode signal MS.
In at least one embodiment, the computing device 10 may provide a compression mode and a non-compression mode. In the compression mode, compressed weight data may be provided to the storage device 180. The computing device 10 may load compressed weight data to the accelerator 130 depending on the method described with reference to
In the non-compression mode, uncompressed weight data may be provided to the storage device 180. The uncompressed weight data may have the same sizes and may be loaded to the memory 160. The computing device 10 may program the DMA masters “M” by using the same start address and the same layer length information. The DMA masters “M” may transfer the uncompressed weight data stored in the memory 160 to the accelerator 130.
In the non-compression mode, the mode signal MS may have a second value different from the first value. In response to the mode signal MS having the second value, the demultiplixers DX may transfer the data provided from the DMA masters “M” to the multiplexers MX. In response to the mode signal MS having the second value, the multiplexers MX may transfer the data provided from the demultiplixers DX to the accelerator cores “C”.
Each of the decompression circuits DCC may decompress data transferred from the corresponding DMA master among the DMA masters “M”, that is, data including compressed weight data. Each of the decompression circuits DCC may transfer the decompressed data to the corresponding validity monitor VM among the validity monitors VM.
Each of the validity monitors VM may monitor whether the length (or size) of the data transferred from the corresponding decompression circuit DCC reaches the length (or size) indicated by the original length information OLI. When the length (or size) of the data transferred from the decompression circuit DCC does not reach the length (or size) indicated by the original length information OLI, each of the validity monitor VM may deactivate a reset signal RS. Each of the validity monitors VM may transfer the data provided from the decompression circuit DCC to the corresponding accelerator core “C” among the accelerator cores “C”.
When the length (or size) of the data transferred from the decompression circuit DCC reaches the length (or size) indicated by the original length information OLI, each of the validity monitor VM may activate the reset signal RS. The reset signal RS may be transferred to the corresponding DMA master “M” among the DMA masters “M”. The corresponding DMA master “M” may be reset in response to a determination that the reset signal RS is activated.
In at least one embodiment, the reset signals RS may also be provided to the decompression circuits DCC. The corresponding decompression circuit DCC may be reset in response to a determination that the reset signal RS is activated.
In at least one embodiment, as described with reference to
In operation S520, the validity monitor VM may determine whether the data has ended (or whether a data end has been reached). For example, when the length (or size) of the received weight data reaches the length (or size) indicated by the original length information OLI, the validity monitor VM may determine that the data has ended. When it is determined that the data has ended, in operation S530, the validity monitor VM may reset the corresponding DMA master “M” among the DMA masters “M”. Afterwards, the validity monitor VM may terminate the process.
In operation S540, the validity monitor VM may transfer the input data to the corresponding accelerator core “C” among the accelerator cores “C” and may continue monitoring. Afterwards, the validity monitor VM may again start from operation S410.
As described above, the computing device 10 according to at least one embodiment of the present disclosure may store compressed weight data with different sizes in the storage space of the memory 160, which has the same start address and the same layer length information. The computing device 10 may simultaneously program the DMA masters “M” to transfer weight data to the accelerator 130 in parallel by using the same start address and the same layer length information. The validity monitors VM of the data controllers CTb of the accelerator 130 may extract decompressed weight data so as to be transferred to the accelerator cores “C”. Because the number of times that the DMA masters “M” are programmed decreases, a speed at which the computing device 10 loads compressed weight data to the accelerator 130 may be improved, and a loading time may decrease.
Each of fourth compressed weight data WD4, fifth compressed weight data WD5, sixth compressed weight data WD6, and seventh compressed weight data WD7 may correspond to one layer of the machine learning module. Fourth length information LI4 indicating the length of the fourth compressed weight data WD4, fifth length information LI5 indicating the length of the fifth compressed weight data WD5, sixth length information LI6 indicating the length of the sixth compressed weight data WD6, and seventh length information LI7 indicating the length of the seventh compressed weight data WD7 may be stored in the storage device 180 together with the fourth compressed weight data WD4, the fifth compressed weight data WD5, the sixth compressed weight data WD6, and the seventh compressed weight data WD7.
Each of the fourth compressed weight data WD4, the fifth compressed weight data WD5, the sixth compressed weight data WD6, and the seventh compressed weight data WD7 may be stored in a storage space whose addresses are sequential.
The fourth length information LI4 may include information about a start address (referenced by an up arrow) and a length (referenced by a side-to-side arrow) of the fourth compressed weight data WD4. The fifth length information LI5 may include information about a start address (referenced by an up arrow) and a length (referenced by a side-to-side arrow) of the fifth compressed weight data WD5.
The sixth length information LI6 may include information about a start address (referenced by an up arrow) and a length (referenced by a side-to-side arrow) of the sixth compressed weight data WD6. The seventh length information LI7 may include information about a start address (refer to an up arrow) and a length (refer to a right arrow) of the seventh compressed weight data WD7.
The processor 120 or the DMAC 170 may load each of the fourth compressed weight data WD4, the fifth compressed weight data WD5, the sixth compressed weight data WD6, and the seventh compressed weight data WD7 onto storage spaces of the memory 160, which start from the same start address and have sequential addresses, in parallel.
As illustrated in
The fourth compressed weight data WD4, the fifth compressed weight data WD5, the sixth compressed weight data WD6, and the seventh compressed weight data WD7 loaded to the memory 160 may be loaded to the accelerator 130 depending on the method described with reference to
The accelerator 230 may include the accelerator cores “C”, the data controllers CT, and the DMA masters “M”. The DMA masters “M” may be dedicated for the accelerator 230. The loading of compressed weight data to the accelerator 230 may be the same as described with reference to
In at least one embodiment, the DMA masters “M” included in the accelerator 230 may be called AXI maters.
Some embodiments in which the computing device 10 or 20 includes one accelerator 130 or 230 is described. However, the examples are not limited thereto, and the computing device 10 or 20 may include two or more accelerators. The computing device 10 or 20 may load compressed weight data to at least one accelerator among the two or more accelerators by using the DMA masters “M” included in the at least one accelerator. Alternatively, the computing device 10 or 20 may load compressed weight data to at least one accelerator by using the DMA masters “M” of a separate DMA controller.
In the above embodiments, components according to the present disclosure are described by using the terms “first”, “second”, “third”, etc. However, the terms “first”, “second”, “third”, etc. may be used to distinguish components from each other and do not limit the present disclosure. For example, the terms “first”, “second”, “third”, etc. do not involve an order or a numerical meaning of any form.
In the above embodiments, functional components according to embodiments of the present disclosure are referenced by using functional blocks and terms directed towards functional said functional blocks elements (e.g., including “processor,” “accelerator,” etc.). The blocks may be implemented with processing circuitry, such as various hardware devices (such as an integrated circuit, an application specific IC (ASIC), a field programmable gate array (FPGA), and a complex programmable logic device (CPLD)), firmware driven in hardware devices, software (such as an application), and/or a combination of a hardware device and/or software. In at least some embodiments the processing circuitry more specifically may be included in (and/or enabled by), but is not limited to, a central processing unit (CPU), an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA), a System-on-Chip, (SoC), a programmable logic unit, a microprocessor, application-specific integrated circuit (ASIC), etc. and/or may include active and/or passive electrical components such as transistors, resistors, capacitors, etc., and/or electronic circuits including one or more of said components. Also, the blocks may include circuits implemented with semiconductor elements in an integrated circuit, or circuits enrolled as an intellectual property (IP). For example, the circuits may include active and/or passive electrical components such as transistors, resistors, capacitors, etc.
According to embodiments of the present disclosure, direct memory access (DMA) masters which load compressed weight data to a machine learning accelerator may be simultaneously programmed by using the same start address and the same length information. Accordingly, a machine learning accelerator capable of reducing a time taken to load compressed weight data, a computing device including the machine learning accelerator, and a method of loading data to the machine learning accelerator are provided.
While the present disclosure has been described with reference to embodiments thereof, it will be apparent to those of ordinary skill in the art that various changes and modifications may be made thereto without departing from the spirit and scope of the present disclosure as set forth in the following claims.
Number | Date | Country | Kind |
---|---|---|---|
10-2024-0011143 | Jan 2024 | KR | national |