Embodiments generally relate to the transfer of data during artificial intelligence (AI) compute operations. More particularly, embodiments relate to maximizing on-chip data reuse in compute in memory (CiM) and compute near memory (CnM) architectures that perform AI compute operations.
A neural network (NN) can be represented as a structure that is a graph of several neuron layers flowing from one layer to the next. During AI/DNN (deep neural network) model execution, intermediate feature maps (e.g., tensors) are created at each layer and consumed in subsequent layers. Storing feature maps off-chip and re-fetching the feature maps during model execution may have a negative impact on power consumption and/or performance.
The various advantages of the embodiments will become apparent to one skilled in the art by reading the following specification and appended claims, and by referencing the following drawings, in which:
Compute in memory (CiM) and compute near memory (CnM) architectures are pursued in DNN-based workloads due to significant energy savings and performance gains that address data access issues. During AI/DNN model execution, intermediate feature maps (e.g., tensors, intermediate state data) are created at each layer and consumed in subsequent layers. Given the relatively high cost of off-chip data accesses (e.g., accesses to on-die or off-die memories external to on-chip (CiM/CnM enabled) memory), on-chip data reuse of intermediate feature maps can provide enhanced performance (e.g., in off-die bandwidth-constrained scenarios) and energy savings as compared to storing feature maps off-die and re-fetching the feature maps from off-die memory. Reusing feature map data, however, within on-chip memories has the following challenges:
The technology described herein provides hardware support for in-memory direct memory access (DMA) operations. The technology described herein also provides an enhanced compiler (e.g., dataflow optimizer) framework addressing the above-mentioned challenges toward achieving improved performance and energy savings at the end-to-end workload level.
The illustrated architecture provides several advantages over conventional solutions. For example, with on-chip inter-layer data reuse across a variety of AI workloads, a substantial reduction in off-die data accesses and performance improvement in bandwidth-constrained scenarios is achieved, as shown in a chart 150 (
Tensor Mapping and Address Allocation—Address Space Management
With continuing reference to
Additionally, a memory sub-array view 62 of buffer allocation demonstrates that the buffer allocation strategy minimizes memory fragmentation and provides maximum allocation size for tile data at each layer. For example, after deducting persistent data usage (e.g., pre-fetching weight data), the remaining memory space is used for feature maps and tile data storage.
In one example, feature map allocation is divided into two categories. Category #1 is used for long reuse buffers (e.g., feature-maps) and can fall on two ends/polarities (e.g., north/upper or south/lower) of the address space. Category #2 buffers are used for feature-maps that are produced and consumed in consecutively executed NN layers.
Tensor Mapping and Address Allocation—Tensor Memory Mapping
In an embodiment, tensor residencies are expressed as one of two modes (RESIDE_ON_CHIP, RESIDE_OFF_CHIP). Inputs to every layer include IFMs (input feature maps) and weights, and the outputs include OFMs (output feature maps).
Input and output operands to the neural network model may be defined as I0 and ON-1, respectively. Without loss of generality, each neural network can be expressed as a Directed Acyclic Graph (DAG). In one example, an architecture aware “greedy” algorithm runs through a linearized traversal of the DAG from source I0 to destination ON-1 and returns per-layer tensor residencies (e.g., residency data) and per-tensor available memory capacity (e.g., memory availability). Stages of the procedure are described below:
By contrast, processing block 72a of the method 72 estimates an access score of each tensor across the layers of the NN. In one example, the access score is a function of the write cost, the differential read cost, the number of consumers, and the lifetime of the tensor). Block 72b traverses a linearized representation of the DAG and block 72c identifies active tensors at the start of each layer. A determination is made at block 72d as to whether all active tensors are less than the available capacity. If so, block 72e marks the current layer tensor as RESIDE_ON-CHIP and the method 72 returns to block 72c. Otherwise, block 72f inspects the access score of the active tensors and block 72g marks the tensor(s) with the lowest score(s) as RESIDE_OFF_CHIP. While the tensors with lowest score(s) are marked as RESIDE_OFF_CHIP, other active tensors whose cumulative size is well within available on-chip memory capacity are marked as RESIDE_ON_CHIP. Thus, the method 72 may be considered to be less “naïve” than the method 70. The residual memory capacity (e.g., after allocating the above buffers) is used to store tiled data for each layer in the intra-layer data flow optimization process.
Tensor Mapping and Address Allocation—Tensor Address Allocation
To minimize memory fragmentation due to allocation and deallocation procedures at each layer, an address (e.g., in the system address map) for each active tensor (e.g., feature-map allocated into CiM/CnM on-chip memory) is assigned as described herein. Embodiments maximize space allocation for feature-map and tile-data storage towards efficient execution:
Turning now to
In-Memory (e.g., Intra) Support for Local DMA
Turning now to
Responses received by a DMA response handler 120 from the inner decoder level are either passed-through to the next (e.g., outer) decoder level or sent to the decoder engine based on a DMA tag bit set in a response FIFO 122 at the time of request generation. Responses to the DMA engine 110 are appropriately handled by read/write managers. Once all the data movements associated with a DMA configuration request are completed, the DMA engine 110 issues a response command for an upstream configuration request to the next (e.g., outer) decoder level.
Inter-Layer Spatial Data Flow Optimization—Overview
Embodiments provide a framework to obtaining an optimal spatial data flow (e.g., the feature-map data of a layer) for a distributed memory (e.g., CnM/CiM) architecture using local DMA capability for inter-memory data transfers and inter-layer data reuse (e.g., feature maps). First, a taxonomy to obtain an optimal data flow for CnM/CiM architecture with multicast capability at each decoder level in the memory hierarchy will be provided, followed by a procedure to find the optimal spatial data flow, and a procedure to identify decoder levels (e.g., in the memory hierarchy) with DMA capability that achieves best trade off between performance and hardware complexity.
Inter-Layer Spatial Data Flow—Taxonomy
In general, multiple tiles are grouped into a “super tile”, which constitutes a quantum of work that is executed simultaneously on the memory architecture. More particularly, IFM tiles and weight data tiles may be grouped individually to form an “IFM super tile” and a “weight super tile”. Based on NN layer parameters and compiler strategy for optimal data flows, an IFM super tile may contain one or more tiles along spatial (I chunks) and input channel (Ic chunks) dimensions. Similarly, a weight super-tile contains one or more tiles along input channel (Ic chunks) and output channel/filter (Oc chunks) dimensions. IFM and weight super-tiles are divided hierarchically such that the product of #1 chunks, #Ic chunks, and #0c chunks for a fan-out at any given level in the memory hierarchy is equal to total number of computing cores under the one fan-out at that hierarchy. In other words, division factors div I, div Ic, and div Oc (e.g., for I chunks, Ic chunks, and Oc chunks, respectively) at a given hierarchy level are chosen in such a way that the product of all three division factors is equal to number of fan-outs at that hierarchy level, to efficiently parallelize computations across N nodes. Thus, chunks at any level are the cumulative product of the divs for all levels below the level in question.
Inter-Layer Spatial Data Flow—Optimizing Tensor Spatial Layout
Assuming hardware support for local DMA at each decoder stage in the memory hierarchy (e.g., CiM/CnM), the below procedure optimizes the spatial arrangement of the current layer to minimize data transfers over longer wires (e.g., maximize data transfers at innermost levels). Such an approach reduces the cost of data movement, achieving energy savings and lower latency (e.g., in bandwidth constraint scenarios).
Turning now to
Inter-Layer Spatial Data Flow—DMA Engine Placement
Placing a DMA engine at each decoder level in the memory hierarchy may add significant complexity (e.g., higher area and power). Accordingly, embodiments restrict the DMA hardware to a subset of locations corresponding to the plurality of address decoders (e.g., selectively picking decoder levels with DMA capability). The procedure detailed below identifies optimal levels to place DMA decoders for facilitating data transfers within distributed memory:
Turning now to
Spatial data flow alignment helps to reduce data transfers by aligning the desired spatial arrangement of feature maps to the expected arrangement hierarchically. Identifying the correct level of data transfers not only reduces data transfers, but also aids in architecture design—to assess the right level to place decoders to move data across nodes of the same hierarchy. Both of these advantages jointly contribute to the performance improvements visible between base-MR and all-levels-MR in the chart 152 (
Illustrated processing block 162 provides for obtaining DMA configuration data (e.g., descriptor details, sideband signals). In an embodiment, the DMA configuration data is received by a chip including compute hardware, a plurality of address decoders coupled to the compute hardware, a hierarchical interconnect fabric coupled to the plurality of address decoders, and DMA hardware positioned adjacent to one or more of the plurality of address decoders. Block 164 conducts on-chip transfers of intermediate state data via the hierarchical interconnect fabric. In one example, the intermediate state data is inter-layer data associated with a neural network. The method 160 therefore enhances performance at least to the extent that conducting the on-chip transfers of intermediate state data reduces DRAM (dynamic random access memory) transfers and/or latency during operation.
Computer program code to carry out operations shown in the method 170 can be written in any combination of one or more programming languages, including an object oriented programming language such as JAVA, SMALLTALK, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. Additionally, logic instructions might include assembler instructions, instruction set architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, state-setting data, configuration data for integrated circuitry, state information that personalizes electronic circuitry and/or other structural components that are native to hardware (e.g., host processor, central processing unit/CPU, microcontroller, etc.).
Processing block 172 determines residency data (e.g., per-layer residency data) associated with the intermediate state data (e.g., inter-layer data). In an embodiment, residency data indicates whether the intermediate state data in each layer is to reside on-chip or off-chip. Block 174 determines a memory availability (e.g., per-tensor memory availability) associated with the intermediate state data. In an embodiment, the memory availability indicates the available memory capacity for each tensor in the intermediate state data. Block 176 allocates address space on the chip to the intermediate state data. In one example, block 176 determines access scores associated with the intermediate state data, wherein the address space is further allocated based on the access scores. Additionally, block 176 may hierarchically align spatial data flows between output state data (e.g., output feature maps) and input state data (e.g., input feature maps) in the intermediate state data. Block 178 stores the intermediate state data to the allocated address space. The method 170 therefore further enhances performance at least to the extent that allocating address space to the intermediate state data based on the residency data and/or the memory availability addresses concerns over bandwidth bottlenecks, limited capacity of on-chip memories and/or locally produced data tiles not being fully consumed by the same local compute cores/nodes.
Turning now to
In the illustrated example, the system 280 includes a host processor 282 (e.g., central processing unit/CPU) that is coupled to a system memory 286 (e.g., dual inline memory module/DIMM including a plurality of dynamic RAMs/DRAMs) and an integrated memory controller (IMC) 284. In an embodiment, an IO (input/output) module 288 is coupled to the host processor 282. The illustrated IO module 288 communicates with, for example, a display 290 (e.g., touch screen, liquid crystal display/LCD, light emitting diode/LED display), mass storage 302 (e.g., hard disk drive/HDD, optical disc, solid state drive/SSD) and a network controller 292 (e.g., wired and/or wireless). The host processor 282 may be combined with the 10 module 288, a graphics processor 294, and the IMC 284 into a system on chip (SoC) 298. The illustrated IMC 284 includes decoder hierarchies 304 (e.g., including DMA hardware) and an AI accelerator 296 (e.g., compute hardware).
In an embodiment, the IMC 284 is incorporated onto a chip such as, for example, the chip 30 (
The IMC 284 and/or the host processor 282 may also execute instructions 300 (e.g., compiler and/or data flow optimizer instructions) retrieved from the system memory 286 and/or the mass storage 302 to perform one or more aspects of the method 70 (
The computing system 280 and/or the chip containing the IMC 284 are therefore considered performance-enhanced at least to the extent that conducting the on-chip transfers of intermediate state data reduces DRAM (dynamic random access memory) transfers and/or latency during operation. The computing system 280 and/or the chip containing the IMC 284 are also considered performance-enhanced because allocating address space to the intermediate state data based on the residency data and/or the memory availability addresses concerns over bandwidth bottlenecks, limited capacity of on-chip memories and/or locally produced data tiles not being fully consumed by the same local compute cores/nodes.
The logic 354 may be implemented at least partly in configurable or fixed-functionality hardware. In one example, the logic 354 includes transistor channel regions that are positioned (e.g., embedded) within the substrate(s) 352. Thus, the interface between the logic 354 and the substrate(s) 352 may not be an abrupt junction. The logic 354 may also be considered to include an epitaxial layer that is grown on an initial wafer of the substrate(s) 352.
The processor core 400 is shown including execution logic 450 having a set of execution units 455-1 through 455-N. Some embodiments may include a number of execution units dedicated to specific functions or sets of functions. Other embodiments may include only one execution unit or one execution unit that can perform a particular function. The illustrated execution logic 450 performs the operations specified by code instructions.
After completion of execution of the operations specified by the code instructions, back end logic 460 retires the instructions of the code 413. In one embodiment, the processor core 400 allows out of order execution but requires in order retirement of instructions. Retirement logic 465 may take a variety of forms as known to those of skill in the art (e.g., re-order buffers or the like). In this manner, the processor core 400 is transformed during execution of the code 413, at least in terms of the output generated by the decoder, the hardware registers and tables utilized by the register renaming logic 425, and any registers (not shown) modified by the execution logic 450.
Although not illustrated in
Referring now to
The system 1000 is illustrated as a point-to-point interconnect system, wherein the first processing element 1070 and the second processing element 1080 are coupled via a point-to-point interconnect 1050. It should be understood that any or all of the interconnects illustrated in
As shown in
Each processing element 1070, 1080 may include at least one shared cache 1896a, 1896b. The shared cache 1896a, 1896b may store data (e.g., instructions) that are utilized by one or more components of the processor, such as the cores 1074a, 1074b and 1084a, 1084b, respectively. For example, the shared cache 1896a, 1896b may locally cache data stored in a memory 1032, 1034 for faster access by components of the processor. In one or more embodiments, the shared cache 1896a, 1896b may include one or more mid-level caches, such as level 2 (L2), level 3 (L3), level 4 (L4), or other levels of cache, a last level cache (LLC), and/or combinations thereof.
While shown with only two processing elements 1070, 1080, it is to be understood that the scope of the embodiments are not so limited. In other embodiments, one or more additional processing elements may be present in a given processor. Alternatively, one or more of processing elements 1070, 1080 may be an element other than a processor, such as an accelerator or a field programmable gate array. For example, additional processing element(s) may include additional processors(s) that are the same as a first processor 1070, additional processor(s) that are heterogeneous or asymmetric to processor a first processor 1070, accelerators (such as, e.g., graphics accelerators or digital signal processing (DSP) units), field programmable gate arrays, or any other processing element. There can be a variety of differences between the processing elements 1070, 1080 in terms of a spectrum of metrics of merit including architectural, micro architectural, thermal, power consumption characteristics, and the like. These differences may effectively manifest themselves as asymmetry and heterogeneity amongst the processing elements 1070, 1080. For at least one embodiment, the various processing elements 1070, 1080 may reside in the same die package.
The first processing element 1070 may further include memory controller logic (MC) 1072 and point-to-point (P-P) interfaces 1076 and 1078. Similarly, the second processing element 1080 may include a MC 1082 and P-P interfaces 1086 and 1088. As shown in
The first processing element 1070 and the second processing element 1080 may be coupled to an I/O subsystem 1090 via P-P interconnects 10761086, respectively. As shown in
In turn, I/O subsystem 1090 may be coupled to a first bus 1016 via an interface 1096. In one embodiment, the first bus 1016 may be a Peripheral Component Interconnect (PCI) bus, or a bus such as a PCI Express bus or another third generation I/O interconnect bus, although the scope of the embodiments are not so limited.
As shown in
Note that other embodiments are contemplated. For example, instead of the point-to-point architecture of
Example 1 includes a performance-enhanced chip comprising a memory structure including compute hardware, a plurality of address decoders coupled to the compute hardware, and a hierarchical interconnect fabric coupled to the plurality of address decoders, and direct memory address (DMA) hardware positioned adjacent to one or more of the plurality of address decoders, wherein the DMA hardware is to conduct on-chip transfers of intermediate state data via the hierarchical interconnect fabric.
Example 2 includes the chip of Example 1, wherein the chip further includes logic coupled to one or more substrates, and wherein the logic is implemented at least partly in one or more of configurable hardware or fixed-functionality hardware, the logic to allocate address space in the chip to the intermediate state data, and store the intermediate state data to the allocated address space.
Example 3 includes the chip of Example 2, wherein the logic is further to determine residency data associated with the intermediate state data, and determine a memory availability associated with the intermediate state data, wherein the address space is allocated based on the residency data and the memory availability.
Example 4 includes the chip of Example 3, wherein the logic is further to determine access scores associated with the intermediate state data, and wherein the address space is further allocated based on the access scores.
Example 5 includes the chip of Example 2, wherein the logic is further to hierarchically align spatial data flows between output state data and input state data in the intermediate state data.
Example 6 includes the chip of Example 2, wherein the logic coupled to the one or more substrates includes transistor regions that are positioned within the one or more substrates.
Example 7 includes the chip of Example 1, wherein the intermediate state data is to be inter-layer data associated with a neural network.
Example 8 includes the chip of Example 1, wherein the memory structure further includes a plurality of memory cell sub-arrays, and wherein the compute hardware includes a plurality of compute cores corresponding to the plurality of memory cell sub-arrays.
Example 9 includes the chip of Example 1, wherein the memory structure includes a memory bit-cell array, and wherein memory bit-cell array includes the compute hardware.
Example 10 includes the chip of any one of Examples 1 to 9, wherein the DMA hardware is restricted to a subset of locations corresponding to the plurality of address decoders.
Example 11 includes a performance-enhanced computing system comprising a network controller, a processor coupled to the network controller, and a chip comprising a memory structure coupled to the processor, wherein the memory structure includes compute hardware, a plurality of address decoders coupled to the compute hardware, and a hierarchical interconnect fabric coupled to the plurality of address decoders, and direct memory access (DMA) hardware positioned adjacent to one or more of the plurality of address decoders, wherein the DMA hardware is to conduct on-chip transfers of intermediate state data via the hierarchical interconnect fabric.
Example 12 includes the computing system of Example 11, wherein the intermediate state data is to be inter-layer data associated with a neural network.
Example 13 includes the computing system of Example 11, wherein the chip further includes a plurality of memory cell sub-arrays, and wherein the compute hardware includes a plurality of compute cores corresponding to the plurality of memory cell sub-arrays.
Example 14 includes the computing system of Example 11, wherein memory structure includes a memory bit-cell array, and wherein the memory bit-cell array includes the compute hardware.
Example 15 includes the computing system of any one of Examples 11 to 14, wherein the DMA hardware is restricted to a subset of locations corresponding to the plurality of address decoders.
Example 16 includes at least one computer readable storage medium comprising a set of instructions, which when executed by a chip, cause the chip to allocate address space in the chip to inter-layer data associated with a neural network, and store the inter-layer data to the allocated address space.
Example 17 includes the at least one computer readable storage medium of Example 16, wherein the instructions, when executed, further cause the chip to determine per-layer residency data associated with the inter-layer data, and determine a per-tensor memory availability associated with the inter-layer data, wherein the address space is allocated based on the per-layer residency data and the per-tensor memory availability.
Example 18 includes the at least one computer readable storage medium of Example 17, wherein the instructions, when executed, further cause the chip to determine access scores associated with the inter-layer data, and wherein the address space is further allocated based on the access scores.
Example 19 includes the at least one computer readable storage medium of any one of Examples 16 to 18, wherein the instructions, when executed, further cause the chip to hierarchically align spatial data flows between output feature maps and input feature maps in the inter-layer data.
Example 20 includes a method of operating a chip, the method comprising allocating address space in the chip to inter-layer data associated with a neural network, and storing the inter-layer data to the allocated address space.
Example 21 includes an apparatus comprising means for performing the method of Example 20.
Example sizes/models/values/ranges may have been given, although embodiments are not limited to the same. As manufacturing techniques (e.g., photolithography) mature over time, it is expected that devices of smaller size could be manufactured. In addition, well known power/ground connections to IC chips and other components may or may not be shown within the figures, for simplicity of illustration and discussion, and so as not to obscure certain aspects of the embodiments. Further, arrangements may be shown in block diagram form in order to avoid obscuring embodiments, and also in view of the fact that specifics with respect to implementation of such block diagram arrangements are highly dependent upon the computing system within which the embodiment is to be implemented, i.e., such specifics should be well within purview of one skilled in the art. Where specific details (e.g., circuits) are set forth in order to describe example embodiments, it should be apparent to one skilled in the art that embodiments can be practiced without, or with variation of, these specific details. The description is thus to be regarded as illustrative instead of limiting.
The term “coupled” may be used herein to refer to any type of relationship, direct or indirect, between the components in question, and may apply to electrical, mechanical, fluid, optical, electromagnetic, electromechanical or other connections. In addition, the terms “first”, “second”, etc. may be used herein only to facilitate discussion, and carry no particular temporal or chronological significance unless otherwise indicated.
As used in this application and in the claims, a list of items joined by the term “one or more of” may mean any combination of the listed terms. For example, the phrases “one or more of A, B or C” may mean A; B; C; A and B; A and C; B and C; or A, B and C.
Those skilled in the art will appreciate from the foregoing description that the broad techniques of the embodiments can be implemented in a variety of forms. Therefore, while the embodiments have been described in connection with particular examples thereof, the true scope of the embodiments should not be so limited since other modifications will become apparent to the skilled practitioner upon a study of the drawings, specification, and following claims.