ACCELERATOR SYSTEM USING DIGITAL IN-MEMORY COMPUTE CHIPLET DEVICES FOR COMPUTATIONAL WORKLOADS

Description

BACKGROUND OF THE INVENTION

Conventional Network Interface Cards (NICs) that enable Ethernet connectivity face significant challenges when scaling across distributed AI accelerators for applications such as natural language processing (NLP), computer vision (CV), generative AI, agentic AI, autonomous reasoning/decision-making, dataset analysis, and the like. The challenges become especially pronounced in environments where multiple nodes are involved. A solution, RoCEv2 (commonly known as “RDMA over Converged Ethernet”) combined with RDMA IB (“InfiniBand”) fabric, can facilitate multi-node GPU accelerator communication

Various limitations, however, exist with RDMA over Converged Ethernet combined with InfiniBand. Such a solution often requires a complex shared address space to support one-sided communication, making deployment and management more complex. Additionally, both software and hardware fabric solutions face constraints when attempting to meet the low-latency demands required for Generative AI (GenAI) inferences. Such limitation hinders their ability to fully capitalize on the performance potential of modern accelerators.

Other conventional techniques involve the use of PCIe (Peripheral Component Interface Express) topologies. However, PCIe fabric topologies present scalability limitations. While such PCIe can support intra-switch communication, PCIe is restricted by a limited number of PCIe lanes provided by CPU sockets and PCIe switches. Accordingly, PCIe further complicates achieving efficient multi-CPU socket Peer-to-Peer (P2P) connectivity across nodes.

Certain PCIe switch vendors offer synthetic fabric models that enable cross-switch x16 link communication through the use of custom firmware. Unfortunately, such synthetic fabric models remain highly specialized and not yet broadly adopted.

From the above, it is seen that techniques for scaling across distributed accelerators are highly desirable.

BRIEF SUMMARY OF THE INVENTION

The present invention relates generally to integrated circuit (IC) devices and artificial intelligence (AI) systems. More particularly, the present invention relates to methods and device structures for accelerating computing workloads of neural network models (e.g., transformer models, convolution neural network models, etc.). These methods and structures can be used in applications such as natural language processing (NLP), computer vision (CV), generative AI, agentic AI, autonomous reasoning/decision-making, and the like. Merely by way of example, the invention has been applied to AI accelerator apparatuses and chiplet devices configured in a PCIe card.

According to an example, the present invention provides for a digital in-memory compute (DIMC) accelerator system using a chiplet architecture. The system includes a host device is configured to compile computational workload data for a target application obtained from data gathering devices into an instruction set architecture (ISA) graph to be executed by a plurality of accelerator apparatuses. Each such accelerator includes a plurality of chiplets, each of which includes a plurality of tiles, and each such tile includes a plurality of slices, a central processing unit (CPU), and a DIMC device configured to perform high throughput computations using the ISA graph to process the computational workload. The target application can include natural language processing (NLP), autonomous reasoning/decision-making, video/image processing, cybersecurity/fraud detection, manufacturing/industrial processes, agentic artificial intelligence (AI), or smart cities/Internet of Things (IoT). And the data gathering devices can include a web-scrapers, a dataset readers, a crowdsourcing devices, a sensors, a simulation devices, an IoT network, and others.

In an example, the DIMC architecture and high memory bandwidth can significantly speed up the processing of target computational workloads of a particular application, such as those mentioned previously. The DIMC accelerator system can perform precise and efficient computations of data in a block floating point (BFP) format and can also switch to a lower precision floating point (FP) during runtime. By dynamically switching between precision levels based on real-time analysis of the target workload, the DIMC system can optimize computational efficiency while maintaining the necessary level of accuracy for each step of the workload computation. And with a high memory bandwidth, the DIMC architecture enables a high throughput of workload computations.

The accelerator and chiplet architecture and its related methods can provide many benefits. With modular chiplets, the accelerator apparatus can be easily scaled to accelerate the workloads for transformers of different sizes. The DIMC configuration within the chiplet slices also improves computational performance and reduces power consumption by integrating computational functions and memory fabric. Further, embodiments of the accelerator apparatus can allow for quick and efficient mapping from computational workload data to enable effective implementation of AI applications, and the like.

A further understanding of the nature and advantages of the invention may be realized by reference to the latter portions of the specification and attached drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to more fully understand the present invention, reference is made to the accompanying drawings. Understanding that these drawings are not to be considered limitations in the scope of the invention, the presently described embodiments and the presently understood best mode of the invention are described with additional detail through use of the accompanying drawings in which:

FIG. 1A-1B are simplified block diagrams illustrating AI accelerator apparatuses according to examples of the present invention.

FIGS. 2A-2B are simplified block diagrams illustrating 16-slice chiplet devices according to examples of the present invention.

FIGS. 3A-B are simplified block diagrams illustrating slice devices according to examples of the present invention.

FIG. 4 is a simplified block diagram illustrating an in-memory-compute (IMC) module according to an example of the present invention.

FIG. 5A is a simplified block flow diagram illustrating numerical formats of the data being processed in a slice device according to an example of the present invention.

FIG. 5B is a simplified diagram illustrating example numerical formats.

FIG. 6 is a simplified block diagram of a transformer architecture.

FIG. 7 is a simplified diagram illustrating a self-attention layer process for an example NLP model.

FIG. 8 is a simplified block diagram illustrating an example transformer.

FIG. 9 is a simplified block diagram illustrating an attention head layer of an example transformer.

FIG. 10 is a simplified table representing an example mapping process between a 24-layer transformer and an example eight-chiplet AI accelerator apparatus according to an example of the present invention.

FIG. 11 is a simplified block flow diagram illustrating a mapping process between a transformer and an AI accelerator apparatus according to an example of the present invention.

FIG. 12 is a simplified table representing a tiling attention process of a transformer to an AI accelerator apparatus according to an example of the present invention.

FIGS. 13A-13C are simplified tables illustrating data flow through the IMC and single input multiple data (SIMD) modules according to an example of the present invention.

FIG. 14 is a simplified block diagram illustrating a digital in-memory compute (DIMC) accelerator system configured for a variety of AI applications according to an example of the present invention.

FIG. 15 is a simplified block diagram illustrating a DIMC accelerator system according to an example of the present invention.

DETAILED DESCRIPTION OF THE INVENTION

The present invention relates generally to integrated circuit (IC) devices and artificial intelligence (AI) systems. More particularly, the present invention relates to methods and device structures for accelerating computing workloads of neural network models (e.g., transformer models, convolution neural network models, etc.). These methods and structures can be used in applications such as natural language processing (NLP), computer vision (CV), and the like. Merely by way of example, the invention has been applied to AI accelerator apparatuses and chiplet devices configured in a PCIe card.

Currently, the vast majority of NLP models are based on the transformer model, such as the bidirectional encoder representations from transformers (BERT) model, BERT Large model, and generative pre-trained transformer (GPT) models such as GPT-2 and GPT-3, etc. However, these transformers have very high compute and memory requirements. According to an example, the present invention provides for an apparatus using chiplet devices that are configured to accelerate transformer computations for AI applications. Examples of the AI accelerator apparatus are shown in FIGS. 1A and 1B.

FIG. 1A illustrates a simplified AI accelerator apparatus 101 with two chiplet devices 110. As shown, the chiplet devices 110 are coupled to each other by one or more die-to-die (D2D) interconnects 120. Also, each chiplet device 110 is coupled to a memory interface 130 (e.g., static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic RAM (SDRAM), or the like). The apparatus 101 also includes a substrate member 140 that provides mechanical support to the chiplet devices 110 that are configured upon a surface region of the substrate member 140. The substrate can include interposers, such as a silicon interposer, glass interposer, organic interposer, or the like. The chiplets can be coupled to one or more interposers, which can be configured to enable communication between the chiplets and other components (e.g., serving as a bridge or conduit that allows electrical signals to pass between internal and external elements).

FIG. 1B illustrates a simplified AI accelerator apparatus 102 with eight chiplet devices 110 configured in two groups of four chiplets on the substrate member 140. Here, each chiplet device 110 within a group is coupled to other chiplet devices by one or more D2D interconnects 120. Apparatus 102 also shows a DRAM memory interface 130 coupled to each of the chiplet devices 110. The DRAM memory interface 130 can be coupled to one or more memory modules, represented by the “Mem” block.

As shown, the AI accelerator apparatuses 101 and 102 are embodied in peripheral component interconnect express (PCIe) card form factors, but the AI accelerator apparatus can be configured in other form factors as well. These PCIe card form factors can be configured in a variety of dimensions (e.g., full height, full length (FHFL); half height, half length (HHHL), etc.) and mechanical sizes (e.g., 1×, 2×, 4×, 16×, etc.). In an example, one or more substrate members 140, each having one or more chiplets, are coupled to a PCIe card. Those of ordinary skill in the art will recognize other variations, modifications, and alternatives to these elements and configurations of the AI accelerator apparatus.

Embodiments of the AI accelerator apparatus can implement several techniques to improve performance (e.g., computational efficiency) in various AI applications. The AI accelerator apparatus can include digital in-memory-compute (DIMC) to integrate computational functions and memory fabric. Algorithms for the mapper, numerics, and sparsity can be optimized within the compute fabric. And, use of chiplets and interconnects configured on organic interposers can provide modularity and scalability.

According to an example, the present invention implements chiplets with in-memory-compute (IMC) functionality, which can be used to accelerate the computations required by the workloads of transformers. The computations for training these models can include performing a scaled dot-product attention function to determine a probability distribution associated with a desired result in a particular AI application. In the case of training NLP models, the desired result can include predicting subsequent words, determining contextual word meaning, translating to another language, etc.

The chiplet architecture can include a plurality of slice devices (or slices)

controlled by a central processing unit (CPU) to perform the transformer computations in parallel. Each slice is a modular IC device that can process a portion of these computations. The plurality of slices can be divided into tiles/gangs (i.e., subsets) of one or more slices with a CPU coupled to each of the slices within the tile. This tile CPU can be configured to perform transformer computations in parallel via each of the slices within the tile. A global CPU can be coupled to each of these tile CPUs and be configured to perform transformer computations in parallel via all of the slices in one or more chiplets using the tile CPUs. Further details of the chiplets are discussed in reference to FIGS. 2A-5B, while transformers are discussed in reference to FIGS. 6-9.

FIG. 2A is a simplified block diagram illustrating an example configuration of a 16-slice chiplet device 201. In this case, the chiplet 201 includes four tile devices 210, each of which includes four slice devices 220, a CPU 221, and a hardware dispatch (HW DS) device 222. In a specific example, these tiles 210 are arranged in a symmetrical manner. As discussed previously, the CPU 221 of a tile 210 can coordinate the operations performed by all slices within the tile. The HW DS 222 is coupled to the CPU 221 and can be configured to coordinate control of the slices 220 in the tile 210 (e.g., to determine which slice in the tile processes a target portion of transformer computations). In a specific example, the CPU 221 can be a reduced instruction set computer (RISC) CPU, or the like. Further, the CPU 221 can be coupled to a dispatch engine, which is configured to coordinate control of the CPU 221 (e.g., to determine which portions of transformer computations are processed by the particular CPU).

The CPUs 221 of each tile 210 can be coupled to a global CPU via a global CPU interface 230 (e.g., buses, connectors, sockets, etc.). This global CPU can be configured to coordinate the processing of all chiplet devices in an AI accelerator apparatus, such as apparatuses 101 and 102 of FIGS. 1A and 1B, respectively. In an example, a global CPU can use the HW DS 222 of each tile to direct each associated CPU 221 to perform various portions of the transformer computations across the slices in the tile. Also, the global CPU can be a RISC processor, or the like. The chiplet 201 also includes D2D interconnects 240 and a memory interface 250, both of which are coupled to each of the CPUs 221 in each of the tiles. In an example, the D2D interconnects can be configured with single-ended signaling. The memory interface 250 can include one or more memory buses coupled to one or more memory devices (e.g., DRAM, SRAM, SDRAM, or the like).

Further, the chiplet 201 includes a PCIe interface/bus 260 coupled to each of the CPUs 221 in each of the tiles. The PCIe interface 260 can be configured to communicate with a server or other communication system. In the case of a plurality of chiplet devices, a main bus device is coupled to the PCIe bus 260 of each chiplet device using a master chiplet device (e.g., main bus device also coupled to the master chiplet device). This master chiplet device is coupled to each other chiplet device using at least the D2D interconnects 240. The master chiplet device and the main bus device can be configured overlying a substrate member (e.g., same substrate as chiplets or separate substrate). An apparatus integrating one or more chiplets can also be coupled to a power source (e.g., configured on-chip, configured in a system, or coupled externally) and can be configured and operable to a server, network switch, or host system using the main bus device. The server apparatus can also be one of a plurality of server apparatuses configured for a server farm within a data center, or other similar configuration.

In a specific example, an AI accelerator apparatus configured for GPT-3 can incorporate eight chiplets (similar to apparatus 102 of FIG. 1B). The chiplets can be configured with D2D 16×16 Gb/s interconnects, 32-bit LPDDR5 6.4 Gb/s memory modules, and 16 lane PCIe Gen 5 PHY NRZ 32 Gb/s/lane interface. LPDDR5 (16×16 GB) can provide the necessary capacity, bandwidth and low power for large scale NLP models, such as quantized GPT-3. Of course, there can be other variations, modifications, and alternatives.

FIG. 2B is a simplified block diagram illustrating an example configuration of a 16-slice chiplet device 202. Similar to chiplet 201, chiplet 202 includes four gangs 210 (or tiles), each of which includes four slice devices 220 and a CPU 221. As shown, the CPU 221 of each gang/tile 210 is coupled to each of the slices 220 and to each other CPU 221 of the other gangs/tiles 210. In an example, the tiles/gangs serve as neural cores, and the slices serve as compute cores. With this multi-core configuration, the chiplet device can be configured to take and run several computations in parallel. The CPUs 221 are also coupled to a global CPU interface 230, D2D interconnects 240, a memory interface 250, and a PCIe interface 260. As described for FIG. 2A, the global CPU interface 230 connects to a global CPU that controls all of the CPUs 221 of each gang 210.

FIG. 3A is a simplified block diagram illustrating an example slice device 301 of a chiplet. For the 16-slice chiplet example, slice device 301 includes a compute core 310 having four compute paths 312, each of which includes an input buffer (IB) device 320, a digital in-memory-compute (DIMC) device 330, an output buffer (OB) device 340, and a Single Instruction, Multiple Data (SIMD) device 350 coupled together. Each of these paths 312 is coupled to a slice cross-bar/controller 360, which is controlled by the tile CPU to coordinate the computations performed by each path 312.

In an example, the DIMC is coupled to a clock and is configured within one or more portions of each of the plurality of slices of the chiplet to allow for high throughput of one or more matrix computations provided in the DIMC such that the high throughput is characterized by 512 multiply accumulates per a clock cycle. In a specific example, the clock coupled to the DIMC is a second clock derived from a first clock (e.g., chiplet clock generator, AI accelerator apparatus clock generator, etc.) configured to output a clock signal of about 0.5 GHz to 4 GHz; the second clock can be configured at an output rate of about one half of the rate of the first clock. The DIMC can also be configured to support a block structured sparsity (e.g., imposing structural constraints on weight patterns of a neural networks like a transformer).

In an example, the SIMD device 350 is a SIMD processor coupled to an output of the DIMC. The SIMD 350 can be configured to process one or more non-linear operations and one or more linear operations on a vector process. The SIMD 350 can be a programmable vector unit or the like. The SIMD 350 can also include one or more random-access memory (RAM) modules, such as a data RAM module, an instruction RAM module, and the like.

In an example, the slice controller 360 is coupled to all blocks of each compute path 312 and also includes a control/status register (CSR) 362 coupled to each compute path. The slice controller 360 is also coupled to a memory bank 370 and a data reshape engine (DRE) 380. The slice controller 360 can be configured to feed data from the memory bank 370 to the blocks in each of the compute paths 312 and to coordinate these compute paths 312 by a processor interface (PIF) 364. In a specific example, the PIF 364 is coupled to the SIMD 350 of each compute path 312.

Further details for the compute core 310 are shown in FIG. 3B. The simplified block diagram of slice device 302 includes an input buffer 320, a DIMC matrix vector unit 330, an output buffer 340, a network on chip (NoC) device 342, and a SIMD vector unit 350. The DIMC unit 330 includes a plurality of in-memory-compute (IMC) modules 332 configured to compute a Scaled Dot-Product Attention function on input data to determine a probability distribution, which requires high-throughput matrix multiply-accumulate operations.

These IMC modules 332 can also be coupled to a block floating point alignment module 334 and a partial products reduction module 336 for further processing before outputting the DIMC results to the output buffer 540. In an example, the input buffer 320 receives input data (e.g., data vectors) from the memory bank 370 (shown in FIG. 3A) and sends the data to the IMC modules 332. The IMC modules 332 can also receive instructions from the memory bank 370 as well.

In addition to the details discussed previously, the SIMD 350 can be configured as an element-wise vector unit. The SIMD 350 can include a computation unit 352 (e.g., add, subtract, multiply, max, etc.), a look-up table (LUT) 354, and a state machine (SM) module 356 configured to receive one or more outputs from the output buffer 340.

The NoC device 342 is coupled to the output buffer 340 configured in a feedforward loop via shortcut connection 344. Also, the NoC device 342 is coupled to each of the slices and is configured for multicast and unicast processes. More particularly, the NoC device 342 can be configured to connect all of the slices and all of the tiles, multi-cast input activations to all of the slices/tiles, and collect the partial computations to be unicast for a specially distributed accumulation.

Considering the previous eight-chiplet AI accelerator apparatus example, the input buffer can have a capacity of 64 KB with 16 banks and the output buffer can have a capacity of 128 KB with 16 banks. The DIMC can be an 8-bit block have dimensions 64×64 (eight 64×64 IMC modules) and the NoC can have a size of 512 bits. The computation block in the SIMD can be configured for 8-bit and 32-bit integer (int) and unsigned integer (uint) computations. These slice components can vary depending on which transformer the AI accelerator apparatus will serve.

FIG. 4 is a simplified block diagram illustrating an example IMC module 700. As shown, module 700 includes one or more computation tree blocks 410 that are configured to perform desired computations on input data from one or more read-write blocks 420. Each of these read-write blocks 420 includes one or more first memory-select units 422 (also denoted as “W”), one or more second memory-select units 424 (also denoted as “I”), an activation multiplexer 426, and an operator unit 428. The first memory-select unit 422 provides an input to the operator unit 428, while the second memory-select unit 424 controls the activation multiplexer 426 that is also coupled to the operator unit 428. In the case of multiply-accumulate operations, the operator unit 428 is a multiplier unit and the computation tree blocks 410 are multiplier adder tree blocks (i.e., Σx.w).

As shown in close-up 401, each of the memory-select units 422, 424 includes a memory cell 430 (e.g., SRAM cell, or the like) and a select multiplexer 432. Each of the memory-select units 422, 424 is coupled to a read-write controller 440, which is also coupled to a memory bank/driver block 442. In an example, the read-write controller 440 can be configured with column write drivers and column read sense amplifiers, while the memory bank/driver block 432 can configured with sequential row select drivers.

An input activation controller 450 can be coupled to the activation multiplexer 426 each of the read-write blocks 420. The input activation controller 450 can include precision and sparsity aware input activation register and drivers. The operator unit 428 receives the output of the first memory-select unit 422 and receives the output of this block 450 through the activation multiplexer 426, which is controlled by the output of the second memory-select unit 424. The output of the operator unit 428 is then fed into the computation tree block 410.

The input activation block 450 is also coupled to a clock source/generator 460. As discussed previously, the clock generator 460 can produce a second clock derived from a first clock configured to output a clock signal of about 0.5 GHz to 4 GHz; the second clock can be configured at an output rate of about one half of the rate of the first clock. The clock generator 460 is coupled to one or more sign and precision aware accumulators 470, which are configured to receive the output of the computation tree blocks 410. In an example, an accumulator 470 is configured to receive the outputs of two computation tree blocks 410. Example output readings of the IMC are shown in FIGS. 13A-13C.

Referring back to the eight-chiplet AI accelerator apparatus example, the memory cell can be a dual bank 2×6T SRAM cell, and the select multiplexer can be an 8T bank select multiplexer. In this case, the memory bank/driver block 442 includes a dual-bank SRAM bank. Also, the read/write controller can include 64 bytes of write drivers and 64 bytes of read sense amplifiers. Those of ordinary skill in the art will recognize other variations, modifications, and alternatives to these IMC module components and their configurations.

FIG. 5A is a simplified block flow diagram illustrating example numerical formats of the data being processed in a slice. Diagram 501 shows a loop with the data formats for the GM/input buffer 510, the IMC 520, the output buffer 530, the SIMD 540, and the NoC 550, which feeds back to the GM/input buffer 510. The IMC block 520 shows the multiply-accumulate operation (Σx.w). Additionally, the format for the data from IMC 532 flows to the output buffer 530 as well. In this example, the numerical formats include integer (int), floating point (float), and block floating (bfloat) of varying lengths.

FIG. 5B is a simplified diagram illustrating certain numerical formats, including certain formats shown in FIG. 5A. Block floating point numerics can be used to address certain barriers to performance. Training of transformers is generally done in floating point, i.e., 32-bit float or 16-bit float, and inference is generally done in 8-bit integer (“int8”). With block floating point, an exponent is shared across a set of mantissa significant values (sec diagonally line filled blocks of the int8 vectors at the bottom of FIG. 5B), as opposed to floating point where each mantissa has a separate exponent (see 32-bit float and 16-bit float formats at the top of FIG. 5A). The method of using block floating point numerical formats for training can exhibit the efficiency of fixed point without the problems of integer arithmetic, and can also allow for use of a smaller mantissa, e.g., 4-bit integer (“int4”) while retaining accuracy. Further, by using the block floating point format (e.g., for activation, weights, etc.) and sparsity, the inference of the training models can be accelerated for better performance. Those of ordinary skill in the art will recognize other variations, modifications, and alternatives to these numerical formats used to process transformer workloads.

FIG. 6 illustrates a simplified transformer architecture 600. The typical transformer can be described as having an encoder stack configured with a decoder stack, and each such stack can have one or more layers. Within the encoder layers 610, a self-attention layer 612 determines contextual information while encoding input data and feeds the encoded data to a feed-forward neural network 616. The encoder layers 610 process an input sequence from bottom to top, transforming the output into a set of attention vectors K and V. The decoder layers 620 also include a corresponding self-attention layer 622 and feed-forward neural network 626, and can further include an encoder-decoder attention layer 624 uses the attention vectors from the encoder stack that aid the decoder in further contextual processing. The decoder stack outputs a vector of floating points (as discussed for FIG. 5B), which is fed to linear and softmax layers 630 to project the output into a final desired result (e.g., desired word prediction, interpretation, or translation). The linear layer is a fully-connected neural network that projects the decoder output vector into a larger vector (i.e., logits vector) that contains scores associated with all potential results (e.g., all potential words), and the softmax layer turns these scores into probabilities. Based on this probability output, the projected word meaning may be chosen based on the highest probability or by other derived criteria depending on the application.

Transformer model variations include those based on just the decoder stack (e.g., transformer language models such as GPT-2, GPT-3, etc.) and those based on just the encoder stack (e.g., masked language models such as BERT, BERT Large, etc.). Transformers are based on four parameters: sequence length(S) (i.e., number of tokens), number of attention heads (A), number of layers (L), and embedding length (H). Variations of these parameters are used to build practically all transformer-based models today. Embodiments of the present invention can be configured for any similar model types.

A transformer starts as untrained and is pre-trained by exposure to a desired data set for a desired learning application. Transformer-based language models are exposed to large volumes of text (e.g., Wikipedia) to train language processing functions such as predicting the next word in a text sequence, translating the text to another language, etc. This training process involves converting the text (e.g., words or parts of words) into token IDs, evaluating the context of the tokens by a self-attention layer, and predicting the result by a feed forward neural network.

The self-attention process includes (1) determining query (Q), key (K), and value (V) vectors for the embedding of each word in an input sentence, (2) calculating a score for from the dot product of Q and K for each word of the input sentence against a target word, (3) dividing the scores by the square root of the dimension of K, (4) passing the result through a softmax operation to normalize the scores, (5) multiplying each V by the softmax score, and (6) summing up the weighted V vectors to produce the output. An example self-attention process 700 is shown in FIG. 7.

As shown, process 700 shows the evaluation of the sentence “the beetle drove off” at the bottom to determine the meaning of the word “beetle” (e.g., insect or automobile). The first step is to determine the q_beetle, k_beetle, and v_beetlevectors for the embedding vector e_beetle. This is done by multiplying ebeetle by three different pre-trained weight matrices W_q, W_k, and W_v. The second step is to calculate the dot products of qbeetle with the K vector of each word in the sentence (i.e., k_the, k_beetle, k_drove, and k_off), shown by the arrows between q_beetleand each K vector. The third step is to divide the scores by the square root of the dimension d_k, and the fourth step is to normalize the scores using a softmax function, resulting in λ_i. The fifth step is to multiply the V vectors by the softmax score (λ_iv_i) in preparation for the final step of summing up all the weight value vectors, shown by v′ at the top.

Process 700 only shows the self-attention process for the word “beetle”, but the self-attention process can be performed for each word in the sentence in parallel. The same steps apply for word prediction, interpretation, translation, and other inference tasks. Further details of the self-attention process in the BERT Large model are shown in FIGS. 8 and 9.

A simplified block diagram of the BERT Large model (S=384, A=16, L=34, and H=1024) is shown in FIG. 8. This figure illustrates a single layer 800 of a BERT Large transformer, which includes an attention head device 810 configured with three different fully-connected (FC) matrices 821-823. As discussed previously, the attention head 810 receives embedding inputs (384×1024 for BERT Large) and measures the probability distribution to come up with a numerical value based on the context of the surrounding words. This is done by computing different combination of softmax around a particular input value and producing a value matrix output having the attention scores.

Further details of the attention head 810 are provided in FIG. 9. As shown, the attention head 900 computes a score according to an attention head function: Attention (Q, K, V)=softmax (QK^T/√d_k)V. This function takes queries (Q), keys (K) of dimension dk, and values (V) of dimension dk and computes the dot products of the query with all of the keys, divides the result by a scaling factor √d_kand applies a softmax function to obtain the weights (i.e., probability distribution) on the values, as shown previously in FIG. 7.

The function is implemented by several matrix multipliers and function blocks. An input matrix multiplier 910 obtains the Q, K, and V vectors from the embeddings. The transpose function block 920 computes K^T, and a first matrix multiplier 931 computes the scaled dot product QK^T/√d_k. The softmax block 940 performs the softmax function on the output from the first matrix multiplier 931, and a second matrix multiplier 932 computes the dot product of the softmax result and V.

For BERT Large, 16 such independent attention heads run in parallel on 16 AI slices. These independent results are concatenated and projected once again to determine the final values. The multi-head attention approach can be used by transformers for (1) “encoder-decoder attention” layers that allow every position in the decoder to attend over all positions of the input sequence, (2) self-attention layers that allows each position in the encoder to attend to all positions in the previous encoder layer, and (3) self-attention layers that allow each position in the decoder to attend to all positions in the decoder up to and including that position. Of course, there can be variations, modifications, and alternatives in other transformers.

Returning to FIG. 8, the attention score output then goes to a first FC matrix layer 821, which is configured to process the outputs of all of the attention heads. The first FC matrix output goes to a first local response normalization (LRN) block 841 through a short-cut connection 830 that also receives the embedding inputs. The first LRN block output goes to a second FC matrix 822 and a third FC matrix 823 with a Gaussian Error Linear Unit (GELU) activation block 850 configured in between. The third FC matrix output goes to a second LRN block 842 through a second short-cut connection 832, which also receives the output of the first LRN block 841.

Using a transformer like BERT Large, NLP requires very high compute (e.g., five orders of magnitude higher than CV). For example, BERT Large requires 5.6 giga-multiply-accumulate operations per second (“GMACs”) per transformer layer. Thus, the NLP inference challenge is to deliver this performance at the lowest energy consumption.

Although the present invention is discussed in the context of a BERT Large transformer for NLP applications, those of ordinary skill in the art will recognize variations, modifications, and alternatives. The particular embodiments shown can also be adapted to other transformer-based models and other AI/machine learning applications.

Many things impact the performance of such transformer architectures. softmax function tends to be the critical path of the transformer layers (and has been difficult to accelerate in hardware). Requirements for overlapping the compute, SIMD operations and NoC transfers also impacts performance. Further, efficiency of NoC, SIMD, and memory bandwidth utilization is important as well.

Different techniques can be applied in conjunction with the AI accelerator apparatus and chiplet device examples to improve performance, such as quantization, sparsity, knowledge distillation, efficient tokenization, and software optimizations. Supporting variable sequence length (i.e., not requiring padding to the highest sequence lengths) can also reduce memory requirements. Other techniques can include optimizations of how to split self-attention among slices and chips, moving layers and tensors between the slices and chips, and data movement between layers and FC matrices.

According to an example, the present invention provides for an AI accelerator apparatus (such as shown in FIGS. 1A and 1B) coupled to an aggregate of transformer devices (e.g., BERT, BERT Large, GPT-2, GPT-3, or the like). In a specific example, this aggregate of transformer devices can include a plurality of transformers configured in a stack ranging from three to N layers, where N is an integer up to 128.

In an example, each of the transformers is configured within one or more DIMCs such that each of the transformers comprises a plurality of matrix multipliers including QKV matrices configured for an attention layer of a transformer followed by three fully-connected matrices (FC). In this configuration, the DIMC is configured to accelerate the transformer and further comprises a dot product of Q K^Tfollowed by a softmax (Q K^T/square root (d_k))V. In an example, the AI accelerator apparatus also includes a SIMD device (as shown in FIGS. 3A and 3B) configured to accelerate a computing process of the softmax function.

According to an example, the present invention provides for methods of compiling the data representations related to transformer-based models mapping them to an AI accelerator apparatus in a spatial array. These methods can use the previously discussed numerical formats as well as sparsity patterns. Using a compile algorithm, the data can be configured to a dependency graph, which the global CPU can use to map the data to the tiles and slices of the chiplets. Example mapping methods are shown in FIGS. 10-13B.

FIG. 10 is a simplified table representing an example mapping process between a 24-layer transformer and an example eight-chiplet AI accelerator apparatus. As shown, the chiplets are denoted by the row numbers on the left end and the model layers mapped over time are denoted by the table entry numbers. In this case, the 24 layers of the transformer (e.g., BERT Large) are mapped to the chiplets sequentially in a staggered manner (i.e., first layer mapped onto the first chiplet, the second layer mapped onto the second chiplet one cycle after the first, the third layer mapped onto the third chiplet two cycles after the first, etc.) After eight cycles, the mapping process loops back to the first chiplet to start mapping the next eight model layers.

FIG. 11 is a simplified block flow diagram illustrating a mapping process between a transformer and an example AI accelerator apparatus. As shown, a transformer 1101 includes a plurality of transformer layers 1110, each having an attention layer 1102. In this case, there are 16 attention heads 1110 (e.g., BERT Large) computing the attention function as discussed previously. These 16 attention heads are mapped to 16 slices 1130 of an AI accelerator apparatus 1103 (similar to apparatuses 201 and 202) via global CPU 1132 communicating to the slice CPUs 1134.

FIG. 12 is a simplified table representing an example tiling attention process between a transformer and an example AI accelerator apparatus. Table 1200 shows positions of Q, K, and V vectors and the timing of the softmax performed on these vectors. The different instances of the softmax are distinguished by fill pattern (e.g., diagonal line filled blocks representing Q, K, V vectors and diagonal line filled blocks representing Q-K and Softmax-V dot products).

In an example, the embedding E is a [64L, 1024] matrix (L=6 for sentence length of 384), and E_iis a [64, 1024] submatrix of E, which is determined as E_i=E_{(64i-63):(64i), 1:1024}, where i=1 . . . L. Each of the K and Q matrices can be allocated to two slices (e.g., @[SL1:AC3,4]: K_j←E_i×K_{1 . . . 1024, 1 . . . 64}; and @[SL1:AC1,2]: Q_i←e_i×Q_{1 . . . 1024, 1 . . . 64}). An example data flows through IMC and SIMD modules are shown in the simplified tables of FIGS. 13A-13C.

FIG. 13A shows table 1301 representing mapping self-attention to an AI slice according to an example of the present invention. The left side shows the IMC cycles for matrix multiplications performed by IMC modules AC1-AC4, while the right side shows SIMD cycles for element-wise computations performed by SIMD modules SIMD1-SIMD4. In this example, the IMC modules determine the key vectors K1-K6 (a[64×512]; w[512×64]; o[64×64]), and query vectors Q1-Q6 (a[64×512]; w[512×64]; o[64×64]), followed by the transpose QKT1-QKT6 (a[64×64]; w[64×384]; o[64×384]). Then, the SIMD modules compute the softmax Smax1-Smax6 (a[64×384]). Meanwhile, the IMC modules determine the value vectors V1-V6 (a[64×512]; w[512×64]; o[64×64]), followed by the multiplication of the value vectors and the softmax results.

FIG. 13B shows table 1302 representing mapping dense embedding vectors and the second FC matrix to an AI slice (left: IMCs; right: SIMDs) according to an example of the present invention. In this example, the IMCs process the embedding vectors E1-E6 (a[64×512]; w[512×64]; o[64×64]), which corresponds to the path from the attention head 810 to the second FC matrix 822 in FIG. 8. Following the processing of each embedding vector E, the SIMDs process the GELU (a[64×64]), which corresponds to the path through the first LRN block 841 and the GELU block 850 in FIG. 8.

FIG. 13C shows table 1303 representing mapping the third FC matrix to an AI slice (left: IMCs; right: SIMDs) according to an example of the present invention. In this example, the IMCs process the results through the second FC matrix, which corresponds to the path through the third FC matrix 823 and the second LRN block 842 in FIG. 8. Those of ordinary skill in the art will recognize other variations, modifications, and alternatives to the mappings shown in FIGS. 10-13C.

According to an example, the advancement of artificial intelligence (AI) and large language models (LLMs) facilitated by high-speed inference engines can be improved or even optimized with the present chiplet-based processor and memory hardware. The use of advanced hardware accelerators configured with the chiplets and high-bandwidth memory, for example, enhances computational efficiency, reducing latency and power consumption while increasing throughput. The advancements enable real-time decision-making and automation across a wide range of industries. This disclosure describes applications of high-speed inference engines in domains such as natural language processing (NLP), reasoning, video and image processing, cybersecurity, manufacturing, drug discovery, and autonomous AI agents. By integrating AI-driven capabilities into these sectors, we can achieve improvements in efficiency, automation, and decision-making. This disclosure provides an example of these engines across multiple applications.

FIG. 14 is a simplified block diagram 1400 illustrating a digital in-memory compute (DIMC) accelerator system configured for a variety of AI applications according to an example of the present invention. The block diagram 1400 includes a DIMC accelerator system 1410 includes a plurality of accelerator apparatuses 1412, each of which includes a plurality of chiplets 1414. These accelerator apparatuses 1412 and chiplets 1414 can be configured similarly to the examples discussed previously. As shown, the DIMC accelerator system 1410 can be used for a variety of applications, including, but not limited to, Natural Language Processing (NLP) 1420, autonomous reasoning/decision-making 1430, video/image processing 1440, cybersecurity/fraud detection 1450, manufacturing/industrial processes 1460, agentic AI 1470, and smart cities/Internet of Things (IoT) 1480.

In a specific example, the present accelerator apparatuses can use the DIMC architecture and high memory bandwidth to significantly speed up the processing of target computational workloads of a particular application, such as those mentioned previously. The DIMC accelerator system can perform precise and efficient computations of data in a block floating point (BFP) format and can also switch to a lower precision floating point (FP) during runtime. By dynamically switching between precision levels based on real-time analysis of the target workload, the DIMC system can optimize computational efficiency while maintaining the necessary level of accuracy for each step of the workload computation. And with a high memory bandwidth, the DIMC architecture enables a high throughput of workload computations. Specific applications are discussed in further detail below.

As discussed previously, the present techniques can be configured for NLP, such as shown in FIG. 7. In an example, NLP applications configured by high-speed inference engines are provided in communication, information retrieval, content generation, and other applications. AI-powered chatbots and virtual assistants provide human-like interactions for customer service and enterprise automation. Real-time translation and transcription services bridge language barriers, while AI-driven search engines deliver contextualized responses to complex queries. In journalism and legal analysis, for example, the present AI assists in summarizing large volumes of text and identifying key insights. In a preferred example, chiplet-based AI accelerators are configured with NLP workloads by distributing tasks across multiple processing units, improving efficiency in model execution.

In an example, the present techniques are configured for reasoning and decision-making. That is, high-speed inference engines enhance AI-driven decision-making, enabling autonomous reasoning in applications such as financial analysis, legal research, scientific hypothesis generation, and others. Financial markets benefit from AI-based trading systems capable of making real-time risk assessments, while the legal industry leverages AI for contract analysis and regulatory compliance. Scientific research is further accelerated by AI's ability to analyze datasets and simulate experiments, driving innovation in fields like genomics and materials science. The integration of the present chiplet-based architectures within AI hardware allows for distributed processing, increasing computational efficiency and supporting more various models.

In an example, the present techniques are configured for video and image processing, such as for computer vision (CV) applications. In an example, AI-powered video and image processing applications include surveillance, healthcare, and entertainment. In an example, the present high-speed inference engines facilitate real-time video analytics (e.g., facial), enabling security systems to detect anomalies and track individuals. In healthcare, AI-driven medical imaging provides diagnostic accuracy in radiology and pathology. Additionally, generative AI is content creation by automating video editing and generating realistic visual media for entertainment and advertising. The incorporation of high-bandwidth memory in AI accelerators according to the present technique provides data access, reducing bottlenecks in processing large-scale image, video datasets, and others.

In an example, the present techniques are configured for cybersecurity and fraud detection. In an example, cybersecurity using chiplet-based AI-powered inference engines contribute to real-time threat detection and risk mitigation. In example, the present AI systems analyze network traffic patterns to identify cyber threats, while fraud detection models in banking and e-commerce prevent financial crimes. Automated penetration testing further enhances security by identifying system vulnerabilities before exploitation occurs. In an example, the utilization of AI hardware with chiplets enables rapid (e.g., nearly real-time) anomaly detection and real-time response, ensuring cybersecurity defenses.

In an example, the present techniques are configured for manufacturing and advanced Industry 4.0 (a.k.a. the Fourth Industrial Revolution or 4IR). Industry 4.0 can be defined as the integration of intelligent digital technologies into manufacturing and industrial processes and includes a set of technologies that include industrial IoT (Internet of Things) networks, AI (Artificial Intelligence), Big Data (e.g., large datacenters), robotics, automation, among other technologies based upon compute. In an example, AI-driven manufacturing is provided with industrial processes through predictive maintenance, robotic automation, and supply chain optimization. In an example, the present chiplet-based high-speed inference engines power intelligent robotics for assembly and quality control, and reducing or event minimizing human intervention. In an example, predictive maintenance systems analyze sensor data to prevent equipment failures, reducing downtime and operational costs. AI-powered logistics and inventory management enhance supply chain efficiency by dynamically adjusting production schedules. The present chiplet-based accelerators in industrial AI systems provide execution of complex manufacturing workflows by efficiently managing parallel computations in an example.

In an example, the present techniques are configured for drug discovery and healthcare. In an example, the present techniques can be applied to AI-driven drug discovery, where high-speed chiplet-based inference engines facilitate molecular simulations and compound screening. In an example, AI algorithms predict drug efficacy and improve and/or optimize clinical trial processes, accelerating the development of new treatments and discoveries. In an example, personalized medicine, enabled by AI's analysis of patient genomics, allows for tailored therapies with improved outcomes. In healthcare, for example, AI-powered diagnostic tools assist medical professionals by analyzing patient data and identifying potential health risks. In an example, the present AI accelerators provide for bioinformatics processing, enabling faster analysis of large-scale genomic datasets.

In an example, the present techniques are configured for agentic AI applications. In an example, the present agentic AI systems provide for AI-driven automation, characterized by autonomous decision-making and task execution with no or minimal human oversight. The intelligent agents autonomously conduct research, manage business operations, and optimize supply chains, etc. In an example, AI-powered legal agents provide in contract negotiation and compliance monitoring, while autonomous AI-driven customer service agents handle inquiries and support requests without human intervention. In an example, an emergence of AI-powered CEOs and business decision-makers are enabled with agentic AI in strategic planning, enterprise management, and other applications. In an example, the present hardware accelerators including chiplets and high-bandwidth memory further enhance these applications by enabling real-time learning and adaptation.

In an example, the present techniques are configured for smart cities and IoT. In an example, smart cities are configured using the present AI-powered inference engines to optimize urban infrastructure, traffic management, energy distribution, among other applications. In an example, AI-driven traffic control systems reduce congestion by analyzing real-time data, while smart grid solutions enhance energy efficiency through predictive analytics. In an example, environmental monitoring applications utilize AI to track pollution levels and model climate change impacts, enabling data-driven policy decisions. In an example, the present AI hardware accelerators provide for low-latency processing, making real-time optimizations feasible for large-scale IoT deployments.

FIG. 15 is a simplified block diagram illustrating a digital in-memory compute (DIMC) accelerator system according to an example of the present invention. As shown, the system 1500 includes a host computing device 1510 with a host runtime 1512 that operates at least a compiler stack 1520, a workload preprocessor 1530, and an execute stack 1540. The host device 1510 is configured to manage and coordinate the plurality of accelerator apparatuses 1512 to perform computational workloads for target applications, such as those discussed previously. Embodiments of this configurable system 1500 allow for the selection of computing throughput, latency, energy consumption, and functional accuracy.

The compiler stack 1520 includes at least a handles layer 1522 and an instruction set architecture (ISA) graph layer 1524. The host runtime 1512 can use the handles layer 1522 to determine references to resources for program, workload, or model; and the host runtime 1512 can use the ISA graph layer 1524 to translate the program, workload, or model into an ISA graph. For example, the ISA graph layer 1524 can translate a computation graph representing a target neural network model workload in machine code.

The workload preprocessor 1530 can be configured to determine a plurality of workload parameters using the translated computation graph from the ISA graph layer 1524. Afterwards, the host runtime 1512 can use the compiler stack 1520 to issue commands for the workload parameters and instructions to the execute stack 1540, which sends these commands to a target hardware. Those of ordinary skill in the art will recognize other variations, modifications, and alternatives to configuration of the host computing device 1510 and the associated software system.

The host computing device 1510 is also coupled to one or more data gathering devices 1580. Each such data gathering device 1580 can be configured to obtain data for one or more target applications (e.g., for a program, workload, or model) and to send the data to the host computing device 1510 to be translated as an ISA graph to be processed via the plurality of accelerator apparatuses 1512. Depending on the application, these data gathering devices 1520 can take many forms and implement different methods. To gather text data for NLP or image/video data for CV, the data gathering device 1520 can include a web-scraping device, a dataset reader device, a crowdsourcing device, and the like. Similar methods can be used to gather data for financial analysis, legal research, scientific research, and other fields as well.

For gathering real-time data, the data gathering device 1520 can include a variety of sensor devices, such as medical imaging devices used to capture patient data, equipment monitoring devices used to determine manufacturing data, network scanning devices used to detect abnormal transactions, and others. The data gathering device 1520 can also accumulate data from a network of devices, vehicles, appliances or other physical objects embedded with sensors, software and network connectivity (i.e., IoT). Further, the scale of such a network of devices can be expanded to smart city networks that manage traffic flow, energy systems, waste collection, and the like.

In addition, the data gathering device 1520 can be configured to analyze collected data to determine fraudulent activity or run simulations using collected data to drive autonomous reasoning and decision-making. For example, traffic data can be analyzed and simulated to assist in autonomous vehicle pathing, supply chain data can be analyzed and simulated to assist in delivery routes and manage inventory, medical data can be analyzed and simulated to assist in personalized treatment schedules or even assist in new discoveries, etc. Of course, there can be other variations, modifications, and alternatives the devices and methods used to gather data for the present DIMC accelerator system.

In an example, the target hardware includes a plurality of accelerator apparatus 1550 with a plurality of chiplet devices 1560 coupled to a CPU 1562, which can include a global CPU and a plurality of local CPUs. The chiplet CPU 1562 is coupled to a plurality of matrix compute apparatuses 1570 via their crossbar devices 1572, each of which is coupled to at least a compute device 1574 (e.g., DIMC device) and a Single Input, Multiple Data (SIMD) device 1576. In an example, the compiler commands are sent to accelerator apparatuses 1550, which can be used to program the CPU 1532 (or CPUs) and connected elements of matrix compute apparatus 1570 via the crossbar device 1572. The AI accelerator apparatus 1550, the chiplet devices 1560, and the matrix compute apparatus 1570 can be configured similarly to any of the previously discussed examples.

In a specific example, the host device 1510 and the plurality of accelerator apparatuses 1512 can be configured within a server device or within multiple connected server devices of a server system. In such cases, the host device 1510 can be configured to coordinate the operations of the accelerator apparatuses 1512 in the server device or a host server can be configured to coordinate the operations of the server devices in the server system.

Although the matrix compute apparatus 1570 is configured within a chiplet device 1560 in an AI accelerator apparatus 1550 in this example, the host computing device 1510 can also be configured send the compiler commands to an independent chiplet device with matrix compute apparatuses or a server system having a plurality of AI accelerator apparatuses. For example, the server system can include a plurality of AI accelerator PCIe card devices coupled to a plurality of switches, each of with is coupled to one or more server CPUs. Those of ordinary skill in the art will recognize other variations, modifications, and alternatives to this workload transfer configuration.

The integration of high-speed inference engines into AI applications is driving advancements across various industries. The adoption of advanced hardware accelerators including the present chiplets and, e.g., high-bandwidth memory, enhances computational efficiency, supporting larger models and reducing inference latency. In an example, by enhancing decision-making, automation, and predictive analytics, these systems enable organizations to operate more efficiently and intelligently. Those of ordinary skill in the art will recognize other variations, modifications, and alternatives to these applications.

While the above is a full description of the specific embodiments, various modifications, alternative constructions and equivalents may be used. As an example, the AI accelerator apparatus and chiplet devices can include any combination of elements described above, as well as outside of the present specification. Therefore, the above description and illustrations should not be taken as limiting the scope of the present invention which is defined by the appended claims.

Claims

1. A digital in-memory compute (DIMC) accelerator system, the system comprising: one or more data gathering devices configured to obtain computational workload data for a target application;a host computing device coupled to the one or more data gathering devices and a plurality of accelerator apparatuses, wherein the host computing device is configured to compile the computational workload data in an instruction set architecture (ISA) graph and to execute the ISA graph using the plurality of accelerator apparatuses;wherein each accelerator apparatus includes a global CPU coupled to one or more chiplets and configured to receive a plurality of matrix inputs; wherein each of the chiplets comprises a plurality of tiles; wherein each of the tiles comprises a plurality of slices, a CPU coupled to the plurality of slices; and wherein each of the plurality of slices includes a DIMC device coupled to a clock;wherein each of the CPUs of the plurality of tiles is configured to receive a portion of the plurality of matrix inputs from the global CPU; andwherein each of the DIMC devices is configured to perform a throughput of one or more matrix computations according to the ISA graph using one or more of the plurality of matrix inputs such that the throughput is characterized by a plurality of multiply accumulates per a clock cycle.
2. The system of claim 1 wherein the one or more chiplets of each accelerator apparatus are coupled to one or more double data rate (DDR) dynamic random access memory (DRAM) devices using a DRAM interface; and wherein the DDR DRAM devices are configured to store the plurality of matrix inputs using the DRAM interface.
3. The system of claim 1 wherein each of the CPUs in each of the tiles is coupled to a peripheral component interconnect express (PCIe) bus; wherein a main bus device is coupled to each PCIe bus in each chiplet using a master chiplet device; wherein the master chiplet device is coupled to each of the other chiplet devices using at least a plurality of die-to-die (D2D) interconnects coupled to each of the CPUs in each of the tiles.
4. The system of claim 1 wherein each of the plurality of slices is coupled to a network on chip (NoC) device configured to perform a multicast process.
5. The system of claim 1 wherein each of the DIMC devices is configured to support one or more block floating point data types using a shared exponent; and wherein each the DIMC devices is configured to support a block structured sparsity.
6. The system of claim 1 wherein the one or more data gathering devices includes a web-scraping device, a dataset reader device, a crowdsourcing device, a sensor device, a simulation device, or an Internet of Things (IoT) network.
7. The system of claim 1 wherein the host computing device includes a compiler stack configured to determine the ISA graph using the computational workload data.
8. The system of claim 1 wherein the host computing device includes a workload preprocessor configured to determine a plurality of workload parameters using the ISA graph.
9. The system of claim 1 wherein the host computing device includes an execution stack configured to transfer the ISA graph to the plurality of accelerator apparatuses.
10. The system of claim 1 wherein the target application includes natural language processing (NLP), autonomous reasoning/decision-making, video/image processing, cybersecurity/fraud detection, manufacturing/industrial processes, agentic artificial intelligence (AI), or smart cities/Internet of Things (IoT).
11. A digital in-memory compute (DIMC) accelerator system, the system comprising: one or more data gathering devices configured to obtain computational data for a program, workload, or model of a target application;a host computing device coupled to the one or more data gathering devices and a plurality of accelerator apparatuses;wherein the host computing device includes a compiler stack having an instruction set architecture (ISA) graph layer configured to determine an ISA graph using the computational data, wherein the compiler stack includes a handles layer configured to determine references to resources for the program, workload, or model of the target application; andwherein the host computing device includes an execution stack configured to transfer the ISA graph to the plurality of accelerator apparatuses;wherein each accelerator apparatus includes a global CPU coupled to one or more chiplets and configured to receive a plurality of matrix inputs; wherein each of the chiplets comprises a plurality of tiles; wherein each of the tiles comprises a plurality of slices, a CPU coupled to the plurality of slices, and a hardware dispatch device coupled to the CPU; andwherein each of the plurality of slices includes a DIMC device coupled to a clock;wherein each of the CPUs of the plurality of tiles is configured to receive a portion of the plurality of matrix inputs from the global CPU; andwherein each of the DIMC devices is configured to perform a throughput of one or more matrix computations according to the ISA graph using one or more of the plurality of matrix inputs such that the throughput is characterized by a plurality of multiply accumulates per a clock cycle.
12. The system of claim 11 wherein the one or more chiplets of each accelerator apparatus are coupled to one or more double data rate (DDR) dynamic random access memory (DRAM) devices using a DRAM interface; and wherein the DDR DRAM devices are configured to store the plurality of matrix inputs using the DRAM interface.
3. The system of claim 11 wherein each of the CPUs in each of the tiles is coupled to a peripheral component interconnect express (PCIe) bus; wherein a main bus device is coupled to each PCIe bus in each chiplet using a master chiplet device; wherein the master chiplet device is coupled to each of the other chiplet devices using at least a plurality of die-to-die (D2D) interconnects coupled to each of the CPUs in each of the tiles; and wherein each of the plurality of slices is coupled to a network on chip (NoC) device configured to perform a multicast process.
14. The system of claim 11 wherein each of the DIMC devices is configured to support one or more block floating point data types using a shared exponent; and wherein each the DIMC devices is configured to support a block structured sparsity.
15. The system of claim 11 wherein the one or more data gathering devices includes a web-scraping device, a dataset reader device, a crowdsourcing device, a sensor device, a simulation device, or an Internet of Things (IoT) network.
16. The system of claim 11 wherein the host computing device includes a workload preprocessor configured to determine a plurality of workload parameters using the ISA graph.
17. The system of claim 11 wherein the target application includes natural language processing (NLP), autonomous reasoning/decision-making, video/image processing, cybersecurity/fraud detection, manufacturing/industrial processes, agentic artificial intelligence (AI), or smart cities/Internet of Things (IoT).
18. A digital in-memory compute (DIMC) accelerator system, the system comprising: one or more data gathering devices configured to obtain computational workload data for a target application;a host computing device coupled to the one or more data gathering devices and a plurality of accelerator apparatuses;wherein the host computing device includes a compiler stack configured to determine an instruction set architecture (ISA) graph using the computational workload data, and wherein the host computing device includes an execution stack configured to transfer the ISA graph to the plurality of accelerator apparatuses;wherein each of the accelerator apparatuses includes a plurality of tiles configured to receive a plurality of matrix inputs using a global reduced instruction set computer (RISC) interface;wherein each of the tiles comprises a plurality of slices, a RISC CPU coupled to the plurality of slices and the global RISC interface, and a hardware dispatch device coupled to the RISC CPU; and wherein each of the plurality of slices includes a digital in memory compute (DIMC) device coupled to a clock;wherein each of the RISC CPUs of the plurality of tiles is configured to receive a portion of the plurality of matrix inputs; andwherein the global RISC interface is configured to map each attention layer on to one of the plurality of slices to communicate with the RISC CPU associated with the tile of the slice to process the portion of the workload associated with the attention layer; andwherein each of the digital in memory compute (DIMC) devices is configured to perform a throughput of one or more matrix computations according to the ISA graph using one or more of the plurality of matrix inputs such that the throughput is characterized by a plurality of multiply accumulates per a clock cycle.
19. The system of claim 18 wherein the one or more data gathering devices includes a web-scraping device, a dataset reader device, a crowdsourcing device, a sensor device, a simulation device, or an Internet of Things (IoT) network.
20. The system of claim 18 wherein the target application includes natural language processing (NLP), autonomous reasoning/decision-making, video/image processing, cybersecurity/fraud detection, manufacturing/industrial processes, agentic artificial intelligence (AI), or smart cities/Internet of Things (IoT).

CROSS-REFERENCES TO RELATED APPLICATIONS

The present application is a continuation-in-part of U.S. patent application Ser. No. 18/493,616, filed Oct. 24, 2023; which is a continuation of U.S. patent application Ser. No. 17/538,923, filed Nov. 30, 2021 (now U.S. Pat. 11,847,072).

Continuations (1)

	Number	Date	Country
Parent	17538923	Nov 2021	US
Child	18493616		US

Continuation in Parts (1)

	Number	Date	Country
Parent	18493616	Oct 2023	US
Child	19076153		US

ACCELERATOR SYSTEM USING DIGITAL IN-MEMORY COMPUTE CHIPLET DEVICES FOR COMPUTATIONAL WORKLOADS

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

CROSS-REFERENCES TO RELATED APPLICATIONS

Continuations (1)

Continuation in Parts (1)