High bandwidth non-volatile memory for AI inference system

Description

RELATED APPLICATION

This application is being filed concurrently with commonly owned, commonly invented pending U.S. patent application Ser. No. 18/112,827 (publication U.S. 2024/0281636, titled “DATA OPTIMIZATION FOR HIGH BANDWIDTH (HBW) NVM AI INFERENCE SYSTEM”, which is incorporated in its entirety herein for all purposes.

BACKGROUND
Field

The present invention relates to integrated circuit memory devices, such as non-volatile memory devices, that support storage and retrieval of information used in Artificial Intelligence (AI) inferencing applications, and particularly relates to a layered die memory architecture and command decoder to provide high bandwidth storage and retrieval of weights and other information supporting AI inferencing applications.

Description of Related Art

Modern information technology applications, such as artificial intelligence (AI) inferencing can consume copious amounts of data such as weighting information in conduct of inferencing operations.

Many different types of memory architectures have been created, each providing storage of digital data and addressing different needs and requirements of a variety of applications. However, conventional approaches to memory devices often suffer from insufficient bandwidth, leading to poor performance, or require refresh logic and are therefore higher in cost.

It is desirable to provide mechanisms for storing AI inferencing data that are capable of greater bandwidths.

SUMMARY

A high bandwidth non-volatile memory (NVM) is described suitable for providing neural network weight data to an AI accelerator processing core or cores (e.g., accelerator core). An artificial intelligence (AI) inference memory device employing the high bandwidth NVM technology as described herein can comprise a logic layer die including channel logic implementing connections between a plurality of channels for conducting data to and from an accelerator core via a bus and a plurality of non-volatile memory (NVM) dies stacked vertically one above another, forming a layered vertical stack of NVM dies, each of the NVM dies including at least one memory chip and a plurality of direct vertical connections to a corresponding channel in the logic layer.

In a representative implementation, an artificial intelligence (AI) inference memory device, comprises a logic layer die that can include channel logic implementing connections between a plurality of channels for conducting data to and from an accelerator core via at least one bus can be mounted to a substrate. A plurality of non-volatile memory (NVM) dies can be stacked vertically one above another, forming a layered vertical stack of NVM dies. Each of the NVM dies can include at least one memory chip and a plurality of direct vertical connections to a corresponding channel logic in the logic layer. The direct vertical connections can be via-to-via connections of a through silicon via (TSV) integrated circuit. AI inference devices can implement NVMs using any suitable NVM technology such as for example and without limitation a phase change memory (PCM), a three-dimensional cross point memory (3D Xpoint), a NOR flash memory, a resistive random-access memory (RRAM), a magneto-resistive random-access memory MRAM, a ferroelectric random-access memory FeRAM, a conductive bridge random-access memory CBRAM, and a NAND flash memory.

In some implementations, the stacked NVM dies can be organized into one or more banks. A controller in the logic layer can execute interface commands with one or more NVM dies of corresponding banks. Thus, interface commands can be executed by two or more channels in parallel. The interface commands include a read neural network weights data command, and a write neural network weights data command.

In one example implementation, NVM dies are arranged in two stacks of four NVM dies on the logic layer die and provide data to eight parallel channels. Interface commands can be executed by memory corresponding to a plurality of channels in parallel; thereby achieving 50 GB/second throughput to the accelerator core.

In another aspect, the channel logic further includes an interface to a substrate mounting a processor or field programmable gate array (FPGA) that performs neural network computations and a static random-access memory (SRAM) storing activation data for use in neural network computations. Activation data can include for example stored non-linear function(s) for relating relationships between input and output of a neural network, such as for example, a sigmoid function, a hyperbolic tangent (tan h) function, a rectified linear unit (ReLU) function, a leaky rectified linear unit (LReLU) function, and a maxout function. Alternatively, the substrate can mount a processor or a field programmable gate array (FPGA) that performs neural network computations and a dynamic random-access memory (DRAM) storing activation data for use in neural network computations.

In a yet further aspect, the logic layer and vertical stack comprising the plurality of NVM dies can be affixed to an interposer layer. The interposer layer provides connection between the plurality of direct vertical connections of the vertical stack to corresponding channel logic in the logic layer. The AI inference memory device can be packaged as a 2.5D through silicon via (TSV) integrated circuit.

In a still yet further aspect, the logic layer can be affixed to an interposer layer. The plurality of NVM dies are stacked vertically above the logic layer and the interposer layer in a vertical stack; thereby establishing connections between the plurality of direct vertical connections of the vertical stack to corresponding channel logic in the logic layer. The AI inference memory device can be packaged as a three-dimension (3D) through silicon via (TSV) integrated circuit. In some implementations, a plurality of solder bumps that have been deposited onto chip pads of the logic layer and the plurality of NVM dies provide connection with a device immediately below. In some implementations, a plurality of backside interconnects have been deposited onto chip pads of the logic layer and the plurality of NVM dies provide wafer-to-wafer connection with a device above. The AI inference memory device can be packaged as a three-dimensional (3D) system-on-chip (3D SOC) integrated circuit.

In another representative implementation, an artificial intelligence (AI) inference method is provided. The method can be conducted by a processor coupled with a logic layer die including channel logic implementing connections between a plurality of channels conducting data to and from an accelerator core via at least one bus, and a plurality of non-volatile memory (NVM) dies stacked vertically one above another, forming a layered vertical stack of NVM dies, each of the NVM dies including at least one memory chip and a plurality of direct vertical connections to a corresponding channel logic in the logic layer. The method can comprise retrieving from the plurality of NVM dies that are stacked vertically above the logic layer die, a plurality of neural network weights stored therein. Using the accelerator core, the plurality of neural network weights can be applied to input data for each one of a plurality of nodes of a neural network to obtain an intermediate output. Activation data defining an activation function or functions can be applied to the intermediate output to obtain a result for a neural network level. The result can be stored in the plurality of NVM dies, a field programmable gate array (FPGA) buffer, activation memory or any storage device facilitating computations at additional neural network levels until a final result is reached. In embodiments suitable activation data can include stored data implementing an activation function including one or more of a sigmoid function, a hyperbolic tangent (tan h) function, a rectified linear unit (ReLU) function, a leaky rectified linear unit (LReLU) function, and a maxout function. In embodiments, the plurality of NVM dies are arranged in two stacks of four NVM dies on the logic layer die, and wherein the method further comprises providing data to eight parallel channels.

In a further representative implementation, an AI inference system can comprise a processor chip, a first memory chip suitable for storing activation functions, and a second (high bandwidth) memory chip suitable for storing arrays of weights can be coupled together. The processor can apply weights retrieved from the NVM dies to specific inputs and can apply activation functions retrieved from the first memory chip to provide inferencing output. The inference system can be implemented as a multichip module in a single package. The package can be mounted on a circuit board or other type of substrate and connected to sensors and other components that can generate data consumed by the execution of inference processing using the weight data stored in the high bandwidth NVM, and consume data generated by execution of the inference processing.

Other aspects and advantages of the present invention can be seen on review of the drawings, the detailed description, and the claims, which follow.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is an illustration of a multichip module including a high bandwidth memory for AI inferencing as described herein.

FIG. 2 is an illustration of another embodiment of a multichip module including a high bandwidth memory for artificial intelligence (AI) inferencing as described herein.

FIG. 3 is an illustration of yet another embodiment of a multichip module including a high bandwidth memory for AI inferencing as described herein.

FIG. 4 is a simplified functional block diagram of an inference system as described herein.

FIG. 5 is a flowchart of a runtime procedure which can be executed by an inference system as described herein.

FIG. 6 is a simplified functional block diagram of a representative neural network node (neuron) processed by the AI inference system as described herein.

FIG. 7 illustrates representative activation functions suitable for implementing artificial intelligence inferencing as described herein.

FIG. 8 is a simplified diagram illustrating storage and retrieval of weight data in a memory system including a flash memory device implemented on an integrated circuit as described herein.

FIG. 9 illustrates weight data used by a neural network to process images being stored according to various storage paradigms by a high bandwidth non-volatile memory (NVM) in accordance with the technologies as described herein.

FIG. 10 is a schematic illustration of a technique of allocating weight data among multiple NVM channels by a high bandwidth NVM in accordance with the technologies as described herein.

FIG. 11 is a schematic diagram illustrating multiple scenarios for allocating weight data among multiple NVM channels by a high bandwidth NVM in accordance with the technologies as described herein.

DETAILED DESCRIPTION

A detailed description of embodiments of the present invention is provided with reference to the FIGS. 1-11.

FIG. 1 illustrates a multichip module (MCM) 120 that includes a processor chip 101, a first memory chip 103 (e.g., an activation memory storing activation data), and second memory chip 102 (e.g., a weight memory storing weight data for a neural network). The weight memory can be considered high bandwidth (HB) memory. In this example, mounted on an interposer 110 are the processor chip 101, the second memory chip 102, which can be a HB non-volatile memory (NVM) for storing weights (e.g., weight data) used in neural network computations, and the first memory chip 103, which can be dynamic random access memory (DRAM), static random access memory (SRAM) or NAND, however DRAM and SRAM are presently preferred options, for storing activation data for activation functions used in neural network computations. The assembly is configured as a multichip module 120 in a single package.

The processor chip 101 can include a runtime processor core (e.g. CPU) and an accelerator core, such as an artificial intelligence accelerator (e.g. AIAcc) or a neuron processing unit.

In this example, processor chip 101 includes an input/output interface 113 disposed on the surface of the chip 101. The input/output interface 113 is connected to interconnection wiring 111 on the interposer 110.

The first memory chip 103 includes an interface 112 for connection to the interconnection wiring 111 on the interposer 110.

The second memory chip 102 includes an interface 114 for connection to the interconnection wiring 111 on the interposer 110. While depicted as a single entity in FIG. 1 for clarity, second memory 102 can comprise multi-layered stack, in which memory dies are disposed on different layers and can be connected to vertical connectors such as through silicon via (TSV) connections to interconnection wiring 111 on the interposer 110.

Thus, interconnection wiring 111 provides part of the data path between the first memory chip 103, the second memory chip 102, and the processor chip 101.

In the example illustrated in FIG. 1, the processor chip 101 includes another input/output interface 122 for connection to external contact structures 121 of the multichip module 120.

FIG. 2 illustrates another configuration of an inference system as described herein. This configuration includes a processor chip 201, a first memory chip 203 (e.g., an activation memory storing activation data), and a second memory chip 202 (e.g., a weight memory storing weight data for a neural network), which can be considered HB memory. In this example, the second memory chip 202 comprises individual layered non-volatile memory (NVM) dies 202a-202N mounted to an I/O die 204 implementing connection logic in a logic layer. The processor chip 201, the I/O die 204 of the second memory chip 202, and the first memory chip 203 are mounted on an interposer 210. The assembly is configured as a multichip module (MCM) 220 in a single package.

The processor chip 201 can include a runtime processor core (e.g. CPU) and an accelerator core, such as an artificial intelligence accelerator (e.g. AIAcc) or a neuron processing unit.

The second memory chip 202 includes the plurality of non-volatile memory dies 202a-202N stacked one on top of another and stacked above the I/O die 204. One such NVM die 202a illustrated includes a chip-to-chip bonding surface on which an interface 231 is exposed for connection to the I/O die 204. The I/O die 204 includes an interface 232 exposed on a surface of the I/O die 204, and complementary to the interface 231 on the NVM die 202a. In this example, direct vertical connections at the surfaces are provided between the memory interface 232 and the interface 231. The direct vertical connections can comprise very short length copper via-to-via conductors or other chip-to-chip contact technologies suitable for high speed, low latency, and low power communication between the chips.

With continuing reference to the second memory chip 202, another NVM die 202b illustrated includes a chip-to-chip bonding surface on which an interface 241 (e.g., an NVM-NVM interface) is exposed for connection to NVM die 202a of the second memory chip 202. The NVM die 202a includes an interface 242 (e.g., an NVM-NVM interface) exposed on a surface of the NVM die 202a, and complementary to the interface 241 on the NVM die 202b. In this example, direct vertical connections at the surfaces are provided between the NVM-NVM interface 242 and the NVM-NVM interface 241.

In some implementations, an interposer layer provides connection between the plurality of direct vertical connections of the vertical stack (e.g., NVM dies 202a-202N) to corresponding channel logic in the logic layer (e.g., I/O die 204). In some implementations, a plurality of solder bumps that have been deposited onto chip pads of the logic layer (e.g., I/O die 204) and the plurality of NVM dies (e.g., 202a-202N) provide connection with a device immediately below. In some implementations, a plurality of backside interconnects have been deposited onto chip pads of the logic layer and the plurality of NVM dies provide wafer-to-wafer connection with a device above. The AI inference memory device can be packaged as any of a 2.5D through silicon via (TSV) integrated circuit, a three-dimension (3D) through silicon via (TSV) integrated circuit and a three-dimensional (3D) system-on-chip (3D SOC) integrated circuit.

In this example, the processor chip 201 includes an input/output interface 213 disposed on the surface of the chip 201. The input/output interface 213 is connected to interconnection wiring 211 on the interposer 210.

The first memory chip 203 includes an interface 212 for connection to the interconnection wiring 211 on the interposer 210.

Also, the I/O chip 204 includes an interface 214 for connection to the interconnection wiring 211 on the interposer 210.

Thus, interconnection wiring 211 provides part of the data path between the first memory chip 203 and the second memory chip 202, and the processor chip 201.

In the example illustrated in FIG. 2, the processor chip 201 includes another input/output interface 222 for connection to external contact structures 221 of the multichip module 220.

FIG. 3 illustrates another configuration of an inference system as described herein. This configuration includes a processor chip 301, a first memory chip 303 (e.g., an activation memory storing activation data), and a second memory chip 302 (e.g., weight memory storing weight data for a neural network), which can be considered HB memory. In this example, the second memory chip 302 comprises individual layered non-volatile memory (NVM) dies 302a-302N mounted to an I/O die 304. The processor chip 301, the I/O die 304 of the second memory chip 302, and the first memory chip 303 are mounted on an interposer 310. The assembly is configured as a multichip module (MCM) 320 in a single package.

The processor chip 301 can include a runtime processor core (e.g. CPU) and an accelerator core, such as an artificial intelligence accelerator (e.g. AIAcc) or a neuron processing unit.

The second memory chip 302 includes the plurality of non-volatile memory dies 302a-302N stacked one on top of another and stacked above an I/O die 304. One such NVM die 302a illustrated includes a chip-to-chip bonding surface on which an interface 331 is exposed for connection to the I/O die 304. The I/O die 304 includes an interface 332 exposed on a surface of the I/O die 304, and complementary to the interface 331 on the NVM die 302a. In this example, direct vertical connections at the surfaces are provided between the interface 332 and the interface 331. The direct vertical connections can comprise very short length copper via-to-via conductors or other chip-to-chip contact technologies suitable for high speed, low latency, and low power communication between the chips.

With continuing reference to second memory chip 302, another NVM die 302b illustrated includes a chip-to-chip bonding surface on which an interface 341 (e.g., an NVM-NVM interface) is exposed for connection to NVM die 302a the second memory chip 302. The NVM die 302a includes an interface 342 (e.g., an NVM-NVM interface) exposed on a surface of the NVM die 302a, and complementary to the interface 341 on the NVM die 302b. In this example, direct vertical connections at the surfaces are provided between the NVM-NVM interface 342 and the NVM-NVM interface 341.

In some implementations, interposer layer 310 provides connection between the plurality of direct vertical connections of the vertical stack (e.g., NVM chips 302a-302N) to corresponding channel logic in the logic layer (e.g., I/O die 304). In some implementations, a plurality of solder bumps that have been deposited onto chip pads of the logic layer (e.g., I/O die 304) and the plurality of NVM dies (e.g., 302a-302N) provide connection with a device immediately below. In some implementations, a plurality of backside interconnects have been deposited onto chip pads of the logic layer and the plurality of NVM dies provide wafer-to-wafer connection with a device above. The AI inference memory device can be packaged as any of a 2.5D through silicon via (TSV) integrated circuit, a three-dimension (3D) through silicon via (TSV) integrated circuit and a three-dimensional (3D) system-on-chip (3D SOC) integrated circuit.

In this example, processor chip 301 includes an input/output interface 313 disposed on the surface of the chip 301. The input/output interface 313 is connected to interconnection wiring 311 on the interposer 310.

The first memory chip 303 includes an interface 312 for connection to the interconnection wiring 311 on the interposer 310.

Also, the I/O chip 304 includes an interface 314 for connection to the interconnection wiring 311 on the interposer 310.

Thus, interconnection wiring 311 provides part of the data path between the first memory chip 303 and the second memory chip 302, and the processor chip 301.

In the example illustrated in FIG. 3, the multichip module (MCM) 320 includes another input/output interface 352 for connection to external contact structures 351 of the multichip module 320.

FIGS. 1-3 provide example arrangements of an inference system with high bandwidth NVM as described herein, showing varieties of configurations of the chips and connections among the chips, the interposer and external contacts of the package. Other arrangements can be implemented as suits a particular need.

FIG. 4 is a simplified functional block diagram of an inference system implemented as described with reference to FIGS. 1-3. The platform includes a processor chip 401 (e.g., a CPU and an AIAcc), a first memory chip 403 (an activation memory storing activation data), and a second memory chip 402 (e.g., a weigh memory storing weight data for a neural network). The processor chip 401 in this example includes a CPU or processor core 410, accelerator core 411, on-chip memory 405, such as SRAM (or other type of memory) which can be used as working memory and as a cache memory, a first I/O interface 413 and a second I/O interface 422. A bus system 420 provides for intra-chip communications among the components of the processor chip 401.

The first memory chip 403 in this example comprises a high capacity, volatile memory 440 such as DRAM or SRAM (or a nonvolatile memory such as 3D NAND or other type of memory implemented using charge trapping storage technology), for example. The first memory chip 403 includes a first memory I/O interface 412 for off-chip communications. The first memory I/O interface 412 can comprise a high-speed serial port, such as a serial peripheral interface (SPI) compatible port, or a parallel port, depending on the particular implementation of the memory chip 403 that is utilized. A data path 415 is provided in this example between the first memory I/O interface 412, and the first I/O interface 413 on the processor chip 401.

The second memory chip 402, in this example, comprises a high-bandwidth (HB), nonvolatile memory (NVM) configured in one or more banks 430a, 430b, each of which can comprise one or more layers of NVM dies arranged in channels. The NVM can be one of a phase change memory (PCM), a three-dimensional cross point memory (3D Xpoint), and a NAND flash memory. In other examples, the second memory chip 402 can comprise NOR flash memory using charge trapping storage technology, or other suitable random-access technologies like resistive RAM (e.g. metal oxide memory), magnetic RAM, Ferroelectric RAM a conductive bridge random-access memory CBRAM and so on.

The second memory chip 402 includes a memory I/O interface 414 for off-chip communications via a logic layer 404 to the I/O interface 413 on the processor chip 401. Logic layer 404 includes channel controllers 434, 444 that provide control of multiple channels forming one or more sets of high-speed data pathways on which weight data can flow across an interface 432a, 432b exposed on a surface of the logic layer 404, and complementary to the interface 431a, 431b on a surface of banks 430a, 430b of NVM dies arranged in layers direct connected by vertical connections 450a, 450b at the surfaces provided between the IO-memory interface 432a, 432b and the memory-IO interface 431a, 431b. The direct vertical connections 450a, 450b can comprise very short length copper via-to-via conductors or other chip-to-chip contact technologies suitable for high speed, low latency, and low power communication between the chips. In an implementation and by way of example, two stacks are formed by stacking four NVM dies with N/8 through silicon via (TSV) I/O per die onto a logic die; wherein N is the total number of TSV IO, and 8 is number of dies; N/8 is IO per die. One NVM die has one channel; one channel is N/8 through silicon via (TSV) I/O. Each channel is completely independent so each channel can operate independently. One controller can control multiple channels. An external controller can be provided in a field programmable gate array (FPGA) or system on a chip (SoC) die (e.g., implementing processor 401).

DRAM is an option to bond into the system in package (SiP) in case on-chip SRAM is not big enough.

Thermal (heat) management can used to guarantee data retention.

An AI accelerator (e.g. accelerator core 411), as the term is used herein, is a configurable logic circuit including components designed or suitable for execution of some or all of the arithmetic operations of AI inference operations. Configuration of the accelerator core can include loading a set of weights from memory 402 to be used in conducting inference operations, or parts of the set of weights. In some embodiments, configuration of the accelerator core can include loading some or all of the of the computation graphs of an inference model that define the sequence and architecture of the operation of the inference model. The inference model can comprise a computation graph of a deep learning neural network, in some examples having a plurality of fully connected and partially connected layers, activation functions, normalization functions and so on.

An accelerator core can be implemented using configurable logic, like arrays of configurable units used in field programmable gate arrays for example, in which compiled computation graphs are configured using bit files. An accelerator core can be implemented using a hybrid of data flow configurable logic and sequential processing configurable logic.

A runtime processor core (e.g. CPU 410) can execute a runtime program to coordinate operation of the accelerator core to accomplish real time inference operations, including data input/output operations, loading computation graphs, moving the set of weights to be applied in the inference operation into and out of the accelerator core, delivering input data to the accelerator core, and performing parts of the computations to obtain inference results.

FIG. 5 is a flowchart illustrating an example of logic of operations executed by an inference system, such as described with reference to FIGS. 1-4. FIG. 6 is a simplified functional block diagram of a representative neural network node (neuron) processed by the AI inference system logic as described herein. The logic can be implemented using machine executable instructions stored in memory, such as the SRAM on-chip memory 405, or other memory accessible by the processor 410 coupled with a logic layer die including channel logic implementing connections between a plurality of channels conducting data to and from an accelerator core via at least one bus, and a plurality of non-volatile memory (NVM) dies stacked vertically one above another, forming a layered vertical stack of NVM dies, each of the NVM dies including at least one memory chip and a plurality of direct vertical connections to a corresponding channel logic in the logic layer. In this example, a collection of weights for artificial intelligence neural network computations downloaded from an external source, such as a network, is loaded into the high bandwidth NVM of the inference system.

With reference to FIGS. 5, 6, and 7, during runtime, an operation retrieves a plurality of neural network weights 602 (e.g., W_b, W₁, W₂, W₃, and W_n) stored in the high bandwidth NVM (operation 501). The weight data is directly transferred into the AI accelerator core 611 to perform calculations. An activation function 604 is stored in activation memory (e.g., 403 of FIG. 4) (operation 502). With reference to FIG. 7, suitable activation data can include stored data implementing an activation function including one or more of a sigmoid function 702, a hyperbolic tangent (tan h) function 704, a rectified linear unit (ReLU) function 706, a leaky rectified linear unit (LReLU) function 708, and a maxout function 710.

After loading the weights and configuring the AI accelerator core 611, the operations includes executing an inference procedure using the AI accelerator core 611 to apply the plurality of neural network weights to input data 606 for each one of a plurality of nodes of a neural network to obtain an intermediate output (operation 503) for one or more levels of nodes of the neural network. The intermediate output 608 comprises a set of computational results from a level of the neural network.

Next, the activation function 604 stored in, for example the first memory chip 103 (e.g., activation memory 403 of FIG. 4) is applied to the intermediate output to obtain an output 610 (e.g., a result) for one or more levels of the neural network (operation 504). For example, in operation 504, the AIAcc can apply activation data to results of nodes of a level of a neural network. The result is stored in the high bandwidth NVM, FPGA buffer or any storage device, or activation memory 403 (operation 505). The procedure includes checking if nodes at further neural network levels are to be processed (operation 506) and if so, restarting operations 501-506 for the next batch of neural network nodes to be processed. If there are no further levels to process then the output is provided directly and/or stored (e.g., for storage back in second memory 102, FPGA buffer or any storage device).

Thus, the operations of FIG. 5 include an operation to select activation data and weights stored in the first and second memory chips, into the accelerator core, and to execute the neural network nodes using the weights and activation data. Also, as shown in FIG. 5, after executing or beginning to execute the selected neural network nodes, the operations loop to operation 501, to process a next level of the neural network. If more neural network nodes are to be processed, the operations 501 to 506 are traversed, and can include changing the activation data to a different function, loading the weights for the nodes of the next neural network level, and executing the different nodes.

It will be appreciated with reference to FIG. 5, that many of the steps can be combined, performed in parallel or performed in a different sequence without affecting the functions achieved. In some cases, as the reader will appreciate, a rearrangement of steps will achieve the same results only if certain other changes are made as well. In other cases, as the reader will appreciate, a rearrangement of steps will achieve the same results only if certain conditions are satisfied. Furthermore, it will be appreciated that the flow charts herein show only steps that are pertinent to an understanding of the invention, and it will be understood that numerous additional steps for accomplishing other functions can be performed before, after and between those shown.

During training processing, weights stored in NVM dies (e.g., of the weights memory 402) are adjusted by the processor chip 401 (e.g., a CPU and an AIAcc) based upon training dataset(s). As training progresses, the processor chip 401 will keep updating values for the weights and this value will be stored in the NVM. In some implementations, weights are fixed during inference processing.

In process constrained environments, speed in which the AI application arrives at a result becomes an important factor. Thus conventional randomly stored weight data in the HBW NAND flash could increase data fetching operations required and lead to low data transmission efficiency due to the page-level granularity inherent to NAND flash page read operations. Weight data storage paradigms in accordance with embodiments described herein can reduce flash access times and increase data process efficiency.

FIG. 8 is a simplified diagram illustrating storage and retrieval of weight data used to compute node values during inferencing operations conducted by a neural network in a memory system including a flash memory device 1 implemented on an integrated circuit as described herein. The memory device can be implemented on a single integrated circuit chip, on a multichip module, or on a plurality of chips configured to suit a particular need.

As shown in FIG. 8, memory device 1 is a NAND flash in which read operations are performed at a one-page granularity, meaning that every time a fetch operation for data is performed, one whole page must be read from the flash. Data fetching can become slow when data is stored in different pages. This may impact the process speed of the entire AI inferencing platform. Further, the AI accelerator may stall and wait for data to arrive. This can result in low bandwidth utilization if locality of the data is not well preserved.

Continuing with FIG. 8, D0 to D5 indicate data storage locations; W0 to W5 indicate weight data. The weight data was stored in the NVM. In a first scenario 801, weight data (i.e., neural network weights) are stored randomly in data locations from location D0 to location D5 in page 0 and page 1. Accordingly, if neural network processing calls for Data 0, Data 4, and Data 1, e.g., weights W1 and W3, W5 and W2 and W2 and W4, to perform a calculation, i.e., obtain a node value or set of node values, then the NAND flash memory device 1 will read page 0 first, in order to obtain Data 0, e.g., weights W1 and W3, then read page 1, in order to obtain Data 4, e.g., weights W5 and W2, and then go back to re-read page 0, in order to obtain Data 1, e.g., weights W2 and W4. However, if the data is stored in data locations in accordance with an expected retrieval indicated by the neural network to be processed, or re-ordered to be in such storage locations, as illustrated by second scenario 802, expected data acquisition times can be reduced. In second scenario 802, weights belonging to Data 0, Data 4 and Data 1, e.g., (W1, W3), (W5, W2), and (W2, W4) respectively, are all stored in locations in page 0. Thus, obtaining weights belonging to Data 0, Data 4 and Data 1 calls for the NAND flash memory device 1 to perform a single read of page 0, which reduces access time, increases efficiency, increases performance, etc. In various implementations, the structure of the neural network, which indicates which weights are to be used and in what order, can be received as an input by the AI inferencing system, read from storage, or otherwise made accessible. Because the whole neural network structure can be obtained from a local storage on chip, received from a host or another chip in the AI inferencing system, the AI inferencing system can determine what weight data needs to be input during AI inference, and thus in what order the weight data can be stored in order to reduce access times during later inferencing.

FIG. 9 illustrates weight data used by a neural network to process images being stored according to various storage paradigms by a high bandwidth NVM in accordance with the technologies as described herein. In the example implementation illustrated by FIG. 9, weight data is stored in the same page or multiple pages in the NVM and read in groups or sets of nine (9) weights for each of three (3) color neural network channels: red (1 to 9), green (1 to 9) and blue (1 to 9). A group (or set) of weights can comprise a filter or a feature map such that the filter can include weights that correspond to network nodes in a layer of the neural network defined in a neural network definition.

In the example storage paradigms 901, 902, 903 illustrated by FIG. 9, there are 32 sets of 9 weights, each having a specific storage location, where a set of 9 weights 911, 912, 913 forms a respective 3×3 filter, e.g., 921, 922, 923. A set of 9 red weights 911 can be used to populate a red filter 921, a set of green weights 912 can be used to populate a green filter 922, and a set of blue weights 913 can populate a blue filter 923. Accordingly, there are 288 red weights (e.g., 32×9), 288 green weights and 288 blue weights. In sum, there is a total of 864 weights (288×3).

As depicted in FIG. 9, weights are stored from left to right in the diagram, which is from red to green to blue in paradigm 901. Accordingly, in paradigm 901, 3 fetch instances are needed to retrieve the first filter of 3 different, i.e., red, green, and blue channels, because, for example, this data might not be stored in the same page in memory, depending on the page size. Accordingly, for example, to populate each of the filters at least 96 read operations (32×3) will be required. If the weights are all stored in the same page, this processing can still require 3 read operations to obtain values for a filter for 3 different (e.g., R, G, and B) channels. That's why the order of the data can be re-arranged so that an AI inferencing engine can process all of the red data at once, if one page is large enough to hold all red data. If the page size is not large enough to hold the weights for all filters of the same color, the remainder of the same color weights will be stored continuing onto the next page in memory.

When the weight data is stored according to a first storage paradigm 901, resulting read processing that is of lower efficiency may result. If data location can be arranged (or re-arranged) and stored in a different ordering, such as paradigms 902, 903, then weight data can be read in once for red, green and blue filters. This will reduce data fetching times, resulting in decreased processing time during inferencing when the data needs to be read.

In paradigm 902, 9 weights for a red filter 911, 9 weights for a green filter 912 and 9 weights for a blue filter 913 have been stored together. In this configuration, a read operation can obtain weights to populate filters 921, 922, and 923 in a single read operation.

In paradigm 903, weights are arranged with all Is grouped together, then all 2s grouped together, and so forth. This is another data re-arrangement to improve access when performing convolution processing. The input image will multiply 3×3 filter to do convolution processing.

Of course, while 3×3 filters are used in this example illustrated by FIG. 9, filters could be constructed having practically any size, such as for example and without limitation embodiments that implement filters of 5×5, 8×8, in which cases, there would be 25, or 64, etc. total weight values per filter.

FIG. 10 is a schematic illustration of a technique of allocating weight data among multiple NVM channels by a high bandwidth NVM in accordance with the technologies as described herein. High bandwidth memory architecture 1000 illustrated by FIG. 10 incorporates the use of channels 1010, 1011, 1012 to enable data to be read from or stored to different NVM dies carrying memory arrays contemporaneously. Such memory channels used for data transfer along pathways to and from dies implementing non-volatile storage arrays. In some implementations, the controller can be shared by, i.e., can control, multiple channels as illustrated by 434, 444 in FIG. 4, which control channels 0-3 and channels 4-7 respectively.

In an implementation and by way of example, if a memory system has 4 channels, or multichannel NVM memory, the total weight data length can be distributed equally or otherwise among the memory channels (data length/channel). Of course, utilization of page size can be maximized using the techniques described above with reference to FIGS. 8-9. For example: 288 weights per neural network channel, e.g., 288 red weights (e.g., 32 filters×9 weights per filter), can be distributed among 4 memory channels, yielding 72 red weights per each memory channel, or 72 red weights/9 weights per filter yields 8 red filters per memory channel. Analogously, 8 green filters per memory channel; and 8 blue filters per memory channel.

Further, different layers of the neural network will employ different groups of weights. As shown in FIG. 10, for example, in the convolutional neural network (CNN), there are different layers 1020A, 1022A. For each layer, there are different filters 1020B, 1022B. Here, layer 0 is the first layer 1020A in convolutional neural network and corresponds to the first layer filter (weights) 1020B are stored in the page 0. If page 0 is not large enough, the weights data will be stored at a following page. Further, weights can be of a different type, and function for the different layers. For example, weights associated with layer 0 1020B include weights grouped according to red, green and blue filters. Layer 1 1022A includes weights grouped according to some other paradigm defined by the neural network definition. In some implementations, the weights corresponding to a first layer of a neural network can be stored according to one paradigm (e.g., 901, 902, or 903 of FIG. 9) and weights corresponding to a second layer of the neural network can be stored according to a different paradigm.

FIG. 11 is a schematic diagram illustrating multiple scenarios for allocating weight data among multiple NVM channels by a high bandwidth NVM in accordance with the technologies as described herein. Appropriate storage scenarios according to the present technology can provide for retrieval of the weights for two or more groups of weights (e.g., filters) without necessitating a read of an additional page of memory. For example, weights can be stored such that retrieval of all groups for the layer can be performed without necessitating triggering additional page of memory. In FIG. 11, weight data belonging to a plurality of filters (filter 1-filter 7) 1101-1108 can be read out by multiple memory channels (channel 0-channel 3) 1110-1113. A memory channel is used for data transfer via pathways to and from dies implementing non-volatile storage arrays. In some implementations, the controller can be shared by, i.e., can control, multiple channels as illustrated by 434, 444 in FIG. 4, which control channels 0-3 and channels 4-7 respectively.

In scenario 1100A, weights grouped according to filters can be allocated to individual memory channels. For example, the weights for filter 1 1101 and filter 2 1102 are stored for retrieval by channel 0 1110. Accordingly, weights from filters can be stored to or retrieved from multiple arrays contemporaneously using multiple NVM channels. In this configuration, weights belonging to different filters can be read contemporaneously using different channels, however, if weights from different groups assigned to the same channel need to be read, for example from filter 1 and filter 2 in scenario 1100A of FIG. 11, both belonging to channel 0 1110, then multiple page reads might need to occur.

In scenario 1100B, weights grouped according to filters are allocated to storage space across different channels. With reference to FIG. 11, the weights for filter 1 through filter 6 1101-1106 are apportioned among each of the four available channels (channel 0-channel 3) 1110-1113. In this scenario, weight data distributed among the multiple channels can be read in contemporaneously from NVM arrays by each of channel 0 through channel 3 1110-1113. Further, read operations against the NVM arrays of each channel have the capability to read weight data for each of the filters without necessitating a page retrieval when switching read operations from one filter to the next filter.

Other implementations of the method described in this section can include a non-transitory computer readable storage medium storing instructions executable by a processor to perform any of the methods described above. Yet another implementation of the method described in this section can include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform any of the methods described above.

Any data structures and code described or referenced above are stored according to many implementations on a computer-readable storage medium, which may be any device or medium that can store code and/or data for use by a computer system. This includes, but is not limited to, volatile memory, non-volatile memory, application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), magnetic and optical storage devices such as disk drives, magnetic tape, CDs (compact discs), DVDs (digital versatile discs or digital video discs), or other media capable of storing computer-readable media now known or later developed.

As used herein, a network node, including network nodes referred to as client side nodes and a server side nodes, is an application hosted on an active electronic device, or virtual device executed in an active electronic device such as a cloud-based platform, that is attached to a network, and is capable of sending, receiving, or forwarding information in support of computer programs such as servers and clients, over a physical media for a communications channel on the network, and having for example media access control addresses and protocol stacks that support higher network layers. A network can include the networks using Internet Protocol addresses, or other type of network layer addresses. In some embodiments the network comprises the Internet. Examples of electronic devices which can host network nodes, include all varieties of computers, workstations, laptop and desktop computers, hand-held computers and smart phones, and cloud-based platforms.

A byte is a basic storage unit used in many integrated circuit logic and memory circuits and consists of eight bits. Basic storage unit can have other sizes, including for example one bit, two bits, four bits, 16 bits and so on. Thus, the description of a high bandwidth NVM set out above, and in other examples described herein utilizing the term byte, applies generally to circuits using different sizes of storage units, as would be described by replacing the term byte or set of bytes, with storage unit or set of storage units. Also, in some embodiments different sizes of storage units can be used in a single command sequence, such as one or more four-bit storage units combined with eight-bit storage units.

A number of flowcharts illustrating logic executed by a memory controller or by memory device are described herein. The logic can be implemented using processors programmed using computer programs stored in memory accessible to the computer systems and executable by the processors, by dedicated logic hardware, including field programmable integrated circuits, and by combinations of dedicated logic hardware and computer programs. With all flowcharts herein, it will be appreciated that many of the steps can be combined, performed in parallel or performed in a different sequence without affecting the functions achieved. In some cases, as the reader will appreciate, a re-arrangement of steps will achieve the same results only if certain other changes are made as well. In other cases, as the reader will appreciate, a re-arrangement of steps will achieve the same results only if certain conditions are satisfied. Furthermore, it will be appreciated that the flow charts herein show only steps that are pertinent to an understanding of the invention, and it will be understood that numerous additional steps for accomplishing other functions can be performed before, after and between those shown.

While the present invention is disclosed by reference to the preferred embodiments and examples detailed above, it is to be understood that these examples are intended in an illustrative rather than in a limiting sense. It is contemplated that modifications and combinations will readily occur to those skilled in the art, which modifications and combinations will be within the spirit of the invention and the scope of the following claims.

Claims

1. An artificial intelligence (AI) inference memory device, comprising: a logic layer die including channel logic implementing connections between a plurality of channels for conducting data to and from an accelerator core via at least one bus; anda plurality of non-volatile memory (NVM) dies storing arrays of weights and stacked vertically one above another, forming a layered vertical stack of NVM dies, each of the NVM dies including at least one memory chip and a plurality of direct vertical connections to a corresponding channel logic in the logic layer, the stacked NVM dies being organized into banks of NVM dies,wherein the logic layer die includes, for each bank of NVM dies, a respective controller to execute interface commands with one or more NVM dies of the corresponding bank.
2. The AI inference memory device of claim 1, wherein the controllers for the banks of NVM dies are configured to execute the interface commands in parallel and to contemporaneously provide arrays of weights, of the stored arrays of weights, to a processor chip including the accelerator core.
3. The AI inference memory device of claim 1, wherein the interface commands include a read neural network weights data command, and a write neural network weights data command.
4. The AI inference memory device of claim 1, wherein the plurality of NVM dies are arranged in two stacks of four NVM dies on the logic layer die, and provide data to eight parallel channels.
5. The AI inference memory device of claim 4, wherein the interface commands are executed by memory corresponding to a plurality of channels in parallel; thereby achieving 50 GB/second throughput to the accelerator core.
6. The AI inference memory device of claim 1, wherein a corresponding channel, of the plurality of channels, further includes an interface to a substrate mounting a processor or field programmable gate array (FPGA) that performs neural network computations and a static random-access memory (SRAM) storing activation data for use in neural network computations.
7. The AI inference memory device of claim 6, wherein storing activation data includes storing a non-linear function for relating relationships between input and output of a neural network.
8. The AI inference memory device of claim 6, wherein storing activation data includes storing data implementing an activation function selected from at least one of a sigmoid function, a hyperbolic tangent (tanh) function, a rectified linear unit (ReLU) function, a leaky rectified linear unit (LReLU) function, and a maxout function.
9. The AI inference memory device of claim 1, wherein a corresponding channel, of the plurality of channels, further includes an interface to a substrate mounting a processor or a field programmable gate array (FPGA) that performs neural network computations and a dynamic random-access memory (DRAM) storing activation data for use in neural network computations.
10. The AI inference memory device of claim 1, wherein the NVM of the plurality of NVM dies is one of a phase change memory (PCM), a three-dimensional cross point memory (3D Xpoint), a NOR flash memory, a resistive random-access memory (RRAM), a magneto-resistive random-access memory MRAM, a ferroelectric random-access memory FeRAM, a conductive bridge random-access memory CBRAM, and a NAND flash memory.
11. The AI inference memory device of claim 1, wherein the direct vertical connections comprise via-to-via connections of a through silicon via (TSV) integrated circuit.
12. The AI inference memory device of claim 1, further including an interposer layer to which the logic layer and a vertical stack comprising the plurality of NVM dies are affixed; wherein the interposer layer provides connection between the plurality of direct vertical connections of the vertical stack to corresponding channel logic in the logic layer; thereby packaging the AI inference memory device as a 2.5D through silicon via (TSV) integrated circuit.
13. The AI inference memory device of claim 1, further including an interposer layer to which the logic layer is affixed; wherein the plurality of NVM dies are stacked vertically above the logic layer and the interposer layer in a vertical stack; thereby establishing connections between the plurality of direct vertical connections of the vertical stack to corresponding channel logic in the logic layer; thereby packaging the AI inference memory device as a three-dimension (3D) through silicon via (TSV) integrated circuit.
14. The AI inference memory device of claim 13, wherein a plurality of solder bumps that have been deposited onto chip pads of the logic layer and the plurality of NVM dies provide connection with a device immediately below.
15. The AI inference memory device of claim 13, wherein a plurality of backside interconnects have been deposited onto chip pads of the logic layer and the plurality of NVM dies provide wafer-to-wafer connection with a device above; thereby packaging the AI inference memory device as a three-dimensional (3D) system-on-chip (3D SOC) integrated circuit.
16. An artificial intelligence (AI) inference method conducted by a processor coupled with a logic layer die including channel logic implementing connections between a plurality of channels conducting data to and from an accelerator core via at least one bus, and a plurality of non-volatile memory (NVM) dies storing arrays of weights and stacked vertically one above another, forming a layered vertical stack of NVM dies, each of the NVM dies including at least one memory chip and a plurality of direct vertical connections to a corresponding channel logic in the logic layer, the stacked NVM dies being organized into banks and the logic layer die including, for each bank of NVM dies, a respective controller to execute interface commands with one or more NVM dies of the corresponding bank, the method comprising: implementing the respective controller for each bank of NVM dies to retrieve from the plurality of NVM dies that are stacked vertically above the logic layer die, a plurality of weights of the arrays of weights stored therein;using the accelerator core, applying the plurality of weights to input data for each one of a plurality of nodes of a neural network to obtain an intermediate output;applying activation data to the intermediate output to obtain a result for a neural network level; andstoring the result in at least one of the plurality of NVM dies, an FPGA buffer, and an activation memory facilitating computations at additional neural network levels until a final result is reached.
17. The AI inference method of claim 16, wherein applying activation data includes applying stored data implementing an activation function selected from at least one of a sigmoid function, a hyperbolic tangent (tanh) function, and a rectified linear unit (ReLU) function.
18. The AI inference method of claim 16, wherein the plurality of NVM dies are arranged in two stacks of four NVM dies on the logic layer die, and wherein the method further comprises providing data to eight parallel channels.
19. An AI inference system, comprising: a substrate coupling: a processor chip including an accelerator core,a first memory chip suitable for storing activation functions, andan artificial intelligence (AI) inference memory device for storing arrays of weights, the memory device comprising: a logic layer die including channel logic implementing connections between a plurality of channels for conducting data to and from the accelerator core via at least one bus; anda plurality of non-volatile memory (NVM) dies storing arrays of weights and stacked vertically one above another, forming a layered vertical stack of NVM dies, each of the NVM dies including at least one memory chip and a plurality of direct vertical connections to a corresponding channel logic in the logic layer, the stacked NVM dies being organized into banks of NVM dies,wherein the logic layer die further includes, for each bank of NVM dies, a respective controller to execute interface commands with one or more NVM dies of the corresponding bank,wherein the processor chip applies weights, of the stored arrays of weights, retrieved from the NVM dies to specific inputs and applies activation functions retrieved from the first memory chip to provide inferencing output; andwherein the inference system is implemented as a multichip module in a single package.
20. The AI inference memory device of claim 1, further including activation memory storing activation data for use in neural network computations, wherein the logic layer die including the controllers and the plurality of NVM dies are part of a particular vertical stack, wherein the activation memory is part of a vertical stack that is different from the particular vertical stack and wherein the accelerator core is part of another vertical stack that is different from both the particular vertical stack and the vertical stack including the activation memory.

US Referenced Citations (243)

Number	Name	Date	Kind
4219829	Dorda et al.	Aug 1980	A
4987090	Hsu et al.	Jan 1991	A
5029130	Yeh	Jul 1991	A
5586073	Hiura et al.	Dec 1996	A
5963803	Dawson et al.	Oct 1999	A
6034882	Johnson et al.	Mar 2000	A
6107882	Gabara et al.	Aug 2000	A
6313486	Kencke et al.	Nov 2001	B1
6385097	Liao et al.	May 2002	B1
6486027	Noble et al.	Nov 2002	B1
6593624	Walker	Jul 2003	B2
6829598	Milev	Dec 2004	B2
6856542	Roy et al.	Feb 2005	B2
6906940	Lue	Jun 2005	B1
6960499	Nandakumar et al.	Nov 2005	B2
7081377	Cleeves	Jul 2006	B2
7089218	Visel	Aug 2006	B1
7129538	Lee et al.	Oct 2006	B2
7177169	Scheuerlein	Feb 2007	B2
7368358	Ouyang et al.	May 2008	B2
7436723	Rinerson et al.	Oct 2008	B2
7593908	Abdulkader et al.	Sep 2009	B2
7646041	Chae et al.	Jan 2010	B2
7747668	Nomura et al.	Jun 2010	B2
7948024	Kim et al.	May 2011	B2
8045355	Ueda	Oct 2011	B2
8154128	Lung	Apr 2012	B2
8203187	Lung et al.	Jun 2012	B2
8275728	Pino	Sep 2012	B2
8331149	Choi et al.	Dec 2012	B2
8432719	Lue	Apr 2013	B2
8564045	Liu	Oct 2013	B2
8589320	Breitwisch et al.	Nov 2013	B2
8630114	Lue	Jan 2014	B2
8725670	Visel	May 2014	B2
8860124	Lue et al.	Oct 2014	B2
9064903	Mitchell et al.	Jun 2015	B2
9111617	Shim et al.	Aug 2015	B2
9147468	Lue	Sep 2015	B1
9177966	Rabkin et al.	Nov 2015	B1
9213936	Visel	Dec 2015	B2
9379129	Lue et al.	Jun 2016	B1
9391084	Lue	Jul 2016	B2
9397110	Lue	Jul 2016	B2
9401371	Lee et al.	Jul 2016	B1
9430735	Vali et al.	Aug 2016	B1
9431099	Lee et al.	Aug 2016	B2
9520485	Lue	Dec 2016	B2
9524980	Lue	Dec 2016	B2
9535831	Jayasena et al.	Jan 2017	B2
9536969	Yang et al.	Jan 2017	B2
9589982	Cheng et al.	Mar 2017	B1
9698156	Lue	Jul 2017	B2
9698185	Chen et al.	Jul 2017	B2
9710747	Kang et al.	Jul 2017	B2
9747230	Han et al.	Aug 2017	B2
9754953	Tang et al.	Sep 2017	B2
9767028	Cheng et al.	Sep 2017	B2
9898207	Kim et al.	Feb 2018	B2
9910605	Jayasena et al.	Mar 2018	B2
9922716	Hsiung et al.	Mar 2018	B2
9978454	Jung	May 2018	B2
9983829	Ravimohan et al.	May 2018	B2
9991007	Lee et al.	Jun 2018	B2
10037167	Kwon et al.	Jul 2018	B2
10043819	Lai et al.	Aug 2018	B1
10056149	Yamada et al.	Aug 2018	B2
10073733	Jain et al.	Sep 2018	B1
10157012	Kelner et al.	Dec 2018	B2
10175667	Bang et al.	Jan 2019	B2
10211218	Lue	Feb 2019	B2
10242737	Lin et al.	Mar 2019	B1
10381376	Nishikawa et al.	Aug 2019	B1
10403637	Lue	Sep 2019	B2
10528643	Choi et al.	Jan 2020	B1
10534840	Petti	Jan 2020	B1
10540591	Gao et al.	Jan 2020	B2
10552759	Rich	Feb 2020	B2
10565494	Henry et al.	Feb 2020	B2
10635398	Lin et al.	Apr 2020	B2
10643713	Louie et al.	May 2020	B1
10719296	Lee et al.	Jul 2020	B2
10777566	Lue	Sep 2020	B2
10783963	Hung et al.	Sep 2020	B1
10790023	Harari	Sep 2020	B2
10790828	Gunter et al.	Sep 2020	B1
10825510	Jaiswal et al.	Nov 2020	B2
10860682	Knag et al.	Dec 2020	B2
10880994	Aoki et al.	Dec 2020	B2
10910393	Lai et al.	Feb 2021	B2
10942673	Shafiee Ardestani et al.	Mar 2021	B2
10957392	Lee et al.	Mar 2021	B2
11069704	Lai et al.	Jul 2021	B2
11127108	Sharma et al.	Sep 2021	B2
11171115	Manipatruni et al.	Nov 2021	B2
11410028	Crill et al.	Aug 2022	B2
11443407	Sharma et al.	Sep 2022	B2
11704211	Kaplan	Jul 2023	B1
12086410	Mathuriya	Sep 2024	B1
20010055838	Walker et al.	Dec 2001	A1
20020028541	Lee et al.	Mar 2002	A1
20030122181	Wu	Jul 2003	A1
20050088878	Eshel	Apr 2005	A1
20050280061	Lee	Dec 2005	A1
20050287793	Blanchet et al.	Dec 2005	A1
20070158736	Arai et al.	Jul 2007	A1
20080101109	Haring-Bolivar et al.	May 2008	A1
20080117678	Shieh et al.	May 2008	A1
20090097321	Kim et al.	Apr 2009	A1
20090184360	Jin et al.	Jul 2009	A1
20100172189	Itagaki et al.	Jul 2010	A1
20100182828	Shima et al.	Jul 2010	A1
20100202208	Endo et al.	Aug 2010	A1
20100270593	Lung et al.	Oct 2010	A1
20110018051	Kim et al.	Jan 2011	A1
20110063915	Tanaka et al.	Mar 2011	A1
20110106742	Pino	May 2011	A1
20110128791	Chang et al.	Jun 2011	A1
20110140070	Kim	Jun 2011	A1
20110194357	Han et al.	Aug 2011	A1
20110286258	Chen et al.	Nov 2011	A1
20110297912	Samachisa et al.	Dec 2011	A1
20120007167	Hung et al.	Jan 2012	A1
20120044742	Narayanan	Feb 2012	A1
20120112264	Lee et al.	May 2012	A1
20120182801	Lue	Jul 2012	A1
20120235111	Osano et al.	Sep 2012	A1
20120254087	Visel	Oct 2012	A1
20130070528	Maeda	Mar 2013	A1
20130075684	Kinoshita et al.	Mar 2013	A1
20130119455	Chen et al.	May 2013	A1
20140043898	Kuo et al.	Feb 2014	A1
20140063949	Tokiwa	Mar 2014	A1
20140119127	Lung et al.	May 2014	A1
20140149773	Huang et al.	May 2014	A1
20140268996	Park	Sep 2014	A1
20140330762	Visel	Nov 2014	A1
20150008500	Fukumoto et al.	Jan 2015	A1
20150170001	Rabinovich et al.	Jun 2015	A1
20150171106	Suh	Jun 2015	A1
20150179661	Huo et al.	Jun 2015	A1
20150199126	Jayasena et al.	Jul 2015	A1
20150331817	Han et al.	Nov 2015	A1
20150340369	Lue	Nov 2015	A1
20160043100	Lee et al.	Feb 2016	A1
20160141299	Hong	May 2016	A1
20160141337	Shimabukuro et al.	May 2016	A1
20160181315	Lee et al.	Jun 2016	A1
20160232973	Jung	Aug 2016	A1
20160247579	Jeda et al.	Aug 2016	A1
20160308114	Kim et al.	Oct 2016	A1
20160329341	Shimabukuro et al.	Nov 2016	A1
20160336064	Seo et al.	Nov 2016	A1
20160342892	Ross	Nov 2016	A1
20160342893	Ross et al.	Nov 2016	A1
20160343421	Pyo	Nov 2016	A1
20160358661	Vali et al.	Dec 2016	A1
20160379115	Burger	Dec 2016	A1
20170003889	Kim et al.	Jan 2017	A1
20170025421	Sakakibara et al.	Jan 2017	A1
20170084748	Yang	Mar 2017	A1
20170092370	Harari	Mar 2017	A1
20170103316	Ross et al.	Apr 2017	A1
20170123987	Cheng et al.	May 2017	A1
20170148517	Harari	May 2017	A1
20170160955	Jayasena et al.	Jun 2017	A1
20170169885	Tang et al.	Jun 2017	A1
20170169887	Widjaja	Jun 2017	A1
20170243879	Yu et al.	Aug 2017	A1
20170263623	Zhang et al.	Sep 2017	A1
20170270405	Kurokawa	Sep 2017	A1
20170287928	Kanamori et al.	Oct 2017	A1
20170309634	Noguchi et al.	Oct 2017	A1
20170316833	Ihm et al.	Nov 2017	A1
20170317096	Shin et al.	Nov 2017	A1
20170337466	Bayat et al.	Nov 2017	A1
20180113649	Shafiee Ardestani et al.	Apr 2018	A1
20180121790	Kim et al.	May 2018	A1
20180129424	Confalonieri et al.	May 2018	A1
20180129936	Young et al.	May 2018	A1
20180144240	Garbin et al.	May 2018	A1
20180157488	Shu et al.	Jun 2018	A1
20180173420	Li et al.	Jun 2018	A1
20180182776	Kim	Jun 2018	A1
20180189640	Henry et al.	Jul 2018	A1
20180240522	Jung	Aug 2018	A1
20180246783	Avraham et al.	Aug 2018	A1
20180247195	Kumar et al.	Aug 2018	A1
20180285726	Baum et al.	Oct 2018	A1
20180286874	Kim et al.	Oct 2018	A1
20180321942	Yu et al.	Nov 2018	A1
20180342299	Yamada et al.	Nov 2018	A1
20180350823	Or-Bach et al.	Dec 2018	A1
20190019538	Li et al.	Jan 2019	A1
20190019564	Li et al.	Jan 2019	A1
20190035449	Saida et al.	Jan 2019	A1
20190043560	Sumbul et al.	Feb 2019	A1
20190050714	Nosko et al.	Feb 2019	A1
20190065151	Chen et al.	Feb 2019	A1
20190073564	Saliou	Mar 2019	A1
20190073565	Saliou	Mar 2019	A1
20190088329	Tiwari et al.	Mar 2019	A1
20190102170	Chen et al.	Apr 2019	A1
20190138891	Kim et al.	May 2019	A1
20190138892	Kim et al.	May 2019	A1
20190148393	Lue	May 2019	A1
20190164044	Song et al.	May 2019	A1
20190164617	Tran et al.	May 2019	A1
20190213234	Bayat et al.	Jul 2019	A1
20190220249	Lee et al.	Jul 2019	A1
20190244662	Lee et al.	Aug 2019	A1
20190286419	Lin et al.	Sep 2019	A1
20190311243	Whatmough et al.	Oct 2019	A1
20190311749	Song et al.	Oct 2019	A1
20190325959	Bhargava et al.	Oct 2019	A1
20190340497	Baraniuk et al.	Nov 2019	A1
20190349426	Smith et al.	Nov 2019	A1
20190363131	Torng et al.	Nov 2019	A1
20200026993	Otsuka	Jan 2020	A1
20200034148	Sumbul et al.	Jan 2020	A1
20200065650	Tran et al.	Feb 2020	A1
20200098784	Nagashima et al.	Mar 2020	A1
20200098787	Kaneko	Mar 2020	A1
20200110990	Harada et al.	Apr 2020	A1
20200117986	Burr et al.	Apr 2020	A1
20200118638	Leobandung et al.	Apr 2020	A1
20200143248	Liu et al.	May 2020	A1
20200160165	Sarin	May 2020	A1
20200227432	Lai et al.	Jul 2020	A1
20200334015	Shibata et al.	Oct 2020	A1
20200343252	Lai et al.	Oct 2020	A1
20200349093	Malladi et al.	Nov 2020	A1
20200365611	Hung et al.	Nov 2020	A1
20200381450	Lue et al.	Dec 2020	A1
20200395309	Cheah et al.	Dec 2020	A1
20200402997	Ahn et al.	Dec 2020	A1
20210168230	Baker et al.	Jun 2021	A1
20210209468	Matsumoto et al.	Jul 2021	A1
20210240945	Strachan et al.	Aug 2021	A1
20220284657	Müller	Sep 2022	A1
20230101654	Nava Rodriguez	Mar 2023	A1
20230153587	Vogelsang et al.	May 2023	A1
20240064044	Liu	Feb 2024	A1

Foreign Referenced Citations (40)

Number	Date	Country
101432821	May 2009	CN
1998012	Nov 2010	CN
103778468	May 2014	CN
105718994	Jun 2016	CN
105789139	Jul 2016	CN
106530210	Mar 2017	CN
106815515	Jun 2017	CN
107368892	Nov 2017	CN
107533459	Jan 2018	CN
107767905	Mar 2018	CN
108268946	Jul 2018	CN
110598752	Dec 2019	CN
2048709	Apr 2009	EP
107077879	Aug 2017	IN
H0451382	Feb 1992	JP
2006127623	May 2006	JP
2009080892	Apr 2009	JP
201108230	Mar 2011	TW
201523838	Jun 2015	TW
201618284	May 2016	TW
201639206	Nov 2016	TW
201715525	May 2017	TW
201732824	Sep 2017	TW
201741943	Dec 2017	TW
201802800	Jan 2018	TW
201807807	Mar 2018	TW
201822203	Jun 2018	TW
201939717	Oct 2019	TW
202004573	Jan 2020	TW
202011285	Mar 2020	TW
202046179	Dec 2020	TW
202103307	Jan 2021	TW
202122994	Jun 2021	TW
202129509	Aug 2021	TW
2012009179	Jan 2012	WO
2012015450	Feb 2012	WO
2016060617	Apr 2016	WO
2016084336	Jun 2016	WO
2017091338	Jun 2017	WO
2018201060	Nov 2018	WO

Non-Patent Literature Citations (54)

Entry
H. Li, M. Bhargav, P. N. Whatmough and H. .—S. Philip Wong, “On-Chip Memory Technology Design Space Explorations for Mobile Deep Neural Network Accelerators,” 2019 56th ACM/IEEE Design Automation Conference (DAC), Las Vegas, NV, USA, 2019, pp. 1-6. (Year: 2019).
Anonymous, “Data in the Computer”, May 11, 2015, pp. 1-8, https://web.archive.org/web/20150511143158/https:// homepage.cs.uri .edu/faculty/wolfe/book/Readings/Reading02.htm (Year. 2015)—See Office Action dated Aug. 17, 2020 in U.S. Appl. No. 16/279,494 for relevance—no year provided by examiner.
Aritome, et al., “Reliability issues of flash memory cells,” Proc. of the IEEE, vol. 81, No. 5, May 1993, pp. 776-788.
Beasley, “Manufacturing Defects,” may be found at https://slideplayer.com/slide/11407304, downloaded May 20, 2020, 61 pages.
Chen et al., “A Highly Pitch Scalable 3D Vertical Gate (VG) NAND Flash Decoded by a Novel Self-Aligned Independently Controlled Double Gate (IDG) StringSelect Transistor (SSL),” 2012 Symp. on VLSI Technology (VLSIT), Jun. 12-14, 2012, pp. 91-92.
Chen et al., “Eyeriss: An Energy-Efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE ISSCC, Jan. 31-Feb. 4, 2016, 3 pages.
Choi et al., “Performance Breakthrough in NOR Flash Memory with Dopant-Segregated Schottky-Barrier (DSSB) SONOS Device”, 2009 Symposium onVLSI Technology Digest of Technical Papers, Jun. 16-18, 2009, pp. 1-2.
Entegris FAQ Series, “Contamination Control in Ultrapure Chemicals and Water: Fundamentals of Contamination Control,” may be found at https://www.entegris.com/en/home/resources/technical-information/faq/contamination-control-in-ultrapure-chemicals-and-water.html., downloaded May 20, 2020, 10 pages.
Fukuzumi et al. “Optimal Integration and Characteristics of Vertical Array Devices for Ultra-High Density, Bit-Cost Scalable Flash Memory,” IEEE Dec. 2007, pp. 449-452.
Gonugondla et al., “Energy-Efficient Deep In-memory Architecture for NAND Flash Memories,” IEEE International Symposium on Circuits and Systems (ISCAS), May 27-30, 2018, 5 pages.
Guo et al., “Fast, energy-efficient, robust, and reproducible mixed-signal neuromorphic classifier based on embedded NOR flash memory technology,” IEEE Int'l Electron Devices Mtg., San Francisco, CA, Dec. 2-6, 2017, 4 pages.
Hsu et al., “Study of Sub-30nm Thin Film Transistor (TFT) Charge-Trapping (CT) Devices for 3D NAND Flash Application”, IEDM 2009 IEEE International, Dec. 7-9, 2009, p. 1-4.
Hubert et al., “A Stacked SONOS Technology, Up to 4 Levels and 6nm Crystalline Nanowires, With Gate-All-Around on Independent Gates (Flash), Suitable for Full 3D Integration,” IEEE 2009, Dec. 7-9, 2009, p. 27.6.1-27.6.4.
Hung et al., “A highly scalable vertical gate (VG) 3D NAND Flash with robust program disturb immunity using a novel PN diode decoding structure,” 2011 Symp. on VLSI Technology (VLSIT), Jun. 14-16, 2011, pp. 68-69.
IMEC Magazine, Mar. 2018, 35 pages.
Jang et al., “Vertical Cell Array Using TCAT (Terabit Cell Array Transistor) Technology for Ultra High Density NAND Flash Memory,” 2009 Symposium on VLSI Technology Digest of Technical Papers, Jun. 16-18, 2009, pp. 192-193.
Johnson et al., “512-Mb PROM With a Three-Dimensional Array of Diode/Antifuse Memory Cells,” IEEE Journal of Solid-State Circuits, vol. 38, No. 11, Nov. 2003, pp. 1920-1928.
Jung et al, “Three Dimensionally Stacked NAND Flash Memory Technology Using Stacking Single Crystal Si Layers on ILD and TANOS Structure for Beyond 30nm Node,” International Electron Devices Meeting, 2006. IEDM '06, Dec. 11-13, 2006, pp. 1-4.
Katsumata et al., “Pipe-shaped BiCS flash memory with 16 stacked layers and multi-level-cell operation for ultra high density storage devices,” 2009 Symp. on VLSI Technology, Jun. 16-18, 2009, 2 pages.
Kim et al. “Novel Vertical-Stacked-Array-Transistor (VSAT) for Ultra-High-Density and Cost-Effective NAND Flash Memory Devices and SSD (Solid State Drive)”, Jun. 2009 Symposium on VLSI Technolgy Digest of Technical Papers, pp. 186-187. (cited in parent—copy not provided herewith).
Kim et al., “Multi-Layered Vertical Gate NAND Flash Overcoming Stacking Limit for Terabit Density Storage, ”2009 Symposium on VLSI Technology Digest of Technical Papers, Jun. 16-18, 2009, pp. 188-189.
Kim et al., “Novel 3-D Structure for Ultra High Density Flash Memory with VRAT (Vertical-Recess-Array-Transistor) and PIPE (Planarized Integration on the same PlanE),” IEEE 2008 Symposium on VLSI Technology Digest of Technical Papers, Jun. 17-19, 2008, pp. 122-123.
Kim et al., “Three-Dimensional NAND Flash Architecture Design Based on Single-Crystalline STacked ARray,” IEEE Transactions on Electron Devices, vol. 59, No. 1, pp. 35-45, Jan. 2012.
Kim, “Abrasive for Chemical Mechanical Polishing. Abrasive Technology: Characteristics and Applications,” Book Abrasive Technology: Characteristics and Applications, Mar. 2018, 20 pages.
Lai et al. “Highly Reliable MA BE-SONOS (Metal-Al203 Bandgap Engineered SONOS) Using a SiO2 Buffer Layer,” VLSI Technology, Systems and Applications 2008, VLSI-TSA International Symposium on Apr. 21-23, 2008, pp. 58-59.
Lai et al., “A Multi-Layer Stackable Thin-Film Transistor (TFT) NAND-Type Flash Memory,” Electron Devices Meeting, 2006, IEDM '06 International, Dec. 11-13, 2006, pp. 1-4.
Liu et al., “Parallelizing SRAM Arrays with Customized Bit-Cell for Binary Neural Networks,” 55th ACM/ESDA/IEEE Design Automation Conference (DAC), Sep. 20, 2018, 4 pages.
Lue et al., “A Highly Scalable 8-Layer 3D Vertical-Gate (VG) TFT NAND Flash Using Junction-Free Buried Channel BE-SONOS Device”, 2010 Symposium on VLSI Technology Digest of Technical Papers, pp. 131-132, Jun. 15-17, 2010.
Lue et al., “A Novel 3D AND-type NVM Architecture Capable of High-density, Low-power In-Memory Sum-of-Product Computation for Artificial Intelligence Application,” IEEE VLSI, Jun. 18-22, 2018, 2 pages.
Lue et al., “A Novel Buried-Channel FinFET BE-SONOS NAND Flash with Improved Memory Window and Cycling Endurance”, 2009 Symposium on VLSI Technology Digest of Technical Papers, p. 224-225.
Meena, et al., “Overview of emerging nonvolatile memory technologies,” Nanoscale Reearch Letters 9:526, Oct. 2, 2014, 34 pages.
Merrikh-Bayat et al., “High-Performance Mixed-Signal Neurocomputing with Nanoscale Flowting-Gate Memory Cell Arrays,” in IEEE Transactions on Neural Netowrks and Learning Systems, vol. 29, No. 10, Oct. 2018, pp. 4782-4790.
Minghao Qi, “ECE 695Q Lecture 10: Optical Lithography—Resolution Enhancement Techniques,” may be found at https://nanohub.org/resources/15325/watch?resid=24507, Spring 2016, 35 pages.
Nowak et al., “Intrinsic fluctuations in Vertical NAND flash memories,” 2012 Symposium on VLSI Technology, Digest of Technical Papers, pp. 21-22, Jun. 12-14, 2012.
Ohzone et al., “Ion-Implanted Thin Polycrystalline-Silicon High-Value Resistors for High-Density Poly-Load Static RAM Applications,” IEEE Trans. on Electron Devices, vol. ED-32, No. 9, Sep. 1985, 8 pages.
Paul et al., “Impact of a Process Variation on Nanowire and Nanotube Device Performance”, IEEE Transactions on Electron Devices, vol. 54, No. 9, Sep. 2007, p. 2369-2376.
Rincon-Mora, et al., “Bandgaps in the crosshairs: What's the trim target?” IEEE, The Georgia Tech Analog & Power IC Labroator, Oct. 18, 2006, 5 pages.
Rod Nussbaumer, “How is data transmitted through wires in the computer?”, Aug. 27, 2015, pp. 1-3, https://www.quora.com/ How-is-data-transmitted-through-wires-in-the-computer (Year: 2015)—See Office Action dated Aug. 17, 2020 in U.S. Appl. No. 16/279,494 for relevance—no year provided by examiner.
Sakai et al., “A Buried Giga-Ohm Resistor (BGR) Load Static RAM Cell,” IEEE Symp. on VLSI Technology, Digest of Papers, Sep. 10-12, 1984, 2 pages.
Schuller et al., “ Neuromorphic Computing: From Materials to Systems Architecture,” US Dept. of Energy, Oct. 29-30, 2015, Gaithersburg, MD, 40 pages.
Seo et al., “A Novel 3-D Vertical FG Nand FLASH Memory Cell Arrays Using the Separated Sidewall Control Gate (S-SCG) for Highly Reliable MLC Operation,” 2011 3rd IEEE International Memory Workshop (IMW), May 22-25, 2011, 4 pages.
Soudry, et al. “Hebbian learning rules with memristors,” Center for Communication and Information Technologies CCIT Report #840, Sep. 1, 2013, 16 pages.
Tanaka et al., “Bit Cost Scalable Technology with Punch and Plug Process for Ultra High Density Flash Memory,” VLSI Technology, 2007 IEEE Symposium on Jun. 12-14, 2007, pp. 14-15.
The Nikon eReview, “KLA-Tencor Research Scientist Emphasizes Stochastic Challenges at LithoVision 2018,” may be found at https://nikonereview.com/2018/kla-tencor-research-scientist-emphasizes-stochastic-challenges-at-lithovision—Spring 2018, 7 pages.
Wang et al., “Three-Dimensional NAND Flash for Vector-Matrix Multiplication,” IEEE Trans. on Very Large Scale Integration Systems (VLSI), vol. 27, No. 4, Apr. 2019, 4 pages.
Wang, Michael, “Technology Trends on 3D-NAND Flash Storage”, Impact 2011, Taipei, dated Oct. 20, 2011, found at www.impact.org.tw/2011/Files/NewsFile/201111110190.pdf.
Webopedia, “DRAM—dynamic random access memory”, Jan. 21, 2017, pp. 1-3, https://web.archive.org/web/20170121124008/https://www.webopedia.com/TERM/D/DRAM.html (Year: 2017)—See Office Action dated Aug. 17, 2020 in U.S. Appl. No. 16/279,494 for relevance—no year provided by examiner.
Webopedia, “SoC”, Oct. 5, 2011, pp. 1-2, https://web.archive.org/web/20111005173630/https://www.webopedia.com/ TERM/S/SoC.html (Year: 2011)—See Office Action dated Aug. 17, 2020 in U.S. Appl. No. 16/279,494 for relevance—no month provided by examiner.
Webopedia, “volatile memory”, Oct. 9, 2017, pp. 1-4, https://web.archive.org/web/20171009201852/https://www.webopedia.com/TERMN/volatile_memory.html (Year: 2017)—See Office Action dated Aug. 17, 2020 in U.S. Appl. No. 16/279,494 for relevance—no year provided by examiner.
Whang, SungJin et al. “Novel 3-dimensional Dual Control-gate with Surrounding Floating-gate (DC-SF) NAND flash cell for 1Tb file storage application,” 2010 IEEE Int'l Electron Devices Meeting (IEDM), Dec. 6-8, 2010, 4 pages.
Y.X. Liu et al., “Comparative Study of Tri-Gate and Double-Gate-Type Poly-Si Fin-Channel Spli-Gate Flash Memories,” 2012 IEEE Silicon Nanoelectronics Workshop (SNW), Honolulu, HI, Jun. 10-11, 2012, pp. 1-2.
Goplen et al., “Placement of 3D ICs with Thermal and Interlayer via Considerations,” 2007 44th ACM/IEEE Design Automation Conference, Jun. 4-8, 2007, pp. 626-631.
Tanaka et al., “Through-Silicon via Interconnection for 3D Integration Using Room-Temperature Bonding,” in IEEE Transactions on Advanced Packaging, vol. 32, No. 4, Nov. 2009, pp. 746-753.
Temiz, et al., “Post-CMOS Processing and 3-D Integration Based on Dry-Film Lithography,” in IEEE Transactions on Components, Packaging and Manufacturing Technology, vol. 3, No. 9, Sep. 2013, pp. 1458-1466.

Related Publications (1)

	Number	Date	Country
	20240281142 A1	Aug 2024	US

High bandwidth non-volatile memory for AI inference system

Information

Patent Number

Date Filed

Date Issued

Inventors

Original Assignees

Examiners

Agents

CPC

Field of Search

CPC

International Classifications