This application claims priorities from Chinese Patent Application No. 201610162375.7, filed Mar. 21, 2016; Chinese Patent Application No. 201610180422.0, filed Mar. 26, 2016; Chinese Patent Application No. 201610182229.0, filed Mar. 27, 2016, in the State Intellectual Property Office of the People's Republic of China (CN), the disclosures of which are incorporated herein by references in their entireties.
1. Technical Field of the Invention
The present invention relates to the field of integrated circuit, and more particularly to neuro-processor for artificial intelligence (AI) applications.
2. Prior Art
AI is the next big wave in computing. Artificial neural network (hereinafter, neural network) is a powerful AI tool. An exemplary neural network is shown in
U.S. Pat. No. 6,199,057 issued to Tawel on Mar. 6, 2001 discloses a circuit implementation of a neuro-processor. As shown in
There is a clear trend towards increasingly large neural networks. Most neural networks use 1-billion to 10-billion Ws-parameters. Apparently, this large amount of Ws-parameters (on the order of GB) cannot be stored in a small Ws RAM 40X. To address this issue, prior art uses the von Neumann architecture, where the Ws-parameters are stored in an external RAM (i.e. the main memory). However, neural networks have become computationally so intensive that the Ws-parameters have to be frequently written back and read from main memory. These frequent memory accesses become the performance bottleneck. It was reported that the neuro-processor loses at least an order of magnitude in performance due to memory accesses.
To solve the above memory-access problem, Chen (referring to Chen et al. “DaDianNao: A Machine-Learning Supercomputer”, IEEE/ACM International Symposium on Micro-architecture, 5(1), pp. 609-622, 2014) taught a machine-learning supercomputer comprising a plurality of accelerator dice. Each accelerator die contains enough RAM so that the sum of the RAM of all dice can contain the whole neural network, thus requiring no external main memory.
The sixteen tiles 70 in the accelerator 60 have similar architecture.
Although having many advantages, the accelerator of Chen still has several drawbacks. First of all, from a system's perspective, even though it does not need an external main memory, this accelerator still needs an external storage for permanently storing the Ws-parameters, because he eDRAM banks 40 only serve as a temporary storage. Before operation, the Ws-parameters still need to be loaded into the eDRAM 40. This takes time. Secondly, each accelerator die 60 contains 32 MB eDRAM for the Ws-parameters. This number, although much larger than Tawel, is still quite small for neural networks. A typical neural network contains billions of Ws-parameters. To store all of them inside the eDRAM 40, hundreds of accelerator dice 60 are needed (e.g. 125 accelerator dice for one billion 32-bit Ws-parameters). These are too many for a mobile device. Accordingly, the accelerator 60 is not suitable for mobile applications. Thirdly, the accelerator 60 adopts an asymmetric architecture where the tile area is heavily biased towards storage rather than computation. Inside each tile, eDRAM 40 occupies nearly 80% of the area, whereas the NPU 50 occupies less than 10%. As a result, the computational power per die area is small.
A root cause of the above issues is that the integration between the eDRAM 40 and the NPU 50 is two-dimensional (2-D), i.e. both are formed at a same physical level (i.e. on the substrate). This 2-D integration leads a dilemma: more computational power per die area means less eDRAM 40 on an accelerator die 60; however, the resulting extra external-memory accesses would void much of the performance gain from the increased computational power. As long as the 2-D integration is used, this dilemma would remain. A fundamentally different integration is desired.
It is a principle object of the present invention to advance the art of neural networks.
It is a principle object of the present invention to improve computational power per die area of a neuro-processor.
It is a principle object of the present invention to improve storage capacity per die area of a neuro-processor.
It is a further object of the present invention to provide a neuro-processor suitable for mobile applications.
In accordance with these and other objects of the present invention, the present invention discloses an integrated neuro-processor comprising at least a three-dimensional memory (3D-M) array.
The present invention discloses an integrated neuro-processor comprising at least a three-dimensional memory (3D-M) array. It not only performs neural processing, but also stores the synaptic weights used thereby. The integrated neuro-processor comprises a plurality of neural storage-processing units (NSPU), with each NSPU comprising a neuro-processing circuit and at least a 3D-M array. The neuro-processing circuit performs neural processing, while the 3D-M array stores the synaptic weights. The 3D-M array is vertically stacked above the neuro-processing circuit. This integration between the 3D-M array and the neuro-processing circuit is referred to as 3-D integration. The 3D-M array is communicatively coupled with the neuro-processing circuit through a plurality of contact vias. These coupling contact vias are collectively referred to as inter-storage-processor (ISP)-connections.
The 3-D integration has a profound effect on the computational power per die area. Because the 3D-M array is vertically stacked above the neuro-processing circuit, the footprint of an NSPU is roughly equal to that of the neuro-processing circuit. This is significantly smaller than prior art. For the 2-D integration used by prior art, the footprint of the tile 70 (equivalent to the NSPU) is roughly equal to the sum of those of the eDRAM 40 (equivalent to the 3D-M array) and the NPU 50 (equivalent to the neuro-processing circuit). Recalling that the NPU 50 occupies less than 10% of the tile area and the eDRAM 40 occupies ˜80% of the tile area, it can be concluded that, after moving the memory array storing the synaptic weights from aside to above, the NSPU could be ˜10× smaller than the tile 70 of prior art. Accordingly, the integrated neuro-processor could contain ˜10× more NSPUs per die area than prior art and therefore, is ˜10× more computationally powerful. The integrated neuro-processor supports more massive parallelism.
The 3-D integration also has a profound effect on the storage capacity per die area. Because each 3D-M cell occupies ˜4 F2 die area whereas each eDRAM cell occupies >100 F2 die area (F is the minimum feature size for a processing node, e.g. 14 nm), 3D-M is more area-efficient. Adding the fact that the 3D-M comprises multiple memory levels (e.g. 4 memory levels) whereas the eDRAM comprises only a single memory level, the integrated neuro-processor has significantly more (˜100×) storage capacity per die area than prior art. Considering that a 3D-XPoint die has a storage capacity of 128 Gb, the integrated neuro-processor can easily store up to 16 GB of synaptic weights. This is more than enough for most AI applications. Because a single or few integrated neuro-processor dice can store the synaptic weights of a whole neural network, the integrated neuro-processor is suitable for mobile applications.
Accordingly, the present invention discloses an integrated neuro-processor, comprising: a semiconductor substrate having transistors thereon; an array of neural storage-processing units (NSPU) formed on said semiconductor substrate, each of said NSPUs comprising at least a first three-dimensional memory (3D-M) array and a neuro-processing circuit, wherein said first 3D-M array is stacked above said neuro-processing circuit, said 3D-M array storing at least a synaptic weight; said neuro-processing circuit is formed on said substrate, said neuro-processing circuit performing neural processing with said synaptic weight; said first 3D-M array and said neuro-processing circuit are communicatively coupled by a plurality of contact vias.
The present invention further discloses an integrated neuro-processor, comprising: a semiconductor substrate having transistors thereon; an array of neural storage-processing units (NSPU) formed on said semiconductor substrate, each of said NSPUs comprising at least a first three-dimensional memory (3D-M) array and a neuro-processing circuit, wherein said first 3D-M array is stacked above said neuro-processing circuit, said 3D-M array storing at least a synaptic weight; said neuro-processing circuit is formed on said substrate, said neuro-processing circuit comprising a multiplier, wherein one input of said multiplier is said synaptic weight; said first 3D-M array and said neuro-processing circuit are communicatively coupled by a plurality of contact vias.
It should be noted that this Summary is provided to introduce a selection of concepts, in a simplified form, that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
It should be noted that all the drawings are schematic and not drawn to scale. Relative dimensions and proportions of parts of the device structures in the figures have been shown exaggerated or reduced in size for the sake of clarity and convenience in the drawings. The same reference symbols are generally used to refer to corresponding or similar features in the different embodiments.
Throughout the present invention, the phrase “memory” is used in its broadest sense to mean any semiconductor-based holding place for information, either permanent or temporary; the phrase “storage” is used in its broadest sense to mean any permanent holding place for information; the phrase “permanent” is used in its broadest sense to mean any long-term storage; the phrase “communicatively coupled” is used in its broadest sense to mean any coupling whereby information may be passed from one element to another element.
Furthermore, the phrase “on the substrate” means the functional elements of a circuit component (e.g. transistors) are formed on the surface of the substrate, while the interconnects between these functional elements may be formed above the substrate, i.e. they do not touch the substrate. On the other hand, the phrase “above the substrate” means the functional elements (e.g. memory cells) are formed above the substrate, i.e. they do not touch the substrate.
In other publications, the term “neural processing unit” is also referred to as “neural functional unit” and the like; the term “neuro-processor” is also referred to as “accelerator”, “neural-network accelerator”, “machine-learning accelerator” and the like. The symbol “/” means a relationship of “and” or “or”.
Those of ordinary skills in the art will realize that the following description of the present invention is illustrative only and is not intended to be in any way limiting. Other embodiments of the invention will readily suggest themselves to such skilled persons from an examination of the within disclosure.
Referring now to
Referring now to
The 3D-M can be categorized into 3D-RAM (random access memory) and 3D-ROM (read-only memory). As used herein, the phrase “RAM” is used in its broadest sense to mean any memory for temporarily holding information, including but not limited to registers, SRAM, and DRAM; the phrase “ROM” is used in its broadest sense to mean any memory for permanently holding information, wherein the information being held could be either electrically alterable or un-alterable. Most 3D-M under development is 3D-ROM. The 3D-ROM is further categorized into 3-D writable memory (3D-W) and 3-D printed memory (3D-P).
For the 3D-W, data can be electrically written (or, programmable). Based on the number of programmings allowed, a 3D-W can be categorized into three-dimensional one-time-programmable memory (3D-OTP) and three-dimensional multiple-time-programmable memory (3D-MTP). The 3D-OTP can be written once, while the 3D-MTP is electrically re-programmable. An exemplary 3D-MTP is 3D-XPoint. Other types of 3D-MTP include memristor, resistive random-access memory (RRAM or ReRAM), phase-change memory, programmable metallization cell (PMC), conductive-bridging random-access memory (CBRAM), and the like.
For the 3D-P, data are recorded thereto using a printing method during manufacturing. These data are fixedly recorded and cannot be changed after manufacturing. The printing methods include photo-lithography, nano-imprint, e-beam lithography, DUV lithography, and laser-programming, etc. An exemplary 3D-P is three-dimensional mask-programmed read-only memory (3D-MPROM), whose data are recorded by photo-lithography. Because electrical programming is not required, a memory cell in the 3D-P can be biased at a larger voltage during read than the 3D-W. Thus, the 3D-P is faster in read than the 3D-W.
The 3D-W cell 5aa comprises a programmable layer 12 and a diode layer 14. The programmable layer 12 could be an antifuse layer (which can be programmed once and is used for the 3D-OTP) or a re-programmable layer (which is used for the 3D-MTP). The diode layer 14 is broadly interpreted as any layer whose resistance at the read voltage is substantially lower than when the applied voltage has a magnitude smaller than or polarity opposite to that of the read voltage. The diode could be a semiconductor diode (e.g. p-i-n silicon diode), or a metal-oxide (e.g. TiO2) diode.
Referring now to
Because it is bound on four sides by the peripheral circuits 15, 15′, 17, 17′, the neuro-processing circuit 180 occupies a small die area and has limited functionalities. It is a simple neuro-processing circuit. Apparently, complex neural processing requires a larger processor area.
The embodiment of
The embodiment of
The 3-D integration has a profound effect on the computational power per die area. Because the 3D-M array 170 is vertically stacked above the neuro-processing circuit 180 (
The 3-D integration also has a profound effect on the storage capacity per die area. Because each 3D-M cell occupies ˜4 F2 die area whereas each eDRAM cell occupies >100 F2 die area (F is the minimum feature size for a processing node, e.g. 14 nm), 3D-M is more area-efficient. Adding the fact that the 3D-M comprises multiple memory levels (e.g. 4 memory levels) whereas the eDRAM comprises only a single memory level, the preferred integrated neuro-processor 200 has significantly more (˜100×) storage capacity per die area than prior art. Considering that a 3D-XPoint die has a storage capacity of 128 Gb, the preferred integrated neuro-processor 200 can easily store up to 16 GB of synaptic weights. This is more than enough for most AI applications. Because a single or few integrated neuro-processor dice can store the synaptic weights of a whole neural network, the integrated neuro-processor is suitable for mobile applications.
In the preferred embodiments of
When the storage capacity of the 3D-ROM is large enough (e.g. on the order of GB) so that all values of the synaptic weights can be stored internally, a neuro-processing system (i.e. a system comprising an integrated neuro-processor, e.g. a machine-learning supercomputer) does not need to use an external main memory or an external storage. The synaptic weights can be directly fetched from an internal 3D-M array 170. This simplifies the system design. More importantly, because no data is transferred to and from the external main memory or the external storage, the “memory wall” in the von Neumann architecture is avoided.
Referring now to
In the preferred embodiment of
In the preferred embodiment of
The activation function (e.g. a sigmoid function, a signum function, a threshold function, a piecewise-linear function, a step function, a tanh function, etc.) controls the amplitude of its output to be between certain values (e.g. between 0 and 1 or between −1 and 1). It is difficult to realize. Tawel disclosed an activation-function circuit using a look-up table (LUT). It comprises a ROM which stores the LUT of the activation function. Like other prior art, the ROM storing the LUT is formed on the substrate, i.e. on the same physical level as the other components (e.g. RAMs 40X, 40Y, NPU 50) of the neuro-processor. This type of the 2-D integration has the same drawback as those faced by other prior art. Because the inclusion of the ROM (for the LUT) expands the area of the NPU 50, the computational power per die area will be lowered, so will the storage capacity per die area (for the synaptic weights).
Following the same inventive spirit of the present invention, besides storing the synaptic weights, at least a 3D-M array on at least one memory level can be used to store the LUT for the activation function. Because the LUT is to be stored permanently, the 3D-M array is preferably a 3D-ROM array.
From the simplified cross-sectional view of
Since the activation function is now realized by the 3D-ROM array 196, the computing component 150 becomes quite simple—it only needs to realize multiplication and addition, but not activation function. As a result, the preferred computing component 150 based on the 3D-ROM LUT occupies a smaller die area than if the activation function is realized otherwise. Thus, the neuro-processing circuit 180 may use the simple neuro-processing circuit of
In
While illustrative embodiments have been shown and described, it would be apparent to those skilled in the art that many more modifications than that have been mentioned above are possible without departing from the inventive concepts set forth therein. The invention, therefore, is not to be limited except in the spirit of the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
201610162375.7 | Mar 2016 | CN | national |
201610180422.0 | Mar 2016 | CN | national |
201610182229.0 | Mar 2016 | CN | national |
Number | Name | Date | Kind |
---|---|---|---|
5519811 | Yoneda | May 1996 | A |
5627943 | Yoneda | May 1997 | A |
5835396 | Zhang | Nov 1998 | A |
6034882 | Johnson et al. | Mar 2000 | A |
6067536 | Maruyama et al. | May 2000 | A |
6199057 | Tawel | Mar 2001 | B1 |
6717222 | Zhang | Apr 2004 | B2 |
6861715 | Zhang | Mar 2005 | B2 |
7450414 | Scheuerlein | Nov 2008 | B2 |
8450781 | Rothberg | May 2013 | B2 |
8700552 | Yu et al. | Apr 2014 | B2 |
9153230 | Maaninen | Oct 2015 | B2 |
20050231855 | Tran | Oct 2005 | A1 |
20090175104 | Leedy | Jul 2009 | A1 |
20130072775 | Rogers | Mar 2013 | A1 |
20140275887 | Batchelder | Sep 2014 | A1 |
20150339570 | Scheffler | Nov 2015 | A1 |
20170060697 | Berke | Mar 2017 | A1 |
20190164038 | Zhang | May 2019 | A1 |
Entry |
---|
Chen et al. “DaDianNao: A Machine-Learning Supercomputer”, IEEE/ACM International Symposium on Micro-architecture, 5(1), pp. 609-622, 2014. |
Number | Date | Country | |
---|---|---|---|
20170270403 A1 | Sep 2017 | US |