The present invention relates to the field of integrated circuit, and more particularly to processors.
Conventional processors use logic-based computation (LBC), which carries out computation primarily with logic circuits (e.g. XOR circuit). Logic circuits are suitable for arithmetic functions, whose operations consist of basic arithmetic operations only, i.e. addition, subtraction and multiplication. However, logic circuits are not suitable for non-arithmetic functions, whose operations are more than the above arithmetic operations performable by the conventional logic circuits. Exemplary non-arithmetic functions include transcendental functions and special functions. Non-arithmetic functions are computationally hard and their hardware implementation has been a major challenge.
A complex function is a non-arithmetic function with multiple independent variables (independent variable is also known as input variable or argument). It can be expressed as a combination of basic functions. A basic function is a non-arithmetic function with a single independent variable. Exemplary basic functions include basic transcendental functions, such as exponential function (exp), logarithmic function (log), trigonometric functions (sin, cos, tan, a tan) and others.
For the conventional processors, all complex functions and most basic functions are implemented by software; only a small number of basic functions (e.g. basic algebraic functions and basic transcendental functions) are implemented by hardware, which are referred to as built-in functions. These built-in functions are realized by a combination of arithmetic operations and look-up tables (LUT). For example, U.S. Pat. No. 5,954,787 issued to Eun on Sep. 21, 1999 taught a method for generating sine/cosine functions using look-up tables; U.S. Pat. No. 9,207,910 issued to Azadet et al. on Dec. 8, 2015 taught a method for calculating a power function using LUTs.
Realization of built-in functions is further illustrated in
The 2-D integration puts stringent requirements on the manufacturing process. As is well known in the art, the memory transistors in the LUT 200X are vastly different from the logic transistors in the ALC 100X. The memory transistors have stringent requirements on leakage current, while the logic transistors have stringent requirements on drive current. To form high-performance memory transistors and high-performance logic transistors at the same time is a challenge.
The 2-D integration also limits computational density and computational complexity. Computation has been developed towards higher computational density and greater computational complexity. The computational density, i.e. the computational power (e.g. the number of floating-point operations per second) per die area, is a figure of merit for parallel computation. The computational complexity, i.e. the total number of built-in functions supported by a processor, is a figure of merit for scientific computation. For the 2-D integration, inclusion of the LUT 200X increases the die size of the conventional processor OOX and lowers its computational density. This has an adverse effect on parallel computation. Moreover, because the ALU 100X, as the primary component of the conventional processor 00X, occupies a large die area, the LUT 200X is left with only a small die area and therefore, supports few built-in functions.
The LBC-based processor OOX suffers one drawback. Because different logic circuits are used to realize different built-in functions, the processor OOX is fully customized. In other words, once its design is complete, the processor OOX can only realize a fixed set of pre-defined built-in functions. Apparently, configurable computation is more desirable, where a same hardware can realize different mathematical functions under the control of a set of configuration signals.
In the past, configurable logic, i.e. a same hardware realizes different logics under the control of a set of configuration signals, was realized by a configurable gate array, which is also known as field-programmable gate array (FPGA), complex programmable logic device (CPLD), or other names. U.S. Pat. No. 4,870,302 issued to Freeman on Sep. 26, 1989 (hereinafter Freeman) discloses a configurable gate array. It comprises an array of configurable logic elements and a hierarchy of configurable interconnects that allow the configurable logic elements to be wired together. In the prior-art configurable gate arrays, only logic functions are configurable, but mathematical functions are not configurable. A small number of mathematical functions (i.e. built-in functions) are realized in fixed computing elements, which are part of hard blocks. Namely, the circuits realizing these built-in functions are fixedly connected and are not subject to change by programming. Apparently, fixed computing elements would limit further applications of the configurable gate array. To overcome this difficulty, the present invention expands the original concept of the configurable gate array by making the fixed computing elements configurable.
It is a principle object of the present invention to realize configurable computation.
It is a further object of the present invention to realize field-configurable computation.
It is a further object of the present invention to realize re-configurable computation.
It is a further object of the present invention to realize configurable computation for complex functions.
It is a further object of the present invention to provide a configurable processor with a greater computational complexity.
It is a further object of the present invention to provide a configurable processor with a higher computational density.
It is a further object of the present invention to provide a configurable gate array with a greater computational flexibility.
In accordance with these and other objects of the present invention, the present invention discloses a configurable processor.
The present invention discloses a configurable processor with a backside look-up table (BS-LUT), i.e. a BS-LUT configurable processor. The preferred BS-LUT configurable processor comprises a processor substrate and a plurality of configurable computing elements. Each configurable computing element comprises at least a programmable memory array on the backside of the processor substrate and at least an arithmetic logic circuit (ALC) on the front side of the processor substrate. The programmable memory array stores at least a portion of a look-up table (LUT) for a mathematical function, which includes numerical values related to said mathematical function (e.g. functional values and/or derivative values thereof), while the ALC performs arithmetic operations on selected data from the LUT. In general, the front side of the processor substrate comprises the ALCs of a plurality of configurable computing elements, while the backside of the processor substrate comprises the programmable memory arrays of another plurality of configurable computing elements. The programmable memory array on the front side and the ALC on the backside are communicatively coupled by a plurality of through-silicon vias. Located on the opposite side of the ALC, the LUT is referred to as backside LUT (BS-LUT). It should be apparent to those skilled in the art that the roles of the front side and backside can be reversed in other preferred embodiments, i.e. the programmable memory array can be formed on the front side of the processor substrate while the ALC can be formed on the backside of the processor substrate.
The preferred BS-LUT configurable processor uses memory-based computation (MBC), which realizes mathematical functions primarily with the LUT. Compared with the LUT used by the conventional processor, the BS-LUT used by the preferred BS-LUT configurable processor has a much larger capacity. Although arithmetic operations are still performed, the MBC only needs to calculate a polynomial to a much lower order because it uses a much larger BS-LUT as a starting point for computation. For the MBC, the fraction of computation done by the BS-LUT is more than the ALC.
Each usage cycle of the BS-LUT configurable processor comprises two stages: a configuration stage and a computation stage. In the configuration stage, the LUT for a desired mathematical function is written into the programmable memory array. In the computation stage, selected values of the mathematical function are read out from the programmable memory array. The BS-LUT configurable processor can realize field-configurable computation and re-configurable computation. For the field-configurable computation, a mathematical function is realized by writing its LUT into the programmable memory array in the field of use. For re-configurable computation, the programmable memory array is re-programmable and different mathematical functions can be realized by writing different LUTs for different mathematical functions thereto during different usage cycles. For example, during a first usage cycle, a first LUT for a first mathematical function is written into the re-programmable memory array; during a second usage cycle, a second LUT for a second mathematical function is written into the re-programmable memory array.
Because the ALC and the LUT are located on opposite sides of the processor substrate, this type of vertical integration is referred to as double-sided integration. The double-sided integration has a profound effect on the computational density and computational complexity. For the conventional 2-D integration, the footprint of a conventional processor 00X is roughly equal to the sum of those of the ALU 100X and the LUT 200X. On the other hand, because the double-sided integration moves the LUT from aside to the backside, the BS-LUT processor becomes smaller and computationally more powerful. In addition, the total LUT capacity of the conventional processor OOX is less than 100 Kb, whereas the total BS-LUT capacity for the BS-LUT processor could reach 100 Gb. Consequently, a single BS-LUT processor could support as many as 10,000 built-in functions (including various types of complex functions), far more than the conventional processor 00X. Furthermore, because the ALC and the LUT are on different sides of the processor substrate, the logic transistors in the ALC and the memory transistors in the LUT are formed in separate processing steps, which can be individually optimized.
To further improve configurability, the present invention further discloses a preferred BS-LUT configurable computing array for implementing complex functions. It is a special type of the BS-LUT configurable processor and comprises an array of configurable computing elements, an array of configurable logic elements and a plurality of configurable interconnects. Each configurable computing element comprises at least a programmable memory array for storing the LUT for a mathematical function and at least an ALC for performing arithmetic operations on selected data from the LUT. The configurable logic elements and configurable interconnects in the BS-LUT configurable computing array are similar to those in the conventional configurable gate array. During computation, a complex function is first decomposed into a combination of basic functions. Each basic function is then realized by an associated configurable computing element. Finally, the complex function is realized by programming the corresponding configurable logic elements and configurable interconnects.
Accordingly, the present invention discloses a configurable processor comprising a semiconductor substrate including a first side and a second side opposite to said first side and a plurality of configurable computing elements on said semiconductor substrate, each of said configurable computing elements comprising: at least a programmable memory array on said first side for storing at least a portion of a look-up table (LUT) for a mathematical function; at least an arithmetic logic circuit (ALC) on said second side for performing at least an arithmetic operation on selected data from said LUT; and means for communicatively coupling said programmable memory array and said ALC; wherein said mathematical function includes more operations than arithmetic operations performable by said ALC.
The present invention further discloses a configurable processor for implementing a mathematical function, comprising: a semiconductor substrate comprising a first side and a second side opposite to said first side; at least first and second programmable memory arrays on said first side, wherein said first programmable memory array stores at least a first portion of a first look-up table (LUT) for a first mathematical function; and, said second programmable memory array stores at least a second portion of a second LUT for a second mathematical function; at least an arithmetic logic circuit (ALC) on said second side for performing at least an arithmetic operation on selected data from said first or second LUT; and means for communicatively coupling said first or second programmable memory array with said ALC; wherein said mathematical function is a combination of at least said first and second mathematical functions; and, each of said first and second mathematical functions includes more operations than arithmetic operations performable by said ALC.
The present invention further discloses a configurable computing array for implementing a mathematical function, comprising: a semiconductor substrate comprising a first side and a second side opposite to said first side; at least an array of configurable computing elements comprising at least a first programmable memory array, a second programmable memory array and an arithmetic logic circuit (ALC), wherein said first programmable memory array stores at least a first portion of a first look-up table (LUT) for a first mathematical function; said second programmable memory array stores at least a second portion of a second LUT for a second mathematical function; and, said ALC performs at least an arithmetic operation on selected data from said first or second LUT; at least an array of configurable logic elements including a configurable logic element for selectively realizing a logic function in a logic library, wherein said first and second programmable memory arrays are located on said first side; and, either said configurable logic element or said ALC is located on said second side; means for communicatively coupling said configurable computing elements and said configurable logic elements; whereby said configurable computing array realizes said mathematical function by programming said configurable computing elements and said configurable logic elements, wherein said mathematical function is a combination of at least said first and second mathematical functions; wherein each of said first and second mathematical functions includes more operations than arithmetic operations included in said logic library; and, each of said first and second mathematical functions includes more operations than arithmetic operations performable by said ALC.
It should be noted that all the drawings are schematic and not drawn to scale. Relative dimensions and proportions of parts of the device structures in the figures have been shown exaggerated or reduced in size for the sake of clarity and convenience in the drawings. The same reference symbols are generally used to refer to corresponding or similar features in the different embodiments.
Throughout this specification, the phrase “mathematical functions” refer to non-arithmetic functions only; the phrase “memory” is used in its broadest sense to mean any semiconductor-based holding place for information, either permanent or temporary; the phrase “permanent” is used in its broadest sense to mean any long-term storage; the phrase “communicatively coupled” is used in its broadest sense to mean any coupling whereby information may be passed from one element to another element; the term “LUT” (or, “BS-LUT”) could refer to the logic look-up table (LUT) stored in the programmable memory array(s), or the physical LUT circuit in the form of the programmable memory array(s), depending on the context; the symbol “/” means a relationship of “and” or “or”.
Those of ordinary skills in the art will realize that the following description of the present invention is illustrative only and is not intended to be in any way limiting. Other embodiments of the invention will readily suggest themselves to such skilled persons from an examination of the within disclosure.
Referring now to
The configurable computing element 300-i comprises at least a programmable memory array 170 and an arithmetic logic circuit (ALC) 180, which are communicatively coupled by connections 160 (
Each usage cycle of the BS-LUT configurable processor 300 comprises two stages: a configuration stage and a computation stage. In the configuration stage, the LUT for a desired mathematical function is written into the programmable memory array 170. In the computation stage, selected values of the mathematical function are read out from the programmable memory array 170. The BS-LUT configurable processor 300 can be used to realize field-configurable computation and re-configurable computation. For the field-configurable computation, a mathematical function is realized by writing its LUT into the programmable memory array 170 in the field of use. For re-configurable computation, the programmable memory array 170 is re-programmable and different mathematical functions can be realized by writing different LUTs for different mathematical functions into the re-programmable memory array 170. For example, during a first usage cycle, a first LUT for a first mathematical function is written into the re-programmable memory array 170; during a second usage cycle, a second LUT for a second mathematical function is written into the re-programmable memory array 170.
In the preferred configurable computing element 300-i, the ALC 180 is formed on the front side 0F of the processor substrate OS, while the programmable memory array 170 is formed on the backside 0B of the processor substrate OS (
The BS-LUT configurable processor 300 uses memory-based computation (MBC), which realizes mathematical functions primarily with the LUT. Compared with the LUT 200X used by the conventional processor 00X, the BS-LUT 170 used by the BS-LUT configurable processor 300 has a much larger capacity. Although arithmetic operations are still performed, the MBC only needs to calculate a polynomial to a much lower order because it uses a much larger BS-LUT 170 as a starting point for computation. For the MBC, the fraction of computation done by the BS-LUT 170 is more than the ALC 180.
Referring now to
Because the ALC 180 and the LUT 170 are formed on both sides 0F, 0B of the processor substrate OS, this type of vertical integration is referred to as double-sided integration. The double-sided integration has a profound effect on the computational density and computational complexity. For the conventional 2-D integration, the footprint of a conventional processor OOX is roughly equal to the sum of those of the ALU 100X and the LUT 200X. On the other hand, because the double-sided integration moves the LUT from aside to the backside 0B, the BS-LUT processor 300 becomes smaller and computationally more powerful. In addition, the total LUT capacity of the conventional processor OOX is less than 100 Kb, whereas the total BS-LUT capacity for the BS-LUT processor 300 could reach 100 Gb. Consequently, a single BS-LUT processor 300 could support as many as 10,000 built-in functions (including various types of complex functions), far more than the conventional processor 00X. Moreover, the double-sided integration can improve the communication throughput between the BS-LUT 170 and the ALC 180. Because they are physically close and coupled by a large number of TSV 160, the BS-LUT 170 and the ALC 180 have a larger communication throughput than that between the LUT 200X and the ALU 100X in the conventional processor 00X. Lastly, the double-sided integration benefits manufacturing process. Because the ALC 180 and the LUT 170 are on different sides 0F, 0B of the processor substrate OS, the logic transistors in the ALC 180 and the memory transistors in the LUT 170 may be formed in separate processing steps, which can be individually optimized.
Referring now to
When realizing a mathematical function, combining the LUT with polynomial interpolation can achieve a high precision without using an excessively large LUT. For example, if only LUT (without any polynomial interpolation) is used to realize a single-precision function (32-bit input and 32-bit output), it would have a capacity of 232*32=128 Gb. By including polynomial interpolation, significantly smaller LUTs can be used. In the above embodiment, a single-precision function can be realized using a total of 4 Mb LUT (2 Mb for the functional values, and 2 Mb for the first-order derivative values) in conjunction with a first-order Taylor series. This is significantly less than the LUT-only approach (4 Mb vs. 128 Gb).
Besides transcendental functions, the preferred embodiment of
Referring now to
The configurable computing elements 300AA-300BD are similar to those in the BS-LUT configurable processor 300 (
The first preferred BS-LUT configurable computing array 700 can realize a complex function by programming the configurable logic elements 400AA-400BD and the configurable computing elements 300AA-300BD. The complex function is a combination of basic functions, which can be implemented by selected configurable computing elements. The mathematical operations included in each basic function are not only more than the arithmetic operations included in the logic library of the configurable logic elements 400AA-400BD, but also more than the arithmetic operations performable by the ALC 180. In general, the arithmetic operations included in the logic library consist of addition and subtraction; and, the arithmetic operations performable by the ALC 180 consist of addition, subtraction and multiplication.
In one preferred BS-LUT configurable computing array 700, the programmable memory arrays 170 of the configurable computing elements 300AA-300BD are located on the backside 0B of the processor substrate OS, while the configurable logic elements 400AA-400BD are located on the front side 0F of the processor substrate OS. The ALCs 180 may be located on the front side 0F, together with the configurable logic elements 400AA-400BD. Alternatively, the ALCs 180 may be located on the backside 0B, together with the programmable memory arrays 170. The programmable memory arrays 170 and the configurable logic elements 400AA-400BB preferably at least partially overlap. It should be apparent to those skilled in the art that the programmable memory array 170 may be located on the front side 0F of the processor substrate OS, while the configurable logic elements 400AA-400BD may be located on the backside 0B of the processor substrate OS.
The first preferred BS-LUT configurable computing array 700 is particularly suitable for realizing complex functions. If only LUT is used to realize the above 4-variable function, i.e. e=a·sin(b)+c·cos(d), an enormous LUT is needed: 216*216*216*216*16=256 Eb even for half precision, which is impractical. Using the BS-LUT configurable gate array 700, only 8 Mb LUT (including 8 configurable computing elements, each with 1 Mb capacity) is needed to realize a 4-variable function. To those skilled in the art, the first preferred BS-LUT configurable computing array 700 can be used to realize other complex functions.
Referring now to
While illustrative embodiments have been shown and described, it would be apparent to those skilled in the art that many more modifications than that have been mentioned above are possible without departing from the inventive concepts set forth therein. For example, the BS-LUT configurable processor of the present invention could be a micro-controller, a controller, a central processing unit (CPU), a digital signal processor (DSP), a graphic processing unit (GPU), a network-security processor, an encryption/decryption processor, an encoding/decoding processor, a neural-network processor, or an artificial intelligence (AI) processor. These BS-LUT configurable processors can be found in consumer electronic devices (e.g. personal computers, video game machines, smart phones) as well as engineering and scientific workstations and server machines. The invention, therefore, is not to be limited except in the spirit of the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
201610300576.9 | May 2016 | CN | national |
201710311013.4 | May 2017 | CN | national |
This application is a continuation-in-part of U.S. patent application Ser. No. 15/588,643, filed May 6, 2017, which claims priority from Chinese Patent Application 201610300576.9, filed May 7, 2016; Chinese Patent Application 201710311013.4, filed May 5, 2017, in the State Intellectual Property Office of the People's Republic of China (CN), the disclosure of which are incorporated herein by references in their entireties.
Number | Date | Country | |
---|---|---|---|
Parent | 15588643 | May 2017 | US |
Child | 16203602 | US |