This relates to integrated circuits and, more particularly, to programmable integrated circuits.
Programmable integrated circuits are a type of integrated circuit that can be programmed by a user to implement a desired custom logic function. In a typical scenario, a logic designer uses computer-aided design tools to design a custom logic circuit. When the design process is complete, the computer-aided design tools generate configuration data. The configuration data is loaded into memory elements on a programmable integrated circuit to configure the device to perform the functions of the custom logic circuit.
Configuration data may be supplied to a programmable device in the form of a configuration bit stream. After a first configuration bit stream has been loaded onto a programmable device, the programmable device may be reconfigured by loading a different configuration bit stream in a process known as reconfiguration. An entire set of configuration data is often loaded during reconfiguration.
Programmable devices may be used for coprocessing in big-data or fast-data applications. For example, programmable devices may be used in application acceleration tasks in a datacenter and may be reprogrammed during datacenter operation to perform different tasks. However, the speed of reconfiguration of programmable devices is traditionally several orders of magnitude slower than the desired rate of virtualization in datacenters. Moreover, on-chip caching or buffering of pre-fetched configuration bit-streams to hide the latency of reconfiguration is undesirably expensive in terms of silicon real estate. Additionally, repeated fetching of configuration bit-streams from off-chip storage via the entire configuration circuit chain is energy intensive.
Situations frequently arise where it would be desirable to design and implement programmable devices with off-chip memory that enables improved reconfiguration speed and reduced energy consumption.
It is within this context that the embodiments herein arise.
It is appreciated that the present invention can be implemented in numerous ways, such as a process, an apparatus, a system, a device, or a method on a computer readable medium. Several inventive embodiments of the present invention are described below.
A system may include a host processor and an integrated circuit package. The integrated circuit package may include a package substrate, an active interposer mounted on the package substrate, an integrated circuit (e.g., a coprocessor) mounted on the active interposer, and an auxiliary chip. The auxiliary chip may be mounted on the package substrate or, if desired, on the interposer. The interposer and the auxiliary chip may each contain memory elements for storing configuration data (e.g., configuration bit streams) for configuring programmable circuitry on the integrated circuit to perform a variety of tasks. The integrated circuit package may also include a heat sink that is attached to and in contact with the coprocessor integrated circuit. In some instances, the heat sink may also be placed in contact with the auxiliary chip.
The programmable circuitry of the integrated circuit may include multiple logic sectors that are coupled to respective associated logic sector managers. These logic sector managers may help retrieve the configuration bit streams from the memory elements of the interposer and the auxiliary chip. Each of the logic sectors may include an array of memory cells (e.g., configuration random access memory cells), an address register coupled to the array of memory cells, and a data register coupled to the array of memory cells.
The integrated circuit, the interposer, and the auxiliary chip may each include an active layer having an active side at which transistor circuitry (e.g., memory elements and programmable logic circuitry) is formed and an inactive layer (e.g., a bulk semiconductor layer) having an inactive back side. In some embodiments, the active sides of the integrated circuit and the interposer may be facing one another to facilitate faster reconfiguration of programmable circuitry on the integrated circuit. In these embodiments, the interposer may communicate with the package substrate using through silicon vias in the inactive layer of the interposer and may communicate with the integrated circuit through microbumps interposed between the active side of the integrated circuit and the active side of the interposer.
In other embodiments, the active side of the interposer may be facing the package substrate and the backside of the interposer may be facing the active side of the integrated circuit. In these embodiments, the interposer may communicate with the integrated circuit using through silicon vias in the inactive layer of the interposer.
In some embodiments, the auxiliary chip may be mounted on the package substrate adjacent to the interposer. In these embodiments, the active side of the auxiliary chip may be facing the package substrate and may be electronically coupled to the interposer through an embedded multi-die interconnect bridge in the package substrate. The backside of the auxiliary chip may be in contact with the heat sink.
In other embodiments, the auxiliary chip may be interposed between the package substrate and the integrated circuit. In these embodiments, the active side of the auxiliary chip may be facing the active side of the integrated circuit, and the auxiliary chip may communicate with the package substrate using through silicon vias in the inactive layer of the auxiliary chip.
In other embodiments, the auxiliary chip may be mounted on the interposer. In these embodiments, the active side of the auxiliary chip may be facing the interposer and the backside of the auxiliary chip may be in contact with the heat sink.
When new configuration bit streams are received by the integrated substrate package, the bit streams may be stored in the memory elements of the interposer or in the memory elements of the auxiliary chip. For instances in which the interposer includes decryption/decompression circuitry, the received configuration bit streams may be decrypted and decompressed using the decryption/decompression circuitry in the interposer. For instances in which the integrated circuit includes decompression/decryption circuitry, the received configuration bit streams may instead be decrypted and decompressed at the integrated circuit before being passed back to the memory elements of the interposer or the memory elements of the auxiliary chip for storage. Some or all of the stored configuration bit streams may then be sequentially loaded onto one or more logic sectors within the programmable circuitry of the integrated circuit.
When a configuration bit stream is requested for configuring the integrated circuit die, a request may be sent to the interposer and optionally to the auxiliary chip to determine whether a requested configuration bit stream is stored in the memory elements of the interposer or in the memory elements of the configuration bit stream. In response to determining that the requested configuration bit stream is missing from both the memory elements of the interposer and the memory elements of the interposer, the integrated circuit may request the confirmation bit stream from off-package (e.g., from the host processor or from an external memory). Otherwise, if the requested configuration bit stream is stored on the memory elements of the auxiliary chip or on the memory elements of the interposer, then the requested configuration bit stream may be sequentially loaded onto the programmable circuitry of the integrated circuit die.
Further features of the invention, its nature and various advantages will be more apparent from the accompanying drawings and following detailed description.
Embodiments of the present invention relate to integrated circuits and, more particularly, to programmable integrated circuits. It will be recognized by one skilled in the art, that the present exemplary embodiments may be practiced without some or all of these specific details. In other instances, well-known operations have not been described in detail in order not to unnecessarily obscure the present embodiments.
Programmable integrated circuits use programmable memory elements to store configuration data. Configuration data may be generated based on source code corresponding to application-specific tasks to be performed in parallel on the programmable integrated circuit. During programming of a programmable integrated circuit, configuration data is loaded into the memory elements. The memory elements may be organized in arrays having numerous rows and columns. For example, memory array circuitry may be formed in hundreds or thousands of rows and columns on a programmable logic device integrated circuit.
During normal operation of the programmable integrated circuit, each memory element provides a static output signal. The static output signals that are supplied by the memory elements serve as control signals. These control signals are applied to programmable logic on the integrated circuit to customize the programmable logic to perform a desired logic function.
It may sometimes be desirable to configure or reconfigure the programmable integrated circuit as an accelerator circuit to efficiently perform parallel processing tasks. The accelerator circuit may include multiple columns soft processors of various types that are specialized for different types of parallel tasks. The accelerator circuit may be dynamically reconfigured to optimally assign and perform the parallel tasks.
An illustrative programmable integrated circuit such as programmable logic device (PLD) 10 is shown in
Programmable integrated circuit 10 contains memory elements 20 that can be loaded with configuration data (also called programming data) using pins 14 and input-output circuitry 12. Once loaded, the memory elements 20 may each provide a corresponding static control output signal that controls the state of an associated logic component in programmable logic 18. Typically, the memory element output signals are used to control the gates of metal-oxide-semiconductor (MOS) transistors. Some of the transistors may be p-channel metal-oxide-semiconductor (PMOS) transistors. Many of these transistors may be n-channel metal-oxide-semiconductor (NMOS) pass transistors in programmable components such as multiplexers. When a memory element output is high, an NMOS pass transistor controlled by that memory element will be turned on to pass logic signals from its input to its output. When the memory element output is low, the pass transistor is turned off and does not pass logic signals.
A typical memory element 20 is formed from a number of transistors configured to form cross-coupled inverters. Other arrangements (e.g., cells with more distributed inverter-like circuits) may also be used. With one suitable approach, complementary metal-oxide-semiconductor (CMOS) integrated circuit technology is used to form the memory elements 20, so CMOS-based memory element implementations are described herein as an example. In the context of programmable integrated circuits, the memory elements store configuration data and are therefore sometimes referred to as configuration random-access memory (CRAM) cells.
An illustrative system environment for device 10 is shown in
Circuit 40 may be an erasable-programmable read-only memory (EPROM) chip, a programmable logic device configuration data loading chip with built-in memory (sometimes referred to as a “configuration device”), or another suitable device. When system 38 boots up (or at another suitable time), the configuration data for configuring the programmable logic device may be supplied to the programmable logic device from device 40, as shown schematically by path 42. The configuration data that is supplied to the programmable logic device may be stored in the programmable logic device in its configuration random-access-memory elements 20.
System 38 may include processing circuits 44, storage 46, and other system components 48 that communicate with device 10. The components of system 38 may be located on one or more boards such as board 36 or other suitable mounting structures or housings and may be interconnected by buses, traces, and other electrical paths 50.
Configuration device 40 may be supplied with the configuration data for device 10 over a path such as path 52. Configuration device 40 may, for example, receive the configuration data from configuration data loading equipment 54 or other suitable equipment that stores this data in configuration device 40. Device 40 may be loaded with data before or after installation on board 36.
It can be a significant undertaking to design and implement a desired logic circuit in a programmable logic device. Logic designers therefore generally use logic design systems based on computer-aided-design (CAD) tools to assist them in designing circuits. A logic design system can help a logic designer design and test complex circuits for a system. When a design is complete, the logic design system may be used to generate configuration data for electrically programming the appropriate programmable logic device.
As shown in
In a typical scenario, logic design system 56 is used by a logic designer to create a custom circuit design. The system 56 produces corresponding configuration data, which is provided to configuration device 40. Upon power-up, configuration device 40 and data loading circuitry on programmable logic device 10 is used to load the configuration data into CRAM cells 20 of device 10. Device 10 may then be used in normal operation of system 38.
After device 10 is initially loaded with a set of configuration data (e.g., using configuration device 40), device 10 may be reconfigured by loading a different set of configuration data. Sometimes it may be desirable to reconfigure only a portion of the memory cells on device 10 via a process sometimes referred to as partial reconfiguration. As memory cells are typically arranged in an array, partial reconfiguration can be performed by writing new data values only into selected portion(s) in the array while leaving portions of array other than the selected portion(s) in their original state.
Partial reconfiguration may be a particularly useful feature when developing an acceleration framework. For example, consider a scenario in which a system such as system 300 includes a host processor 302 that is coupled to other network components via paths 304 (see, e.g.,
Configured as such, accelerator circuit 310 may sometimes be referred to as a “hardware accelerator.” As examples, the processing cores on the coprocessor may be used to accelerate a variety of functions, which may include but are not limited to: encryption, Fast Fourier transforms, video encoding/decoding, convolutional neural networks (CNN), firewalling, intrusion detection, database searching, domain name service (DNS), load balancing, caching network address translation (NAT), and other suitable network packet processing applications, just to name a few.
For instances in which cores P1-P4 are implemented as logic sectors in accelerator circuit 310, each logic sector may be managed using local sector managers, which may in turn be managed using a secure device manager. As shown in
In some instances, the configuration data and accelerator requests may optionally be compressed and encrypted. Thus, secure device manager 402 may include decompression engine 404 and decryption engine 406 for decompressing and decrypting data received from the host processor through hard processing controller 400.
Logic sectors 410 may be individually configurable/programmable. This allows each of logic sectors 410 to independently process different tasks in parallel. The parallel processing enabled by logic sectors 410 may be utilized to perform application acceleration (e.g., in a datacenter) for a variety of tasks or jobs simultaneously by reconfiguring different subsets of the logic sectors to perform said tasks.
In order to efficiently manage application acceleration as new tasks are issued to accelerator circuit 310 from the host processor, it may be necessary to perform real-time reconfiguration on any of logic sectors 410 that will be used to process a given newly received task. In other words, reconfiguration of logic sectors 410 may be performed while accelerator circuit 310 is running and may be performed without interrupting the operation of accelerator circuit 310.
The selection of which of logic sectors 410 are to be used for a given task may be determined by identifying which sectors are idle (e.g., not presently performing a task) and by identifying which sectors are handling lower-priority tasks (e.g., tasks without a fixed time budget) compared to the priority of the given task. Some or all of logic sectors 410 that are identified as being idle or as performing less critical tasks may then be selected, and if necessary, reconfigured to perform operations of the given task. Reassignment of logic sectors 410 that are working on a lower-priority task than the given task in need of sector assignment may be performed based on a load-balancing mechanism. It should be noted that those logic sectors 410 that are identified as already being configured to perform the given task may be given selection priority over any sectors that would need to be reconfigured to perform said task.
Configuration data received by accelerator circuit 310 may be stored in memory on the same circuit package as accelerator circuit 310. As shown in
In some instances interposer 502 may also include decryption/decompression circuitry (not shown) for decrypting and decompressing configuration bit streams (e.g., received from the host processor or external memory). If desired, coprocessor 310 may instead include this decryption/decompression circuitry.
Auxiliary chip 504 may be one of a variety of chips, including a transceiver chip, a volatile memory (e.g., high bandwidth memory) chip, or a non-volatile memory (e.g., 3D XPoint) chip. In instances in which auxiliary chip 504 is a memory chip, chip 504 may be used as a secondary cache for storing configuration bit streams used in reconfiguring logic sectors of coprocessor 310.
Configuration data from host processor 302 may be loaded onto memory elements 506 of interposer 502 (and optionally onto memory elements in auxiliary chip 504) after undergoing processing/routing through secure device manager 402 of coprocessor 310 (e.g., after undergoing decompression and decryption). The configuration data may include one or more sector-level reconfiguration bit streams. When one of sectors 410 is selected to perform a task, if that sector needs to be reconfigured to perform the task (e.g., because the sector is presently configured to perform a different task), then secure device manager 402 may provide the selected sector with a pointer to the location of the necessary configuration bit stream (e.g., persona) required to perform that task in memory elements 506.
In some scenarios, the memory elements 506 may not already have the necessary configuration bit stream stored when said bit stream is needed by the selected sector. In this case, secure device manager 402 may retrieve the necessary configuration bit stream from external memory or from auxiliary chip 504 and may load the retrieved bit stream onto the selected sector and onto memory elements 506.
Coprocessor 310, interposer 502, and auxiliary chip 504 described above in connection with
At step 600, one or more new configuration bit streams may be provided to an integrated circuit package from a host processor (e.g., integrated circuit package 500 of
At step 602, the new configuration bit streams may optionally be cached in an auxiliary chip (e.g., auxiliary chip 504 of
At step 604, if interposer 502 includes decryption/decompression capabilities (e.g., decryption/decompression circuits), the method may proceed to step 606. Otherwise, if interposer 502 does not include decryption/decompression capabilities, the method may proceed to step 608.
At step 606, the new configuration bit streams may be decrypted and decompressed using the decryption/decompression circuits of interposer 502 and the decrypted/decompressed bit streams may be cached at interposer 502.
At step 608, the new configuration bit streams may be passed to coprocessor 310 and decryption/decompression circuitry on coprocessor 310 may decrypt/decompress the bit streams.
At step 610, the decrypted/decompressed bit streams may be passed back to interposer 502 where the bit streams may be subsequently cached in memory elements 506 on interposer 502.
After the completion of either step 606 or step 610, step 612 may be performed. At step 612, at least part of the decrypted/decompressed bit streams may be sequentially loaded into configuration random access memory (CRAM) cells on the coprocessor (e.g., into CRAM cells in one or more of sectors 410 of coprocessor 310 of
At step 614, any remaining unstored decrypted/decompressed bit streams may optionally be cached in auxiliary chip 504. For example, if memory elements 506 in interposer 502 do not have sufficient space to store a portion of the configuration bit streams used to configure the CRAM cells of sectors 410, that portion of the configuration bit streams may be stored in auxiliary chip 504.
Coprocessor 310 and interposer 502 described above in connection with
At step 700, a host processor (e.g., host processor 302 of
At step 702, host processor 302 may send an acceleration request to coprocessor 310. This acceleration request may be received by a secure device manager (e.g., secure device manager 402 of
At step 704, during an execution phase of the instruction cycle, secure device manager 402 may communicate with local sector managers (e.g., local sector managers 412 of
At step 706, if such a pre-configured sector exists, that sector may be selected and used to execute the given task.
At step 708, if such a pre-configured sector does not exist, host processor 302 may provide local sector manager 412 of an available sector with a pointer to the location of the configuration bit stream required for performing the given task that is stored in memory elements in an active interposer (e.g., memory elements 506 in active interposer 502 of
At step 710, if the required configuration data is stored in memory elements 506 of interposer 502 or on the memory elements of auxiliary chip 504 (e.g., if there is a cache hit), the required or desired configuration bit stream may be retrieved from those memory elements and may be used to reconfigure the available sector (e.g., by loading the required configuration bit stream onto the available sector). The configuration image stored in memory elements 506 of interposer 502 or on the memory elements of auxiliary chip 504 may not be encrypted. Memory elements 506 of interposer 502 and the memory elements of auxiliary chip 504 may act as instruction caches from which configuration data (e.g., bit streams) are fetched by the local sector managers for reconfiguring logic sectors 410. If the required configuration bit stream is fetched from the memory elements of auxiliary chip 504, then the required configuration bit stream may be passed to the memory elements 506 of interposer 502 before being loaded onto the available logic sector with the required configuration bit stream. Alternatively, the required configuration bit stream may be directly loaded from the memory elements of auxiliary chip 504 onto the available logic sector.
At step 712, if the required configuration data is not stored in memory elements 506 of interposer 502 or on the memory elements of auxiliary chip 504 (e.g., if there is a cache miss), local sector manager 412 of the available sector may send a request to host processor 302 asking that host processor 302 provide the required configuration bit stream to memory elements 506 of interposer 502 or on the memory elements of auxiliary chip 504. For example, the required configuration bit stream may be retrieved from off-package directly from host processor 302 or from an external memory. Local sector manager 412 may then load the required configuration bit stream onto the available sector as described in connection with step 710, thereby reconfiguring the available sector. In some scenarios, local sector manager 412 may receive the required configuration bit stream from host processor 302 directly through secure device manager 402, in which case the required configuration bit stream may also be stored on memory elements 506 of interposer 502 or on the memory elements of auxiliary chip 504.
A cross-sectional side view of an IC package having a stacked coprocessor and smart active interposer is shown in
Coprocessor 310 and its functions are described in detail in connection with
Active interposer 802 may include an active layer 806 and a bulk semiconductor layer 804 (sometimes referred to as inactive layer 804). The front surface of active layer 806 may sometimes be referred to herein as an active side. The opposing surface of bulk semiconductor layer 804 may sometimes be referred to as a backside. Active layer 806 may include multiple level one (L1) memory elements 808 formed at the active side, which may be used as memory caches for storing configuration bit streams for configuring logic sectors 410 in coprocessor 310. L1 memory elements 808 may correspond to memory elements 506 in form and function described in connection with
In some instances, active layer 806 may optionally include decryption/decompression circuitry for processing encrypted and/or decompressed configuration bit streams. Active interposer 802 may be electrically connected to package substrate 832 through solder 822 (sometimes referred to herein as solder bumps 822 or solder balls 822). Inactive layer 804 may include through silicon vias (TSVs) 810, which may connect components such as L1 memory elements 808 in active layer 806 to solder balls 822. For example, L1 caches 808 may receive configuration bit streams from a host processor through solder balls 822 and TSVs 810. By using TSVs 810 in this way, the energy efficiency of transferring signals and power between active layer 806 and package substrate 832 may be improved compared to traditional interposer interconnect arrangements.
By connecting active layer 806 of interposer 802 to package substrate 832 using TSVs 810 in inactive layer 804, active layer 806 of interposer 802 may be arranged facing active layer 392 of coprocessor 310, and may be electrically connected to components in active layer 392 through microbumps 820, which may be smaller and may have smaller pitch widths than solder balls 822. This arrangement may significantly shorten the distance traveled (and thereby reduce the latency of) signals transmitted between interposer 802 and coprocessor 310. This reduction in latency may advantageously increase the speed of reconfiguration of logic sectors 410 in coprocessor 310.
Auxiliary chip 812 may be placed adjacent to interposer 802 and may be stacked on package substrate 832. Auxiliary chip 812 may include an active layer 816 and a bulk semiconductor layer 814 (sometimes referred to as inactive layer 814). The front surface of active layer 816 may sometimes be referred to herein as an active side. The opposing surface of bulk semiconductor layer 814 may sometimes be referred to as a backside. Active layer 816 may include level two (L2) memory elements 809 formed at the active side, which may be used as memory caches for storing configuration bit streams for configuring logic sectors 410 in coprocessor 310.
For example, configuration bit streams stored in L1 memory elements 808 on interposer 802 may be transferred to L2 memory elements 809 to make room on L1 memory elements 808 for new incoming configuration bit streams (e.g., received at L1 memory elements 808 from a host processor). Active layer 816 of auxiliary chip 812 may be arranged facing package substrate 832 and may be electrically connected to package substrate 832 through solder balls 822 and solder balls 824. Solder balls 824 may be smaller than and may have smaller pitch widths than solder balls 822.
An embedded multi-die interconnect bridge (EMIB) 826 may be used to connect auxiliary chip 812 to interposer 802. EMIB 826 may include interconnects 828 formed in a silicon substrate that is embedded in package substrate 832. Interconnects 828 may electrically connect the portion of solder balls 824 connected to auxiliary chip 812 to the portion of solder balls 824 connected to interposer 802.
Heat sink 830 may be placed in contact with inactive layer 390 of coprocessor 310 and inactive layer 814 of auxiliary chip 812 in order to remove heat from coprocessor 310 and auxiliary chip 812. Active interposer 802 may perform minimal processing and therefore may generate minimal heat. Thus, it may not be necessary to put heat sink 830 in contact with interposer 802.
Alternative arrangements for package 800 are shown in
As shown in
As shown in
As shown in
The embodiments thus far have been described with respect to integrated circuits. The methods and apparatuses described herein may be incorporated into any suitable circuit. For example, they may be incorporated into numerous types of devices such as programmable logic devices, application specific standard products (ASSPs), and application specific integrated circuits (ASICs). Examples of programmable logic devices include programmable arrays logic (PALs), programmable logic arrays (PLAs), field programmable logic arrays (FPLAs), electrically programmable logic devices (EPLDs), electrically erasable programmable logic devices (EEPLDs), logic cell arrays (LCAs), complex programmable logic devices (CPLDs), and field programmable gate arrays (FPGAs), just to name a few.
The programmable logic device described in one or more embodiments herein may be part of a data processing system that includes one or more of the following components: a processor; memory; IO circuitry; and peripheral devices. The data processing can be used in a wide variety of applications, such as computer networking, data networking, instrumentation, video processing, digital signal processing, or any suitable other application where the advantage of using programmable or re-programmable logic is desirable. The programmable logic device can be used to perform a variety of different logic functions. For example, the programmable logic device can be configured as a processor or controller that works in cooperation with a system processor. The programmable logic device may also be used as an arbiter for arbitrating access to a shared resource in the data processing system. In yet another example, the programmable logic device can be configured as an interface between a processor and one of the other components in the system.
The foregoing is merely illustrative of the principles of this invention and various modifications can be made by those skilled in the art. The foregoing embodiments may be implemented individually or in any combination.
This application is a division of U.S. patent application Ser. No. 15/381,981, filed Dec. 16, 2016, which is hereby incorporated by reference herein in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
5787007 | Bauer | Jul 1998 | A |
8922243 | Jayasena et al. | Dec 2014 | B2 |
9106229 | Hutton | Aug 2015 | B1 |
9344091 | Jayasena et al. | May 2016 | B2 |
20030233558 | Lieberman et al. | Dec 2003 | A1 |
20140032888 | O'Mullan et al. | Jan 2014 | A1 |
20140181458 | Loh et al. | Jun 2014 | A1 |
20150028918 | Hutton | Jan 2015 | A1 |
20160126291 | Lu et al. | May 2016 | A1 |
Number | Date | Country |
---|---|---|
2014078133 | May 2014 | WO |
Number | Date | Country | |
---|---|---|---|
20190229888 A1 | Jul 2019 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 15381981 | Dec 2016 | US |
Child | 16372297 | US |