Examples of the present disclosure generally relate to communicating between data processing engines (DPEs) in an array using shared memory.
A processor, a system on a chip (SoC), and an application specific integrated circuit (ASIC) can include multiple cores for performing compute operations such as processing digital signals, performing cryptography, executing software applications, rendering graphics, and the like. In some examples, the cores may transmit data between each other when performing the compute operations. Typically, transferring data between cores requires the data to pass through a core-to-core interface that adds latency and is an inefficient use of memory.
Techniques for transferring data between data processing engines are described. One example is a method that includes processing data in a first data processing engine in an array of data processing engines disposed in an integrated circuit and storing the processed data in a first memory module in the first data processing engine. The method also includes retrieving the processed data from the first memory module using a direct neighbor connection that directly couples the first memory module in the first data processing engine to a second data processing engine in the array and processing the retrieved data using the second data processing engine.
One example described herein is a SoC that includes a first data processing engine in an array of data processing engines where the first data processing engine includes a first memory module and the first data processing engine is configured to store processed data in the first memory module. The SoC also includes a second data processing engine in the array and is configured to retrieve the processed data from the first memory module using a direct neighbor connection that directly couples the first memory module in the first data processing engine to the second data processing engine and process the retrieved data.
So that the manner in which the above recited features can be understood in detail, a more particular description, briefly summarized above, may be had by reference to example implementations, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical example implementations and are therefore not to be considered limiting of its scope.
To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the figures. It is contemplated that elements of one example may be beneficially incorporated in other examples.
Various features are described hereinafter with reference to the figures. It should be noted that the figures may or may not be drawn to scale and that the elements of similar structures or functions are represented by like reference numerals throughout the figures. It should be noted that the figures are only intended to facilitate the description of the features. They are not intended as an exhaustive description of the description or as a limitation on the scope of the claims. In addition, an illustrated example need not have all the aspects or advantages shown. An aspect or an advantage described in conjunction with a particular example is not necessarily limited to that example and can be practiced in any other examples even if not so illustrated, or if not so explicitly described.
Examples herein describe techniques for transferring data between cores in an array using shared memory. In one embodiment, certain cores in the array have connections to the memory in neighboring cores. For example, each core may have its own assigned memory module which can be accessed directly (e.g., without using a streaming or memory mapped interconnect). In addition, the surrounding cores (referred to herein as the neighboring cores) may also include direct connections to the memory module. Using these direct connections, the cores can read or store data in the neighboring memory modules.
Transferring data using a shared memory may reduce latency relative to using an interconnect network. For example, the array may include a streaming or memory mapped network that permits each of the cores to share data. However, accessing and transmitting data using the interconnect may take more clock cycles than reading or writing to shared memory. Thus, by providing direct connections to memory modules in neighboring cores, the array can improve the speed at which neighboring cores can transmit data.
In one embodiment, the DPEs 110 are identical. That is, each of the DPEs 110 (also referred to as tiles or blocks) may have the same hardware components or circuitry. Further, the embodiments herein are not limited to DPEs 110. Instead, the SoC 100 can include an array of any kind of processing elements, for example, the DPEs 110 could be digital signal processing engines, cryptographic engines, Forward Error Correction (FEC) engines, or other specialized hardware for performing one or more specialized tasks.
In
In one embodiment, the DPEs 110 are formed from non-programmable logic—i.e., are hardened. One advantage of doing so is that the DPEs 110 may take up less space in the SoC 100 relative to using programmable logic to form the hardware elements in the DPEs 110. That is, using hardened or non-programmable logic circuitry to form the hardware elements in the DPE 110 such as program memories, an instruction fetch/decode unit, fixed-point vector units, floating-point vector units, arithmetic logic units (ALUs), multiply accumulators (MAC), and the like can significantly reduce the footprint of the array 105 in the SoC 100. Although the DPEs 110 may be hardened, this does not mean the DPEs 110 are not programmable. That is, the DPEs 110 can be configured when the SoC 100 is powered on or rebooted to perform different functions or tasks.
The DPE array 105 also includes a SoC interface block 115 (also referred to as a shim) that serves as a communication interface between the DPEs 110 and other hardware components in the SoC 100. In this example, the SoC 100 includes a network on chip (NoC) 120 that is communicatively coupled to the SoC interface block 115. Although not shown, the NoC 120 may extend throughout the SoC 100 to permit the various components in the SoC 100 to communicate with each other. For example, in one physical implementation, the DPE array 105 may be disposed in an upper right portion of the integrated circuit forming the SoC 100. However, using the NoC 120, the array 105 can nonetheless communicate with, for example, programmable logic (PL) 125, a processor subsystem (PS) 130 or input/output (I/O) 135 which may disposed at different locations throughout the SoC 100.
In addition to providing an interface between the DPEs 110 and the NoC 120, the SoC interface block 115 may also provide a connection directly to a communication fabric in the PL 125. In one embodiment, the SoC interface block 115 includes separate hardware components for communicatively coupling the DPEs 110 to the NoC 120 and to the PL 125 that is disposed near the array 105 in the SoC 100.
Although
Referring back to
In one embodiment, the interconnect 205 includes a configurable switching network that permits the user to determine how data is routed through the interconnect 205. In one embodiment, unlike in a packet routing network, the interconnect 205 may form streaming point-to-point connections. That is, the electrical paths and streaming interconnects or switches (not shown) in the interconnect 205 may be configured to form routes from the core 210 and the memory module 230 to the neighboring DPEs 110 or the SoC interface block 115. Once configured, the core 210 and the memory module 230 can transmit and receive streaming data along those routes. In one embodiment, the interconnect 205 is configured using the Advanced Extensible Interface (AXI) 4 Streaming protocol.
In addition to forming a streaming network, the interconnect 205 may include a separate network for programming or configuring the hardware elements in the DPE 110. Although not shown, the interconnect 205 may include a memory mapped interconnect which includes different electrical paths and switch elements used to set values of configuration registers in the DPE 110 that alter or set functions of the streaming network, the core 210, and the memory module 230.
The core 210 may include hardware elements for processing digital signals. For example, the core 210 may be used to process signals related to wireless communication, radar, vector operations, machine learning applications, and the like. As such, the core 210 may include program memories, an instruction fetch/decode unit, fixed-point vector units, floating-point vector units, arithmetic logic units (ALUs), multiply accumulators (MAC), and the like. However, as mentioned above, this disclosure is not limited to DPEs 110. The hardware elements in the core 210 may change depending on the engine type. That is, the cores in a digital signal processing engine, cryptographic engine, or FEC may be different.
The memory module 230 includes a direct memory access (DMA) engine 215, memory banks 220, and hardware synchronization circuitry (HSC) 225 or other type of hardware synchronization block. In one embodiment, the DMA engine 215 enables data to be received by, and transmitted to, the interconnect 205. That is, the DMA engine 215 may be used to perform DMA reads and write to the memory banks 220 using data received via the interconnect 205 from the SoC interface block or other DPEs 110 in the array.
The memory banks 220 can include any number of physical memory elements (e.g., SRAM). For example, the memory module 230 may be include 4, 8, 16, 32, etc. different memory banks 220 where each of these banks 220 can include respective arbitration circuitry or logic. In this embodiment, the core 210 has a direct connection 235 to the memory banks 220. Stated differently, the core 210 can write data to, or read data from, the memory banks 220 without using the interconnect 205. That is, the direct connection 235 may be separate from the interconnect 205. In one embodiment, one or more wires in the direct connection 235 communicatively couple the core 210 to a memory interface in the memory module 230 which is in turn coupled to the memory banks 220.
In one embodiment, the memory module 230 also has direct connections 240 to cores in neighboring DPEs 110. Put differently, a neighboring DPE in the array can read data from, or write data into, the memory banks 220 using the direct neighbor connections 240 without relying on their interconnects or the interconnect 205 shown in
Because the core 210 and the cores in neighboring DPEs 110 can directly access the memory module 230, the memory banks 220 can be considered as shared memory between the DPEs 110. That is, the neighboring DPEs can directly access the memory banks 220 in a similar way as the core 210 that is in the same DPE 110 as the memory banks 220. Thus, if the core 210 wants to transmit data to a core in a neighboring DPE, the core 210 can write the data into the memory bank 220. The neighboring DPE can then retrieve the data from the memory bank 220 and begin processing the data. In this manner, the cores in neighboring DPEs 110 can transfer data using the HSC 225 while avoiding the extra latency introduced when using the interconnects 205. In contrast, if the core 210 wants to transfer data to a non-neighboring DPE in the array (i.e., a DPE without a direct connection 240 to the memory module 230), the core 210 uses the interconnects 205 to route the data to the memory module of the target DPE which may take longer to complete because of the added latency of using the interconnect 205 and because the data is copied into the memory module of the target DPE rather than being read from a shared memory module.
In
Unlike the DPEs 110A and 110D, in the DPEs 110B and 110C, the cores 210B and 210C are disposed to the right of the memory modules 230B and 230C. As a result, the cores 210B and 210C are disposed directly above and directly below the memory module 230A (i.e., the cores 210B and 210C are north and south of the memory module 230A). Doing so makes establishing the direct neighboring connections 240A and 240C between the shared memory module 230A and the cores 210B and 210C easier than if the cores 210B and 210C were disposed to the left of the memory modules 230B and 230C. Using the arrangement shown in
The arrangement of the DPEs 110 illustrated in
Moreover, although not shown in
In one embodiment, a compiler or user may assign tasks to specific cores 210. The compiler or user may know which cores 210 in the array share memory modules. Referring to
The data flow 400 begins with a SoC block 305 in the SoC transmitting data to the SoC interface block 115. The SoC block 305 could be the NoC which receives data from a PL block, the PS, or I/O module in the SoC which is to be processed by the cores 210. The SoC interface block 115, in turn, uses one or more interconnects 205 to route the data to the memory module 230A where the data is stored.
Once the core 210A is ready to process the data, the core 210A reads the data from the memory module 230A and processes the data. In this example, once finished, the core 210A stores the processed data in the memory module 230A. Because the core 210B has a direct connection the memory module 230A as shown in
The core 210B can then process the data retrieved from the shared memory module 230A and store the processed data in the memory module 230A. In this manner, the cores 210A and 210B can perform respective tasks in a pipelined application or thread and benefit from the reduced latency offered by using a shared memory module. Although the data flow 400 illustrates using two cores 210, the flow 400 may include any number of cores. For example, instead of the core 210B storing processed data back in the memory module 230A, the core 210B may store the processed data in its memory module 230B. A core in a DPE that neighbors the DPE 110B can then directly read the processed data from the memory module 230B and process the data. That is, the memory module 230B may be shared by multiple cores in neighboring DPEs just like the memory module 230A.
Once the application or thread is complete, the memory module 230A provides the data to the interconnect 205 which routes the data to the SoC interface block 115. In turn, the SoC interface block 115 provides the data to a SoC block 310 (e.g., the NoC or programmable logic). In this manner, hardware components in the SoC can provide data to the DPE array which can be processed by multiple cores 210 using one or more shared memory modules 230.
In one embodiment, the core 210A and 210B may perform multiple tasks in parallel or sequentially. For example, the core 210A may perform Task A and Task B for the same application and thread. If Task B relies on data processed after Task A, each time the core 210A completes Task A, it can store the processed data in the memory module 230A and then retrieve the data once it is ready to perform Task B. That is, the core 210A can use the memory module 230A as a buffer for storing the data. As such, the core 210A can transmit data between tasks that are being executed in parallel or sequentially in the core 210A using the direct connection to the memory module 230A.
Unlike in data flow 400, in data flow 500, the core 210A transmits the processed data to different memory modules. That is, the core 210A transmits some or all of the processed data to the memory module 230A (i.e., the local memory module), a memory module 230E, and a memory module 230F. Nonetheless, although the memory modules 230E and 230F are on different DPEs, the core 210 uses direct connections to write the processed data into these the memory modules 230E and 230F. Put differently, the memory modules 230E and 230F are memory modules that are shared by the core 210A and its neighboring DPEs.
The processed data stored in the memory module 230A is retrieved by the core 210B using the direct connection 240A shown in
The cores 210B, 210E, and 210F can process the data and then store the data in respective memory modules. In this example, the cores 210B and 210F both write the processed data into the memory module 230E—i.e., the local memory module for the core 210E. However, in other embodiments, the cores 2108 and 210F can write their data into any memory module that is also shared with the core 210E. Put differently, the cores 2108 and 210F can write their data into any memory module to which the cores 210B, 210F, and 210E have a direct connection.
The core 210E retrieves the processed data from the memory module 230E that was provided by the cores 210A, 2108, and 210F. In one embodiment, the cores 210A, 210B, and 210F are allocated different portions or buffers in the memory module 230E. After reading the data, the core 210E may perform an aggregate operation, or use the three sources of data to execute its assigned task or tasks. Once complete, the core 210E stores the data in a memory module 230G which may be a shared memory module in a neighboring DPE. Alternatively, the core 210E may store the processed data back in the memory module 230E.
The data flow 500 illustrates that data can be split and transmitted to different cores and then be aggregated later into one of the cores. Further, the core-to-memory transfers in the data flow 500 can all occur using direct connections which means the data flow 500 can avoid using the interconnects in the various DPEs. That is, a compiler or user can assign the tasks such that each core-to-core transfer in the data flow 500 occurs used a memory module 230 that is shared between the cores 210. Nonetheless, not all of the cores 210 in the data flow 500 may share the same memory modules 230. For example, the core 210E may not have a direct connection to the memory module 230F.
In
At block 710, the core 210 or tile interface block 605 determines whether the lock was granted. If granted, at block 715, the core 210A or tile interface block 605A writes data to the allocated portion (i.e., the locked buffer) of the shared memory 230A. For example, the core 210A my write processed data to the memory module 230A as shown in the data flows 400 and 500 in
Referring to
At block 730, the core 210A, 210B, or tile interface block 605B determines if the HSC 225 has granted the lock. The HSC 225 may stall that request until the core 210A or tile interface block 605A releases the lock.
Once the lock is granted, at block 735, the core 210 or tile interface block 605B 215B can read data from the portion of the shared memory allocated by the HSC 225. In this manner, the HSC 225 can control access to the shared memory to provide core-to-core communication while avoiding higher latency communication networks such as the streaming network in the interconnect 205. Once finished reading the data, at block 740, the core 210 or tile interface block 605B releases the read lock which makes the shared memory module 230A available to the core 210A or tile interface block 605A to write updated data into the module 230A.
Although
The hashing in
One advantage of using three or more buffers in a FIFO to transfer data between the cores 210 is that the cores can operate at different speeds. For example, in
Although
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various examples of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
While the foregoing is directed to specific examples, other and further examples may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.
Number | Name | Date | Kind |
---|---|---|---|
3776137 | Abbott | Dec 1973 | A |
4876641 | Cowley | Oct 1989 | A |
6091263 | New et al. | Jul 2000 | A |
6150839 | New et al. | Nov 2000 | A |
6204687 | Schultz et al. | Mar 2001 | B1 |
6314500 | Rose | Nov 2001 | B1 |
6462579 | Camilleri et al. | Oct 2002 | B1 |
6473086 | Morein | Oct 2002 | B1 |
6526557 | Young et al. | Feb 2003 | B1 |
6759869 | Young et al. | Jul 2004 | B1 |
6810514 | Alfke et al. | Oct 2004 | B1 |
6836842 | Guccione et al. | Dec 2004 | B1 |
6907595 | Curd et al. | Jun 2005 | B2 |
6987961 | Pothana | Jan 2006 | B1 |
7024651 | Camilleri et al. | Apr 2006 | B1 |
7057413 | Young et al. | Jun 2006 | B1 |
7124338 | Mark et al. | Oct 2006 | B1 |
7224184 | Levi | May 2007 | B1 |
7302625 | Payakapan et al. | Nov 2007 | B1 |
7394288 | Agarwal | Jul 2008 | B1 |
7477072 | Kao et al. | Jan 2009 | B1 |
7478357 | Mason et al. | Jan 2009 | B1 |
7482836 | Levi et al. | Jan 2009 | B2 |
7509617 | Young | Mar 2009 | B1 |
7518396 | Kondapalli et al. | Apr 2009 | B1 |
7546572 | Ballagh et al. | Jun 2009 | B1 |
7619442 | Mason et al. | Nov 2009 | B1 |
7640527 | Dorairaj et al. | Dec 2009 | B1 |
7724815 | Raha et al. | May 2010 | B1 |
7746099 | Chan et al. | Jun 2010 | B1 |
8045546 | Bao et al. | Oct 2011 | B1 |
8102188 | Chan et al. | Jan 2012 | B1 |
8250342 | Kostarov et al. | Aug 2012 | B1 |
8359448 | Neuendorffer | Jan 2013 | B1 |
8415974 | Lysaght | Apr 2013 | B1 |
8719750 | Balzli, Jr. | May 2014 | B1 |
8796539 | Asaumi | Aug 2014 | B2 |
8928351 | Konduru | Jan 2015 | B1 |
9081634 | Simkins et al. | Jul 2015 | B1 |
9436785 | Javre | Sep 2016 | B1 |
9722613 | Schultz et al. | Aug 2017 | B1 |
9990241 | Dobbs et al. | Jun 2018 | B2 |
10747690 | Bilski et al. | Aug 2020 | B2 |
10866753 | Noguera Serra et al. | Dec 2020 | B2 |
20040236891 | Arimilli | Nov 2004 | A1 |
20070022253 | Cypher | Jan 2007 | A1 |
20070139422 | Kong | Jun 2007 | A1 |
20080071996 | Ohmori | Mar 2008 | A1 |
20080082759 | Pong | Apr 2008 | A1 |
20080270710 | Kwon | Oct 2008 | A1 |
20090055627 | Giacomoni | Feb 2009 | A1 |
20090248941 | Morein | Oct 2009 | A1 |
20120110316 | Chamberlain | May 2012 | A1 |
20140006751 | Aliseychik et al. | Jan 2014 | A1 |
20140267334 | Duluk, Jr. | Sep 2014 | A1 |
20150331822 | Takai | Nov 2015 | A1 |
20160011996 | Asaad et al. | Jan 2016 | A1 |
20170220499 | Gray | Aug 2017 | A1 |
20170315944 | Mayer | Nov 2017 | A1 |
20190155666 | Dobbs | May 2019 | A1 |
20190205284 | Fleming | Jul 2019 | A1 |
20190303033 | Noguera Serra et al. | Oct 2019 | A1 |
20190303311 | Biiski et al. | Oct 2019 | A1 |
20190303328 | Balski et al. | Oct 2019 | A1 |
Number | Date | Country |
---|---|---|
2019195132 | Oct 2019 | WO |
2019195309 | Oct 2019 | WO |
2019195343 | Oct 2019 | WO |
Entry |
---|
Wentzlahh, David et al., “On-Chip Interconnection Architecture of the Tile Processor,” IFFF Micro, Nov. 12, 2007, pp. 15-31, vol. 27, Issue 5, IEEE Computer Society Press, Los Alamitos, California, USA. |
Mellanox, “Bluefield Multicore System on Chip,” copyright 2017, 4 pp., Mellanox Technologies, Sunnyvale, California, USA. |
Mellanox, “NP-5 Network Processor,” copyright 2107, 2 pp., Mellanox Technologies, Sunnyvale, California, USA. |
Mellanox, “Tile-Gx672 Processor,” PB041, Feb. 14, 2015, 2 pp., Mellanox Technologies, Sunnyvale, California, USA. |
Kalray, “Kalray NVMe-oF Target Controller Solutions,” Dec. 18, 2017, 14 pp., Kalray Inc., Los Altos, California, USA. |
Ezchip, “Tile-GX72 Processor,” PB041 , Feb. 14, 2015, 2 pp., EZchip Semiconductor, Inc., San Jose, California, USA. |
Schooler, Richard, “Tile Processors: Many-Core for Embedded and Cloud Computing,” Sep. 15, 2010, 35 pp., 14th Annual Workshop on High Performance Embedded Computing (HPEC '10). |
Doud, Bob, “Accelerating the Data Plane with the Tile-Mx Manycore Processor,” Feb. 25, 2015, 19 pp., Linley Data Center Conference, EZchip Semiconductor, Inc., San Jose, California, USA. |
Wentzlaff, David et al., “On-Chip Interconnection Architecture of the Tile Processor,” IEEE Micro, Nov. 12, 2007, pp. 15-31, vol. 27, Issue 5, IEEE Computer Society Press, Los Alamitos, California, USA. |
Kalray, “MPPA Processors for Autonomous Driving,” May 25, 2017, 18 pp., Kalray Inc., Los Altos, California, USA. |
Kalray, “Deep Learning for High-Performance Embedded Applications,” 19 pp., Kalray Inc., Los Altos, California, USA. |
Xilinx, UltraScale Architecture DSP Slice, UG579, Oct. 18, 2017, 74 pp., Xilinx, Inc., San Jose, California, USA. |
Xilinx, Inc., PCT International Application No. PCT/US2019/025115, International Search Report and Written Opinion, , 12 pg. |
Xilinx, Inc., “AXI4-Stream Internconnect v1.1, LogiCORE IP Product Guide,” PG035, Vivado Design Suite,Chap. 2: Product Specification, Chap. 3: Designing With the Core, Oct, 4, 2017, 44 pg., Xilinx, Inc., San Jose, California, USA. |
Xilinx, Inc., PCT International Application No. PCT/US2019/025414, Invitation to Pay Additional Fees, Communication Relating to the Results of The Partial International Search, and Provisional Opinion Accompanying the Partial Search Result, datedJul. 5, 2019, 11 pg. |
Stocksdale et al., Architecting HBM as a High Bandwidth, High Capacity, Self-Managed Last-Level Cache, Nov. 2017, PDSW-DSICS 17 Denver, CO, USA (Year:2017). |