Address interleaving for machine learning

Information

  • Patent Grant
  • 11734608
  • Patent Number
    11,734,608
  • Date Filed
    Wednesday, December 23, 2020
    4 years ago
  • Date Issued
    Tuesday, August 22, 2023
    a year ago
Abstract
A system includes a memory, an interface engine, and a master. The memory is configured to store data. The inference engine is configured to receive the data and to perform one or more computation tasks of a machine learning (ML) operation associated with the data. The master is configured to interleave an address associated with memory access transaction for accessing the memory. The master is further configured to provide a content associated with the accessing to the inference engine.
Description
BACKGROUND

Applied Machine Learning (ML) is a booming field that utilizes a cascade of layers of nonlinear processing units and algorithms for feature extraction and transformation with a wide variety of usages and applications. ML typically involves two phases, training, which uses a rich set of training data to train a plurality of machine learning models, and inference, which applies the trained machine learning models to actual applications. Each of the two phases poses a distinct set of requirements for its underlying infrastructures. Various infrastructures may be used, e.g., graphics processing unit (GPU), a central processing unit (CPU), a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), etc. Specifically, the training phase focuses on, as a non-limiting example, GPU or ASIC infrastructures that scale with the trained models and retraining frequency, wherein the key objective of the training phase is to achieve high performance and reduce training time. The inference phase, on the other hand, focuses on infrastructures that scale with the applications, user, and data, and the key objective of the inference phase is to achieve energy (e.g., performance per watt) and capital (e.g., return on investment) efficiency.


Inference phase of ML is usually very computationally and data intensive. Unfortunately, as the input data and model sizes grow, data movement becomes a bottleneck and data processing increases because in order to perform simple processing, three operations or instructions are performed for each data, e.g., load, processing, and store. As the amount of data grows, performing these three operations or instructions becomes burdensome. Moreover, the current computing architecture is not scalable and are not well suited for ML and its applications, since a lot of time goes in loading and storing the data in comparison to processing the data.


The foregoing examples of the related art and limitations related therewith are intended to be illustrative and not exclusive. Other limitations of the related art will become apparent upon a reading of the specification and a study of the drawings.


SUMMARY

Accordingly, a need has arisen to improve memory access and to utilize bandwidth efficiently, thereby alleviating bottleneck resulting from data movement and memory access. In some nonlimiting examples, memory accesses are interleaved across multiple channels. In other words, the addresses associated with memory accesses are interleaved across multiple channels.


In some nonlimiting embodiments, a system includes a memory, an interface engine, and a master. The memory is configured to store data. The inference engine is configured to receive the data and to perform one or more computation tasks of a machine learning (ML) operation associated with the data. The master is configured to interleave an address associated with memory access transaction for accessing the memory. The master is further configured to provide a content associated with the accessing to the inference engine.


It is appreciated that in some embodiments the memory is a dynamic random access memory (DRAM). In some embodiments the memory may be a double data rate (DDR).


According to some embodiments, a subset of bits of the interleaved address is used to determine an appropriate channel through which to access the memory. In some embodiments, the interleaving includes moving channel identifier bits within the address to highest order bits. The channel identifier bits identify an appropriate channel through which to access the memory. The interleaving further includes shifting down address bits with bit orders higher than bit order of channel identifier bits before the moving. The shifting down is by a same order as a number of channel identifier bits. The moving and the shifting down forms the interleaved address. According to some embodiments, the system further includes a network interface controller. The network interface controller in some embodiments only supports address interleaving at a granularity greater than a burst length of the address.


These and other aspects may be understood with reference to the following detailed description.





BRIEF DESCRIPTION OF THE DRAWINGS

Aspects of the present disclosure are best understood from the following detailed description when read with the accompanying figures. It is noted that, in accordance with the standard practice in the industry, various features are not drawn to scale. In fact, the dimensions of the various features may be arbitrarily increased or reduced for clarity of discussion.



FIG. 1 depicts an example of diagram of a hardware-based programmable architecture configured to support inference acceleration for machine learning according to one aspect of the present embodiments.



FIG. 2 depicts an example of diagram of a hardware-based programmable architecture configured to interleave addresses for improving data access for machine learning according to one aspect of the present embodiments.



FIG. 3 depicts an example of diagram of a master component in a programmable architecture for machine learning configured to interleave addresses to improve data access and utilize bandwidth efficiently according to one aspect of the present embodiments.





DETAILED DESCRIPTION

The following disclosure provides many different embodiments, or examples, for implementing different features of the subject matter. Specific examples of components and arrangements are described below to simplify the present disclosure. These are, of course, merely examples and are not intended to be limiting. In addition, the present disclosure may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed.


Before various embodiments are described in greater detail, it should be understood that the embodiments are not limiting, as elements in such embodiments may vary. It should likewise be understood that a particular embodiment described and/or illustrated herein has elements which may be readily separated from the particular embodiment and optionally combined with any of several other embodiments or substituted for elements in any of several other embodiments described herein. It should also be understood that the terminology used herein is for the purpose of describing the certain concepts, and the terminology is not intended to be limiting. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood in the art to which the embodiments pertain.



FIG. 1 depicts an example of a diagram of a hardware-based programmable system/architecture 100 configured to support inference acceleration for machine learning. Although the diagrams depict components as functionally separate, such depiction is merely for illustrative purposes. It will be apparent that the components portrayed in this figure can be arbitrarily combined or divided into separate software, firmware and/or hardware components. Furthermore, it will also be apparent that such components, regardless of how they are combined or divided, can execute on the same host or multiple hosts, and wherein the multiple hosts can be connected by one or more networks.


Each of the engines in the architecture 100 is a dedicated hardware block/component including one or more microprocessors and on-chip memory units storing software instructions programmed by a user for various machine learning operations. When the software instructions are executed by the microprocessors, each of the hardware components becomes a special purposed hardware component for practicing certain machine learning functions as discussed in detail below. In some embodiments, the architecture 100 is on a single chip, e.g., a system-on-chip (SOC).


In the example of FIG. 1, the architecture 100 may include a host 110 coupled to a memory (e.g., Double Data Rate (DDR), Dynamic Random Access Memory (DRAM), high bandwidth memory (HBM), etc.) 120 and a core engine 130 via a PCIe controller and/or a direct memory access (DMA) module 125. The host 110 is a processing unit configured to receive or generate data to be analyzed and/or inferred by architecture 100 via machine learning. The DDR memory 120 is coupled to a data streaming engine 140 configured to transfer/stream data between the DDR memory 120 and on-chip memory (OCM) 210 of an inference engine 160 discussed below via DDR-to-OCM DMA or DoD. The core 130 is a processing engine configured to receive and interpret a plurality of ML commands from the host 110 into instructions for a ML operation. The core 130 is also configured to process a plurality of performance non-critical operations, e.g., data/instruction preparatory work, data collection, data mapping, etc. The core 130 is coupled to an instruction-streaming engine 150, which accepts instructions destined for the inference engine 160 from the core 130 and distributes the instructions to the appropriate units within the inference engine 160. The inference engine 160 is configured to perform dense and sparse operations on received stream of data, e.g., to identify a subject in an image, by using the training data and executing the programming instructions received from the instruction-streaming engine 150.


In some embodiments, the inference engine 160 includes a two-dimensional computing array of processing tiles, e.g., tiles 0, . . . , 63, arranged in, e.g., 8 rows by 8 columns. Each processing tile (e.g., tile 0) includes at least one on-chip memory (OCM) e.g., 210, one POD engine (or POD), e.g., 220, and one processing engine/element (PE), e.g., 230. Here, the OCMs in the processing tiles are configured to receive data from the data streaming engine 140 in a streaming fashion. The OCMs enable efficient local access to data per processing tile. The PODs are configured to perform dense or regular computations on the received data in the OCMs, e.g., matrix operations such as multiplication, matrix manipulation, tanh, sigmoid, etc., and the PEs are configured to perform sparse/irregular computations and/or complex data shape transformations of the received data in the OCMs, e.g., memory transpose, addition operation, operations on irregular data structures (such as trees, graphs, and priority queues), respectively. Both the PODs and the PEs can be programmed according to the programming instructions received from the instruction-streaming engine 150. Accordingly, the data is received and processed by each processing tile as an input data stream from the DDR memory 120 and the result is output by each processing tile as a stream of data to the DDR memory 120.


In some embodiments, a plurality of (e.g., four) processing tiles in the inference engine 160 together form a processing block or quad 250, e.g., processing tiles 0-3 form processing block 250, wherein the processing tiles within each processing block 250 are coupled to one another via a routing element 240. In some embodiments, all the routing elements are connected together as a mesh 260 of interconnect to connect the processing blocks in the same row or column as a two-dimensional array. It is appreciated that the number and/or types of components within each processing tile, the formation of the processing blocks, the number of processing tiles in each processing block, and the number of processing blocks in each row and column of the inference engine 160 as shown in FIG. 1 are exemplary and should not be construed as limiting the scope of the embodiments. In some embodiments, the same number of PE and POD may be used for each tile, and the same number of blocks may be used in each row and column in order to provide flexibility and scalability.


Referring now to FIG. 2, an example of diagram of a hardware-based programmable architecture configured to interleave addresses for improving data access for machine learning according to one aspect of the present embodiments is shown. The system may include the host 110, the PCIe controller/DMA 125, the core 130, the instruction streaming engine 150, and a data streaming engine 140 that operates substantially similar to that described in FIG. 1. In some nonlimiting examples a network interface controller (NIC) 290 may be coupled to facilitate transactions, e.g., instructions, commands, read requests, write requests, etc., between various components, e.g., the host 110, the PCIe controller/DMA 125, the core 130, the instruction streaming engine 150, the data streaming engine 140, etc., and the DDR memory 120 and/or to the OCMs 210 of the inference engine 160. It is appreciated that while the illustrated example is described with respect to a DDR memory, other types of memory components may be used, e.g., DRAM, HBM, etc., and that describing the embodiments with respect to DDR should not be construed as limiting the scope.


As presented above, memory accesses may cause bottleneck. In order to address the bottleneck resulting from memory access, the bandwidth associated with DRAM, DDR, etc., should be utilized more efficiently. In some nonlimiting examples, memory accesses are interleaved across multiple channels. In other words, the addresses associated with memory accesses are interleaved across multiple channels.


In a low power double data rate (LPDDR) system, the minimum burst length is 16. Thus, the minimum granularity of interleave is 128B. Unfortunately, NIC 290 may not support address interleaving of less than a certain size, e.g., 4 kB. Accordingly, the interleaving for addresses less than 4 kB, as an example, should be performed by each component (also referred to as master hereinafter), e.g., the host 110, the PCIe controller/DMA 125, the core 130, the instruction streaming engine 150, the data streaming engine 140, etc. In other words, each master may perform an address-bit swizzle at connectivity level with no logic involved (described in greater detail in FIG. 3). The address interleaving is followed by the master transaction, e.g., ARM Core Complex (ACC), DDR-OCM-DMA (DOD), read, write, etc., to the DDR memory 120.


Referring now to FIG. 3, an example of diagram of a master component in a programmable architecture for machine learning configured to interleave addresses to improve data access and utilize bandwidth efficiently according to one aspect of the present embodiments is shown. In this nonlimiting example, the master 310 interleaves the address and transmits the interleaved address followed by the transaction to the DRAM memory 120. In this illustrative example, the master 310 is coupled to the DDR memory 120 through channel 320, e.g., 4 channels A3, A2, A1, and A0. However, it is appreciated that in other embodiments a different number of channels may be used, e.g., 8 channels, 16 channels, etc. As such, the description of the embodiment with 4 channels is for illustrative purposes only and should not be construed as limiting the scope of the embodiments.


In some examples, the master 310 interleaves the address [a33, a32, a31, . . . , a0] associated with a memory location for a transaction resulting in an interleaved address 312. In this illustrative example, since there are 4 channels, only 2 bits of the address bits (also referred to as channel identifier bits) are needed to determine the appropriate channel, e.g., A0, A1, A2, or A3. In this illustrative example, the bits as and a7 of the address are used to determine the appropriate communication channel. In some embodiments, 00 may be associated with channel A3, 01 may be associated with channel A2, 10 may be associated with channel A1, and 00 may be associated with channel A0. It is appreciated that using bits as and a7 of the address to determine the appropriate channel is for illustrative purposes and that in other examples bits with different orders may be used. In one illustrative where 8 channels are used, 3 bits of the address bits are needed to identify the appropriate channel. Similarly, if 16 channels are used, 4 bits of the address bits are needed to identify the appropriate channel and so on. It is appreciated that in some embodiments fewer than 4 channels may be used, e.g., 2 channels may be used with one address bit such as a7.


The master 310 interleaves the bits of the address. For example, bits a8 and a7 of the address that are 8th and 7th order bits are moved to be the highest ordered bits of the address, hence the 33 and 32 order bits of the address. The order of the address bits a33 . . . a9 of the address are changed to new order bits and the address bits a6 . . . a0 of the address remain at the same order bits as before. In other words, the address bits [8:7] are shifted to the highest address bits [33:32] and are used to select the appropriate channel. Original bits [33:9] are shifted down by two order bits, and bits [6:0] remain unchanged. It is appreciated that higher order address bits above 34 can also remain unchanged. Accordingly, in some embodiments the higher order bits may be used to select the DRAM rank or chip-select bits, thereby supporting higher capacities without a change to the interleaving scheme.


Accordingly, the master 310 is used to efficiently access memory, e.g., DRAM memory 120, in an interleaved fashion, thereby alleviating memory accesses that cause bottleneck and inefficiencies. The DDR memory 120 receives the interleaved address 312 via an appropriate channel. In some illustrative embodiments, the DDR memory 120 may return a data 122 associated with the received interleaved data 312 to the master 310 via the appropriate channel, e.g., the same channel through which the interleaved address 312 was received. Accordingly, the bandwidth is utilized more efficiently when accessing the DDR memory 120.


The foregoing description of various embodiments of the claimed subject matter has been provided for the purposes of illustration and description. It is not intended to be exhaustive or to limit the claimed subject matter to the precise forms disclosed. Many modifications and variations will be apparent to the practitioner skilled in the art. Embodiments were chosen and described in order to best describe the principles of the invention and its practical application, thereby enabling others skilled in the relevant art to understand the claimed subject matter, the various embodiments and the various modifications that are suited to the particular use contemplated.

Claims
  • 1. A system to support an operation, comprising: an inference engine comprising one or more processing tiles, wherein each processing tile comprises at least one or more of an on-chip memory (OCM) configured to load and maintain data for local access by components in the processing tile; andone or more processing units configured to perform one or more computation tasks of the operation on data in the OCM by executing a set of task instructions; anda data streaming engine configured to stream data between the a memory and the OCMs of the one or more processing tiles of the inference engine, wherein the data streaming engine is configured to interleave an address associated with a memory access transaction for accessing the memory, wherein a subset of bits of the interleaved address is used to determine an appropriate communication channel through which to access the memory; anda network interface controller configured to support address interleaving for a burst length greater than a burst length of the address.
  • 2. The system of claim 1, wherein: each processing unit of the processing units in each processing tile includes one or more of a first processing unit configured to perform a dense and/or regular computation operation on the data in the OCM; anda second processing unit/element configured to perform a sparse and/or irregular computation task operation on the data in the OCM and/or from the first processing unit.
  • 3. The system of claim 1, wherein the memory is a dynamic random access memory (DRAM).
  • 4. The system of claim 1, wherein the memory is a double data rate (DDR).
  • 5. The system of claim 1, wherein: the data streaming engine is configured to move one or more communication channel identifier bits within the address to the highest order address bits, wherein the communication channel identifier bits identify an appropriate communication channel through which to access the memory; andshift down the address bits with a bit order higher than a bit order of the communication channel identifier bits before the moving, wherein the shifting down is by a same order as a number of the communication channel identifier bits, and wherein the moving and the shifting down forms the interleaved address.
  • 6. A system comprising: an inference engine configured to receive the data and to perform one or more computation tasks operation associated with the data;a master configured to: interleave an address associated with a memory access transaction for accessing a memory, and wherein the master is further configured to stream a content associated with the accessing to the inference engine,move one or more communication channel identifier bits within the address to the highest order address bits, wherein the communication channel identifier bits identify an appropriate communication channel through which to access the memory, andshift down the address bits with a bit order higher than a bit order of the communication channel identifier bits before the moving, wherein the shifting down is by a same order as a number of the communication channel identifier bits, and wherein the moving and the shifting down forms the interleaved address; anda network interface controller configured to support address interleaving for a burst length greater than a burst length of the address.
  • 7. The system of claim 6, wherein the memory is a dynamic random access memory (DRAM).
  • 8. The system of claim 6, wherein the memory is a double data rate (DDR).
  • 9. The system of claim 6, wherein a subset of bits of the interleaved address is used to determine an appropriate communication channel through which to access the memory.
  • 10. A method, comprising: interleaving an address associated with a memory access transaction for accessing a memory, wherein interleaving of the address is for a burst length greater than a burst length of the address;utilizing a subset of bits of the interleaved address to determine an appropriate communication channel through which to access the memory;streaming data associated with the memory accessing transaction from the memory to an inference engine; andperforming one or more computation tasks operation associated with the data via the inference engine.
  • 11. The method of claim 10, wherein: the inference engine comprises a plurality of processing tiles, wherein each processing tile comprises at least one or more of an on-chip memory (OCM) configured to load and maintain data for local access by components in the processing tile; andone or more processing units configured to perform one or more computation tasks of the ML operation on data in the OCM by executing a set of task instructions.
  • 12. The method of claim 10, further comprising: moving one or more communication channel identifier bits within the address to the highest order address bits, wherein the communication channel identifier bits identify an appropriate communication channel through which to access the memory; andshifting down the address bits with a bit order higher than a bit order of the communication channel identifier bits before the moving, wherein the shifting down is by a same order as a number of the communication channel identifier bits, and wherein the moving and the shifting down forms the interleaved address.
  • 13. The method of claim 12, further comprising: identifying an appropriate communication channel to communicate with the memory, wherein the identifying is through the communication channel identifier bits.
  • 14. The method of claim 13, further comprising: transmitting the memory access transaction associated with the address via the appropriate communication channel to the memory.
  • 15. The method of claim 14, further comprising: receiving the data associated with the address from the memory through the appropriate communication channel that the memory access transaction is received from.
  • 16. The method of claim 12, further comprising: maintaining an address bit with a lower bit order than that of the communication channel identifiers before the moving.
CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No. 16/420,078, filed May 22, 2019, which is a continuation-in-part of U.S. patent application Ser. No. 16/226,539, filed Dec. 19, 2018, now U.S. Pat. No. 10,824,433, issued Nov. 3, 2020, and claims the benefit of U.S. Provisional Patent Application No. 62/675,076, filed May 22, 2018, which are incorporated herein in their entirety by reference.

US Referenced Citations (71)
Number Name Date Kind
4982291 Kurahashi et al. Jan 1991 A
5329611 Pechanek et al. Jul 1994 A
5481487 Jang et al. Jan 1996 A
6128638 Thomas Oct 2000 A
6282583 Pincus et al. Aug 2001 B1
6415377 Wolf et al. Jul 2002 B1
6577312 Deering et al. Jun 2003 B2
6640262 Uppunda et al. Oct 2003 B1
7089380 Schober Aug 2006 B1
7191163 Herrera et al. Mar 2007 B2
7509363 Clifton Mar 2009 B2
7809663 Birch et al. Oct 2010 B1
7840914 Agarwal et al. Nov 2010 B1
7912883 Hussain Mar 2011 B2
8200728 Michaels et al. Jun 2012 B2
8200940 Lindholm Jun 2012 B1
8209703 Yee et al. Jun 2012 B2
8504954 Arnold Aug 2013 B1
8583896 Cadambi et al. Nov 2013 B2
8738860 Griffin et al. May 2014 B1
8838663 Tang et al. Sep 2014 B2
9015217 Arnold et al. Apr 2015 B2
9954771 Levy et al. Apr 2018 B1
10161786 Chang et al. Dec 2018 B2
10296556 Zhou May 2019 B2
10305766 Zhang et al. May 2019 B1
11106432 Mangnall et al. Aug 2021 B2
11604799 Bigdelu et al. Mar 2023 B1
20030204674 Ryan Oct 2003 A1
20040153501 Yamashita et al. Aug 2004 A1
20070122347 Statnikov et al. May 2007 A1
20080040577 Nemirovsky et al. Feb 2008 A1
20090158005 Carmichael Jun 2009 A1
20110219208 Asaad et al. Sep 2011 A1
20110307890 Achilles et al. Dec 2011 A1
20130101035 Wang et al. Apr 2013 A1
20130117521 Li et al. May 2013 A1
20140007098 Stillwell, Jr. et al. Jan 2014 A1
20150019836 Anderson et al. Jan 2015 A1
20150106568 Feldman et al. Apr 2015 A1
20150309808 Nandy et al. Oct 2015 A1
20150347012 Dewitt Dec 2015 A1
20160124651 Sankaranarayanan May 2016 A1
20160132272 Iwashita May 2016 A1
20160170916 Deshpande Jun 2016 A1
20160224465 Morad et al. Aug 2016 A1
20170068571 Lu et al. Mar 2017 A1
20170083313 Sankaralingam et al. Mar 2017 A1
20170228345 Gupta et al. Aug 2017 A1
20170351642 Omtzigt Dec 2017 A1
20170353397 Che Dec 2017 A1
20170364694 Jacob et al. Dec 2017 A1
20180046458 Kuramoto Feb 2018 A1
20180047126 Falkenstern et al. Feb 2018 A1
20180114114 Molchanov et al. Apr 2018 A1
20180260220 Lacy et al. Sep 2018 A1
20180286016 Bar-On et al. Oct 2018 A1
20180293782 Benthin et al. Oct 2018 A1
20180307980 Barik et al. Oct 2018 A1
20180315158 Nurvitadhi et al. Nov 2018 A1
20180341484 Fowers et al. Nov 2018 A1
20190121641 Knowles et al. Apr 2019 A1
20190121679 Wilkinson et al. Apr 2019 A1
20190147471 McKelvey, Jr. et al. May 2019 A1
20190171604 Brewer Jun 2019 A1
20200082198 Yao et al. Mar 2020 A1
20200242734 Wang et al. Jul 2020 A1
20210133911 Yao et al. May 2021 A1
20210216874 Jegou et al. Jul 2021 A1
20230024035 Thuerck et al. Jan 2023 A1
20230071931 Huang et al. Mar 2023 A1
Foreign Referenced Citations (1)
Number Date Country
2018222904 Dec 2018 WO
Non-Patent Literature Citations (4)
Entry
NanoMesh: An Asynchronous Kilo-Core System-on-Chip, Tse, et al. 2013 19th IEEE International Symposium on Asynchronous Circuits and Systems.
Ceze, L., et al. Colorama: Architectural Support for Data-Centric Synchronization, 2007, IEEE, pp. 134-144 (Year: 2007).
Brewer, “Instructions Set Innovations for the Convey HC-1 Computer”, 2010, pp. 70-79, Year: 2010.
Seng, et al. “Reducing Power with Dynamic Critical Path Information”, Jan. 1999, pp. 114-123; Year: 1999.
Related Publications (1)
Number Date Country
20210117866 A1 Apr 2021 US
Provisional Applications (1)
Number Date Country
62675076 May 2018 US
Continuations (1)
Number Date Country
Parent 16420078 May 2019 US
Child 17247810 US
Continuation in Parts (1)
Number Date Country
Parent 16226539 Dec 2018 US
Child 16420078 US