A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
This disclosure relates to implementing an application in a data processing array and, more particularly, to implementing data flows of the application across a memory hierarchy of target hardware including the data processing array.
Some varieties of integrated circuits provide architectures that include multiple compute units. Such ICs are capable of providing significant computational power and a high degree of parallelism. Applications can be created for execution on target hardware having a multi-compute unit architecture using a data flow model of computation. For example, machine learning applications are often specified using a data flow model of computation. Examples of computational data flow models used to create applications include synchronous data flow, cyclo-static data flow, and multi-dimensional data flow. Each of these computational models focuses on data production properties and data consumption properties between computational nodes in a data flow graph used to specify the application. In general, a data flow graph is a collection of nodes and edges in which the nodes represent operations performed on data and the edges represent communication links that convey data among the nodes.
An application intended to execute on a multi-compute unit hardware architecture often consumes significant amounts of data. The ability to efficiently provide data to the compute units and output resulting data from the compute units has a significant effect on runtime performance of the application as executed on the target hardware. Currently available data flow models of computation are unable to define data movements of an application in a manner that accounts for the architecture, including the memory hierarchy, of the target hardware. Further, many modern machine learning applications rely on multi-producer and multi-consumer operation. Available data flow models, however, lack support for multi-producer and multi-consumer semantics.
In one or more example implementations, a method includes receiving a data flow graph specifying an application for execution on a data processing array. The method includes identifying a plurality of buffer objects corresponding to a plurality of different levels of a memory hierarchy of the data processing array and an external memory. The plurality of buffer objects specify data flows. The method includes determining buffer object parameters. The buffer object parameters define properties of the data flows. The method includes generating data that configures the data processing array to implement the data flows among the plurality of different levels of the memory hierarchy and the external memory based on the plurality of buffer objects and the buffer object parameters.
In one or more example implementations, a system includes one or more processors configured to initiate operations. The operations include receiving a data flow graph specifying an application for execution on a data processing array. The operations include identifying a plurality of buffer objects corresponding to a plurality of different levels of a memory hierarchy of the data processing array and an external memory. The plurality of buffer objects specify data flows. The operations include determining buffer object parameters. The buffer object parameters define properties of the data flows. The operations include generating data that configures the data processing array to implement the data flows among the plurality of different levels of the memory hierarchy and the external memory based on the plurality of buffer objects and the buffer object parameters.
In one or more example implementations, a computer program product includes one or more computer-readable storage media, and program instructions collectively stored on the one or more computer-readable storage media. The program instructions are executable by computer hardware to initiate operations. The operations can include receiving a data flow graph specifying an application for execution on a data processing array. The operations include identifying a plurality of buffer objects corresponding to a plurality of different levels of a memory hierarchy of the data processing array and an external memory. The plurality of buffer objects specify data flows. The operations include determining buffer object parameters. The buffer object parameters define properties of the data flows. The operations include generating data that configures the data processing array to implement the data flows among the plurality of different levels of the memory hierarchy and the external memory based on the plurality of buffer objects and the buffer object parameters.
This Summary section is provided merely to introduce certain concepts and not to identify any key or essential features of the claimed subject matter. Other features of the inventive arrangements will be apparent from the accompanying drawings and from the following detailed description.
The inventive arrangements are illustrated by way of example in the accompanying drawings. The drawings, however, should not be construed to be limiting of the inventive arrangements to only the particular implementations shown. Various aspects and advantages will become apparent upon review of the following detailed description and upon reference to the drawings.
This disclosure relates to implementing an application in a data processing array and, more particularly, to implementing data flows of the application across a memory hierarchy of target hardware including the data processing array. In accordance with the inventive arrangements described within this disclosure, a programming model is provided that supports multi-dimensional data and movement of such data throughout a memory hierarchy of a data processing array and an external memory. The programming model provides constructs and application programming interfaces (APIs) that are used within an application to define the movement of the multi-dimensional data across various levels of the memory hierarchy and the external memory of the target hardware.
In one or more example implementations, the programming model supports multi-producer and/or multi-consumer semantics for accessing multi-dimensional data stored at various levels of the memory hierarchy. The programming model, when utilized to specify an application, may be compiled by a compiler to generate data that is loadable into the target hardware to configure the data processing array. The data processing array is configured by the data to implement the application therein and to implement the data flows defined by the application using the programming model.
Further aspects of the inventive arrangements are described below with reference to the figures.
In the example of
A kernel refers to a software data processing element. A kernel may be a user-specified (e.g., custom) data processing element or a data processing element obtained from a standard library of software-based kernels. The kernel may implement any of a variety of different functions including commonly used functions. These functions may be specific to a particular domain such as image processing, communications, cryptography, machine learning, or the like. A kernel may be specified in a high-level programming language such as C/C++ and compiled. Other examples of high-level programming languages include, but are not limited to, Python, Javascript, Swift, Go, LabView, or Simulink. It should be appreciated that kernels may be specified in any of a variety of different programming languages whether high-level or low level. The kernel may be compiled into computer-readable program instructions executable by a hardware processor or compiled into circuitry (e.g., implemented using programmable circuitry such as programmable logic).
Using graph APIs 104, data flow graph 110 also defines one or more data nodes. Each data node may be specified within data flow graph 110 as a buffer. In one or more examples, buffers may be specified as different types. For example, a data node of data flow graph 110 may be specified as a “shared buffer” or as an “external buffer.” In conventional applications specified as data flow graphs, nodes are limited to specifying computational nodes (e.g., kernels). In contrast to conventional data flow graphs, the inventive arrangements described herein expand the set of graph objects, available APIs, and available semantics for creating a data flow graph 110 to include data nodes. The inclusion of data nodes in the data flow graph with kernels allows an application developer to specify particular data flows using the buffer objects, which correspond to particular levels of memory hierarchy in the data processing array 140.
Within this disclosure, various levels of memory hierarchy are described. It should be appreciated that the memory hierarchy that may be modeled using buffer types as nodes within a data flow graph may be arbitrarily specified to correspond to a particular hardware architecture that is to be used to implement an application represented by the data flow graph. In this regard, the “shared buffer” and the “external buffer” types described herein are provided as examples and are not intended as limitations. In addition, in certain contexts, one or more of the data nodes may not map to an actual data storage device or memory, but rather to a mechanism such as a circuit for accessing the data storage device or memory.
Compiler 106 is capable of compiling data flow graph 110 and graph APIs 104 to generate compiled data flow graph 120. Compiled data flow graph 120 may be specified as binary code. Compiled data flow graph 120 may be loaded into target hardware 130 including data processing array 140 to implement data flow graph 110 in data processing array 140. Compiled data flow graph 120 may specify compiled versions of kernels, a mapping of kernels to different compute tiles of data processing array 140, data establishing stream channels within data processing array 140, a mapping of data nodes to levels of memory hierarchy of data processing array 140 and/or target hardware 130, and/or data that, when loaded into appropriate configuration registers of data processing array 140, implements data flows across the memory hierarchy of data processing array. As described in greater detail below, the data flows may involve a global memory that is external to data processing array 140.
Each compute tile 202 can include one or more cores 208, a program memory (PM) 210, a data memory (DM) 212, a DMA circuit 214, and a stream interconnect (SI) 216. In one aspect, each core 208 is capable of executing program code stored in program memory 210. In one aspect, each core 208 may be implemented as a scalar processor, as a vector processor, or as a scalar processor and a vector processor operating in coordination with one another.
In one or more examples, each core 208 is capable of directly accessing the data memory 212 within the same compute tile 202 and the data memory 212 of any other compute tile 202 that is adjacent to the core 208 of the compute tile 202 in the up, down, left, and/or right directions. Core 208 sees data memories 212 within the same tile and in one or more other adjacent compute tiles as a unified region of memory (e.g., as a part of the local memory of the core 208). This facilitates data sharing among different compute tiles 202 in data processing array 140. In other examples, core 208 may be directly connected to data memories 212 in other compute tiles 202.
Cores 208 may be directly connected with adjacent cores 208 via core-to-core cascade connections (not shown). In one aspect, core-to-core cascade connections are unidirectional and direct connections between cores 208. In another aspect, core-to-core cascade connections are bidirectional and direct connections between cores 208. In general, core-to-core cascade connections allow the results stored in an accumulation register of a source core 208 to be provided directly to an input of a target or load core 208. This means that data provided over a cascade connection may be provided among cores directly with less latency since the data does not traverse the stream interconnect 216 and is not written by a first core 208 to data memory 212 to be read by a different core 208.
In an example implementation, compute tiles 202 do not include cache memories. By omitting cache memories, data processing array 140 is capable of achieving predictable, e.g., deterministic, performance. Further, significant processing overhead is avoided since maintaining coherency among cache memories located in different compute tiles 202 is not required. In a further example, cores 208 do not have input interrupts. Thus, cores 208 are capable of operating uninterrupted. Omitting input interrupts to cores 208 also allows data processing array 140 to achieve predictable, e.g., deterministic, performance.
In the example of
In one or more other examples, compute tiles 202 may not be substantially identical. In this regard, compute tiles 202 may include a heterogeneous mix of compute tiles 202 formed of two or more different types of processing elements. As an illustrative and nonlimiting example, different ones of compute tiles 202 may include processing elements selected from two or more of the following groups: digital signal processing engines, cryptographic engines, Forward Error Correction (FEC) engines, or other specialized hardware.
In the example of
Interface tiles 204 form an array interface 222 for data processing array 140. Array interface 222 operates as an interface that connects tiles of data processing array 140 to other resources of the particular IC in which data processing array 140 is disposed. In the example of
In the example of
In addition, external memory 250, as the name suggests, is external to data processing array 140. In one example, external memory 250 represents memory that is implemented in the same IC as data processing array 140. That is, external memory 250 may be an “on-chip” memory whether disposed on the same die as data processing array 140 or on a different die than data processing array 140 but within the same IC package. In another example, external memory 250 may be external to the IC in which data processing array 140 is implemented. In that case, external memory 250 may be disposed on the same circuit board as the IC including data processing array 140. For example, external memory 250 may be implemented as a Double Data Rate, Synchronous Dynamic Random Access Memory (DDR).
In the example of
Referring to
In general, each shared buffer object that is instantiated includes one or more input ports and one or more output ports that represent DMA channels of DMA circuits 220. Referring to line 4, the shared buffer object has one, two, or more input ports. The particular number of input ports may be specified by the numInputs parameter in the constructor of the shared buffer object at lines 13-14. Each input port of the shared buffer object represents a logical stream-to-memory mapped (S2MM) DMA channel of DMA circuit 220 in the memory tile 206. Each S2MM DMA channel of a shared buffer object physically maps to the S2MM DMA channel of the memory tile 206 that receives stream data as input via a stream interface of the DMA channel coupled to stream interconnect 216 and writes such data to memory 218 within the memory tile 206 via a memory-mapped interface of the DMA channel.
Referring to line 5, the shared buffer object has one, two, or more output ports. The particular number of output ports may be specified by the numOutputs parameter in the constructor of the shared buffer object at lines 13-14. The shared buffer object has one, two, or more output ports. Each output port of the shared buffer object represents a logical memory mapped-to-stream (MM2S) DMA channel of DMA circuit 220 in the memory tile 206. Each MM2S DMA channel of a shared buffer object physically maps to the MM2S DMA channel of the memory tile 206 that reads data from memory 218 therein via a memory-mapped interface of the DMA channel and outputs the data as a data stream to stream interconnect 216 via a stream interface of the DMA channel.
The capability of the shared buffer object to have multiple input ports allows the shared buffer object to connect to multiple data producers that are capable of sending data to the shared buffer object for storage therein (e.g., for storage in memory 218 of the memory tile 206). Similarly, the capability of the shared buffer object to have multiple output ports allows the shared buffer object to send data to multiple data consumers (e.g., send data from memory 218 of the memory tile 206 to the multiple data consumer(s)).
In addition, a shared buffer object input port can connect to other data storage buffer objects such as one or more shared buffers and/or one or more external buffers. Similarly, a shared buffer object output port can connect to data storage buffer objects such as one or more shared buffers and/or one or more external buffers. These mechanisms may be used to specify data transfers between memory tiles and/or between memory tile and external memory in an application.
Referring to
In general, each external buffer object that is instantiated includes one or more input ports and one or more output ports that represent DMA channels of DMA circuits 224. Referring to line 20, the external buffer object has one, two, or more input ports. The particular number of input ports may be specified by the numInputs parameter in the constructor of the external buffer object at lines 29-30. Each input port of the external buffer object represents a logical stream-to-memory mapped (S2MM) DMA channel of DMA circuit 224 in a particular interface tile 204. Each S2MM DMA channel of an external buffer object physically maps to the S2MM DMA channel of the interface tile 204 that receives stream data as input via a stream interface of the DMA channel coupled to stream interconnect 216 (in interface tile 204) and writes such data to external memory 250 via a memory-mapped interface of the DMA channel.
Referring to line 21, the external buffer object has one, two, or more output ports. The particular number of output ports may be specified by the numOutputs parameter in the constructor of the external buffer object at lines 29-30. The shared buffer object has one, two, or more output ports. Each output port of the shared buffer object represents a logical memory mapped-to-stream (MM2S) DMA channel of DMA circuit 224 in the interface tile 204. Each MM2S DMA channel of an external buffer object physically maps to the MM2S DMA channel of the interface tile 204 that reads data from external memory 250 via a memory-mapped interface of the DMA channel and outputs the data as a data stream to stream interconnect 216 (e.g., of the interface tile 204) via a stream interface of the DMA channel.
The capability of the external buffer object to have multiple input ports allows the external buffer object to connect to multiple data producers capable of sending data to the external buffer object for storage therein (e.g., for storage in external memory 250). Similarly, the capability of the external buffer object to have multiple output ports allows the external buffer object to send data to multiple data consumers (e.g., send data from external memory 250 to the data consumer(s)).
In addition, an external buffer object input port can connect to other data storage buffer objects of the memory hierarchy of the data processing array. Similarly, an external buffer object output port can connect to data storage buffer objects of the memory hierarchy of the data processing array. These mechanisms may be used to specify data transfers between external buffer objects and one or more other buffer objects of the memory hierarchy of the data processing array.
Referring to
Lines 36-46 specify traversing parameters. The traversing parameters include an dimension parameter that specifies a dimension for inter-data tile traversal; a stride parameter that specifies a distance in terms of number of data elements in the traversal dimension between consecutive tile traversing steps, the wrap parameter indicates the number of steps before wrapping the dimension and incrementing the dimension +1 for inter-data tile traversal.
Lines 49-71 specify tiling parameters. The tiling parameters include a buffer_dimension parameter that specifies the buffer dimensions in the buffer element type. buffer_dimension[0] is the fast-moving dimension and is contiguous memory. The tiling_dimension parameter specifies the tiling dimensions of the buffer element type. The offset parameter specifies a multi-dimensional offset with respect to the buffer starting element. The tile_traversal parameter specifies a vector of traversing parameters, where tile_traversal[0], tile_traversal[1], . . . , tile_traversal[N−1] represent loops of a traversing pattern from the most inner loop to the most outer loop. For example, tile_traversal[i].dimension represents the buffer object dimension in the i-th loop of the traversal. The packet_port_id parameter specifies the output port identifier of the connected packet split or the input port identifier of the connected packet merge. The repetition parameter specifies the number of repetitions of the tiling traversal. The phase parameter specifies the phase of the tiling parameter resource sharing and execution. The boundary_dimension parameter specifies the real data boundary dimension to be used in performing padding.
As noted, the buffer objects may be implemented as data nodes in the data flow graph 110, which expresses the computational model for the application. The various parameters described in connection with
The graph APIs 104 also includes access rules that ensure data integrity in multi-producer and/or multi-consumer implementations for shared buffer objects and/or for external buffer objects. These access rules, which are enforced within data flow graph 110, include the following:
Though not described within this disclosure, additional buffer ports representing kernel buffer ports (e.g., as mapped to and implemented in a particular compute tile 202) may be specified within data flow graph 110. These buffer ports represent data stored in data memories 212 (e.g., locally within a compute tile 202). With the addition of the shared buffer objects and the external buffer objects, data flow graph 110 is capable of specifying data and data movements throughout all levels of the memory hierarchy of data processing array 140. Compiler 106 is capable of recognizing the data flows specified between the various types of nodes specified (e.g., data nodes and/or computation nodes), data access patterns, and multi-producer and/or multi-consumer semantics. The ability to recognize such data flows within data flow graph 110 allows compiler 106 to generate the configuration data needed to program control registers of data processing array 140 to effectuate the specified data movements therein and ensure that the data movements are synchronized across the memory hierarchy in accordance with the access rules indicated above. In one aspect, data movements may be synchronized using locks.
In block 402, compilation system 100 receives data flow graph 110 specifying an application for execution on data processing array 140. Data flow graph 110 includes a plurality of data nodes. The data nodes are specified as buffer objects as described within this disclosure. In one aspect, one or more of the plurality of buffer objects is a shared buffer object representing a memory tile of the data processing array (e.g., multi-dimensional data stored in a memory tile of the data processing array). In another aspect, one or more of the plurality of buffer objects is an external buffer object representing a global memory that is external to the data processing array (e.g., representing multi-dimensional data stored in the global memory).
In block 404, compilation system 100 begins performing a compilation process on data flow graph 110. For example, in block 406, compilation system 100 is capable of identifying a plurality of buffer objects corresponding to a plurality of different levels of a memory hierarchy of data processing array 140 and external memory 250. The plurality of buffer objects specify data flows. In this sense, compilation system 100 effectively detecting data flows in data flow graph through identification of the plurality of buffer objects. One or more of the data transfers may be between a buffer object selected from the plurality of buffer objects and multiple data consumers. One or more of the data transfers may be between a buffer object selected from the plurality of buffer objects and multiple data producers.
In one or more example implementations, a kernel, as implemented in a core 208 of a compute tile 202 is an example of a data producer and/or a data consumer. A kernel that receives input from a buffer object (e.g., that reads data from the buffer object) is considered a data consumer. A kernel that provides data to a buffer object (e.g., that writes data to a buffer object) is considered a data producer.
In block 408, compilation system 100 determines buffer object parameters of the plurality of buffer objects. The buffer object parameters define properties of the data flows specified therein. In one or more example implementations, the buffer object parameters specify dimensionality of multi-dimensional data stored by the plurality of buffer objects. That is, the buffer object parameters specify the particular dimensions of the multi-dimensional data structure. In one or more example implementations, the buffer object parameters specify read access patterns and write access patterns for multi-dimensional data stored in the plurality of buffer objects. For example, the read access patterns and the write access patterns specify at least one of tiling parameters or traversal parameters.
As discussed, the graph APIs 104, as provided by compilation system 100, is capable of providing APIs that support creation of buffer objects and defining the various buffer object parameters described within this disclosure.
In block 410, compilation system 100 generates compiled data flow graph 120. Compiled data flow graph 120 may be binary data that is loadable into target hardware such as an IC, that includes data processing array 140 to configure data processing array 140 to implement the application specified by data flow graph 110.
For example, as part of generating compiled data flow graph 120, in block 412, compilation system 100 generates data that configures data processing array 140 to implement the data flows among the plurality of different levels of the memory hierarchy as specified by the plurality of buffer objects and the buffer object parameters. The data generated may be included in, or part of, the binary data constituting compiled data flow graph 120. For example, the data generated that configures data processing array 140 to implement the data transfers among the plurality of different levels of the memory hierarchy configures one or more DMA circuits (e.g., one or more of DMA circuits 224, 220, and/or 214) of tiles of data processing array 140 to implement the data transfers.
In block 414, the compilation system, or another system in communication with target hardware such as an IC that includes data processing array 140, loads the compiled data flow graph 120 into the target hardware. In loading the compiled data flow graph 120 into the target hardware, the compiled kernels, as mapped to particular compute tiles, may be loaded into program memories 210 of the respective compute tiles 202 to which the kernels have been mapped during compilation. In loading the compiled data flow graph 120 into the target hardware, stream interconnects 216 are configured to implement stream channels among the compute tiles 202 executing kernels, the various memory tiles 206 implementing buffers, and the various interface tiles 204 implementing communication pathways. In loading the compiled data flow graph 120 into the target hardware, memories (e.g., external memory 250, memory tiles 206, and/or data memories 212) may be initialized. In addition, the data that configures data processing array 140 to implement the data transfers of data flow graph 110 among the plurality of different levels of the memory hierarchy is loaded. In one or more example implementations, this process entails loading the data into control/configuration registers of the various DMA circuits 214, 220, and/or 224 of data processing array 140.
At line 10, an output port 0 of kernel k1 is connected to input port 0 of the shared buffer mtx. At line 13, an output port 0 of kernel k2 is connected to an input port 1 of the shared buffer mtx. At line 16, an output port 0 of the shared buffer mtx is connected to an input port 0 of kernel k3. At line 19, an output port 1 of the shared buffer mtx is connected to an input port 0 of kernel k4. Thus, in the example of
Lines 11-12 define the write access of kernel k1 to the shared buffer mtx via the input port 0 of the shared buffer mtx. Lines 11-12 define the write access by way of the buffer object parameters listed. These buffer object parameters include tiling parameters defining the dimensionality of the write-data. The tiling parameters include those specified by buffer dimension, tiling dimension, and offset. These buffer object parameters also include traversal parameters defining how to traverse or move through the multi-dimensional data. The traversal parameters include those defined by tiling traversal which specifies dimension, stride, and wrap.
Lines 14-15 define write access of kernel k2 to the shared buffer mtx by way of input port 1 of the shared buffer mtx. Lines 14-15 define the write access by way of the buffer object parameters listed.
Lines 17-18 define the read access of kernel k3 to the shared buffer mtx via the output port 0 of the shared buffer mtx. Lines 17-18 define the read access by way of the buffer object parameters listed. These buffer object parameters include tiling parameters defining the dimensionality of the read-data. The tiling parameters include those specified by buffer dimension, tiling dimension, and offset. These buffer object parameters also include traversal parameters defining how to traverse or move through the multi-dimensional data. The traversal parameters include those defined by tiling traversal which specifies dimension, stride, and wrap.
Lines 20-21 define read access of kernel k4 to the shared buffer mtx by way of output port 1 of the shared buffer mtx. Lines 20-21 define the read access by way of the buffer object parameters listed.
The example of
The write access pattern for kernel k1 is specified at lines 11-12 of
In the example of
The starting point for the write accesses is defined by the offset parameter which specifies a starting point of (0,0) for write access of kernel k1 corresponding to the lower left-hand corner of the data structure illustrated. The first set of parameters corresponding to the inner loop is {.dimension=0, .stride=3, .wrap=2}. The dimension parameter value of 0 indicates that the dimension of tile traversal in the current loop is along the D0-axis. The wrap parameter value of 2 indicates that kernel k1 will write 2 data tiles along the D0-axis before incrementing along the next dimension corresponding to the D1-axis. The stride parameter value of 3 indicates that the starting address of each new data tile along the D0-axis will be 3 data blocks away from the starting point of the prior data tile. For example, the starting point of data tile k1−1 is (0,0). The stride parameter value of 3 indicates that the starting point of data tile k1−2 is (3,0).
The second set of parameters corresponding to the outer loop is {.dimension=1, .stride=2, .wrap=3]. The dimension parameter value of 1 indicates that the dimension of the current loop is along the D1-axis. The wrap parameter value of 3 indicates that kernel k1 will write 3 data tiles along the D1-axis. The stride parameter value of 2 indicates that the starting address of each new data tile along the D1-axis will be 2 data blocks away from the starting point of the prior data tile. For example, the starting point of data tile k1−1 is (0,0). The stride parameter value of 2 indicates that the starting point of data tile k1−3 is (0,2).
In the example of
The starting point for the write accesses is defined by the offset parameter which specifies a starting point of (6,0) for write access of kernel k2. The first set of parameters corresponding to the inner loop is {.dimension=0, .stride=2, .wrap=2}. The dimension parameter value of 0 indicates that the the dimension of the current loop is along the D0-axis. The wrap parameter value of 2 indicates that kernel k2 will write 2 data tiles along the x-axis before incrementing along the next dimension corresponding to the D1-axis. The stride parameter value of 2 indicates that the starting address of each new data tile along the x-axis will be 2 data blocks away from the starting point of the prior data tile. For example, the starting point of data tile k2−1 is (6,0). The stride parameter value of 2 indicates that the starting point of data tile k2−2 is (8,0).
The second set of parameters corresponding to the outer loop is {.dimension=1, .stride=2, .wrap=3}. The dimension parameter value of 1 indicates that the dimension of the current loop is along the D1-axis. The wrap parameter value of 3 indicates that kernel k2 will write 3 data tiles along the y-axis. The stride parameter value of 2 indicates that the starting address of each new data tile along the D1-axis will be 2 data blocks away from the starting point of the prior data tile. For example, the starting point of data tile k2−1 is (6,0). The stride parameter value of 2 indicates that the starting point of data tile k2−3 is (6,2).
The read access pattern for kernel k3 is specified at lines 17-18 of
In the example of
The starting point for the read accesses is defined by the offset parameter which specifies a starting point of (0,0) for read access of kernel k3. The tiling parameters are {.dimension=0, .stride=2, .wrap=2}. The dimension parameter value of 0 indicates that the dimension of the current loop is along the D0-axis. The wrap parameter value of 2 indicates that kernel k3 will read 2 data tiles along the D0-axis. The stride parameter value of 2 indicates that the starting address of each new data tile along the D0-axis will be 2 data blocks away from the starting point of the prior data tile. For example, the starting point of data tile k3−1 is (0,0). The stride parameter value of 2 indicates that the starting point of data tile k3−2 is (2,0).
In the example of
The starting point for the read accesses is defined by the offset parameter which specifies a starting point of (4,0) for read access of kernel k4. The set of parameters is {.dimension=0, .stride=3, .wrap=2}. The dimension parameter value of 0 indicates that the dimension of the current loop is along the D0-axis. The wrap parameter value of 2 indicates that kernel k4 will read 2 data tiles along the D0-axis. The stride parameter value of 3 indicates that the starting address of each new data tile along the x-axis will be 3 data blocks away from the starting point of the prior data tile. For example, the starting point of data tile k4−1 is (4,0). The stride parameter value of 3 indicates that the starting point of data tile k4−2 is (7,0).
In the example, the external buffer ddrin stores a multi-dimensional array having the dimensions (10, 6, 100). The external buffer ddrin includes one read output port. The shared buffer tensor stores a sub-volume of the multi-dimensional array having the dimensions (10,6,10). At line 11, an output port 0 of external buffer ddrin is connected to an input port of shared buffer tensor.
At lines 12-13, read access of external buffer ddrin by way of output port 0 is specified. The read access of external buffer ddrin may be implemented by a DMA circuit disposed in an interface tile 204 and a DMA circuit disposed in a memory tile 206. As illustrated, the read access pattern indicates that the buffer dimension is (10,6,100). Lines 12-13 define the read access of the shared buffer tensor to the external buffer ddrin via the output port 0 of the external buffer ddrin. Lines 12-13 define the read access by way of the buffer object parameters listed. These buffer object parameters include tiling parameters defining the dimensionality of the data. The tiling parameters include those specified by buffer dimension, tiling dimension, and offset. These buffer object parameters also include traversal parameters defining how to traverse or move through the multi-dimensional data. The traversal parameters include those defined by tiling traversal which specifies dimension, stride, and wrap.
The example of
Referring to the example of
Referring to the example of
Referring to line 13 of
In the example of
The example of
The inventive arrangements described herein as implemented and/or provided by the graph APIs 104 and compiler 106 are capable of supporting one or more other features relating to the architecture of data processing array 140. In one aspect, the graph API 104 and compiler 106 support semaphore locks. The semaphore locks may be used to control read and/or write accesses and ensure that read accesses to a particular buffer do not begin until, or responsive to, any writes to that buffer completing. In another example, zero-padding of data may be supported.
Processor 1402 may be implemented as one or more processors. In an example, processor 1402 is implemented as a central processing unit (CPU). Processor 1402 may be implemented as one or more circuits capable of carrying out instructions contained in program code. The circuit may be an integrated circuit or embedded in an integrated circuit. Processor 1402 may be implemented using a complex instruction set computer architecture (CISC), a reduced instruction set computer architecture (RISC), a vector processing architecture, or other known architectures. Example processors include, but are not limited to, processors having an x86 type of architecture (IA-32, IA-64, etc.), Power Architecture, ARM processors, and the like.
Bus 1406 represents one or more of any of a variety of communication bus structures. By way of example, and not limitation, bus 1406 may be implemented as a Peripheral Component Interconnect Express (PCIe) bus. Data processing system 1400 typically includes a variety of computer system readable media. Such media may include computer-readable volatile and non-volatile media and computer-readable removable and non-removable media.
Memory 1404 can include computer-readable media in the form of volatile memory, such as random-access memory (RAM) 1408 and/or cache memory 1410. Data processing system 1400 also can include other removable/non-removable, volatile/non-volatile computer storage media. By way of example, storage system 1412 can be provided for reading from and writing to a non-removable, non-volatile magnetic and/or solid-state media (not shown and typically called a “hard drive”). Although not shown, a magnetic disk drive for reading from and writing to a removable, non-volatile magnetic disk (e.g., a “floppy disk”), and an optical disk drive for reading from or writing to a removable, non-volatile optical disk such as a CD-ROM, DVD-ROM or other optical media can be provided. In such instances, each can be connected to bus 1406 by one or more data media interfaces. Memory 1404 is an example of at least one computer program product.
Memory 1404 is capable of storing computer-readable program instructions that are executable by processor 1402. For example, the computer-readable program instructions can include an operating system, one or more application programs, other program code, and program data. Processor 1402, in executing the computer-readable program instructions, is capable of performing the various operations described herein that are attributable to a computer. For example, processor 1402, in executing the computer-readable program instructions is capable of implementing the functions and/or operations described in connection with compilation system 100 of
It should be appreciated that data items used, generated, and/or operated upon by data processing system 1400 are functional data structures that impart functionality when employed by data processing system 1400. As defined within this disclosure, the term “data structure” means a physical implementation of a data model's organization of data within a physical memory. As such, a data structure is formed of specific electrical or magnetic structural elements in a memory. A data structure imposes physical organization on the data stored in the memory as used by an application program executed using a processor.
Data processing system 1400 may include one or more Input/Output (I/O) interfaces 1418 communicatively linked to bus 1406. I/O interface(s) 1418 allow data processing system 1400 to communicate with one or more external devices and/or communicate over one or more networks such as a local area network (LAN), a wide area network (WAN), and/or a public network (e.g., the Internet). Examples of I/O interfaces 1418 may include, but are not limited to, network cards, modems, network adapters, hardware controllers, etc. Examples of external devices also may include devices that allow a user to interact with data processing system 1400 (e.g., a display, a keyboard, and/or a pointing device) and/or other devices such as accelerator card.
Data processing system 1400 is only one example implementation. Data processing system 1400 can be practiced as a standalone device (e.g., as a user computing device or a server, as a bare metal server), in a cluster (e.g., two or more interconnected computers), or in a distributed cloud computing environment (e.g., as a cloud computing node) where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.
As used herein, the term “cloud computing” refers to a computing model that facilitates convenient, on-demand network access to a shared pool of configurable computing resources such as networks, servers, storage, applications, ICs (e.g., programmable ICs) and/or services. These computing resources may be rapidly provisioned and released with minimal management effort or service provider interaction. Cloud computing promotes availability and may be characterized by on-demand self-service, broad network access, resource pooling, rapid elasticity, and measured service.
The example of
While the disclosure concludes with claims defining novel features, it is believed that the various features described within this disclosure will be better understood from a consideration of the description in conjunction with the drawings. The process(es), machine(s), manufacture(s) and any variations thereof described herein are provided for purposes of illustration. Specific structural and functional details described within this disclosure are not to be interpreted as limiting, but merely as a basis for the claims and as a representative basis for teaching one skilled in the art to variously employ the features described in virtually any appropriately detailed structure. Further, the terms and phrases used within this disclosure are not intended to be limiting, but rather to provide an understandable description of the features described.
For purposes of simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements for clarity. Further, where considered appropriate, reference numbers are repeated among the figures to indicate corresponding, analogous, or like features.
As defined herein, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise.
As defined herein, the terms “at least one,” “one or more,” and “and/or,” are open-ended expressions that are both conjunctive and disjunctive in operation unless explicitly stated otherwise. For example, each of the expressions “at least one of A, B, and C,” “at least one of A, B, or C,” “one or more of A, B, and C,” “one or more of A, B, or C,” and “A, B, and/or C” means A alone, B alone, C alone, A and B together, A and C together, B and C together, or A, B and C together.
As defined herein, the term “automatically” means without human intervention. As defined herein, the term “user” means a human being.
As defined herein, the term “computer-readable storage medium” means a storage medium that contains or stores program instructions for use by or in connection with an instruction execution system, apparatus, or device. As defined herein, a “computer-readable storage medium” is not a transitory, propagating signal per se. The various forms of memory, as described herein, are examples of computer-readable storage media. A non-exhaustive list of examples of computer-readable storage media include an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of a computer-readable storage medium may include: a portable computer diskette, a hard disk, a RAM, a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an electronically erasable programmable read-only memory (EEPROM), a static random-access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, or the like.
As defined herein, the term “if” means “when” or “upon” or “in response to” or “responsive to,” depending upon the context. Thus, the phrase “if it is determined” or “if [a stated condition or event] is detected” may be construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “responsive to detecting [the stated condition or event]” depending on the context.
As defined herein, the term “responsive to” and similar language as described above, e.g., “if,” “when,” or “upon,” means responding or reacting readily to an action or event. The response or reaction is performed automatically. Thus, if a second action is performed “responsive to” a first action, there is a causal relationship between an occurrence of the first action and an occurrence of the second action. The term “responsive to” indicates the causal relationship.
As defined herein, the term “substantially” means that the recited characteristic, parameter, or value need not be achieved exactly, but that deviations or variations, including for example, tolerances, measurement error, measurement accuracy limitations, and other factors known to those of skill in the art, may occur in amounts that do not preclude the effect the characteristic was intended to provide.
The terms first, second, etc. may be used herein to describe various elements. These elements should not be limited by these terms, as these terms are only used to distinguish one element from another unless stated otherwise or the context clearly indicates otherwise.
A computer program product may include a computer-readable storage medium (or media) having computer-readable program instructions thereon for causing a processor to carry out aspects of the inventive arrangements described herein. Within this disclosure, the term “program code” is used interchangeably with the term “program instructions.” Computer-readable program instructions described herein may be downloaded to respective computing/processing devices from a computer-readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a LAN, a WAN and/or a wireless network. The network may include copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge devices including edge servers. A network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium within the respective computing/processing device.
Computer-readable program instructions for carrying out operations for the inventive arrangements described herein may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, or either source code or object code written in any combination of one or more programming languages, including an object-oriented programming language and/or procedural programming languages. Computer-readable program instructions may include state-setting data. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a LAN or a WAN, or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some cases, electronic circuitry including, for example, programmable logic circuitry, an FPGA, or a PLA may execute the computer-readable program instructions by utilizing state information of the computer-readable program instructions to personalize the electronic circuitry, in order to perform aspects of the inventive arrangements described herein.
Certain aspects of the inventive arrangements are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, may be implemented by computer-readable program instructions, e.g., program code.
These computer-readable program instructions may be provided to a processor of a computer, special-purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the operations specified in the flowchart and/or block diagram block or blocks.
The computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operations to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various aspects of the inventive arrangements. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified operations.
In some alternative implementations, the operations noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. In other examples, blocks may be performed generally in increasing numeric order while in still other examples, one or more blocks may be performed in varying order with the results being stored and utilized in subsequent or other blocks that do not immediately follow. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, may be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.