The invention relates to a direct memory access controller intended to be placed in a computing node of a system on chip. It also refers to a data processing device of a network-on-chip in which each computing node comprises such a direct memory access controller. It also refers to a data reception and storage method which can be implemented by means of such a direct memory access controller.
Such a controller, termed a DMA (Direct Memory Access) controller, is generally used in a data processing device employing distributed processing requiring numerous data transfers between various computing nodes. The controller makes it possible to transfer data to be processed or already processed by a processor or accelerator of a computing node coming from or going to a peripheral, such as another computing node, a communication port, a hard drive, or any kind of memory. The transfer is made between the peripheral and a local memory placed as close as possible to the processor or accelerator in the computing node to which it belongs, without intervention by the processor or accelerator other than initiating and finishing the transfer. In this way, in a distributed architecture in which there are numerous data exchanges, for example in the application field of the Internet of Things (IoT), processors and accelerators dedicated to data processing are freed from transfers.
A DMA controller is, for example, very useful in a system in which repeated accesses to fast peripherals could otherwise freeze up or at least slow down the processing performed by the processor or accelerator close to which it is placed. The presence thereof optimizes the processing time of software applications executed by the processor or accelerator by allowing the DMA controller to manage data transfers to and from the local memory. It is generally used for data transmission, for a remote transfer from the local memory, but it can also be used for data reception, for data reception and transfer thereof to the local memory.
The invention applies more specifically to a direct memory access controller intended to be placed in a computing node of a system on chip, used at least for data reception, thus comprising:
Such a DMA controller is described, for example, in the article by Martin et al. entitled “A microprogrammable memory controller for high-performance dataflow applications”, published at the time of the ESSCIRC 2009 conference held in Athens (Greece) on Sep. 14-18, 2009. This DMA controller is primarily used for data transmission, such that it also comprises an output buffer memory for the transmission of processed data packets to the input/output interface of the computing node and a read control module in the local address space. Its arithmetic logic unit for executing microprograms is controlled by the read control module and is applied, in data transmission, to the reading of these data in the local address space. It also makes it possible to manage complex programmed movements of a work pointer, during reading and writing, in an additional buffer memory in the DMA controller, for reordering, if necessary, the data coming from the local address space before they are transmitted outside the computing node, so as to comply with a predefined sequence. In this way it is possible to organize the data for transmission as a function of the subsequent processing they are to undergo. But some processing is not capable of anticipating a reorganization of the data at the source, meaning that this must be done upon receipt. Consequently, when the case arises, the data received by a computing node must generally first be written to a first local address space and then reorganized in a second local address space, thus doubling the amount of address space required. Even by possibly transposing the teaching of the aforementioned Martin et al. document to reception by symmetry, the necessary use of an additional internal buffer memory in the DMA controller doubles the amount of storage space occupied in the data reception computing node.
A solution is suggested in the article by Dupont de Dinechin et al. entitled “A clustered manycore processor architecture for embedded and accelerated applications”, published at the time of the HPEC 2013 conference held in Waltham, Mass. (USA) on Sep. 10-12, 2013. It would consist in adding explicit instructions as metadata for destination address jumps applicable at least to the initial data of each received packet. But for complex reorganizations, such a method is likely to lead to transmissions of excessively large quantities of metadata in addition to the actual data.
It may therefore be desirable to design a direct memory access controller which can overcome at least some of the aforementioned problems and limitations.
Consequently, a direct memory access controller intended to be placed in a computing node of a system on chip is proposed, comprising:
Consequently, by explicitly microprogramming the reorganization of the addresses of data being received, that is, as close as possible to the processing that needs to be applied to them, for an execution of said reorganization by the arithmetic logic unit of the direct memory access controller controlled by its write control module, this reorganization occurs on the fly and directly in the shared local address space without data duplication. In other words, it can be done on the “transport” protocol level of the OSI (Open Systems Interconnection) standard communication model by working directly on the addresses of data without the need for intermediate storage of the actual data. Bearing in mind that for a great deal of processing, such as the encoding of images by wavelet transformation requiring matrix transpositions for horizontal and vertical filtering, data reorganization can take up almost as much memory space and computing time as data processing, about a third of local memory space is thus saved and the computing time is cut in half.
As an option:
Also as an option:
Also as an option, the execution parameters to be defined comprise image processing parameters including at least one of the elements of the set consisting of:
Also as an option, the write control module is designed to control the execution of several microprograms in parallel on several different execution channels, each of which identified by a channel identifier included in each packet of data to be processed, and to reserve several address sub-spaces associated with the execution channels, respectively, in the shared local address space.
Also as an option, said at least one microprogram stored in the register comprises instruction lines for arithmetic and/or logical calculation aimed at carrying out a matrix transposition between the data to be processed, as received by the input buffer, and the same data to be processed, as reorganized and written to the shared local address space.
Also as an option, at least two operating modes of said at least one microprogram stored in the register are configurable:
Also as an option, two write modes of the write control module can be configured by an identifier included in a header of each packet of data to be processed:
Also proposed is a network-on-chip data processing device comprising:
Also proposed is a method for receiving and storing data by a computing node of a network-on-chip data processing device, comprising the following steps:
The invention will be better understood through the following description provided solely as an example and given in reference to the appended drawings, in which:
Data processing device 10 shown in
In the example of
Node Ni,j more specifically comprises a computing node Ci,j and a router Ri,j. The function of router Ri,j, is to direct each data packet that it receives from computing node Ci,j or from one of the aforementioned four directions to one of the aforementioned four directions or to computing node Ci,j. By well-known means, computing node Ci,j comprises a network interface 14i,j for the input/output of data packets, capable of communicating with router Ri,j. It also includes, for example, a set of microprogrammable processors 16i,j and a set of dedicated accelerators 18i,j for processing data contained in the packets received via network interface 14i,j. It also includes a local memory 20i,j shared by processors 16i,j and accelerators 18i,j for carrying out their processing. The data to be processed, contained in the packets received via network interface 14i,j, are intended to be stored therein. This shared local memory 20i,j is shown in the form of a single storage resource in
Computing node Ci,j also comprises a direct memory access controller or DMA controller which is functionally split in two in
All the aforementioned functional elements, 14i,j, 16i,j, 18i,j, 20i,j, 22i,j, and 24i,j, of computing node Ci,j are connected to an internal data transmission bus 26i,j, by which they can communicate with each other, particularly in order to facilitate and accelerate local memory access.
Just like node Ni,j, node Ni,j+1 comprises a computing node Ci,j+1 and a router Ri,j+1, computing node Ci,j+1 comprising a network interface 14i,j+1, a set of microprogrammable processors 16i,j+1, a set of dedicated accelerators 18i,j+1, a shared local memory 20i,j+1, a DMA controller 22i,j+1 for receiving data to be processed, a DMA controller 24i,j+1 for transmitting processed and an internal bus 26i,j+1. Likewise, node Ni+1,j comprises a computing node Ci+1,j and a router Ri+1,j, computing node Ci+1,j comprising a network interface 14i+1,j, a set of microprogrammable processors 16i+1,j, a set of dedicated accelerators 18i+1,j, a shared local memory 20i+1,j, a DMA controller 22i+1,j for receiving data to be processed, a DMA controller 24i+1,j for transmitting processed data and an internal bus 26i+1,j. Likewise, node Ni+1,j+1 comprises a computing node Ci+1,j+1 and a router Ri+1,j+1, computing node Ci+1,j+1 comprising a network interface 14i+1,j+1, a set of microprogrammable processors 16i+1,j+1, a set of dedicated accelerators 18i+1,j+1, a shared local memory 20i+1,j+1, a DMA controller 22i+1,j+1 for receiving data to be processed, a DMA controller 241+1,j+1 for transmitting processed data and an internal bus 26i+1,j+1.
Processing device 10 also comprises one or more peripherals connected to one or more nodes of the network on chip. Two are shown in
The general architecture of receiving DMA controller 22i,j of computing node Ci,j will now be described in detail in reference to
Receiving DMA controller 22i,j firstly comprises an input buffer 32i,j for receiving packets of data to be processed transmitted by router Ri,j over network interface 14i,j. Its primary function is to write the data to be processed, contained in the packets it receives, to dedicated address spaces of shared local memory 20i,j. In an advantageous way, this input buffer 32i,j is as small as possible, which does not facilitate management of incoming fluctuations. However, the latency between this buffer memory and shared local memory 20i,j is also a critical parameter that is important to limit. Consequently, in practice it is appropriate to anticipate changes in context between received packets to avoid lost clock cycles, while still opting for an effective and memory-efficient construction.
Receiving DMA controller 22i,j also comprises an output buffer 34i,j for sending various pieces of information complying with implemented communication or read/write protocols and resulting from its write operations in shared local memory 20i,j, to the network on chip from router Ri,j via network interface 14i,j.
It also includes a module 36i,j for managing writing to shared local memory 20i,j. More specifically, the function of this module 36i,j is to extract the data to be processed from each packet received sequentially by input buffer 32i,j and to direct them to a corresponding address space in shared local memory 20i,j for their subsequent processing by at least one of microprogrammable processors 16i,j and accelerators 18i,j. According to a non-limiting embodiment as illustrated in
Receiving DMA controller 22i,j further comprises a storage register 44i,j for storing at least one microprogram. Several microprograms 46i,j(1), 46i,j(2), 46i,j(3), . . . are thus stored in the register and respectively associated with several corresponding identifiers making it possible to distinguish them from each other.
As shown in
As an option, selected microprogram 46i,j(Id) may function according to at least two different configurable modes:
A corresponding operating mode parameter may be defined in packet 48, in primary header 50 or in the secondary header. This parameter may also be directly accessible in register 44i,j, for example via an MMIO (Memory-Mapped I/O) interface, by a user, by microprogrammable processors 16i,j or by accelerators 18i,j.
Also as an option, write control module 36i,j may function according to at least two different write modes:
A corresponding write mode parameter may be defined in packet 48, in primary header 50 or in the secondary header. This parameter may also be directly accessible in register 44i,j by a user, by microprogrammable processors 16i,j, or by accelerators 18i,j.
Note that nothing prevents write control module 36i,j from processing a plurality of data packets at a time, particularly thanks to its pipeline architecture. In this case, it is designed advantageously to control the execution of several microprograms in parallel on several different virtual execution channels, each identified by a channel identifier included in each packet of data to be processed (in the primary header, for example). Indeed, it is sufficient to create a plurality of instruction threads “thd” in parallel, each of which being associated with an execution channel. An input buffer 32i,j and an output buffer 34i,j may also be associated with each execution channel. Likewise, each execution channel is associated with a local address space that is specifically dedicated to it in shared local memory 20i,j.
Note also that each microprogram is actuality a parametric software kernel which can be triggered upon receipt of each packet and completed at the time of its execution by an instruction thread by means of parameters supplied in each packet. It can be precompiled and stored in the form of binary code in register 44i,j. Advantageously, it is implemented in the form of an API (Application Programming Interface) to avoid any risk of inconsistency between microprograms and configurations at the time of their execution.
In accordance with the architecture described in detail above, in reference to
In a first step 100, data packets concerning a given execution channel are received via network interface 14i,j of computing node Ci,j by input buffer 32i,j dedicated to this execution channel. As mentioned above, a plurality of input buffers dedicated respectively to a plurality of execution channels may be provided in receiving DMA controller 22i,j.
During a next step 102, the data to be processed “data” are extracted from each received packet by first loading unit 38i,j, along with the “Id” and “param” parameters for selecting the microprogram corresponding to the execution channel in question, for completely setting the parameters and operation thereof, and the desired execution or write mode.
During a next step 104, an instruction thread “thd” is created by decoding unit 40i,j based on the “Id” and “param” parameters, and is executed by arithmetic logic unit 42i,j for reorganization F of the destination addresses @d of the data to be processed “data” in shared local memory 20i,j. As indicated earlier, the instruction lines of the selected microprogram concern only arithmetic and/or logical calculations in which the operands are memory addresses of these data “data” in shared local memory 20i,j. The reorganization of the data to be processed “data” is therefore done on the fly without duplication of these data in any address space of receiving DMA controller 22i,j or to which it has access.
Lastly, in final step 106, created instruction thread “thd” writes the data to be processed “data” to the reorganized addresses F(@d) in a dedicated local address space of shared local memory 20i,j.
In light of the pipeline architecture of write control module 36, steps 100 to 106 are actually executed continuously in an interlaced manner.
Also note that these steps can be executed by software execution of one or more computer programs comprising instructions defined for this purpose, with receiving DMA controller 22i,j being able to consist of a “computer” such as a programmable microprocessor. The functions performed by these programs could also be at least in part microprogrammed or micro-hardwired in dedicated integrated circuits. For instance, as a variant, the computing device implementing receiving DMA controller 22i,j could be replaced with an electronic device consisting solely of digital circuits (without a computer program) for completing the same actions.
It is clear that a device such as the one described earlier allows for an effective reorganization of data transmitted by DMA controllers between computing nodes of a network on a chip, upon receipt of these data by the computing nodes that are supposed to process them, and without increasing local storage needs. In image compression processing applications involving, for example, DCT (Discrete Cosine Transform) processing or wavelet processing, the savings in data processor computing time and storage space are very substantial and large thanks to implementation of the general principles of the present invention. In particular, the invention has particularly promising applications in distributed compute core devices operating at low computing frequency with low energy consumption, which is characteristic of the IoT audio and video field of application: energy consumption between 100 mW and a few Watts, on-chip memory of less than 1 MB, optional external memory between 256 kB and 1 MB, small integrated circuit. Of these applications, let us mention computer vision, voice recognition, digital photography, face detection or recognition, etc.
In addition, it should be noted that the invention is not limited to the embodiment described above. Indeed, a person skilled in the art could conceive of various modifications to the invention in light of the teaching disclosed above. In the claims which follow, the terms must not be interpreted as limiting the claims to the embodiment presented in the present description, but rather must be interpreted as including all equivalent measures that the claims are intended to cover, in light of their wording, and which can be foreseen by a person skilled in the art through the application of his/her general knowledge to the implementation of the teaching disclosed above.
Number | Date | Country | Kind |
---|---|---|---|
17 57998 | Aug 2017 | FR | national |