In a computer processing system, multiple distinct processing functions may need to be executed on data according to a defined data processing flow, with the output(s) of one function providing the input(s) for the next function. To improve the throughput of the processing system, at any given time, each of the processing functions may be applied to a different data set, or different sub-set of a data set. Such simultaneous or overlapped processing by the various processing functions is referred to as pipelining.
In one example, a system includes a pipeline depth determination circuit and a buffer depth determination circuit. The pipeline depth determination circuit is configured to analyze input-output connections between a plurality of processing nodes specified to perform a processing task, and determine a pipeline depth of the processing task based on the input-output connections. The buffer depth determination circuit is configured to analyze the input-output connections between the plurality of processing nodes, and assign, based on the input-output connections, a depth value to each of a plurality of buffer memories configured to store output of a first of the processing nodes for input to a second of the processing nodes.
In another example, a non-transitory computer-readable medium is encoded with instructions that, when executed, cause a processor to identify input-output connections between a plurality of processing nodes specified to perform a processing task. The instructions also cause the processor to determine a pipeline depth of the processing task based on the input-output connections. The instructions further cause the processor to assign, based on the input-output connections, a depth value to each of a plurality of buffer memories configured to store output of a first of the processing nodes for input to a second of the processing nodes.
In a further example, a method includes identifying, by a processor, input-output connections between a plurality of processing nodes specified to perform a processing task. The method also includes determining, by the processor, a pipeline depth of the processing task based on the input-output connections. The method further includes assigning, by the processor, based on the input-output connections, a depth value to each of a plurality of buffer memories configured to store output of a first of the processing nodes for input to a second of the processing nodes.
For a detailed description of various examples, reference will now be made to the accompanying drawings in which:
The same reference number is used in the drawings for the same or similar (either by function and/or structure) features.
In various types of computing applications (e.g., video, imaging, or vision computing applications), a data processing flow may be represented as a connected graph with the processing nodes (nodes) of the graph representing the processing functions to be executed. Thus, the terms “processing node,” “functional node,” and, more succinctly, “node” may be used to refer to a processing function to be implemented. A heterogeneous computing system, such as a heterogeneous System-on-Chip (SoC), includes a variety of computational resources, such as general-purpose processor cores, digital signal processor (DSP) cores, and function accelerator cores, that may be applied to implement specified processing functions (e.g., to execute the nodes of a graph). To improve throughput of a processing flow, the nodes may be operated as stages of a pipeline. For example, the nodes may be implemented using separate and distinct computational resources, such that the nodes of a processing flow process different portions of a dataset in an overlapped fashion.
To implement pipelined processing based on a graph, for example graph pipelining in accordance with the OpenVX Graph Pipelining Extension, the depth of the pipeline (pipeline depth) and the depth of buffers (buffer depth) provided between pipeline stages must be determined. Pipeline depth and buffer depth are not defined as part of the graph itself. In some graph implementations, these parameters are determined manually as part of the development cycle of the processing flow. Manual selection of these pipeline parameters requires access to expertise and additional development time. For example, selection of optimum pipeline parameters may require multiple cycles of trial and error even with access to graph analysis expertise.
The pipeline processing techniques disclosed herein automatically select values of pipeline depth and buffer depth in a target device, such as a heterogeneous SoC, by analyzing the graph to be executed. Thus, the pipelining manager of the present disclosure determines pipeline parameters without developer assistance while reducing development time and expense. The selected pipeline parameters may also improve the efficiency of computational resource and memory utilization.
The buffer 108 (buffer memory) stores output of the node 102 for input to the node 104. The buffer 110 stores output of the node 104 for input to the node 106. The buffer 112 stores output of the node 106 for input to systems external to the graph 100.
If the graph 100 is executed without pipelining, a new execution of the graph 100 is unable to start until the prior execution of the graph 100 is complete. However, because the computational resources assigned to the nodes of the graph 100 can operate concurrently, the graph 100 is inefficient with regard to hardware utilization and throughput.
Pipelining the graph 100 allows for more efficient use of hardware by creating multiple instances of the graph based on the value of the pipeline depth.
At line 2 of pipeline depth routine 1, a topological sort of the nodes of a graph is executed. In lines 5-18 of pipeline depth routine 1, using the sorted nodes, a depth value is assigned to each node. The depth value assigned to a given node is selected based on the number of nodes preceding the given node (e.g., a count of the number of nodes present in a path to the given node). The pipeline depth of the graph is selected to be the highest depth value assigned to a node of the graph.
Applying the pipeline depth determination of pipeline depth routine 1 to the graph 200, node 201 is assigned a depth value of one. Nodes 202 and 203 are assigned a depth value of two. Node 204 is assigned a depth value of three. Node 205 is assigned a depth value of four. The pipeline depth of the graph 200 based on the structure of the graph 200 is set to four.
At line 2 of pipeline depth routine 2, a topological sort of the nodes of a graph is executed. In lines 7-24 of pipeline depth routine 2, a depth value is assigned to node. If the computation resource assigned to a given node is not assigned to the node preceding a given node, then the depth of the given node is the depth of the preceding node plus one. If the computation resource assigned to the given node is also assigned to the node preceding the given node, then the depth of the given node is the same as the depth of the preceding node (a same depth value is assigned to the nodes). The pipeline depth of the graph is selected to be the highest depth value assigned to a node of the graph.
Applying the pipeline depth determination of pipeline depth routine 2 to the graph 300, node 301 is assigned a depth value of one. Nodes 302, 303, and 304 are assigned a depth value of two. Node 305 is assigned a depth value of three. The pipeline depth of the graph 300 based on the structure of the graph 300 and the computational resources assigned to the nodes is set to three.
In one example procedure for determining buffer depth, buffer depth is determined based on structure of the graph, without considering the computational resources assigned to the nodes. Buffer depth routine 1 illustrates a software-based implementation of buffer depth determination based on graph structure.
Buffer depth routine 1 sets the depth of each buffer storing output of a given node to one greater than the number of nodes processing the output of the given node. For each node, buffer depth routine 1 identifies an output of the node, and identifies all other nodes that process the output of the node (receive the output of the node as input data). The depth of the buffer receiving the output of the node is initially set to one and incremented with each other node identified as processing the output of the node.
Applying the buffer depth determination of buffer depth routine 1 to the graph 400, the depth of buffer 403 is set to two, allowing a first buffer instance to receive output from the node 401, while a second buffer instance provides previously stored output of node 401 to node 402. Applying the buffer depth determination of buffer depth routine 1 to the graph 500, the depth of buffer 504 is set to three and the depth of buffer 505 is set to two.
In the graph 600 and the graph 700, pipeline depth is determined based on structure of the graph, while considering the computational resources assigned to the nodes. Buffer depth routine 2 illustrates an example software-based implementation of buffer depth determination based on graph structure and awareness of computational resource assignments.
Buffer depth routine 2 sets the depth of each buffer storing output of a given node to one greater than the number of nodes processing the output of the given node that are not executed by the same computational resource as the given node. For each node, buffer depth routine 2 identifies an output of the node, initializes the depth of the buffer receiving output to one, identifies all other nodes that process the output using a different computing resource than the node, and increments the buffer depth for each node receiving output from the buffer that does not use the same computing resource as the node writing to the buffer. Thus, the depth of the buffer set to one plus the number of nodes reading the buffer that use a different computational resource than the node writing the buffer.
Applying the buffer depth determination of buffer depth routine 2 to the graph 600, the depth of buffer 603 is set to one because nodes 701 and 702 are executed by the same DSP. Applying the buffer depth determination of buffer depth routine 2 to the graph 700, the depth of buffer 704 is set to one and the depth of buffer 705 is set to one because nodes 701, 702, and 703 are executed by the same DSP. Thus, buffer depth routine 2 reduces the amount of memory allocated to the buffers when the same computational resource is applied to serially execute the nodes.
In block 902, a graph to be executed has been downloaded to a heterogeneous computing system, such as the heterogeneous computing system 800. The heterogeneous computing system is configured to execute the graph using pipelining to increase throughput and computing resource utilization. To implement pipelining of the graph, the heterogeneous computing system determines a value of pipeline depth for the graph, and determines a value of buffer depth for each buffer applied to store node output.
To determine the pipeline and buffer depth values, the heterogeneous computing system analyzes the nodes of the graph, and identifies the input-output connections of the nodes. For example, the heterogeneous computing system determines a sequence of nodes connected from graph start to graph end for assigning pipeline depth, and determines, for each node, which other nodes process the output of the node for assigning buffer depth.
In block 904, the heterogeneous computing system determines a pipeline depth value based on the interconnection of the nodes between graph start and end. The method 1000 illustrated in
In block 906, the heterogeneous computing system determines buffer depth values based on the input-output connections the nodes, and assigns the buffer depth values to the buffers that store output of the nodes. The method 1100 illustrated in
The pipelined graph is initialized using the assigned pipeline depth and buffer depth values, and executed by the heterogeneous computing system.
In block 1002, the heterogeneous computing system assigns a pipeline depth value to each node of the graph based on the number of preceding nodes (the number of other nodes connected between the start of the graph and the node).
In block 1004, the heterogeneous computing system identifies the computing resource (the processing circuit) assigned to each node. If two adjacent nodes are implemented using the same computing resource, a same pipeline depth value is assigned to the two adjacent nodes.
In block 1006, the pipeline depth value is set to be a highest node depth value assigned to a node of the graph in block 1004. In some implementation of the method 1000, the pipeline depth value is set to be a highest node depth value assigned to a node of the graph in block 1002.
In block 1102, the heterogeneous computing system assigns, to each buffer receiving output from a node of the graph, a buffer depth value. The buffer depth value is based on a count of nodes that receive input from the buffer. E.g., a count of nodes that process the output of a given node, where the buffer stores the output of the given node. The assigned buffer depth value may be one plus the number of nodes receiving input from the buffer in some implementations.
In block 1104, the heterogeneous computing system identifies adjacent nodes that are to be implemented using a same computing resource (adjacent nodes executed by the same computing resource). Because the processing done by adjacent nodes executed by the same computing resource must be serialized, a buffer depth of one may be assigned to a buffer between such nodes.
The processor platform 1200 includes a processor 1212. The processor 1212 of the illustrated example is hardware. For example, the processor 1212 can be implemented by one or more integrated circuits, logic circuits, microprocessors, or controllers. The processor 1212 may be a semiconductor based (e.g., silicon based) device. The processor 1212 executes instructions for implementing a graph execution framework 1211 that includes a pipelining manager 1234 that configures the heterogeneous computing system to pipeline execution of a graph. The pipelining manager 1234 includes a pipeline depth determination circuit 1236 that determines a pipeline depth for the graph as described herein, and buffer depth determination circuit 1238 that assigns depth values to the buffer associated with the graph as described herein. The pipeline depth determination circuit 1236 and the buffer depth determination circuit 1238 are formed by execution of the coded instructions 1232 by the processor 1212.
The processor 1212 includes a local memory 1213 (e.g., a cache). The processor 1212 is in communication with a main memory including a volatile memory 1214 and a nonvolatile memory 1216 via a link 1218. The link 1218 may be implemented by a bus, one or more point-to-point connections, etc., or a combination thereof. The volatile memory 1214 may be implemented by Synchronous Dynamic Random Access Memory (SD RAM), Dynamic Random Access Memory (DRAM), RAMBUS Dynamic Random Access Memory (RD RAM), Static Random Access Memory (SRAM), and/or any other type of random access memory device. The nonvolatile memory 1216 may be implemented by flash memory and/or any other desired type of memory device. Access to the main memory may be controlled by a memory controller.
The processor platform 1200 may also include an interface circuit 1220. The interface circuit 1220 may be implemented according to any type of interface standard, such as an Ethernet interface, a universal serial bus (USB), and/or a PCI express interface.
One or more input devices 1222 may be connected to the interface circuit 1220. The one or more input devices 1222 permit a user to enter data and commands into the processor 1212. The input device(s) can be implemented by, for example, an audio sensor, a microphone, a camera (still or video), a keyboard, a button, a mouse, a touchscreen, a track-pad, a trackball, a voice recognition system and/or any other human-machine interface. Also, many systems, such as the processor platform 1200, can allow the user to control the computer system and provide data to the computer using physical gestures, such as, but not limited to, hand or body movements, facial expressions, and face recognition.
One or more output devices 1224 may also be connected to the interface circuit 1220. The output devices 1224 can be implemented, for example, by display devices (e.g., a light emitting diode (LED), an organic light emitting diode (OLED), a liquid crystal display, a cathode ray tube display (CRT), a touchscreen, a tactile output device, a printer and/or speakers). The interface circuit 1220 may include a graphics driver device, such as a graphics card, a graphics driver chip, or a graphics driver processor.
The interface circuit 1220 may also include a communication device such as a transmitter, a receiver, a transceiver, a modem and/or network interface card to facilitate exchange of data with external machines (e.g., computing devices of any kind) via a network 1226 (e.g., an Ethernet connection, a digital subscriber line (DSL), a telephone line, coaxial cable, a cellular telephone system, etc.).
The processor platform 1200 may also include one or more mass storage devices 1228 for storing software and/or data. Examples of mass storage devices 1228 include floppy disk drives, hard drive disks, compact disk drives, Blu-ray disk drives, RAID (redundant array of independent disks) systems, and digital versatile disk (DVD) drives.
Coded instructions 1232 corresponding to the instructions of pipeline depth routine 1, pipeline depth routine 2, buffer depth routine 1, and/or buffer depth routine 2 may be stored in the mass storage device 1228, in the volatile memory 1214, in the nonvolatile memory 1216, in the local memory 1213 and/or on a removable tangible computer readable storage medium, such as a CD or DVD. The processor 1212 executes the instructions as part of the pipeline depth determination circuit 1236 or the buffer depth determination circuit 1238.
In this description, the term “couple” may cover connections, communications, or signal paths that enable a functional relationship consistent with this description. For example, if device A generates a signal to control device B to perform an action: (a) in a first example, device A is coupled to device B by direct connection; or (b) in a second example, device A is coupled to device B through intervening component C if intervening component C does not alter the functional relationship between device A and device B, such that device B is controlled by device A via the control signal generated by device A.
A device that is “configured to” perform a task or function may be configured (e.g., programmed and/or hardwired) at a time of manufacturing by a manufacturer to perform the function and/or may be configurable (or re-configurable) by a user after manufacturing to perform the function and/or other additional or alternative functions. The configuring may be through firmware and/or software programming of the device, through a construction and/or layout of hardware components and interconnections of the device, or a combination thereof.
A circuit or device that is described herein as including certain components may instead be adapted to be coupled to those components to form the described circuitry or device. For example, a structure described as including one or more semiconductor elements (such as transistors), one or more passive elements (such as resistors, capacitors, and/or inductors), and/or one or more sources (such as voltage and/or current sources) may instead include only the semiconductor elements within a single physical device (e.g., a semiconductor die and/or integrated circuit (IC) package) and may be adapted to be coupled to at least some of the passive elements and/or the sources to form the described structure either at a time of manufacture or after a time of manufacture, for example, by an end-user and/or a third-party.
Circuits described herein are reconfigurable to include additional or different components to provide functionality at least partially similar to functionality available prior to the component replacement.
Modifications are possible in the described embodiments, and other embodiments are possible, within the scope of the claims.
This application claims priority to U.S. Provisional Application No. 63/140,649, filed Jan. 22, 2021, entitled Automated Pipeline Settings for Graph-Based Applications in Heterogeneous SoC's, which is hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
63140649 | Jan 2021 | US |