The present technology is directed to the provision of data stream processors configured to execute repetitive or patterned arithmetical/logical operations (such as computer vision or image processing operations) on data, possibly according to stencil processing algorithms. In image processing and computer vision tasks, it is frequently necessary to perform sequences or arrangements of instructions in a patterned or correlated manner—one example of this type of processing is stencil processing.
Stencil processing operations are a widely-used type of data processing operations in which fixed patterns can be applied repetitively to subsets of sets of data (for example, using a sliding window pattern for acquiring the data to be processed), and typically involving some dependencies among the data elements of the subsets and/or correlations among the operations to be executed at each instance of the stencil's application. Stencil operations are well-adapted to take advantage of spatial and temporal locality in data, and can provide advantages in efficiency of processing and in economy of resource consumption, by, for example, reducing the number of memory accesses required to perform a process that features repetitions and correlations.
A typical example of a processing entity that is capable of performing repetitive or patterned arithmetical/logical operations on data is a Graphics Processing Unit (GPU). Conventional GPUs were designed for the specific purpose of processing inputs in the form of, typically, annotated mathematical (usually vector) representations of images, extracting geometrical forms and their positions from those representations, manipulating and interpreting annotations describing characteristics of elements in the images (such as colour and texture), and providing outputs suitable for controlling the rasterization of a final output image to display buffers ready for display on an output device, such as a display screen or a printer. In performing these functions, GPUs frequently operated in a single instruction, multiple data mode to perform repetitive arithmetical/logical operations on data.
In conventional GPUs, there are sub-units providing the various functions required for the computational processing of graphics, the sub-units having access to a dedicated memory subsystem and also typically having one or more caches used for input and output buffering and for intermediate data storage during processing and usually providing high-speed data load and store operations. The units providing these functions are typically operable in parallel processing pipelines to handle the often very large amounts of data that need to be processed.
Because GPUs are characterised by their ability to process very large sets of data, using massive parallelism, at the very high speeds needed for detailed rendition of still or video graphics on screens, developers have observed that they are also well adapted to other uses, such as processing the very large statistical data sets needed for scientific, medical and pharmacological data analysis and for artificial intelligence inferencing.
It is thus now known in the art to use GPUs to perform other functions—for example, it is known to exploit the built-in parallel processing capabilities of GPUS to perform non-graphics-related computations, such as computations on statistical data sets or machine-learning neural network tensor data. The parallel processing capabilities of GPUs make possible the concept of the general purpose GPU (or GPGPU), operable alongside conventional CPUs to take on some workload that is in need of such parallel processing capabilities. This is typically achieved by using special purpose software that is adapted to exploit the strengths of GPU hardware for these non-graphics-related functions.
Recently, developers have realised that it is also possible to exploit the parallel processing power of GPUs to perform visual data processing, such as image processing, by enabling the sub-units to perform the computations required to process the computer vision or image data, under control of specialised software.
The type of visual data processing or image processing envisioned here is the processing of input data from a camera or other image capture device to prepare the data (typically using image-to-image manipulations, such as image simplification, normalization and transformation) for computational operations such as image recognition, and this clearly differs from the conventional use of GPUs.
In an approach to addressing some difficulties in providing efficient, and possibly low power-consumption, repetitive arithmetical/logical operation processing of data such as image or computer vision data, the present technology provides a data stream processor according to the appended claims.
In other approaches, there may be provided a method of operating a data stream processor according to the present technology, and that method may be realised in the form of a computer program operable to cause a computer system to perform the process of the present technology. As will be clear to one of skill in the art, a hybrid approach may also be taken, in which hardware logic, firmware and/or software may be used in any combination to implement the present technology.
Reference is made in the following detailed description to accompanying drawings, which form a part hereof, wherein like numerals may designate like parts throughout that are corresponding and/or analogous. It will be appreciated that the figures have not necessarily been drawn to scale, such as for simplicity and/or clarity of illustration. For example, dimensions of some aspects may be exaggerated relative to others. Further, it is to be understood that other embodiments may be utilized. Furthermore, structural and/or other changes may be made without departing from claimed subject matter. It should also be noted that directions and/or references, for example, such as up, down, top, bottom, and so on, may be used to facilitate discussion of drawings and are not intended to restrict application of claimed subject matter. Thus, seen broadly, the present technology provides a configurable, repetitive or patterned operation data stream processor composed of at least one compute unit and at least one memory unit implemented using dataflow principles. The compute unit comprises arrangements of processing units (typically arranged in array form of rows and columns) that can be configured to operate in different ways based on a setup conveyed by a configuration memory or instructions. The processing units can receive data from input queues in a memory unit and from one another in various arrangements of linkages.
Because of their high performance per Watt of power consumed, GPUs have become desirable computing platforms for implementing computational imaging and vision pipelines. As is known in the art, one estimate is that a GPU-implemented vision processing pipeline can result in an approximately five to ten times better performance efficiency (in performance per Watt) than a conventional CPU-based implementation. Typically, the camera, imaging, and computer vision pipelines mapped to the GPU hardware are realised by using the facilities provided by GPU shader software programs. These GPU shader software programs make use of the available GPU hardware resources, typically by using the facilities of a texture unit (TU) for hardware sampling from the image frame buffers, the facilities of an execution unit (EU) for arithmetic data paths (integer, or floating point), and the facilities of a post-processing unit (PPU) for final post-processing tasks like 2D blit (rapid data move/copy in memory) operations, composition operations like alpha compositing, colour space conversions and the like.
Thus the use of a GPU can be an effective alternative to the use of a CPU for solving complex image processing tasks. The performance per Watt of power consumed of optimized image processing solutions on a GPU is much higher than the performance of the same functions on a CPU. As will be clear to one of ordinary skill in the art, the GPU architecture allows parallel processing of image pixels which, in turn, leads to a reduction of the processing time for a single image and thus reduced latency for the system as a whole.
For image-related tasks that require the use of neural networks (such as image recognition), the provision in GPUs of hardware tensor kernels can significantly improve performance. High performance GPU software can reduce hardware resource usage in such systems, and the high energy efficiency of the GPU hardware reduces power consumption. Thus, a GPU has the flexibility, high performance, and low power consumption required to represent an attractive alternative to highly specialized field programmable gate array and application-specific integrated circuit systems, especially for mobile and embedded image processing applications. In a GPU configured in this way, visual data processing can be performed by the execution and/or texture units, or inside the neural network engine, but this typically leads to monopolisation of the arithmetical/logical capacity of these units, and hence the overall pipeline performance is degraded. In addition, software-controlled adaptation of these processing units to perform visual data processing to some extent detracts from their efficiency, as they are specifically designed for the different requirements of conventional graphics processing tasks.
The uses of computer visual processing are expanding with the developments in the use of, for example, robots and other autonomous systems requiring fast and accurate computer vision, augmented reality devices and applications, and artificial intelligence systems needing large scale learning data that may include visual representations to be provided in a usable form.
With the increased amount and importance of visual image computing, the present technology addresses some of the performance deficiencies encountered in known techniques of using a conventional GPU under high-level software control for classic image processing, by integrating a vision engine into a GPU shader core, where it can seamlessly interoperate with the execution units and the other graphics specific units, as well as with the neural network engine when inferencing is required, either to achieve part of the image processing task or to operate on the output of the visual processing engine as a post-process task.
The processing units of the present technology are particularly well-adapted to perform a limited set of primitive visual processing operators from which any higher-level operators may be constructed, thereby forming a hardware/firmware/software stack implementation of a visual processing architecture arranged according to the following rules:
The visual processing architecture defines a set of primitive operators according to the rules to which higher level operators can be consistently reduced—the present technology provides a base upon which such an architecture can advantageously be implemented.
Each of the processing units in a compute unit according to the present technology is specifically adapted to perform data processing on at least a portion of a data stream according to the primitive operator for a received configuration instruction.
The processing architecture defines a set of primitive operators according to the rules to which higher level operators can be consistently reduced—the present technology provides a base upon which such an architecture can advantageously be implemented.
Each of the processing units in a compute unit according to the present technology is specifically adapted to perform data processing on at least a portion of a data stream according to the primitive operator or combination of operators for a received configuration instruction. There is shown in
By providing a structure in which sets of processing units designed to perform these primitive operators can be reconfigured in various sequential and parallel structures to perform their operations on visual or image data, the present technology advantageously exploits the performance and efficiency characteristics of GPU architecture. Within a compute unit, the processing units can pass data directly to one another in various arrangements of linkages or they can pass data via a memory in a memory unit. In one arrangement, the data can pass as a continuous stream through an array of processing units to perform sequences of operations as instructed by the configuration memory or received instructions. In another arrangement, the compute unit may have its processing units operating in various stencil modes, to perform stencil-type operations (operations in which an action on one element of a data set are conditioned by operations on neighbouring or related elements of the data set). In yet a further arrangement, the compute unit may have its processing units configured to perform a hybrid of arrangements of operation types, for example, by varying the linkages by row, column or n-dimensional subset. The processed data from the processing units may be accumulated and post-processed, for example by data reduction, at an accumulator, before being passed to an output queue in the memory unit.
Additional configurability and scale can be achieved at the level of the compute units by arranging their external linkages to form chains, thereby increasing the number of processing units that can be brought to bear on the input data stream.
The present technology is architected to provide native support for the efficient execution of common computer vision and image processing operations. The compute unit is fed data using a specialised memory unit that is operable to look up data in line buffer and circular buffer (or FIFO) modes from memory units, such as SRAMs. The common computer vision and image processing operations for which the present technology is particularly adapted comprise:
As will be clear to one of skill in the art, the common computer vision operations can benefit from the characteristics inherent in systems arranged according to the present technology, because the present technology is particularly suited to operations involving the different forms of data locality. One set of computing operations of the sort envisioned here are stencil type operations (where an action taken on an element of a data set affects and/or is affected by the relative positions of the data elements and/or actions taken on neighbouring elements)—the present technology is operable in at least one of its configurations to efficiently exploit spatial locality to perform these operations. This is achieved using a set of processing units configured to operate in a sliding window mode as required to perform, for example, one-and two-dimensional convolution, Resize_Bilinear and Expand operations.
In another of the configurations, the present technology is operable in at least one of its configurations to perform, for example, image reduction operations such as Reduce_Interpolation_Channelwise, Reduce_Max_Planewise, Reduce_Sum_Channelwise, and the like.
In both the exemplary configurations described above, scalar reductions and efficient stitching of data are required to achieve the desired result.
Turning to
Thus, each compute unit within a data stream processor has a grid (or other layout) of processing units interconnected with streaming interfaces. The processing elements process the incoming streamed data and forward either the incoming streamed data to another processing unit or forward the output to another processing unit or to the accumulator for output to a memory unit.
To address efficient repetitive and stencil-type operations, the present technology incorporates data forwarding techniques that are operable to take advantages of the spatial characteristics of the incoming data and the types of operation that are typically required for visual or image processing.
In the sequential mode of operation, the output of each processing unit in a column is forwarded to the corresponding processing unit in the next column. This mode of operation is adapted to the operations required by many image processing tasks, in which there are regular sequences of operations such as Add followed by Sub and then Mult. Instead of having a dedicated register file to write to and read, this fabric supports direct shifting of data to the processing units in the next column.
The sequential mode of operation or sequential configuration 200 of the data stream processor (100 of
Turning now to
In this mode, initial inputs to the rows comprise individual per-row data from input queue A 106 and a constant 108A passed to each row from input queue B (108 of
Turning now to
In this mode, initial inputs to the rows comprise individual per-row data from input queue A 106 and a constant 108A passed to each row from input queue B (108 of
The two-dimensional configuration of
In both the stencil-type processing configurations, there is a requirement for data lookup support in line buffer mode in the memory. Taking a two-dimensional image as an example, the pixels are defined in memory using a Cartesian co-ordinate system (x,y) where ‘y’ denotes the row or line number and ‘x’ denotes the column number.
If the available bandwidth for the lookup is 4, the memory organization and read capabilities for this type of data lookup can permit the reading of:
The read and write phases are implemented using self-throttling mechanisms, keeping the control logic to a minimum.
In addition to the above-described homogeneously-defined configurations, each column of processing units 104 in the compute unit 102 can be configured to operate in a different mode. One such case is when some columns need to operate in stencil processing mode and some in sequential mode. For implementing colour space conversion, for example, the configuration may have the leftmost two columns configured to perform in a sequential configuration, while the rightmost two columns are configured to perform in a two-dimensional stencil configuration.
In all the configurations described hereinabove, compute unit 102 is operatively coupled with accumulator 112, which is operable to perform what are, in effect, post-processing actions on the data provided by processing units 104. As will be clear to one of ordinary skill in the art, the data produced by the processing units 104 may require such post-processing to render it into a form suitable for output in output queue 114. For all the above modes of operation, the accumulator 112 provides a Reduce_Columnwise, Reduce_Rowwise and a Reduce_all capability.
In a further implementation, there may be provided various techniques to further exploit the possibilities inherent in the data stream processor of the present technology. For example, in addition to the internal exploitation of multiple arithmetical/logical processor units inside a compute unit, there may be provided ways of configuring data stream processors at a next higher level of a hierarchy.
Image processing and other computer vision-related operations are achieved using small filter sizes most of the time, but there are outliers to this aspect of computer vision and image processing. For example, very large convolution filters are required in the case of Bokeh filters which selectively blur different portions of an image. Building a monolithic compute unit to support these large filters would lead to huge underutilization when smaller kernels are mapped on to the fabric, and this underutilization is clearly undesirable. Hence, the preferable option to build a large filter size stencil processor (4×8, 8×8, . . . ) is by using multiple smaller compute units and providing features for reconfiguring them to act in conjunction—for example, 4×4 processors. Hardware features to support aggregating multiple 4×4 compute units to create a higher logical size of compute unit (4×8, 8×8, etc) can be provided for the data stream processor according to the present technology.
The hardware features to support chaining or clustering of compute units according to the present technology include:
In
The heavy dotted lines LINKAGES 518 in
Turning now to
Using the additional linkages, the four last column inputs of the leftmost compute units are passed to the respective leftmost processing units of the respective rightmost compute units. The upper rightmost accumulator unit 604 receives input also from its left neighbour accumulator unit 602, as well as from its own columns of processing units, and passes its results to the rightmost lower accumulator unit 608. The lower rightmost accumulator unit 608 receives input also from its left neighbour accumulator unit 606, as well as from its own columns of processing units, and also receives results, as described, from accumulator unit 604. The lower rightmost accumulator unit 608 completes the processing (data reductions and the like|) of the processed data from the cluster of compute units and provides its output to output queue 114.
In this way, by providing a structure in which sets of processing units designed to perform a limited set of primitive operators can be reconfigured in various sequential and parallel structures to perform operations on visual or image data, the present technology advantageously exploits the performance and efficiency characteristics of GPU architecture. As will be appreciated by one skilled in the art, the present techniques may be embodied as a system, method or computer program product. Accordingly, the present technique may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware. Where the word “component” is used, it will be understood by one of ordinary skill in the art to refer to any portion of any of the above embodiments.
The present technology may be incorporated into a pipeline arrangement (typically implemented in a GPU) that is operable to perform both visual processing and machine learning neural network processing. For example, there may be provided a stack structure 800 as shown in
Stack structure 800 may comprise software, firmware and hardware elements including user applications 802 that may incorporate program operators from a vision operator set 804—instructions based on primitives specifically tailored for performing operations on visual data—and operators from a machine learning operator set 806—instructions based on primitives specifically tailored for performing operations on machine learning data, typically tensor data. The user application 802 is processed at least in part by the graph compiler 808, which is adapted to compile both vision operators from 804 and machine learning operators from 806 into a unified program processing graph. Graph compiler 808 is arranged in at least intermittent electronic communication with graphics processing unit 810 to provide compiled graph data to control and graph scheduling component 812, which controls and schedules the activities of visual processing engine 815 and machine learning (ML) neural network engine 813. Visual processing engine 815 and machine learning (ML) neural network engine 813 are operable to make use of shared memory 814 (which may comprise on-chip SRAM memory resources) for local memory operations, and to provide data as required via DMA component 816 to system memory 818.
There is thus provided in this embodiment a single centralised point of control in the control and graph scheduling component 812 which fetches the command stream for the visual processing engine 815 and the ML neural network 813 and controls overall processing and data-flow for the compute stages, as defined by the output of the graph compiler.
Graph-based programming (software) model for both ML and non-ML parts of the vision pipeline, thanks to Vision Processor Graph Compiler incorporating graph-based vision pipeline abstractions leveraging specifically-designed visual processing instruction set architecture and a specifically-designed machine learning tensor-based instruction set intermediate representations.
In this way, the present technology may achieve improved energy efficiency by way of end-to-end visual and machine-learning pipeline scheduling optimised for keeping data on-chip and maximizing utilisation of available hardware resources. This efficiency may combine with improved performance by also avoiding Remote Procedure Calls (RPC) between the host CPU and the visual processing engine. The present technology may further benefit from a reduction in chip area due to increased sharing of the hardware resources in the form of common control, SRAM and DMA resources.
Reference throughout this document to “one embodiment,” “certain embodiments,” “an embodiment,” “implementation(s),” “aspect(s),” or similar terms means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Thus, the appearances of such phrases or in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments without limitation.
The term “or,” as used herein, is to be interpreted as an inclusive or meaning any one or any combination. Therefore, “A, B or C” means “any of the following: A; B; C; A and B; A and C; B and C; A, B and C.” An exception to this definition will occur only when a combination of elements, functions, steps or acts are in some way inherently mutually exclusive.
As used herein, the term “configured to,” when applied to an element, means that the element may be designed or constructed to perform a designated function, or has the required structure to enable it to be reconfigured or adapted to perform that function. Numerous details have been set forth to provide an understanding of the embodiments described herein. The embodiments may be practiced without these details. In other instances, well-known methods, procedures, and components have not been described in detail to avoid obscuring the embodiments described. The disclosure is not to be considered as limited to the scope of the embodiments described herein.
Those skilled in the art will recognize that the present disclosure has been described by means of examples. The present disclosure could be implemented using hardware component equivalents such as special purpose hardware and/or dedicated processors which are equivalents to the present disclosure as described and claimed.
The present technology further provides processor control code to implement the above-described systems and methods, for example on a general purpose computer system or on a digital signal processor (DSP). Furthermore, the present technique may take the form of a computer program product tangibly embodied in a non-transitory computer readable medium having computer readable program code embodied thereon. A computer readable medium may be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Computer program code for carrying out operations of the present techniques may be written in any combination of one or more programming languages, including object-oriented programming languages and conventional procedural programming languages. For example, program code for carrying out operations of the present techniques may comprise source, object or executable code in a conventional programming language (interpreted or compiled) such as C, or assembly code, code for setting up or controlling an ASIC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array), or code for a hardware description language such as Verilog™ or VHDL (Very high speed integrated circuit Hardware Description Language).
The program code may execute entirely on the user's computer, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network. Code components may be embodied as procedures, methods or the like, and may comprise sub-components which may take the form of instructions or sequences of instructions at any of the levels of abstraction, from the direct machine instructions of a native instruction-set to high-level compiled or interpreted language constructs.
It will also be clear to one of skill in the art that all or part of a logical method according to embodiments of the present techniques may suitably be embodied in a logic apparatus comprising logic elements to perform the steps of the method, and that such logic elements may comprise components such as logic gates in, for example a programmable logic array or application-specific integrated circuit. Such a logic arrangement may further be embodied in enabling elements for temporarily or permanently establishing logic structures in such an array or circuit using, for example, a virtual hardware descriptor language, which may be stored using fixed carrier media.
In one alternative, an embodiment of the present techniques may be realized in the form of a computer implemented method of deploying a service comprising steps of deploying computer program code operable to, when deployed into a computer infrastructure or network and executed thereon, cause the computer system or network to perform all the steps of the method. In a further alternative, an embodiment of the present technique may be realized in the form of a data carrier having functional data thereon, the functional data comprising functional computer data structures to, when loaded into a computer system or network and operated upon thereby, enable the computer system to perform all the steps of the method.
It will be clear to one skilled in the art that many improvements and modifications can be made to the foregoing exemplary embodiments without departing from the scope of the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
2311434.1 | Jul 2023 | GB | national |