This invention relates to the field of data processing systems. More particularly, this invention relates to the identification of data hazards due to data dependency during parallel processing using scoreboard techniques.
It is known within the field of microprocessors to provide a scoreboard used in association with a sequence of operations on resources such as a register bank. This helps to prevent data hazards, such as read before write etc.
It is known to split a video decoder into pipelined stages running on separate processing units to provide a degree of parallel processing. The management of data dependencies can be achieved by using a sequence of simple data queues between the stages such that the processing in one stage is not commenced until the necessary processing in the preceding stage has been completed. Whilst this approach is suitable for avoiding data hazards, it has the disadvantage that each pipelined stage is performing a different operation, such as unpacking, initial decoding, deblocking etc, and it does not allow parallel processing to bear upon an individual processing operation.
An example of a pipelined approach to parallel video decoding is described in the paper “H.264 Baseline Video Implementation on the CT3400 Multiprocessor DSP” by Z Lance Wang of Cradle Technologies.
It is also known to split a video image to be decoded into multiple regions with an individual processor then serving to decode each individual region. In order for this type of processing to be efficiently achieved it is necessary for the data stream to match the type of decoding to be performed, such as containing regions that are independently decodable, e.g. slices as used in video decoding. Often there is no such control over the data stream to be decoded.
It is also known to provide a high level parallel coordination language called LINDA that uses a logical associative memory called “tuplespace” which can store tuples, such as (state, x, y). However, it is inefficient to store (x, y) values with each state data item and it is also inefficient to have to search all these tuples to identify whether any indicates a state which would represent a data hazard for a data processing operation to be performed.
Viewed from one aspect the present invention provides a method of processing data, said method comprising the steps of
performing a plurality of parallel processing operations upon an N-dimensional array of data elements, where N is an integer greater than one;
storing within a scoreboard memory status data indicative of a status of respective data elements within said N-dimensional array of data elements, a location of a data element within said N-dimensional array of data elements being indicative of a storage location within said scoreboard memory of status data corresponding to said data element; and
checking for a data hazard, in respect of processing to be performed upon a given data element within said N-dimensional array of data elements arising from a plurality of other data elements within said N-dimensional array of data elements having respective positions-within said N-dimensional array of data elements relative to said given data element and upon which processing for said given data element is dependent, by reading status data for said plurality of other data elements within said N-dimensional array of data elements from said scoreboard memory.
The present technique recognizes that within the context of parallel processing performed upon an N-dimensional array of data elements, it is efficient and advantageous to use a scoreboard memory storing status data for the data elements where the location of the status data for a given data element is indicated by the location of that data element within the N-dimensional array of data elements such that separate location data for the status data need not be stored. Furthermore, the data hazard checking using status data of other data elements can be achieved by knowing their relative position to the given data element to be processed allowing the provision of efficient coding and operation, which is important in achieving high performance. Thus, a memory efficient scoreboarding technique is achieved which is also capable of high performance implementation by deriving the location of the status data within a scoreboard from the location of a data element for which the status data of other data elements is being checked.
The processing may be performed by multithreading on one or more processors, but is particularly suited to systems having a plurality of processors operating in parallel.
The hazard checking could be performed by one or more of these processors themselves, or alternatively by a separate hazard checking processor. This is particularly useful when the parallel processing is being performed by special purpose data engines.
The position data may optionally include some absolute position specifying data as well as being inferred from relative positions of the data elements.
It will be appreciated that the N-dimensional arrays of data elements could be two-dimensional, three-dimensional or some higher order of dimension. However, many real examples of use of the current technique will be in the processing of two-dimensional arrays of data, such as pixel data, which could be, for example, macroblocks of video data or macroblocks of image data.
The status data and data elements could be stored separately or together in some merged form of array.
The scoreboard memory could store the status data in a variety of different ways. One direct way of storing the data is to use a corresponding N-dimensional array of status data. Thus, an individual data element within the N-dimensional array of data elements will map to an individual status data item within the N-dimensional array of status data.
The status data could be a simple binary flag having two possible states, such as processed or not processed. However, in other embodiments, the status data could take three or more different values indicative, for example, of various levels or stages of processing.
The scoreboard memory may also store the status data as a plurality of N-dimensional arrays of status data representing different aspects of the status of a given data element within the N-dimensional array of data elements.
It will be appreciated that the processing of the N-dimensional array of data elements as parallel operations (parallel threads) could be achieved in a variety of different ways depending upon the particular algorithm being used, but a common type of parallel processing that is well suited to the present technique is one in which each processor of the plurality of processors performs processing operations upon a sequence of data elements extending along a processing track, such as a one dimension within the N-dimensional array of data elements with the position in the other dimensions being common between those data elements.
Thus, an individual processor will process a line (row) of data elements in a sequence and then move onto another such line (either adjacent or at some regular spacing therefrom) until the entire processing required upon the N-dimensional data processing array has been performed. The processing workload is thus split in parallel between the different processors, which may all be performing a common processing operation (e.g. all deblocking video data) whilst the data hazards due to data dependencies are managed with reference to the scoreboard memory using its efficient data storage and access mechanisms.
The relationships in position within the N-dimensional array of data elements corresponding to the data hazard dependencies can take a wide variety of different forms, but in many practical uses of the present technique the data dependencies is to neighbouring data elements in respective dimensions within the array as these are most likely to influence a given data element in real life situations.
It will be appreciated that a further refinement in respect of the scoreboard memory is that the scoreboard memory may store only an active window upon the status data such that status data which is being tracked is not stored for a region if for that region the status data is that all processing has been performed or that none of the processing is being performed. This is a common situation and this windowing technique advantageously reduces the amount of memory required for the scoreboard.
Viewed from another aspect the present invention provides an apparatus for processing data to perform a plurality parallel processing operations upon an N-dimensional array of data elements, where N is an integer greater than one, said apparatus comprising:
a scoreboard memory storing status data indicative of a status of respective data elements within said N-dimensional array of data elements, a location of a data element within said N-dimensional array of data elements being indicative of a storage location within said scoreboard memory of status data corresponding to said data element; wherein
at least one of said plurality of processors is arranged to check for a data hazard, in respect of processing to be performed upon a given data element within said N-dimensional array of data elements arising from a plurality of other data elements within said N-dimensional array of data elements having respective positions within said N-dimensional array of data elements relative to said given data element and upon which processing for said given data element is dependent, by reading status data for said plurality of other data elements within said N-dimensional array of data elements from said scoreboard memory.
Viewed from a further aspect the present invention provides an apparatus for processing data to perform a plurality parallel processing operations upon an N-dimensional array of data elements, where N is an integer greater than one, said apparatus comprising:
scoreboard memory means for storing status data indicative of a status of respective data elements within said N-dimensional array of data elements, a location of a data element within said N-dimensional array of data elements being indicative of a storage location of status data corresponding to said data element within said scoreboard memory; wherein
at least one of said plurality of processors means is arranged to check for a data hazard, in respect of processing to be performed upon a given data element within said N-dimensional array of data elements arising from a plurality of other data elements within said N-dimensional array of data elements having respective positions within said N-dimensional array of data elements relative to said given data element and upon which processing for said given data element is dependent, by reading status data for said plurality of other data elements within said N-dimensional array of data elements from said scoreboard memory means.
Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawings in which:
As schematically illustrated in
The processing described above could also be performed by multi-threading on one or more processors. A further example embodiment would use a plurality of data engines each responsible for one processing operation and a separate hazard checking processor for reading the status data and controlling the data engines.
Such a deblocking function is one example of a common processing operation which it is desired to share between the multiple processors 4, 6, 8, 10 so that overall processing is achieved more rapidly. As illustrated, an individual processor 4, 6, 8, 10 is attempting to deblock the macroblock X. In accordance with the MPEG 4 Part 10 data compression standard, macroblock X has a data dependency upon four neighbouring macroblocks with respect to its deblocking. These four neighbouring macroblocks are marked with an “s” in
Also illustrated in
The active area of the scoreboard includes rows of status data values respectively indicating whether an individual corresponding macroblock within the array of data elements either has or has not yet been processed. This status data can then be accessed when checking for a data dependency hazard before commencing deblocking of an individual macroblock by an individual processor.
At step 20 a check is made as to whether a given data element at position {tilde over (P)} is ready to be processed. In a system in which multiple processing steps are performed and data dependencies may exist therebetween, it is first necessary to check that a given data element has reached the required level of processing in itself to commence the next level of processing.
At step 22 the first data element with a given relative position to the data element P to be processed is selected for checking. At step 24 the status data for the selected relative position is read. At step 26 a determination is made as to whether or not the status data read indicates that the data hazard concerned is or is not present, i.e. is it OK to proceed with processing. If the status data at the relative position concerned indicates that it is not appropriate to proceed, then processing returns to step 24 where the status data is read again until the status data does indicate that processing can proceed.
If the determination at step 26 was that processing could proceed, then step 28 determines whether there are more relative positions to check for the given data element. If there are such further positions, then the next of these is selected at step 30 prior to returning processing to step 24. The plurality of relative positions to be checked can take a wide variety of different forms including relative positions in spatial dimensions, temporal dimensions, colour space or some other dimension of the data to be processed.
If the determination at step 28 was that there were no more relative positions to check, then processing proceeds to step 32 at which the given data element at position {tilde over (P)} is subject to the processing concerned knowing that the data hazards are not present. The scoreboard for the given data element is then marked to indicate that processing of that data element has completed that particular stage. It shall be noted that an advantageous aspect of this technique is that only a single processor or thread is needed and is able to update the status data for a given data element. This helps simplify the control since the issue of multiple processors or threads competing to update the same status data can be avoided.
If the determination at step 36 is that the processing of macroblock (1, −1) is complete, then step 38 processes the macroblock (0, 0). At step 40 the status data in respect of macroblock (0, 0) is marked as complete.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/GB2006/002555 | 7/11/2006 | WO | 00 | 12/15/2008 |