The invention relates in general to micro-processors and in particular to a processor architecture having an instruction to evaluate and analyze the monotonicity of a series of input values.
Signal smoothness and scale are fundamental qualities in signal processing and allow for analyzation and interpretation of digital signals. This even applies for two-dimensional signals such as images. In digital signal processing, e.g., digital image and video processing, such qualities are used to analyze and to improve the quality of the images. In the publication “locally monotonic models for image and video processing” Acton et al. introduces definitions for locally monotonic images and presents algorithms which compute local monotonic versions of images.
Local monotonicity provides a useful criterion for image smoothing, image scaling and image denoising. Acton et al. provides definitions for the property of local monotonicity for images or video. A one-dimensional signal is called locally monotonic of degree d (LOMO-D) if every interval of length d is monotonic. However, an image is called locally monotonic if, in a weak case, every point is at least in one direction LOMO-d and in a strong case if every one-dimensional path in the image is LOMO-d.
Sophisticated image and video algorithms exploit monotonicity. However, the conventional approach of calculation of the monotonicity of a series of pixels requires huge additional computational performance as different cases of monotonicity exist and each case of monotonicity is described by a complex equation. Moreover, this property has to be calculated for each pixel or for a group of pixels within an image in selected directions. Hence, it is necessary to provide a mechanism and an apparatus to allow an efficient evaluation of the monotonicity of a group of pixels.
ALU is an arithmetic logic unit portion of a processor.
Array refers to an arrangement of elements in one or more dimensions. An array can include an ordered set of data items (array elements) which in computer programming languages like Fortran are identified by a single name. In other languages such a name of an ordered set of data items refers to an ordered collection or set of data elements, all of which have identical attributes. A program array has dimensions specified generally by a number or dimension attribute. The declarator of the array may also specify the size of each dimension of the array in some languages. In some languages, an array is an arrangement of elements in a table. In a hardware sense, an array is a collection of structures (functional elements) which are generally identical in a parallel architecture. Array elements in data parallel computing are elements which can each execute independently and in parallel any operations required. Generally, arrays may be thought of as grids of processing elements (PEs). However, data can be indexed or assigned to an arbitrary location in an array.
An array processor uses several processing elements to exploit parallelism. There are mainly two principal types of array processors—multiple instruction multiple data (MIMD) and single instruction multiple data (SIMD). An exemplary embodiment of a processor described herein has other characteristics.
A functional unit is an entity of hardware, software, or both capable of accomplishing a purpose.
GB refers to a billion bytes. GB/s would be a billion bytes per second.
Image processing is defined herein as any kind of information processing for which both an input and output are images. The images are two-dimensional.
MIMD is used to refer to an array processor architecture wherein each processing element in the array has its own instruction stream, thus giving a multiple instruction stream, to execute multiple data streams located one per processing element (PE).
Module is a program unit that is discrete and identifiable or a functional unit of hardware designed for use with other components. Also, a collection of PEs contained in a single electronic chip is called a module.
PE is a processing element. A PE has its own set of registers along with some means for it to receive unique data (such as a data value for a particular pixel in an image) and to execute instructions on these data.
SIMD is a single instruction multiple data array processor architecture wherein all processors in the array are commanded from a single instruction stream to execute multiple data streams located one per processing element.
SISD is an acronym for Single Instruction Single Data.
Video processing is defined herein as a special kind of image processing whereas for the calculation of a single output image a series of at least two input images are necessary. A typical application is deinterlacing which calculates interleaving lines from a series of consecutive images. Video processing is often termed three-dimensional with the sequence of images forming the third dimension.
VLIW is an acronym for very long instruction word.
A method and processor to evaluate monotonicity of a set of input values is disclosed. The monotonicity of a set of values is defined by a series of monotonicity conditions, whereas each monotonicity condition identifies a case of monotonicity. Each case of monotonicity can be assigned a monotonicity value. A threshold value can be freely configured to allow an uncertainty of nearly equal values which is of high importance in digital signal processing.
The processor architecture itself achieves high processing power by means of an arbitrary number of identical or highly similar parallel processing elements. Each processing element allows instruction dependent data paths and makes use of ALU factories which consist of a number of separate arithmetical and logical units (ALUs) arranged in a special kind of matrix. The processor allows parallel evaluation and analysis of the monotonicity of a multitude of sets of values within a single clock cycle.
In an exemplary embodiment, the present invention is a processor architecture used in digital signal processing to efficiently analyze monotonicity of a set of N input values. The processor architecture includes a means for comparing the set of N input values and generating N comparison signals where each of the N comparison signals indicating a higher value of two different input values from the set of N input values; a means for calculating N absolute differences of the two different input values; a set of N comparators coupled to the means for calculating N absolute differences and configured to determine which of the N absolute differences are greater than a reference value where each of the set of N comparators is further configured to generate a second comparison signal indicating whether a absolute difference is greater than the reference value; a plurality of logic elements coupled to the set of N comparators and configured to check a plurality of cases of monotonicity where each logic element of the plurality of logic elements configured to determine a unique case of monotonicity using one of the N comparison signals and the second comparison signal and generating a control signal, the control signal indicating whether the unique case of monotonicity of the plurality of cases of monotonicity is valid; and a selection unit coupled to the plurality of logic elements and configured to select a monotonicity output value.
In another exemplary embodiment, the present invention is a processor architecture used in digital signal processing to efficiently analyze monotonicity of a set of N input values. The processor architecture includes a comparison logic circuit configured to compare the set of N input values and generate N comparison signals where each of the N comparison signals indicating a higher value of two different input values from the set of N input values; a calculation circuit coupled to the comparison logic circuit and configured to calculate N absolute differences of the two different input values; a set of N comparators coupled to the calculation circuit configured to determine which of the N absolute differences are greater than a reference value where each of the set of N comparators is further configured to generate a second comparison signal indicating whether a absolute difference is greater than the reference value; a plurality of logic elements coupled to the set of N comparators and configured to check a plurality of cases of monotonicity where each logic element of the plurality of logic elements configured to determine a unique case of monotonicity using one of the N comparison signals and the second comparison signal and generating a control signal, the control signal indicating whether the unique case of monotonicity of the plurality of cases of monotonicity is valid; and a selection unit coupled to the plurality of logic elements and configured to select a monotonicity output value.
In another exemplary embodiment, the present invention is a method of determining monotonicity of a set of N input values. The method includes pairwise comparing the set of N input values to determine a higher value of two different input values from the set of N input values; calculating N absolute differences of the two different input values; determining which of the N absolute differences are greater than a given reference value; checking a plurality of cases of monotonicity, the checking performed using a set of monotonicity conditions evaluated with a result of the step of pairwise comparing and the step of determining which of the N absolute differences are greater, the checking generating control signals indicating which case of monotonicity of the plurality of cases of monotonicity is valid; and using the generated control signals to select a monotonicity output value from a set of output values, the monotonicity output value being a result of a monotonicity instruction.
The appended drawings illustrate exemplary embodiments of the invention and must not be considered as limiting its scope.
In the following description, a new method and apparatus to evaluate the monotonicity of a set of input values is disclosed. An associated processor achieves high processing power by means of an arbitrary number of identical or highly similar parallel processing elements. Each processing element allows instruction dependent data paths and makes use of ALU factories which consist of a number of separate arithmetical and logical units (ALUs) which are arranged in a special kind of matrix. The processor allows parallel evaluation and analysis of the monotonicity of a multitude of sets of values. A threshold value can be freely configured to allow an uncertainty of nearly equal values which is of high importance in digital signal processing.
The memory subsystem 109 receives an incoming video stream, arranges images in an appropriate format in the external data memory 111, and allows external devices (not shown) to access calculated output images. Moreover, the memory subsystem 109 connected to the processor 100 is responsible for providing the correct data for each of the plurality of slices 101 and, hence, acts as a cache for the external data memory 111. Even for a scaling algorithm or complex algorithms like de-interlacing the memory subsystem 109 is important. The memory subsystem 109 caches several lines from a current, previous, and succeeding images of the sequence of the video stream stored in the external data memory 111 and manages to read and to write the calculated pixels back to the output memory within the external data memory 111. While one video line is processed, other video lines are loaded in parallel and the caches are switched when a subsequent line has to be processed.
Hence, the actual implementation of the memory subsystem 109 is dependent upon the algorithms used. For instance, de-interlacing algorithms need the current, previous, and succeeding images of a video stream. On the contrary, simple image processing algorithms like noise reduction require only the current image. Hence, depending on the application, the memory subsystem 109 can be a complex memory management and caching system or even a simple line cache. However, an architecture of the memory subsystem 109 would be understood to a skilled artisan is thus not within a scope of the present invention.
The main control unit 103 is a global sequencer which fetches and decodes instruction words and fills and controls the program flow and the instruction pipeline during processing even in case of interrupts, stops, loops, and jumps. The main control unit 103 synchronizes the execution and data flow within each of the plurality of slices 101 according to the program read from the program memory 107.
The plurality of slices 101 are each identical or similar to one another, whereby a total number of the plurality of slices 101 which are integrated in the core can be chosen freely up to the processing power requirements of the application. For instance, low power applications may use one or a few slices only whereas high performance solutions may include 40 slices or more. As the processor 100 is a full scalable architecture, the total number of the plurality of slices 101 does not influence the processor behavior itself as the plurality of slices 101 operate independently from each other. However, the memory subsystem 109 mentioned above has to support the data throughput to and from all of the plurality of slices 101. Thus, the processor 101 architecture is suitable for system-on-chip (SOC) solutions even for a moderate number of slices, for example, 40 or 64 slices. The processor 100 architecture therefore enables high processing power and manufacturing of the processor 100 on a single chip. As an example, selecting the plurality of slices to be 40 results in an achievable I/O bandwidth for the processor 100 of 560 GB/s if operated at 400 MHz.
The internal data width of the embodiments depicted in
An ALU factory 240 forms the core of the slice 200. The ALU factory 240 is used as a black box within the slice 200 architecture and is described in detail, below. However, it is of importance to outline some key facts of the ALU factory 240 black box in order to understand the slice 200. At each clock cycle the ALU factory 240 can read data from a plurality of input registers 231 and execute a set of, e.g., mathematical, statistical or logical operations, based on these data. The ALU factory 240 comprises several operational stages. The output of some or all operational stages of the ALU factory 240 can be fed to a slice-internal data bus 260. As an example, in
The data bus 260 in the slice 200 architecture is a broad data bus that comprises the output data of the plurality of input registers 231 and the output data buses of the ALU factor 240 comprising the ALU-A registers out, ALU-B registers out, and ALU-C registers out.
The slice 200 can have a set of x address generators or slice address generation units. Hence, in addition to the global addresses generated by the global address generator 105, each slice 200 can generate and use x addresses for itself However, the architecture and capabilities of the global address generator 105 is not of importance for the disclosure. Each of the slice address generation units computes a memory address, a slice address pointer SAP, which can be used as a read or write address for its slice to access a memory A 201, a memory B 211, and the memory subsystem 109.
In a specific exemplary embodiment, the memory A 201 and the memory B 211 may be of equal size and capabilities and are controlled in similar fashions. Both the memory A 201 and the memory B 211 are dual-ported, i.e., data can be read and written in a single clock cycle. At each clock cycle a certain number of data words, e.g, 4 data words, can be stored in each of the memory A 201 and the memory B 211 whereas the data words are selected from the data bus 260 by VLIW-controlled multiplexers 203, 213, respectively. The memory write addresses for the memory A 201 and the memory B 211 are selected from a set of available address pointers by VLIW-controlled multiplexers 205, 215, respectively, whereas the set of address pointers can comprise the slice address pointers SAPx and immediate address values contained in the VLIW. Moreover, at each clock cycle a certain number of data words, e.g, 2 data words, can be read from each of the memory A 201 and the memory B 211 and are sent to the plurality of multiplexers 233. The memory write addresses for the memory A 201 and the memory B 211 are selected from a set of available address pointers by the VLIW-controlled multiplexers 207, 217, respectively, whereas the set of address pointers can comprise the slice address pointers SAPx and immediate address values contained in the VLIW.
At each clock cycle the plurality of input registers 231 read values from the plurality of multiplexers 233. The plurality of multiplexers 233 are controlled by the VLIW and allow for each of the plurality of input registers 231 to select one value from the multitude of values provided by the data bus 260, the memory A 201 and the memory B 211, and the memory subsystem 109. Hence, in one clock cycle each of the plurality of input registers 231 can perform one of the following actions: hold its value, read a value from one of the other input registers, read a value from one of the outputs of the ALU factory 240, read a value from one of the memory A 201 and the memory B 211, or read a value from the memory subsystem 109.
The slice 200 can provide a read address RC—Addr to the memory subsystem. The read address RC—Addr can be selected by the VLIW-controlled multiplexer 227 from the set of addresses given by the slice address pointers SAPx and the immediate address values IMM contained in the VLIW. With reference again to
The slice 200 can provide a write address WC—Addr to the memory subsystem 109. The write address WC—Addr can be selected by the VLIW-controlled multiplexer 225 from the set of addresses given by the slice address pointers SAPx and the immediate address values IMM contained in the VLIW. As shown in
Referring again to
In the ALU factory 300, the first operational stage has 4 independent ALUs 305 of type ALU-A, the second stage has 4 independent ALUs 315 of type ALU-B, and the third stage has 3 independent ALUs 325 of type ALU-C. All ALUs of type ALU-A have the instruction set IA, all ALUs of type ALU-B have the instruction set IB, and all ALUs of type ALU-C have the instruction set IC.
Each ALU within the ALU factory 300 has at least one input. In the specific exemplary embodiment of
The values computed by the ALUs in the ALU factory 300 are stored in registers. Each ALU can have its own output register. The ALU-A registers 307 store values computed by the ALUs 305 of type ALU-A. The ALU-B registers 317 store the values computed by the ALUs 315 of type ALU-B. The ALU-C registers 327 store the values computed by the ALUs 325 of type ALU-C.
In the structure shown in
One benefit of the structure of the ALU factory 300 is that several data paths exist among the ALUs. The data paths are programmable and all the data paths through the ALU factory 300 are a result of the combination of instructions used in the ALUs. As an example, one ALU of the ALUs 315 of type ALU-B could be used to accumulate the results of all ALUs 305 at each clock cycle while the other ALUs 315 of type ALU-B execute different instructions. Another example can be, that one ALU of the ALUs 305 of type ALU-A contained in the first stage accumulates values loaded in some of the input registers 231 at each clock cycle, while a different ALU in the same stage holds and updates the number of values accumulated so far, and while a third ALU in the same first stage calculates the actual mean value which is determined by the accumulated value divided by the number of values.
As mentioned above, each of the ALUs can perform different operations at a certain clock cycle. The specifc exemplary embodiment of the architecture of the ALU factory 300 shown in
The instruction set of the whole ALU factory 300 as described above comprises the instruction sets of all ALU types. Each ALU type of the ALUs shown in
A monotonicity instruction according to the description given herein analyzes its input values and returns a value that determines a correlation of the input values. For example, let's consider five input values a, b, c, d, and e. The extreme situations of monotonicity of a series of monotone increasing values like a<b<c<d<e or a series of monotone decreasing values a>b>c>d>e have to be detected as well as peaks like a<b<c>d>e or a>b>c<d<e. Other cases of monotonicity might be a=b>c=d=e or similar. Depending on the number of input values an arbitrary number of monotonicity cases can be defined. The set of monotonicity cases of choice is dependent upon the application.
Monotonicity sometimes is used to determine if certain input values are higher or lower than others. In other cases monotonicity is used to determine if any combination of the input values matches a monotonicity condition such as a>b=c=d=e.
Although these examples use five input values (a, b, c, d, and e), the same cases could be covered with a monotonicity function that uses only three input values as well. For instance, to determine if a>b>c>d>e is true, one could even check for both a>b>c and c>d>e. Hence, the monotonicity of a series of N values can be determined also with several calls to a function that analyzes the monotonicity of M values, where M<N. The lower M is the more partial monotonicity analyzes have to be performed and the more cycles are necessary to compose the partial monotonicity analyzes. For example, if M=2 (this is a simple “greater than,” “less than,” or “equal to” operation), four partial analysis (a<b, b<c, c<d, and d<e) are four “AND” operations are necessary to combine these partial monotonicity case analysis to a whole monotonicity case of the five input values for a<b<c<d<e. As discussed above, simple comparator operations like “less than” and “greater than” are not sufficient to efficiently handle evaluation of monotonicity of a series of values. On the other hand, a monotonicity function that analyzes a high number of input values (e.g., seven or more) would result in a complex circuit. Our analysis have shown, that an optimal monotonicity function that analyzes a combination of input values should have three to five input values.
Another criteria for a monotonicity function or its implementation as a monotonicity instruction of a processor's ALU is its tolerance. In signal processing, two values which are close and vary slightly are termed “equal.” Mathematics of such values, however, vary within a certain tolerance. For example, the values a and b are termed “equal,” if abs (a−b)<ref, where “abs (a−b)” denotes the absolute value of the difference. The value ref denotes a certain threshold. It is, therefore, necessary for digital signal processing to consider a certain uncertainty of values when evaluating the monotonicity of values.
Modules in
In
The second plurality of comparators 405b use the absolute differences to determine if two input values are within a certain tolerance ref, i.e., to determine the equality of two input values. If, for example, the input values a and b are so close that their difference abs (a−b) is “greater than” (or “less than” in other embodiments) a given threshold value ref, the corresponding one of the second plurality of comparators 405b will signal true.
A plurality of combinatorial logic blocks 407 uses the output signals of the plurality of comparators 405 to determine the monotonicity of the input signals a, b, and c according to a case diagram shown in
Hence, the embodiment shown in
With reference to
Each of the boxes 500 graphically shows the values a, b, and c. A stripe in the middle denotes a tolerance defined by a threshold value ref. For instance, the first box 500 with the mono value 0 has the monotonicity condition a==b==c whereas all three values a, b, and c are within a certain tolerance ref and, hence, are treated as equal.
The box 500 with the mono value 1 shows strong monotonically increasing values, the box 500 with the mono value 6 shows strong monotonically decreasing values. The boxes 500 with the mono value 2 and 7 show monotonicity cases where a and b are within a certain tolerance and c is higher or lower respectively. The boxes 500 with the mono value 3 and 8 show monotonicity cases where b and c are within a certain tolerance and a is lower or higher respectively. The boxes 500 with the mono value 4 and 9 show monotonicity cases where a and c are within a certain tolerance and b is higher or lower respectively. The boxes 500 with the mono value 5 and 10 show the remaining monotonicity cases where a and c are not within a certain tolerance and b is higher or lower respectively.
The mono values provided in the upper left corner 502 of the boxes 500 in
Using the ALU factory 300 shown in
In this example, a cycle is represented by a pair of braces. In a first cycle, the threshold value ref is set to 7 using a special instruction MONO.FORMAT. The instruction MONO.FORMAT configures the behaviour of all subsequent calls to the MONO instruction. The instruction MONO is a monotonicity instruction. In this example, three values of the ARU-A registers 307 are analyzed, whereas in the second to fourth call to MONO two of them are always compared to a constant value 20. The subsequent handling of the results of the monotonicity instruction in algorithms is not demonstrated in the example above as they are not of relevance.
The above examples performs four checks for monotonicity and subsequent stages of the ALU factory can, for example, use the results of the monotonicity function to check whether the provided input values match the defined quality criteria given by a mono value and defined by a threshold ref.
By assigning different values 502 to the monotonicity cases which are illustrated by the boxes 500 in
An exemplary embodiment for a call of a monotonicity processor instruction is:
ACCUy=(operand1, operand2, operand3)
An exemplary embodiment for a call of a processor instruction to configure the threshold value is:
MONO.FORMAT (threshold)
Another embodiment for a call of a monotonicity processor instruction with an immediate threshold value can be:
ACCUy=(threshold, operand1, operand2, operand3)
An exemplary embodiment for a processor instruction that allows configuration of the monotonicity return values (values 502 in the table shown in
MONO.TABLE=(CaseIndex,ReturnValue)
One advantage of the present method and apparatus is that the monotonicity of a series of values can be evaluated in a single clock cycle. Moreover, the method and apparatus according to the description given herein enables one to set and even to adjust a tolerance value ref which allows an uncertainty in the monotonicity equations. Configurable monotonicity case tables (see
In the foregoing specification, the present invention has been described with reference to specific embodiments thereof. It will, however, be evident to a skilled artisan that various modifications and changes can be made thereto without departing from the broader spirit and scope of the present invention as set forth in the appended claims. For example, particular embodiments describe a number of registers, ALUs, and multiplexers per stage. A skilled artisan will recognize that these numbers are flexible and the quantities shown herein are for exemplary purposes only. Additionally, a skilled artisan will recognize that various numbers of stages may be employed for various array sizes and applications. These and various other embodiments are all within a scope of the present invention. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.
This application claims priority from U.S. Provisional Patent Application Ser. No. 60/867,406 entitled “Method and Apparatus to Efficiently Evaluate Monotonicity,” filed Nov. 28, 2006 and which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
60867406 | Nov 2006 | US |