The exemplary and non-limiting embodiments of this invention relate generally to wireless communication systems, methods, devices and computer programs and, more specifically, relate to parallel computation methods and apparatus for implementing same, which are seen to be particularly advantageous for computations in the wireless communications arts.
This section is intended to provide a background or context to the invention that is recited in the claims. The description herein may include concepts that could be pursued, but are not necessarily ones that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, what is described in this section is not prior art to the description and claims in this application and is not admitted to be prior art by inclusion in this section. Whereas both associative computing and distributed arithmetics are summarized in this background description, they are described as independent computational techniques and to the inventors' knowledge it is not known in the art to combine them.
The relevant field of these teachings is massively parallel computation methods. For example, systems supporting a single modem radio standard typically include hardware (HW) accelerators for implementing these types of operations. However, a software defined radio (SDR) system implies support for a large set of radio standards to be implemented on a shared flexible, programmable platform. Taking into account the demand for very high computational power, only highly parallel processors are feasible. Fortunately, most of the computation demanding algorithms in radio standards are potentially parallelizable at a very high level. For example, the digital video broadcast for handheld devices (DVB-H) standard requires implementation of N-point fast Fourier transform (FFT) of either of sizes N=1K, N=2K or N=8K (where K=1024). Implementation of an N-point FFT could be parallelized in a traditional single-instruction stream/multiple-data stream (SIMD) fashion wherein N/2 butterfly operations could be implemented in parallel. Each butterfly is, actually, a product of a 2×2 complex matrix with a 2×1 vector, that is, each butterfly represents four inner products. Therefore, an N-point FFT could potentially be parallelized at the level, where 2N inner products are computed in parallel. Unfortunately, existing SIMD processors offer only parallelism supporting implementation of at most 32 inner products in parallel, and a much higher level of parallelism from traditional SIMD processors is not seen to be likely in the near future.
Another, even more important set of algorithms involved in all radio standards are finite impulse response (FIR) filters of various sizes. In such algorithms, inner products are computed between of a vector of filter coefficients with very large number of vectors formed as the contents of a window that slides across a very long input signal. The length of the vectors are typically in the range between tens and hundreds, but the length of the input signal and therefore the number of inner products to be computed is typically in the range of thousands or tens of thousands. For example, in the front-end of the DVB-H standard, in the 8K mode the number of samples associated with one orthogonal frequency division multiplex (OFDM) symbol is 31.5K. With a proper buffering technique all the inner products could have theoretically been implemented in parallel provided that a processor supporting such a vast parallelism is available.
These are but two examples. With the development of the technology newer applications emerge which from one side demand even higher computational power, and from the other side allow even higher levels of parallelism. At the moment the only processor architecture that appears feasible to support such a vast level of parallelism appears to be associative processor array technology. However, not many of the computation algorithms were yet developed for such processors.
Associative computing (ASC) is a principle used in content-addressable memory based associative processors (ASPs) for massively parallel computations. ASPs are powerful tools to implement massively parallel data processing. Their operation is, in essence based on a look-up table approach. In this approach, input data is first compared with all possible values that these data may potentially take. If these input data is the same as the value to which it is currently compared, the correct pre-calculated output value is written in the corresponding memory field. Further background with regard to associative computing maybe seen, for example, at U.S. Pat. No. 6,195,738 (entitled C
The ASC principle is illustrated at
The ASP 100 of
In addition, an ASP 100 includes a mask register 124, and a pattern register 128. Both registers are of the length equal to the total number of columns 118a/b in all CAM arrays 112a/b. Cells 126 of the mask register 124 are associated with CAM bit slices 118a′ and are solely used for enabling/masking the corresponding slices. Also the pattern register cells 130 are associated with the CAM bit slices 118a′ so that each cell 130 of the pattern register 128 may be compared with the content of all the bits within the associated CAM bit slice 118a′ in parallel, as well as each bit of the pattern register 128 can in parallel be written to all those bits of that slice 118a′, which are enabled by corresponding bits of the tags register. The content of the pattern register 128 as well as the content of the mask register 130 cannot be modified by the content of the CAM array 112a/b. They are specified by the program operating the associative processor 100.
In ASC, all the arithmetic operations and expressions are implemented based on two elementary operations: “Compare” and “Write”. For both operations, the set of CAM cells 114a/b that participate in the operations are specified by the mask register 124 and by the tags register 120a/b: all and only those cells, for which associated mask bit 126 and associated tags register bit 122a/b are both 1's, will participate in the operation. During one cycle of a compare operation, each activated CAM row 116a/b (enabled by tags register 120a/b) generates a 1 or 0 value to the bit in the associated tags register cell 122a/b depending on whether its content is equal or not equal to the content of the pattern register 128 in all activated bit slices 118a′ (which are enabled by the mask register 124). During one cycle of a write operation, the content of the pattern register 128 in all activated bit slices 118a′ is written in parallel into each of the activated CAM rows 116a/b. Note that each arithmetical operation and even larger expressions may in this way be implemented. Moreover, many of them may be implemented in parallel.
For example, in order to pairwise add N pairs of m-bit integers (N being less than or equal to the number of CAM rows), the following algorithm may be used. Assume the corresponding pairs are written in CAM memory, one pair in a row manner and occupy bits 0 to 2m−1. Also assume that outputs (the pairwise sums) must be written in the same rows as the corresponding input pairs but in the bit slices 2m through 3m. One possible algorithm that pairwise adds all the N pairs in parallel could be as follows. The algorithm executes 22m steps each step consisting of two operations. The first operation in each of the steps i, i=0, . . . , 22m−1, is the compare operation over all the CAM rows. During this operation a next possible 2m -bit input i (say [a(i)b(i)], where a(i) denotes m bits of the first operand and b(i) denotes m bits of the second operand) is written in the bits 0 to 2m−1 of the pattern register. This input is compared simultaneously, in one machine cycle, with bits 0 to 2m−1 of each activated associative word. As a result, tags register bits that are associated with those rows that happen to contain the [a(i)b(i)] will become equal to 1, and all the other tags register bits will become equal to 0. hi the second operation of that step, the correct output a(i)+b(i) is written into a designated for outputs field (bits 2m through 3m) of the pattern register and write operation is executed in parallel for all the enabled CAM rows. As a result, the correct sum a(i)+b(i) will be written into bits 2m through 3m of that associative words, for which the tag register cell was set to 1 (that is for which it was identified in the first operation of that step, that they contain the pair [a(i)b(i)] as input). After all the 22m possible inputs were tested, each associative word will contain the correct result for the input pair written in it in the beginning. The whole computation thus will occupy 22m+1 machine cycles.
The algorithm in the above example is only for illustrative purposes. In a sophisticated algorithm, m-bit additions could possibly be reduced to a set of smaller bit-width additions. Breaking the bit-width up to a single bit-slice will lead to an algorithm where m bit-slices are added in m iterations wherein at each iteration three 1-bit numbers are added (two inputs and one carry-in signal). This way, the number of machine cycles to implement m-bit additions may be estimated as 8 m.
In an even more efficient implementation, this number might still be further reduced. For example, according to NeoMagic Corporation of Santa Clara, Calif., USA, the number of cycles to implement 8-bit additions may be as low as 25 machine cycles per addition. It is also known from NeoMagic Corp. that 12-bit multiplications may be implemented in 200 cycles. It is noted that the number of cycles is independent of the number of pairs for which identical operation is implemented. Thus, up to 8K or even 64K 8-bit additions (or 12-bit multiplications) may be implemented in only 25 (or 200) machine cycles. Even though every single operation is very inefficient, the theoretical possibility to implement many of them in parallel makes the approach extremely efficient.
At least some of the advantages of the associative computing method are as follows:
These advantages are offset somewhat by at least the following drawbacks:
Distributed arithmetic's (DA), which is also based on a look-up table approach, is a very efficient way to implement inner vector product operation, differently from the ASP approach. DA is the basic operation in many applications, such as digital signal and image processing, communications, etc. One advantage of DA is its ability to provide accelerated computation of inner products of a vector a=[a0, . . . , aN−1] with fixed known coefficients ak, k=0, . . . , N−1, with a large number of input vectors x=[x0, . . . , xN−1]T, y=[y0, . . . , yN−1]T, z=0, . . . , zN−1]T, etc.
In distribute arithmetic's, computation of an inner product
is reduced to the weighted sum of inner products of the vector a=[a0, . . . , aN−1] with m binary vectors each being one bit-slice of the vector x=[x0, . . . , xN−1]T. Let the two's complement binary representation of xk, k=0, . . . , N−1, be xk=xk,m−1, . . . , kk,1, xk,0. Then
and the innerproduct of equation (1) can be rewritten as
Each sum in brace of equation (2) is basically an inner product of the vector a with a binary vector being a bit-slice of the vector x . For a fixed vector a there are 2N possible values corresponding to 2N binary vectors of length N that these inner products may take. For a reasonably moderate vector length N, all ofthese 2N values may be pre-calculated and stored in a look-up table. Then the inner product (2) may be calculated in m iterations of fetch-shift-accumulate accumulate operations, where at the j th iteration, j=0, . . . , m−1, the inner product
that corresponds to the j th bit-slice of the vector x is fetched from the look-up table, shifted by 2j and then is accumulated to previously accumulated binary inner products.
Some of the drawbacks of DA include:
Further background with regard to distributed arithmetics may be seen, for example, at a paper by Stanley A. White entitled A
There are many applications (such as software defined radio [SDR], image video compression/processing, 3rd generation graphics, etc) where implementation of a very large number of inner products in parallel would bring a benefit. What is needed is an efficient method for implementing such large number of inner products in parallel.
Conventional DA implementations for inner product computations are based on a look-up table approach. A traditional ASP implementation of inner products would be based on implementing multiplications. Neither approach is efficient enough, and each carry several drawbacks mentioned above. It appears that the most common method for implementing inner product calculations is based on performing multiplications and additions or multiply-accumulate operations on traditional multipliers and adders or multiply-accumulate units. What is needed in the art is a more efficient flow of computations to perform inner product calculations, particularly in ASP and similar type processors.
The foregoing and other problems are overcome, and other advantages are realized, by the use of the exemplary embodiments of this invention.
In accordance with a first exemplary embodiment ofthis invention there is a method that includes storing subvector slices x(i, r, s) of a first vector x(i) in a bit-parallel word-serial manner, for each of the stored subvector slices and in parallel on bits of said each subvector slice, executing an operation that outputs a pre-calculated inner product result of the said bits and a second vector a; and outputting a result that depends from the executed operation
In accordance with a second exemplary embodiment of this invention there is a computer readable memory storing a program of instructions that are executable by a processor to take actions. In this embodiment the actions include storing subvector slices x(i,r,s) of a first vector x(i) in a bit-parallel word-serial manner; for each of the stored subvector slices and in parallel on bits of said each subvector slice, executing an operation that outputs a pre-calculated inner product result of the said bits and a second vector a ; and outputting a result that depends from the executed operation.
In accordance with a third exemplary embodiment of this invention there is an apparatus that includes a data storage array and a processor. In the data storage array there are subvector slices x(i,r,s) of a first vector x(i) which are stored in a bit-parallel word-serial manner. The processor is configured to execute an operation, on each of the stored subvector slices and in parallel on bits of said each subvector slice, that outputs a pre-calculated inner product result of the said bits and a second vector a.
In accordance with a fourth exemplary embodiment of this invention there is an apparatus that includes storage means (such as, for example a CAM array) and processing means (such as, for example an associative processor). The storage means is fir storing subvector slices x(i,r,s) of a first vector x(i) in a bit-parallel word-serial manner. The processing means is for executing an operation, on each of the stored subvector slices and in parallel on bits of said each subvector slice, that outputs a pre-calculated inner product result of the said bits and a second vector a.
a-e illustrates data organization and transformations for a CAM register and a tags register with respect to the process steps at
a-b are similar to respective
a shows a simplified block diagram of various electronic devices that are suitable for use in practicing the exemplary embodiments of this invention.
b shows a more particularized block diagram of a user equipment such as that shown at
One technical advantage that exemplary embodiments of the invention provide is an efficient method for implementing a very large number of inner products in parallel. Specifically, these teachings detail a new high-performance approach for massively parallel implementation of computations. Examples of where such large matrix-vector computations may be implemented include matrix-vector product, FIR filtering, convolution, and discrete orthogonal transforms, to name a few. More precisely, the approach detailed herein combines two distinct techniques for high-speed computations: associative computing and the distributed arithmetic, and combines them in a manner that further increases the efficiency of the both.
One particular embodiment of these teachings is implementation of DA on ASPs, in particular for finite impulse response FIR filter (e.g., flexible-size FIR filtering type of operations) and/or cross-correlation operations which are frequently used for example in wireless communication algorithms. One technical advantage of these teachings is that the combined approach detailed herein overcomes drawbacks of the two separate approaches DA and ASC noted above, while synergistically combining their individual advantages.
These teachings may be applied to many fields of Information Technologies where high-speed implementation of matrix vector operations, in particular, inner product computations is needed. An important application in which these teachings may prove particularly advantageous is digital communications, and more specifically software defined radio (SDR) where several radio standards are to be implemented on a flexible programmable platform using hard real-time constraints. Implementation of the radio modems supporting these standards, in particular physical layer 1 (PHY L1) of long term evolution (LTE, or 3.9G) of universal mobile telecommunications system—terrestrial radio access network (UTRAN) and high speed data packet access (HSDPA). Implementations of these radio standards, such as in their related modems, require many matrix-vector operations such as fast Fourier transforms (FFT) and especially FIR filtering and cross-correlation operations of various sizes. Non-limiting examples below are in the context of flexible implementation of FIR filtering type of operations of variable sizes, or in other words, to variable size moving window inner product operations.
Other examples where these teachings maybe employed include image/video processing, pattern recognition, 3D-graphics, etc. For example, in the simplest image compression standard (JPEG) an image is split into blocks of a small size (typically 8×8) and then all the blocks are similarly processed by a series of algorithms (such as color conversion, discrete cosine transform, quantization, pre- or post-filtering), each of which is basically comprised of a set of inner product operations. All of these algorithms could be implemented over all the blocks in parallel. Even for relatively low resolution images, such as 1.3 megapixel images, a very high level of parallelism (approximately 20K blocks) could have been achieved if proper processors and proper implementation techniques was developed.
According to exemplary aspects of these teachings is an approach to implement inner products on associative processor arrays. Specifically, it would be desirable to implement DA on ASPs to execute various communication algorithms such as the FIR filtering and FFTs mentioned above. Such a technique would overcome drawbacks of the two approaches but would combine their advantages.
Consider again distributed arithmetics. For the case where there is a very large number N of components in each of the input vectors x(i) that are weighted and summed, to make feasible the direct approach that is noted in background above, an N-point inner product may be broken into N/n inner products each of the length n. This is equivalent to splitting the internal sum in (2) into shorter sums:
Then, instead of one single 2N-word look-up table, one can use a number N/n of 2n-word look-up tables since the number of possible values that the innermost sum in the brace of equation (3) may take is 2n. Each inner product of length n is again calculated in m iterations. However, now there are N/n inner products to calculate and to accumulate to each other.
Consider the opposite problem, where N is too small to make the DA approach noted in background beneficial. For this instance one can group m bit-slices of equation (2) into m/p planes of depth p (or “p-planes”):
Then there are m/p fetch-shift-accumulate iterations that need to be implemented instead of m fetch-shift-accumulate iterations. However, now there would be 2Np different values for the sum in the brace of equation (4) that need be pre-calculated and stored.
With these generalizations for how to take the inner products, some of the advantages of DA approach are then:
Recalling the disadvantages listed in background for DA, certain of the exemplary embodiments of these teachings can easily solve those drawbacks where DA is implemented in associative processors.
As an initial matter, first combine equations (3) and (4) into a single general equation for DA so that the end solution is optimized for any size N. This then leads to the following equation:
where n and p are DA parameters indicating a working inner product length and a working bit-depth, respectively.
In applications such as the inner products for radio communications noted above, there are many input vectors x(i)=[x0(i), . . . , xN−1(i)]T, i=0, . . . , L−1, for which inner products
need be calculated.
Clearly, explanation on a generic basis may soon become unclear to the reader due to the large number of input vectors being considered, and so a specific example will be used hereinafter: implementation of FIR filtering and cross-correlation type of operations which exemplify the general description of these teachings. This is also seen to be an embodiment in which the technical advantage of increased computational efficiency is quite pronounced. Specific examples of vectors on which the moving window embodiments may be implemented include interpolation filters or channel filters applied to received wireless communication signals; pre or post filtering of image rows and columns, particularly of video or gaming image data, but also for audio signals and/or for the purpose of de-noising image data. These are exemplary and not limiting to the broad and varied implementations for which these teachings may be employed.
In FIR filtering and cross-correlation type of operations, a vector of known fixed coefficients is multiplied to vectors that are formed by input signal samples entering into a window sliding across the long input signal. One can call this type of operations moving window inner products. If for example, we denote the FIR filter window size by N, the filter coefficient vector by a=[a0, . . . , aN−1], and the input signal by X=x0,x1, . . . , xN−1, xN, xN+1, . . . , xM, then the inner product b(i)=a·x(i) of the vector a with the vector x(i)=[x1, . . . , xi+N−1]T is computed to obtain the i th output Xi, i=0, . . . , L−1, (where L is the number of outputs, which is typically the same as the number of inputs M but here, without loss of generality, we allow it to be less than M for simplifying the equations).
Therefore equation (6) in this case is transformed to:
One can see from examining equation (7) that in this case the multiple vectors that participate in inner products with the vector a contain common components. This property may be used for more efficient utilization of ASP's CAM arrays for representing and processing of bit-slices in the innermost brace of equation (7).
The teachings according to this invention detailed below with particularity are seen to provide at least four distinct differences over the prior art DA or ASP implementations, summarized below.
First: there is an input data format rearrangement which enables application of the distributed arithmetic in the memory of the associative processor array. This is an important step in order to get the requisite processing efficiency, and this data format arrangement is especially efficient for implementing FIR filter or other operations involving calculation of inner products of a fixed vector with a plurality of other vectors involved in a window sliding across a long input vector. It is noted that an associative processor array could also be used solely for this purpose. It is well known that distributed arithmetic needs a data format which is not convenient to store in traditional memories. Traditional FIFO based conversion of the data format to a suitable one is known to be power consuming. This data format conversion is therefore important to achieve the efficiencies possible by these teachings.
Second: the distributed arithmetic technique is applied without a need to store pre-calculated binary inner products in look-up tables. This alone is seen fundamentally different from the underlying principles of DA.
Third: parallelization of the distributed arithmetics. In a traditional look-up table based implementation of the distributed arithmetic, the level of parallelization is restricted to the common number of all ports of all look-up tables used, whereas in the associative processor based method the level of parallelization is only restricted by the size of the associative processor's memory.
Fourth: a multiplication-less method of implementing inner products on associative processors. The conventional associative processor-based method for implementing inner products would involve multiplications which are rather slow on associative processors.
With those guideposts in mind, we now detail how computations according to equation (6) and by example for FIR filtering type of operations in particular, computations according to (7) are implemented on associative processors according to an exemplary and non-limiting embodiment. For simplicity, this particular description is provided for associative processors consisting of a single CAM array and a single tags register such as the arrangement shown at
Furthermore, it can be shown that in most of the practical cases of implementing DA on ASPs, the optimal choice for p in equations (6) and (7) is p=1. Therefore, we will use this value of p in describing the preferred embodiments and in the illustrations.
Equations (6) and (7), for the case p=1 may be rewritten as
respectively, where we have denoted x(i,r,s)=[xnr,s(i), xnr+1,s(i), . . . , xn(r+1)−1,s(i)]T be the s th, s=0, . . . , m−1, bit-slice of the r th r=0, . . . , N/n−1 subvector of the vector x(i) that is multiplied to the r th subvector a(r)=[anr, anr+1, . . . , an(r+1)−1]T of the vector a according to (8).
In the case of moving window inner product operations [denoted by equation (9)], x(i,r,s)=[xnr+i,s, xnr+i+1,s, . . . , xn(r+1)+l−1,s]T. Let us note that, in this case, x(i+ln,r,s)=x(i,r+l,s), for any integer l such that 0<i+ln<L−1 and 0<r+l<N/n−1. This in particular means that once stored in the CAM memory in a needed format, the same subvector may be reused for computation of several outputs.
At the beginning of the actual implementation, we assume that input vectors x(i), i=0, . . . , L−1, are written in the CAM array 402 of the associative processor in the conventional bit-serial manner as shown in
At block 304 of
The iterations are indexed as k=0 . . . n−1 and denote the component of the subvector x(i,r,s) as in equations (8) and (9).
Consider the transforms shown at
The X'd out cells of the CAM array at
As the result of the transform which occurs through the k=n−1 iterations, the input bits are rearranged into an order where each bit-slice of each subvector x(i,r,s), i=0, . . . , L−1, r=0, . . . , N/n−1, s=0, . . . , m−1, participating in computation of one binary product in equation (8), is written in one associative word.
In an arrangement for implementing the moving window FIR type of operation according to this specific example, the input vectors have common components. Arrangement of the bits before block 304 of
Note that there are no X'd out bits/cells for the specific embodiment of
The complexity of block 304 (
Moving now to block 306 of
At each iteration t=0, . . . , 2n−1, a next possible binary vector t of length n is first compared in parallel to all the binary slices written into the ASP's associative rows 406 at block 304/Step 1 of
Clearly at the end of 2n compare-write iterations, binary products a(r)*x(i,r,s) of all the subvectors x(i,r,s), i=0, . . . , L−1, r=0, . . . , N/n−1 written to the ASP's associative rows 406 at block 304/Step 1 with the corresponding subvectors a(r) will be computed. Therefore at the end of block 306/Step 2, all the binary products participating in equation (8) (in the general case) or in equation (9) (in the case of moving window inner products) will be computed and stored in corresponding associative rows 406 of the ASP, which is shown specifically at
It follows that the complexity of block 306/Step 2 of
Now consider block 308/Step 3 of
Note that there are N/n groups, each consisting of m addends (see (8) and (9)). Therefore, there are
stages of parallel additions to accomplish in order to sum up all the binary inner products of equations (8) or (9). Before implementing each of these stages one needs to arrange the addends so that pairs participating in one addition are written in the same associative word of the ASP. It is easy to see that this rearrangement may be implemented in at most 2f machine cycles of shift and write, where f is the number of bits of the binary inner products obtained at block 306/Step 2. Therefore, the complexity of block 308/Step 3 of
where Cadd({tilde over (m)}) is the complexity of {tilde over (m)}-bit additions where {tilde over (m)} is the output precision.
Definitely, Cadd({tilde over (m)})≦8 {tilde over (m)}, where the upper bound 8 {tilde over (m)} of complexity corresponds to the above described parallel addition procedure detailed for ASP processing in background above. Thus, the complexity of block 308/Step 3 may be estimated as
machine cycles.
Now consider that up to Q=T/(mN) inner products of length N may be computed in parallel by the exemplary approach above, where T is the number of rows 406 in the CAM array 402 of the ASP. The total complexity then for these Q inner products may be evaluated as:
In the case of FIR filtering type of operations, the complexity is given by the same formula but for S=T/m=NQ output samples. The comparatively higher performance is achieved due to a higher degree of CAM row utilization noted above, and therefore a higher level of parallelism. The complexities of the above exemplary computational approach per one inner product may be estimated as:
in the general case, and for the specific moving-window case as:
Typically N and m are much smaller than T. For example in radio modems, typically N<125 and m=8 or m=16, while, as mentioned above a typical value for T is T=216.
As an illustration of the computational efficiency improvements these teachings may offer, now are compared the complexity of the above exemplary embodiments to that of three conventional methods.
First is compared a conventional multiply-accumulate (MAC) based implementation of Q inner products. Assuming an architecture that involves P MAC units, the complexity of the MAC based method per inner product may be estimated as
machine cycles where CMAC(m) is the number of machine cycles for m-bit MAC operation. Since in the exemplary embodiments detailed above the value of T is assumed be very large (up to 216) and the value of n is a parameter that may be optimized and since most of the practical architectures contain a moderate number P of MAC units (usually P≦16), a significant complexity reduction may always be achieved by these exemplary embodiments as compared to the MAC-based one.
Next is compared the exemplary embodiments detailed above to a conventional distributed arithmetics approach. Assume a distributed arithmetic's architecture utilizing a memory of the same total size of T words as the assumed associative processor in the exemplary embodiments of these teachings. Then a total of
parallel look-up tables, each of the size 2n, may be utilized to implement computations according to equations (8) or (9) in parallel for
inner products. As an aside it is noted that the property of FIR filtering type of operations that input vectors are overlapping is additionally difficult to utilize in DA. Now assuming
adders are available, then
machine cycles are needed in order to implement shift-additions according to equations (8) (or (9)), where C+(m) is the number of machine cycles for one m-bit addition with a conventional adder. Therefore, the complexity per one inner product for the conventional distributed arithmetics technique is estimated as:
Since T is a large number a clear complexity gain is again evident.
Finally, let us compare the exemplary embodiments detailed above to a conventional associative computing approach. In this case, T/m inner products could in parallel be computed according to equation (1) utilizing the same ASP as in the exemplary embodiments according to these teachings. Assuming that the complexity of one m-bit multiplication on the ASP is Cmpy(m), and the complexity of one m-bit addition on the ASP be Cadd(m), there are then CASP(T/m)=NCmpy(m)+(N−1)Cadd(m) machine cycles needed to implement computations according to equation (1). Therefore, the complexity per inner product for the conventional associative processing technique is estimated as:
Since Cmpy(m)=O(m2) while Cadd(m)=O(m), and since the complexity of the exemplary embodiments according to these teachings can be varied by varying the value of n, a significant gain is again evident, especially in the case of FIR filtering type of operations.
By the above comparison, clearly the combination of DA with ASC as detailed herein provides a synergistic gain over either independent prior art approach.
So in summary, some of the advantages offered by specific exemplary embodiments according to these teachings include:
These teachings are seen to be so divergent from what is known to the inventors as being within the prior art that implementation may require in some instances new programming models and possibly new programming skills to exploit advantages of the invention. Further, depending on data storage format in the main memory of the system and depending on the input/output (I/O) types supported by the ASP, there may be some early adoption difficulties in organizing data in the CAM arrays in the format needed for implementing these teachings in the most efficient manner. This may however be solved by modifications in ASC principles and by introducing some modifications to ASP architectures to fully exploit the computational efficiency and high levels of parallelism that is the potential of this technique.
Embodiments of the invention may be advantageously deployed in elements of a communication system, such as in chips/processors and/or software embodied in a memory of a user equipment or access node of a wireless communication system.
The UE 10 includes a controller, such as a computer or a data processor (DP) 10A, a computer-readable memory medium embodied as a memory (MEM) 10B that stores a program of computer instructions (PROG) 10C, and a suitable radio frequency (RF) transceiver 10D for bidirectional wireless communications with the eNB 12 via one or more antennas. The eNB 12 also includes a controller, such as a computer or a data processor (DP) 12A, a computer-readable memory medium embodied as a memory (MEM) 12B that stores a program of computer instructions (PROG) 12C, and a suitable RF transceiver 12D for communication with the UE 10 via one or more antennas. The eNB 12 is coupled via a data/control path 13 to the NCE 14. The path 13 may be implemented as the S1 interface shown in
At least one of the PROGs 10C and 12C is assumed to include program instructions that, when executed by the associated DP, enable the device to operate in accordance with the exemplary embodiments of this invention, as will be discussed below in greater detail.
That is, the exemplary embodiments of this invention may be implemented at least in part by computer software executable by the DP 10A of the UE 10 and/or by the DP 12A of the eNB 12, or by hardware, or by a combination of software and hardware (and firmware).
For the purposes of describing the exemplary embodiments of this invention the UE 10 may be assumed to also include an ASP data array 10E, and the eNB 12 may include also its own ASP data array arrangement 12E, such data array arrangements include at least a data array 402 with storage units in rows 406 and columns 404, a tags array 410 which may be one or more rows or columns apart from the other data array 402, a mask array 124 which may also be one or more rows or columns apart from the other data array 402 and from the tags array 410, and a pattern array 128 which may also be one or more rows or columns apart from the other data array 402 and from the tags array 410 and from the mask array. The data array arrangements 10E, 12E may be similar in relevant respects to that shown by example at
In general, the various embodiments of the UE 10 can include, but are not limited to, any of the following exemplary devices which have wireless communication capabilities, and/or image processing (e.g., compression) capabilities: cellular telephones, personal digital assistants (PDAs), portable computers, image capture devices such as digital cameras, gaming devices (particularly those having 3-dimensional image processing capacity), music storage and playback appliances, Internet appliances permitting wireless Internet access and browsing, as well as portable units or terminals that incorporate combinations of such functions.
The computer readable MEMs 10B and 12B may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor based memory devices, flash memory, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The DPs 10A and 12A may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs) and processors based on a multicore processor architecture, as non-limiting examples.
b illustrates further detail of an exemplary UE in both plan view (left) and sectional view (right), and the invention may be embodied in one or some combination of those more function-specific components. At
Within the sectional view of
Signals to and from the camera 28 pass through an image/video processor 44 which encodes and decodes the various image frames. A separate audio processor 46 may also be present controlling signals to and from the speakers 34 and the microphone 24. The graphical display interface 20 is refreshed from a frame memory 48 as controlled by a user interface chip 50 which may process signals to and from the display interface 20 and/or additionally process user inputs from the keypad 22 and elsewhere.
Certain embodiments of the UE 10 may also include one or more secondary radios such as a wireless local area network radio WLAN 37 and a Bluetooth® radio 39, which may incorporate an antenna on-chip or be coupled to an off-chip antenna. Throughout the apparatus are various memories such as random access memory RAM 43, read only memory ROM 45, and in some embodiments removable memory such as the illustrated memory card 47 on which at least some of the various programs 10C may be stored. All of these components within the UE 10 are normally powered by a portable power supply such as a battery 49.
The aforesaid processors 38, 40, 42, 44, 46, 50, if embodied as separate entities in a UE 10 or eNB 12, may operate in a slave relationship to the main processor 10A, 12A, which may then be in a master relationship to them. Embodiments of this invention may be seen at one or multiple components within the UE 10 or eNB 12. For example, embodiments of this invention may be seen at the baseband processor/chip 42 for the case of processing radio-frequency signals, at the video processor/chip 44 for the case of processing still or moving image data that is input from the camera 28 (or image data received over a wireless link 11 via the antennas 36), at the audio processor/chip 46 for the case of processing audio data received over some download link, and at the WLAN processor/chip 37 and/or possibly also at the Bluetooth processor/chip 39 for non-cellular wireless signal processing. It is noted that other embodiments need not be disposed in any of those processors individually but may be disposed across various chips and memories as shown or disposed within another processor that combines some of the functions described above for
Note that the various chips (e.g., 38, 40, 42, etc.) that were described above may be combined into a fewer number than described and, in a most compact case, may all be embodied physically within a single chip.
The various blocks shown in
In general, the various exemplary embodiments may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the exemplary embodiments of this invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as nonlimiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
It should thus be appreciated that at least some aspects of the exemplary embodiments of the inventions may be practiced in various components such as integrated circuit chips and modules, and that the exemplary embodiments of this invention may be realized in an apparatus that is embodied as an integrated circuit. The integrated circuit, or circuits, may comprise circuitry (as well as possibly firmware) for embodying at least one or more of a data processor or data processors, a digital signal processor or processors, baseband circuitry and radio frequency circuitry that are configurable so as to operate in accordance with the exemplary embodiments of this invention.
Various modifications and adaptations to the foregoing exemplary embodiments of this invention may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings. However, any and all modifications will still fall within the scope of the non-limiting and exemplary embodiments of this invention.
It should be appreciated that the exemplary embodiments of this invention are not limited for use with any one particular wireless protocol (e.g., LTE) or even to communications in general (e.g., can be employed for image processing apart from communicating the image data), but may be used to advantage in other wireless communication systems such as for example WLAN, UTRAN, global system for mobile communications GSM, wideband code division multiple access WCDMA, and the like.
It should be noted that the terms “connected,” “coupled,” or any variant thereof, mean any connection or coupling, either direct or indirect, between two or more elements, and may encompass the presence of one or more intermediate elements between two elements that are “connected” or “coupled” together. The coupling or connection between the elements can be physical, logical, or a combination thereof. As employed herein two elements maybe considered to be “connected” or “coupled” together by the use of one or more wires, cables and/or printed electrical connections, as well as by the use of electromagnetic energy, such as electromagnetic energy having wavelengths in the radio frequency region, the microwave region and the optical (both visible and invisible) region, as several non-limiting and non-exhaustive examples.
Furthermore, some of the features of the various non-limiting and exemplary embodiments of this invention may be used to advantage without the corresponding use of other features. As such, the foregoing description should be considered as merely illustrative of the principles, teachings and exemplary embodiments of this invention, and not in limitation thereof.