The present disclosure is directed to a squaring technique that can be implemented as a circuit or as a software algorithm, and more particularly, a squaring technique that uses an arbitrary radix number system.
Squaring is an arithmetic operation used in many digital systems. Squaring circuits can be used for digital signal processing applications, such as image compression, pattern recognition, and others. Squaring is also used as an atomic computation for some cryptography algorithms. Squaring circuit architecture is also commonly incorporated in graphics processors. Several general purpose multiplier circuit designs have also been proposed based on squaring of input operands.
Certain aspects of the present disclosure pertain to methods, circuit elements, and computer program products for squaring a value. A fixed-point value with a fixed word size and a substring size for substrings of the fixed-point value can be identified, wherein the fixed-point value comprises a binary bit string. A square of the fixed-point value can be determined using the fixed point value, the substring size, and least significant bits of the fix-point value equal to the substring size.
In some implementations, a square can be determined by iteratively determining squares of substrings of the fixed-point value using least significant bits of each operand equal to the substring size and the substring of the fixed-point value, wherein the operand in each iteration comprises a portion of the previous operand, wherein the operand is formed by decatenating the previous operand least significant bits equal to the substring size.
In some implementations, determining a square of a fixed point value can include identifying the fixed-point value as an operand. A substring of the operand can be determined as the least significant bits of the operand where the substring is of a specified substring size. The substring can be decatentated from the operand to form a word. The substring can be squared using the word, the substring, and the substring size. The square of the substring can be added to a result. If a length of the word is greater than zero, the word can be identified as the operand and the determining, decatenating, squaring, and adding steps can be executed. If the length of the word and substring is zero, one more iteration is undertaken to account for non-zero residual values, and the result is identified as the square of the fix-point value.
In some implementations, the following expansion can be calculated:
where A is the word, β is the radix, the substring size is log2 [β], and b is the substring value minus β/2.
The details of one or more embodiments of the disclosure are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description and drawings, and from the claims. For example, hardware-based squaring circuits, such as those described here, can accommodate the increasing demand for cryptography hardware support in low power, high-speed mobile devices.
The present disclosure describes an iterative squaring technique that produces a 2nm-bit length result, α2, based on an input operand (often referred to as a squarand) α of nm-bits in length. The circuit produces 2m bits of the output α2 during each iterative step. By considering an m-bit grouping within the squarand α as representing a single radix-2m digit, the circuit can be considered a digit-serial implementation that produces two m-bit digits per iteration.
This digit-serial architecture may allow for a tradeoff between bit-serial and parallel architectures by allowing for the digit to be represented by m bits. Because 2m bits of the result are computed in each iterative step, varying m can yield more or less parallelism while inversely affecting required circuit area. Thus, a minimal or otherwise reduced area circuit can be realized when m is small (bit-serial for the case m=1) and a large parallel circuit results at the other extreme when m is set to the wordsize of the squarand. Designers may be able to choose an appropriate value of m such that performance requirements are met while minimizing or otherwise reducing the amount of circuitry required.
Arithmetically, the technique assumes the squarand is represented as a higher-radix digit string where each digit is represented by an m-bit substring. Furthermore, the technique may yield two digits of output squared value during each iterative step; hence, a total of 2m bits of the squared result are computed at each iterative step.
The device 102 includes a squaring module 106. The squaring module 106 (described in more detail in
Network 104 facilitates wireless or wireline communication between device 102 and other devices. Network 104 may be all or a portion of an enterprise or secured network. In another example, network 104 may be a VPN between device 102 and other devices across a wireline or wireless link. Such an example wireless link may be via 802.11a, 802.11b, 802.11g, 802.11n, 802.20, WiMax, and many others. The wireless link may also be via cellular technologies such as 3GPP GSM, UMTS, LTE, etc. While illustrated as a single or continuous network, network 104 may be logically divided into various sub-nets or virtual networks without departing from the scope of this disclosure, so long as at least portion of network 104 may facilitate communications between senders and recipients of requests and results. In other words, network 104 encompasses any internal and/or external network, networks, sub-network, or combination thereof operable to facilitate communications between various computing components in system 100. Network 104 may communicate, for example, Internet Protocol (IP) packets, Frame Relay frames, Asynchronous Transfer Mode (ATM) cells, voice, video, data, and other suitable information between network addresses. Network 104 may include one or more local area networks (LANs), radio access networks (RANs), metropolitan area networks (MANs), wide area networks (WANs), all or a portion of the global computer network known as the Internet, and/or any other communication system or systems at one or more locations.
The following notation may be used in the description of the digit-serial fixed-point squaring algorithm:
β represents the radix or base of a number system. β may be in the set of natural numbers, β∈.
The ‘radix polynomial’ form of a value a is written as an n-term polynomial of the form:
α=an−1βn−1+an−2βn−2+ . . . +a2β2+a1β1+a0β0
A value α can also be represented in the radix-β number system in the form of a positional string of n characters denoted by α=[an−1 an−2 . . . a2 a1 a0]. For clarity, the character strings denoting the positional digit representations of a value α may be enclosed by square brackets. The digits ai are the coefficients of the radix-polynomial form and their position within
the string inherently denotes the exponent of the radix β.
Each character ai in a positional string representing a value is referred to as a “digit” regardless of the radix of the number system. Binary digits may alternatively be referred to as “bits.”
Digits are restricted to the natural numbers when β≦10, and are members of the set:
{ai∈|0≦ai≦β−1}.
For the case where β>10, alternative single characters are used to represent a digit such as the characters “A” through “F” for the case of β=16.
Where necessary for clarity, digit strings are subscripted by the radix β of the particular number system being used, α=[an−1 an−2 . . . a2a1a0]β.
LSD(α,k) and MSD(α,k) are operators that yield k least significant or most significant digits, respectively, in the digit string representing a value α. LSD(α,1) represents the least significant digit of α, LSD(α,1)=a0. Likewise the most significant digit is given as MSD(α,1)=an−1.
{A,B,C} denotes concatenation of the content of registers A, B, and C which can be of any size and whose individual sizes may differ.
SHL(A,k,B) denotes the operation of shifting the content of register A to the left by k bits and setting the least significant k bits to the content of register B. A can be of any size greater than or equal to the size of B and B must be of size k.
SHR(A,k,B) denotes the operation of shifting the content of register A to the right by k bits and setting the most significant k bits of A to the content of register B. A can be of any size greater than or equal to the size of B and B must be of size k.
A←B denotes the operation of setting the content of register A with that of register B. A and B can be the same size in some implementations.
The radix-β value A is defined as A=α−a0. Expressed as a positional n-digit string:
A=[a
n−1
a
n−2
. . . a
2
a
10]β.
Thus, A can be formed by replacing LSD(α,1)=a0 with the zero digit [0]β or as:
A={SHR([α−0]β,1,[0]β),[0]β}.
The present disclosure describes a circuit and algorithm such that the choice of radix β allows for a trade-off in logic circuit area versus throughput performance in the computation of α2 when α is represented as a binary bit string. Higher values of β allow more bits to be produced per iterative step in the resulting representation of α2. A tradeoff occurs in that the amount of computation or logic required at each iterative step increases for higher radix values.
In the basis of the algorithm as stated here, it is assumed that the squarand is of the form of a binary bit string. Intermediate computations can be efficiently implemented when the radix β is in the form β=2m where m is a positive integer m≧2. Efficiency results since β=2m allows each higher radix digit in the string representing α to be equivalent to an m-bit substring within α. α, in terms of a higher-radix digit string, is simply the concatenation of the disjoint m-bit substrings of α in binary form where LSD(α,1) is the least significant m bits, the subsequent next significant higher-radix digit is represented by the next group of m bits to the left of LSD(α,1), and so on.
For convenience in specifying the basis of the algorithm, Equation (1) can be written with the restriction that β=2m and some of the individual terms on the right-hand side of the equation can be denoted as T1, T2, and T3. α2 can be written as:
The terms T1, T2, and T3 are explicitly defined as follows:
The idea behind the algorithm may be to compute terms T1, T2 and T3 during each iterative step and accumulate them with the previous result. Subsequent iterations use A/β from the (A/β)2 term in Equation (1) as a squarand. The subsequent operand A/β for each iterative step is a digit string containing one less digit than the squarand in the previous step indicating that the iterative algorithm requires O(n/m) iterations to complete. The 22m shifting factor of the first term in Equation (1) illustrates the fact that two digits (2m length bitstrings) are produced at each step and they represent digits in α2 that are produced in the order of the lesser significant digits first.
Several observations may be used to more efficiently implement the computation of the three terms T1, T2, and T3 in the squaring algorithm. First, the term A/β may be efficiently obtained by shifting the digit string representing α one position to the right and discarding a0,
A/β=[an−1 an−2 . . . a2 a1]β. Second, values that are multiplied by a factor of β=2km may be easily obtained by shifting the value to left by km bit positions and inserting a radix-β zero digit place holder [0]β for the vacated least significant digits. Third, the term β/2 is always of the form of a
single radix-β digit. Expressed as an m-bit binary string β/2=[10 . . . 0]2. Finally, the term (β/2)2 is always of the form of two radix-β digits with the most significant digit of value β/4 and the least significant digit of value zero. Hence, expressed as a 2m-bit binary string, (β/2)2=[010 . . . 0]2.
Term T1 can be computed in a single operation. Making use of the first and second observations, the value (A/β)22m is obtained by forming the digit string [an−1 an−2 . . . a2 a100]β. Furthermore, based on the fourth observation, T3=(β/2)2 can always be expressed as two radix-2m digits (2m bits) denoted as [q1q0]β. Thus, T1 is obtained by forming the string
Term T2 is computed by first forming a digit string representing 2(A+β/2) and then multiplying this string with the single radix-β digit b. Relying on the first, second, and third observations, A=an−1 an−2 . . . a2 a1 0]βand β/2 may be represented as a single unsigned radix-2m digit (m-bit string). Therefore, (A+β/2)=[an−1 an−2 . . . a2 a1 β/2]. To account for the multiplicative factor of 2, the (A+β/2)=[an−1 an−2 . . . a2 a1(β/2)]β digit string is then shifted by one bit position to the left resulting in 2(A+β/2). The multiplicative factor 2 would in general be implemented through the use of an addition operation, 2(A+β/2)=(A+β/2)+(A+β/2), when a higher-valued radix β is used that is not an integral power of two since this can be considered a “fractional digit shift,” if β≠m.
The final step in the formation of term T2 involves the multiplication of
2(A+β/2)=[an−1 an−2 . . . a2 a1(β/2)]β by the signed single radix-2m digit of b=a0−β2. Because b is a single digit value, this multiplication may be accomplished with a minimal or reduced amount of computation or circuitry as compared to a general purpose multiply operation or circuit. Clearly, as the value m is increased resulting in a higher valued radix, 2m, both computational complexity and overall algorithm throughput may increase. The actual implementation of the multiplication by b may be dependent upon the value m and may be carefully considered for a given realization of the algorithm. Relatively small values of m generally allow for a simple logic circuit or lookup table to be used.
Term T3=b2 on relies the computation of the square of the residual value b. The implementation of this computation may also be dependent upon the size of m, which dictates the number of bits required to represent a radix-2m digit. For smaller values of m, the direct calculation of b2 can be very efficiently implemented as a small combinational logic circuit or through a lookup table. As m increases, the computation of b2 becomes more complex and other methods may be employed.
For large values of m, the computation of T3b2 can be accomplished in parallel with the computation of the other two terms T1 and T2 since accumulation of T1+T2+T3 with overall result can occur at the end of each iterative step.
After terms T1, T2, and T3 are formulated, they are summed together and accumulated with the previous result. The accumulation takes into account the process of multiplying subsequent iterative operands by 22m and the fact that two independent radix-β digits (or, 2m bits) of the final result are produced at each iterative step. This can be implemented in a variety of ways, including using registers. The size of the register may be 2nm bits where n is the number of radix-β digits representing α and m denotes the radix. The final operation of each iterative step of the algorithm is to shift the result register 2m bits to the right and insert the 2m least significant bits of T1+T2+T3 into the most significant positions of the shifted result register. Insertion of the two radix-2m digits in the most significant portion of the result register instead of performing a multi-bit left shift before adding them to the previously accumulated result allows the algorithm to be implemented without the need for a inclusion of a multi-bit left shift operation or the use of a barrel shifting circuit in a hardware realization.
The algorithm uses an iteration index i to determine if all digits of the squarand have been produced. For an n-digit radix-α squarand, the squared result consists of 2n digits. Because two digits are produced per iterative step, the index i ranges from zero to (n/2)−1. Initially, when i=0, α is the original squarand. During intermediate computations, when 0<i<n/2, the algorithm iterates and sets the intermediate squarand α=A/β. In the final iterative step, the squarand argument becomes α=0; however, this step is performed to account for circumstances when the residual b is not zero-valued.
Any given implementation of the algorithm should include careful consideration of the manner in which the signed digit b is represented. When explicitly represented using a radix-complement or a signed-magnitude form, m+1 bits are required to account for the sign. Furthermore, depending upon the definition of the residual, b can take on integer values in either of the ranges [−(β/2),(β/2−1)] (as is the case in this formulation) or [−(β/2)+1,(β/2)]. However, because there is a one-to-one relationship between the a0 and b values (since b=a0−β/2), the m-bit string representing a0 can be used as an encoding for the corresponding b value.
The algorithm formulated in the previous section makes use of several registers. For succinctness, the registers used within the algorithm statement are defined in Table 1, shown below:
A statement of the algorithm is given below. Intermediate locations within the algorithm are denoted by labels in the form “STEP k.” The labels are included for convenience in referring to certain portions of the algorithm and they also indicate clock boundaries in that the results of
STEP k−1 are registered before computation occurs in STEP k. As an example, the T2←{AB,B2} operation of STEP 3 must complete before the T2←SHL(T2,1,[0]2) operation of STEP 4 can proceed. Breaking up the computation of term T2 into multiple intermediate registered operations is an example of pipelining the datapath and allows for the overall circuit clock speed to be increased. The steps are described below:
The example algorithm shown above can undergo n/m iterations producing 2m bits of α2 during each iterative step. Therefore, the algorithm has temporal complexity equivalent to O(n/m). In terms of required computational resources, the algorithm requires circuitry to perform shifting, bit-string concatenation, 2nm-bit operand addition, m×2nm-bit multiplication, and m-bit operand squaring. While 2nm-bit operand addition operations are required in STEPs 4 and 6, it is noted that a single 2nm-bit addition circuit can be used since these sums may be formed sequentially allowing for reuse of the single 2nm-bit adder. The multiplication and single-digit squaring operations can be implemented in a variety of forms although it is noted that due to the relatively small size of the operands (m bits) very compact and fast circuits such as lookup tables are a practical choice.
The computation of T2←T2×b in STEP 5 of the algorithm is accomplished by using a 4:1 multiplexer as a simple lookup structure with data paths of size 2nm and an m-bit control signal driven by the content of register B. The idea behind this circuit is similar to that of the T3←b×b computation in STEP 3 with the important difference that the possible T2×b values are computed during each iterative step rather than being precomputed and stored before circuit operation. Fortunately, these values are easily and efficiently computed since, for the quaternary implementation, they consist of the value 2(A+β/2) multiplied by only one of
b∈{−2,−1,0,1}. Thus, a negated version of 2(A+β/2)=−[2(A+β/2)] and a single-bit left-shifted shifted version of −[2(A+β/2)] are used as well as 2(A+β/2) and [0 . . . 0] to drive the data inputs of the multiplexer.
The output of the multiplexer 204 is received by combinational logic 206. Combinational logic 206 includes several outputs: one output is coupled to the input of the multiplexer 204. The other outputs of combinational logic 206 are coupled to an adder array 208. Each of the multiplexer 204, the combinational logic 206, and the adder array 208 also include as inputs control signals from a clocked synchronous controller (not shown).
The combinational logic 206 may be implemented based on simplifications in the formation of the intermediate terms T1, T2, and T3, and their various sums. These simplifications exploit the choice of using β=4 as an implicit operand radix and allow for the computation of the intermediate terms T1, T2, and T3 to be implemented with a reduced and simplified set of register transfer level (RTL) operations.
A single quaternary digit [ak]4 can, in general, be written as a two-bit binary string [b2k+1b2k]2 where {bi∈} and ={0,1}. Using this definition, various intermediate terms and their sums can be evaluated for different cases of the least significant digit of the squarand, a0∈{0,1,2,3}. Term T1 is independent of the value of a0 and is always a bit string of length 2n+2 expressed as:
T
1
[a
n−1
a
n−2
. . . a
2
a
110]4=[b2n−1 b2n−2 b2n−3 b2n−4 . . . b5b4b3b20100]2
Case 1: a0=[0]4 resulting in the residual b=[−2]4, thus T3=b2=[10]4=[0100]2. Term T2 can
Combining the terms:
Case 2: a0=[1]4 resulting in the residual b=[−1]4, thus T3=b2=[01]4=[0001]2. Term T2 can be expressed as:
Combining the terms:
Case 3: a0=[2]4 resulting in the residual b=[0]4, thus T3=b2=[00]4=[0000]2. Term T2 can be expressed as:
Combining the terms:
Case 4: a0=[3]4 resulting in the residual b=[1]4, thus T3=b2=[01]4=[0001]2. Term T2 can be expressed as:
For this case, the sum T2+T3 can be formed directly and it is subsequently combined with term T1 using the addition circuit. T2+T3 is formed as:
Table 3 below contains a summary of the results of the intermediate terms and their various sums in terms of values of the least significant digit of the operand at each iterative step.
The combinational logic 206 makes use of the results in Table 3 and outputs the two 2n+2 bit values that are summed in the adder array 208 resulting in T1+T2+T3 (i.e., the combinational logic includes two outputs: one for each input of the adder array). For the cases a0∈{0,1,2}, T1+T2+T3 is formed directly in the combinational logic 206 and is input to the adder array 208 on the leftmost input bus with the right-most input set to the 2n+2 bit string [00 . . . 00]2. The adder array 208 is used for the case of a0=3, where the left-most input is the bit string [b2n−1 b2n−2 . . . b3 b20100]2 and the right-most input is [0b2n−1 b2n−2 . . . b3 b2101]2.
Accumulator 210 consists of an internal accumulator register, an internal adder circuit, and a feedback loop that allows for the internal adder output to be stored in the internal accumulator register. Accumulator 210 can receive the output of the adder array 208 where it is added to the previously stored value in the accumulator register and then stored back into the accumulator register. A right shift register 212 can receive the output of the accumulator. The size of the register 212 may be 2nm bits where n is the number of radix-β digits representing α and m denotes the radix. The final operation of each iterative step of the algorithm is to shift the result register 2m bits to the right and insert the 2m least significant bits of T1+T2+T3 into the most significant positions of the shifted result register. Insertion of the two radix-2m digits in the most significant portion of the result register instead of performing a multi-bit left shift before adding them to the previously accumulated result allows the algorithm to be implemented without the need for a inclusion of a multi-bit left shift operation or the use of a barrel shifting circuit in a hardware realization. After the iterative steps are completed, the square α2 214 can be output.
(A/β)22m+(β/2)2 (514).
A number of implementations have been described. Nevertheless, it will be understood that various modifications may be made including portions or the entirety of the implementation in software form. Accordingly, other implementations are within the scope of the following claims.