This disclosure relates encryption/decryption to and in particular to Data Encryption Standard (DES).
The Data Encryption Standard (DES) is described in Federal Information Processing Standards (FIPS) Publication (Pub) 46-3. DES Encryption is performed by performing 16 table lookups and associated data swaps to encode a 64-bit data block. A table lookup and the associated data swaps may be referred to as a “round”. Hence, DES processes the 64-bit data block in 16 rounds. The 3-Data Encryption Standard (3-DES) performs three times the number of rounds performed by DES.
There are two key metrics for evaluation performance of DES. One metric is the maximum speed at which a data block can be encrypted and the other metric is the total aggregate bandwidth which can be encrypted, for example, the encryption of a 10 Mega bits per second (Mbs) data stream. A system may include multiple DES encryption units that operate in parallel in order to achieve the aggregate bandwidth.
Features of embodiments of the claimed subject matter will become apparent as the following detailed description proceeds, and upon reference to the drawings, in which like numerals depict like parts, and in which:
Although the following Detailed Description will proceed with reference being made to illustrative embodiments of the claimed subject matter, many alternatives, modifications, and variations thereof will be apparent to those skilled in the art. Accordingly, it is intended that the claimed subject matter be viewed broadly, and be defined only as set forth in the accompanying claims.
The performance of DES encryption/decryption may be 10 Mega bits per second (Mbs), 100 Mbs, 1 Giga bits per second (Gbs), or 10 Gbs for a unidirectional bit stream. If encrypting/decryption a full-duplex stream, the bit rate is doubled.
For example, in order to achieve 1 Giga bits per second (Gbs) full-duplex 3-DES operation in a system having a clock frequency of 533 Megahertz (Mhz), twelve cycles are allocated per 64-bits to encode/decode. The forty-eight (16*3) rounds required per 64-bits for 3DES, requires four rounds to be performed per cycle.
Increasing throughput of an encryption unit has the dual benefit of decreasing the number of encryption units and increasing the maximum throughput of a single encryption/decryption stream.
The system 100 includes a processor 101, a Memory Controller Hub (MCH) 102 and an Input/Output (I/O) Controller Hub (ICH) 104. The MCH 102 includes a memory controller 106 that controls communication between the processor 101 and memory 110. The processor 101 and MCH 102 communicate over a system bus 116.
The processor 101 may be any one of a plurality of processors such as a single core Intel® Pentium IV® processor, a single core Intel Celeron processor, an Intel® XScale processor or a multi-core processor such as Intel® Pentium D, Intel® Xeon®processor, or Intel® Core® Duo processor or any other type of processor.
The memory 110 may be Dynamic Random Access Memory (DRAM), Static Random Access Memory (SRAM), Synchronized Dynamic Random Access Memory (SDRAM), Double Data Rate 2 (DDR2) RAM or Rambus Dynamic Random Access Memory (RDRAM) or any other type of memory.
The ICH 104 may be coupled to the MCH 102 using a high speed chip-to-chip interconnect 114 such as Direct Media Interface (DMI). DMI supports 2 Gigabit/second concurrent transfer rates via two unidirectional lanes. The ICH 104 includes a crypto unit 104 which includes functions to perform DES and 3DES symmetric-key ciphers for bulk encryption and decryption. Symmetric ciphers may be used for ensuring privacy of network packets in Virtual Private Network (VPN) gateways and in Transport Layer Security (TLS). The crypto unit may also include functionality for Advanced Encryption Standard (AES), Secure Hash Algorithm (SHA-1) or Hashed Message Authentication Code (HMAC).
The ICH 104 may also include a storage I/O controller 120 for controlling communication with at least one storage device 112 coupled to the ICH 104. The storage device 112 may be, for example, a disk drive, Digital Video Disk (DVD) drive, Compact Disk (CD) drive, Redundant Array of Independent Disks (RAID), tape drive or other storage device. The ICH 104 may communicate with the storage device 112 over a storage protocol interconnect 118 using a serial storage protocol such as, Serial Attached Small Computer System Interface (SAS) or Serial Advanced Technology Attachment (SATA).
The DES algorithm as described in Federal Information Processing Standards (FIPS) Publication 46-3 enciphers and deciphers blocks of data consisting of 64 bits under control of a 64 bit key. A 64-bit block to be enciphered is subjected to an initial permutation, then to a complex key-dependent computation using a key schedule generated from the key and finally to a permutation which is the inverse of the initial permutation. The initial permutation rearranges the bits of the 64-bit block as defined in FIPS Publication 46-3 to produce a permuted input, for example, bit 58 of the 64-bit block is the Most Significant Bit (MSB) of the permuted input, bit 50 of the 64-bit block is the MSB-1 bit and bit 7 of the 64-bit input block is the Least Significant Bit (LSB) of the permutted input. The permuted input is input to the complex key-dependent computation which produces a pre-output block.
The complex key-dependent computation for DES includes sixteen iterations (rounds) of a cipher function that operates on a 32-bit block and a 48-bit block to produce a 32-bit block. The complex key-dependent computation for 3DES includes 48 rounds. Each iteration may also be referred to as a round.
The inputs to the round are the 64 permutted input block split into a 32-bit Ln block and a 32-bit block Rn and a 48-bit Key Kn+1. The outputs are a 32-bit Ln+1 block and a 32-bit Rn+1 block.
The output block Ln+1 is computed as follows:
L
n+1
=R
n
As shown in
The output block Rn+1 is computed as follows:
R
n+1
=L
n
̂f(Rn and Kn+1)
A composite function “f” 304 is performed on the 32-bit input block Rn and the 48-bit key Kn+1. An Exclusive OR function is performed on the result of the composite function 308 and the 32-bit input block Ln The output of the Exclusive OR operation 310 is directed on path 310 to 32-bit output block Rn+1.
Thus, the composition function “f” 340 shown in
f=P(sbox(E(R)̂K))
In order to reduce the amount of logic required to implement the DES algorithm described in FIPS-PUB 46.3, the logic required to implement the composition function “f” may be reused multiple times by adding addition state elements and circulating data through the same logic for a plurality of cycles. This requires the addition of a state machine to schedule the key that is used by each cycle and to control the circulation of the data through the associated data-path. In an embodiment, four rounds 314 (
Referring to Table 1, the inputs to Round 1 are 32-bit block L0, 32-bit block R0 and 48-bit key schedule K0. The outputs from Round 1 are 32-bit block R1 and 32-bit block L0 that are computed as discussed earlier in conjunction with
As shown in Table 1, the critical path includes a plurality of Exclusive OR (XOR operations with two XOR operations (denoted by the symbol “̂”) per round. There is one XOR operation performed by the f function “P(sbox(E(R)̂K[47:0])” and another XOR operation is performed on the result of the f function and the L data. Thus, the critical path for a cycle in which four rounds are performed includes eight XOR (̂) operations, with two XOR operations used to compute each of the four data blocks R1-R4, one per round in the four-round cycle. The path that provides the key schedule (K0-K3) is not critical because the key schedule (K0-K3) for the four rounds in the cycle is a fixed value that is stored in memory with 48-bits of the key schedule used per round.
The cycle for computing a plurality of rounds 500 includes an initial stage 502, a function stage 504 and a final stage 506. The initial stage 502 performs an expansion function E on the 32-bit R input and performs an XOR operation on the 48-bit expanded R input and the 48-bit key schedule. The final stage 506 performs an XOR operation on the result of the L path and the result of the R path to provide a 32-bit R output which is input to the next cycle.
Both the expansion operation (E) and the exclusive OR operation (XOR) are linear functions. A linear function has a distributivity property, that is, E(ÂB)=E(A)̂E(B) and an associativity property, that is, (âb)̂c=â(b̂c). These properties may be used to decrease the number of XOR operations in the critical path.
These properties are used to perform transformations on a portion of the f function processed by the function stage 500 shown in
Ri
—
wk=(E(Li−1̂Ri—i)̂Ki)
Ri
—
wk=(E(Li−1)̂E(Ri—i))̂Ki)
Instead of expanding the result of the XOR operation on the 32-bit L data block and 32-bit R data block, the expansion is performed separately on each of the data blocks. The XOR operation is then performed on the expanded data blocks (L and R).
Ri
—
wk=(E(Li−1)̂Ki)̂E(Ri—i))
An expansion operation to expand the L data block to 48-bits is performed in the non-critical L path. Next, an XOR operation is performed on the expanded L block and the key schedule K in the non-critical L path. The result of the XOR operation is used to perform an XOR operation on the expanded R data block. This results in a reduction of an XOR stage through the critical R path.
The resulting operations for a 4-round implementation of the DES function that make use of the transformations are shown below in Table 2. As shown, the number of XORs in the critical timing path from “R” to “R4” is reduced from eight to five, that is, there is one XOR per round in the R critical path in each of the four rounds per cycle and one additional XOR per cycle to obtain R4 from R4_i.
The transformations described in conjunction with
An embodiment of the present invention further decreases the number of XOR stages in the critical R-path. In an embodiment, in a four round cycle, the number of XOR stages in the critical path is reduced to four per cycle. In addition, logic organization is symmetric which further increases the overall performance of DES and 3DES.
An embodiment of the present invention for a four round cycle removes the overhead of the additional XOR operation per cycle in the final stage 504 shown in
The logic in the initial stage 702 shown in
R
—
in
—
w=E(R—in)̂K0[47:0];
Thus, the input “R” state register (R0_wk) in the initial stage 702 is expanded to 48 bits wide instead of a 32-bit wide state register. Instead of initializing the “R” state register with the R input bits, these bits are expanded to 48-bits and XORed with the initial key (K0) value. As the data loops through the “R” state element (Ri_wk), the “R” state element always contains a pre-computed XOR with the next compression key that will be used. Thus, the “R” state domain (Ri_wk) remains expanded to 48-bits and does not transform back to 32-bits after every 4 rounds. The “R” input to each round cycle other than the initial cycle is L3̂R4_i.
In an initial stage in a multi-stage (round) cycle for the non-critical L-path, the 32-bit L0 input to the cycle is expanded to a 48-bit L input and the 48-bit L input is input to an XOR stage where an XOR operation is performed on the 48-bit L input and the key schedule to produce a 48-bit L_in_w input.
The logic in the initial stage may be represented by the following pseudo code:
L
—
in
—
w=E(L—in)̂K0[47:0];
Thus, the input “L” state register (L0_wk) in the initial stage is expanded to 48 bits wide instead of a 32-bit wide state register. Instead of initializing the “L” state register with the L input bits, these bits are expanded to 48-bits and XORed with the initial key (K0) value. As the data loops through the “L” state element (Li_wk), the “L” state element always contains a pre-computed XOR with the next compression key that will be used. Thus, the “L” state domain (Li_wk) remains expanded to 48-bits and does not transform back to 32-bits after every 4 rounds.
Pre-computing the initial XOR value into the “R” state element, allows one XOR to be reduced from the key DES critical path, that is, the R path In addition expanding the “R” state element width to 48-bits increases the symmetry of the data_path and allows for a higher performance implementation of the initial “sbox” lookup function. In contrast, in the composition function shown in
For higher speed implementations with fewer rounds per cycle, the overall effect of this performance increase will be even more pronounced due to the saving of one XOR on the critical path independent of the number of rounds completed per cycle. For example for a 2 round hardware implementation, the number of XORs in the critical paths is reduced from 3 to 2.
The inter-cycle logic 920 includes a multiplexer 906 and R-state register 908 for the critical R-path and a multiplexer 904 and L-state register 906 for the non-critical L path. In the R-path, prior to the initial cycle, the multiplexer 906 allows the initial R state R0_wk through to the R_state register 908 as discussed in conjunction with FIG. A. In the L-path, prior to the initial cycle, the multiplexer 904 allows the initial L state L0_wk through to the L-state register 906.
Psuedo code for the data path logic for n rounds, with four 32 bit input blocks R0-R3, L0-L3 and four 48-bit key schedules (K0-K3) to generate four 32-bit output blocks R1-R4 and L1-L4 is shown below in Table 4
At block 1000, a 48-bit input R state element (R_in_w) is initialized with an expanded input R (R_in) vector that has been expanded from 32-bits to 48-bits XORed with the 48-bit initial key value (K0[47:0]) as shown below and discussed in conjunction with
R
—
in
—
w=E(R—in)̂K0[47:0];
Processing continues with block 1002.
At block 1002, the state elements operate on the 48-bit R state element, the 48-bit key values and the 32-bit L values to provide a 48-bit R value and a 32-bit L value per round for each of N rounds as shown in Table 3 and discussed in conjunction with
At block 1004, if there are another N rounds to be computed for DES or 3DES, processing continues with block 1002 to compute the next N rounds. If not, processing is complete, with the 32-bit R for the last round of DES/3DES output from ? and the 32-bit L for the last round of DES/3DES computed as shown in Table 4.
It will be apparent to those of ordinary skill in the art that methods involved in embodiments of the present invention may be embodied in a computer program product that includes a computer usable medium. For example, such a computer usable medium may consist of a read only memory device, such as a Compact Disk Read Only Memory (CD ROM) disk or conventional ROM devices, or a computer diskette, having a computer readable program code stored thereon.
While embodiments of the invention have been particularly shown and described with references to embodiments thereof; it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of embodiments of the invention encompassed by the appended claims.