Embodiments of the present invention may relate to processing encrypted code and data in a secure manner in a processor.
With the frequent stories of hacking personal information of millions of customers and clients of corporations and government, data and computer security have become significant issues in computing. Organizations, such as the Trusted Computing Group, have created a number of standards for secure authentication to decrypt privileged information. HDMI and other standard communication protocols have methods for encrypting and decrypting privileged data as well, but all these solutions deal with encrypting transmitted or stored data, not the actual code and data used in the actual computer, which has left a gap where hackers may be able to get access to decrypted information within the computing systems themselves. Goto et al., in U.S. Pat. No. 7,865,733 granted Jan. 4, 2011, suggests securely decrypting data received from any external memory into the processor chip, and encrypting any data being sent off of the processor chip to external memory, and Buer, in U.S. Pat. No. 7,734,932 granted Jun. 8, 2010, suggests a solution by leaving the data and instruction encrypted in main memory, decrypting it when fetched into cache. Furthermore, while Hall, in U.S. Pat. No. 7,657,756 granted Feb. 2, 2010, suggests storing the metadata for decryption in cache, it is along with the decrypted data. These may address the problem of single-threaded, single processors residing, with their own cache on secure integrated circuits (ICs), but that is not the state of all computing, e.g., cloud computing, today. Most of today's servers contain multiple ICs, each with multiple processors and multiple levels of shared cache, processing potentially different applications on virtual machines in the same chip. In this environment, one application may snoop another application within the same chip, and may do so well enough to hack it.
Convolution encrypting the source code, while helpful, may still be decrypted by detecting the frequency of the instruction codes. Other techniques such as decrypting the instruction by applying the XOR of the encrypted instruction and one or more fixed keys such as described by Henry et al., in US Patent Application Publication No. 2011/0296204, published Dec. 1, 2011, are only as good as the keys. A sufficiently robust encryption technique may be needed to be adequately tamper proof. Such a technique should be sufficiently random to be difficult to break. Butler, in U.S. Pat. No. 7,412,468 granted Aug. 12, 2008, suggested using a Multiple Input Shift Register (MISR), also known as a linear feedback shift register (LFSR), for both built-in self test (BIST) and the generation of random keys for encryption, which he suggested may be applied to external messages using a form of Rivest-Shamir-Adelman (RSA) encryption. Unfortunately. Butler's encryption approach may require too much computational overhead for encoding and decoding a processor's instructions and data, as may other software-based encryption techniques, such as that described by Horovitz et al, in US Patent Application Publication No. 2013/0067245, published Mar. 14, 2013; and while Henry et al., in US Patent Application Publication No. 2012/0096282, published Apr. 19, 2012, suggest using the NOR operations to decrypt in the “same time” as not decrypting, they still require additional instructions to switch their encryption keys. Therefore, in order to provide an adequate tamper proofing mechanism for cloud computing, in multi-processor systems with shared cache memory systems, it may be desirable to employ a pseudo-random key based technique for transparently encoding and decoding instructions and data with annul overhead, with in individual processors, such that protected applications and data may remain encrypted in shared memory spaces.
Various embodiments of the invention may relate to hardware and software encryption and decryption techniques using pseudo-random numbers generated by LFSRs for use in both testing a processor's hard are and protect a processor's data and instructions in shared memory by encrypting and decrypting the data and instructions between the processor are all shared memory.
In one embodiment, a multi-processor system, e.g., on an IC, may have two or more processors, here each processor may include an instruction unit for processing instructions, an execution unit for operating on data, at least one cache memory, at least one interface to a system bus, logic for translating instructions accessed by the instruct on unit from a cache memory and logic between the execution unit and a cache for translating data, where the translating may use pseudo random numbers. The logic for translating instructions may include logic for decrypting encoded instructions, and the logic for translating data may include logic for decrypting data being accessed the execution unit and logic for encrypting data being written to the cache. The logic for translating instructions may hide a LFSR, and the logic for translating data may include code transformation logic. The logic translating data may also include logic for selectively encrypting data written to system bus. The cache memories may include an instruction cache and a data cache. The logic for translating instructions may access the instruction cache, and the logic to translating data may access the data cache.
In another embodiment a multi-processor system, e.g., on an IC, may have two or more processors, where each processor may include an instruction unit for processing instructions, an execution unit for operating on data, at least one cache memory, at least one interface to a system bus, and logic for translating data and instructions transferred between the system and a cache memory. The logic for translating instructions may include logic for decrypting encoded instructions, and the logic for translating data may include logic for decrypting data being accessed by the execution unit and logic for encrypting data being written to the cache.
In another embodiment is method for encrypting a program's instructions and data may include the steps of:
In another embodiment, debugging unencrypted applications may be performed without recompiling the application or altering the encrypted application's cycle-by-cycle operation, e.g., by using zero translation codes.
In yet another embodiment, instructions to generate the data for transform mask registers, which define the programming of code transformation logic from an LFSR's mask register bits, may be encrypted, appended in front of the encoded application, and may be executed following the loading of the LFSR mask register and initial translation code. The transform mask registers' data may be generated by:
Finally, in another embodiment, an LFSR, code transformation logic, and checksum logic may be used to generate random instructions and data to test the processor prior to normal operation.
Various embodiments of the invention will no be described in connection with the attached drawings, in which:
a and 1b are conceptual diagrams of exam multi-processor systems with encryption and decryption translators,
a and 2b are diagram of examples of instruction decryption using an LFSR in conjunction with instructions,
a, 3b, and 3c are diagrams of examples of data encryption and decryption using offset translated codes associated with base addresses of the data,
a is as high level diagram of code translation based on the example LFSR,
b is a detailed diagram of the code transformation logic example,
Embodiments of the present invention are now described with reference to
Reference is made to
The architecture shown in
Returning to the system architecture shown in
Reference is now made to
As mentioned above, some encrypted instructions may include translation codes, which may also be encrypted, thereby securing all but the initial translation code for the first instruction. In this manner, the proper translation code for each instruction may be easily obtained in one clock cycle by either loading or clocking the LFSR.
Unfortunately, in a random access memory, data may not be accessed sequentially, thereby requiring a way to directly calculate the proper translation code from the address of the data. So, with regard to the system architecture shown in
In addition to branches, instructions for loading a base register's address may also load the base register's associated code register. In a similar manner, a subroutine call may store the translation code associated with the instruction in the instruction LFSR after saving the prior contents of the LFSR in the code register associated with the base register where it stores its return address. Similarly, a return instruction may branch to the address in the base register while loading the contents of the corresponding code register into the instruction LFSR.
Initial encryption of the instructions in a program and data space may be performed after compilation and before the final load module creation, e.g., by: creating an initial translation code; incrementing the LFSR function to obtain the translation code for each instruction; defining a translation code for each data space; incrementing the LFSR function to obtain the translation code for each predefined data element; appending to selected instructions the translation code corresponding to the value in the address field of those instructions; and encoding each instruction, data and appended translation code with the translation code associated with its address. The instructions requiring appended translation codes may include instructions invoking addresses in base registers, branches and/or subroutine calls.
Reference is now made to
In another embodiment, the contents of the L1 cache may be decrypted in the system architecture depicted in
Furthermore, it is contemplated that an LFSR, starting with a translation, code for the first instruction or word of data in a cache line buffer, may be clocked and applied to each subsequent instruction or word of data being read from or written into the cache line buffer. If the read or write is out of order, the translation code may be adjusted by a single transformation function that may “subtract” the buffer size from the LFSR when the data or instructions wrap around the line buffer. Given an LFSR function with M unique values before repeating, “subtraction” of N is equivalent to a transformation function of M−N, where M>N.
A simple four-bit example may be used to clarify the structure and functional operation of both an LFSR and its associated code transformation logic. Reference is now made to
Reference is now made to
Reference is now made to
Given the inputs for Jx are a0, b0, c0 and d0 and the outputs are ax, bx, cx, dx, letting the symbol “<-” representing assigning the expression of inputs on the right to the output on the left, and letting “+” represent an exclusive-OR operation, and given:
J1 is a1<-d0, b1<-(a0+d0), c1<-b0, d1<-c0;
J2 is a2<-c0, b2<-(c0+d0), c2<-(a0+d0), d2<-b0;
J4 is a4<-(a0+d0), b4<-((a0+d0)+b0), c4<-(b0+c0), d4<-(c0+d0); and
J8 is a8<-(a0+c0), b8<-((b0+d0)+c0), c8<-((a0+c0)+(d0), d8<-(b0+d0); then
J15 is a15<-a0, b15<-b0, c15<-c0, d15<-d0; so C15=C0
In the case where the offset may be larger than the size of the non-repeating numeric sequence of the LFSR, it may possible to reduce the logic of the higher order transformation functions. Reference is now made to
While the above techniques may provide reasonably strong encryption when using large LFSRs, the encryption may be weaker for smaller LFSRs. One solution may be to expand the number of potential repeating sequences by making the LFSRs and code transformation logic programmable. Reference is now made to
Therefore, in another embodiment, the LFSRs and all decryption may be disabled by loading translation codes of zero. This may be performed, e.g., when exiting an encrypted application.
Reference is now made to
The actual number of unique bits required to program the code transformation logic may be much less than N3. First, a should be noted that the transform mask register bits for the first transformation function, when viewed as an N×N matrix, may be generated by rotating an identity matrix down one row after ORing the N−1 LFSR mask register bits into the first bits of the last column, in a manner that properly simulates one clock shift of the associated LFSR. The second transformation function's matrix may be generated by multiplying modulo 2 the first transformation function's matrix by itself, the third transformation function's matrix may be generated by multiplying modulo 2 the second matrix by itself, and each successive transformation function's matrix may be generated from the matrix of the previous transformation function in the same manner. As such, the N3 programming bits of a programmable code transformation function may be generated with as few as N−1 programmable bits, or may, with appropriate logic, only require N−1 programmable bits.
Assuming a programmable version of the LFSR in
Given the 3 LFSR bits are [1 0 0], the single shift matrix [J1] may be:
The matrix for two shifts [J2] may be:
The matrix for four shifts [J4] may be:
And the matrix for eight shifts [J8] may be:
It is further contemplated that the LFSR mask register bits needed for programming the LFSR may not be the bits used to program the transformation functions, thereby providing different encryption algorithms for the instruction and data. Such additional mask register bits may also be included with the initial translation code.
It is also contemplated that the mask register bits may be encrypted with the initial translation code, and prior to executing the encrypted program, the mask register data may be decrypted by loading the initial translation code into the LFSR, using the initial translation code to decode the mask register data without clocking the LFSR, and then loading the LFSR's decrypted mask register data.
It is also contemplated that instructions to generate the data for the transform mask registers from the LFSR's mask register bits may be encrypted, appended in front of the encoded application, and may be executed following the loading of the LFSR mask register and initial translation code. It should be noted that this code may not address data memory, which may require the use of the yet-to-be-programmed code transformation logic. As such, all transform mask registers may be directly addressable by instructions, and all generation of the transform mask register data may be done in situ, thereby avoiding use of addressed data memory.
Furthermore, it is contemplated that the processor's legal instruction codes may be a small fraction of the possible values in the opcode field of an instruction. Upon incorrect decryption, the execution of an illegal instruction may cause an operating system interrupt, thereby allowing the operating system to detect instruction tampering. Similarly, by maintaining legal memory space or spaces that are small relative to the full address space, illegal addresses may also cause operating system interrupts, thereby allowing the operating system to detect data tampering.
Small examples, such as those above, may be useful for illustrating, the detailed logic, but in current more realistic multi-processor environments, a practical example may be a 32-bit RISC processor with 20-bit offset address fields in the instructions and multiple levels of cache. In this example, the instructions and data may remain encrypted within their respective caches, the LFSR may be 32 bits long, and the LFSR mask register may be 31 bits long, both manageable sizes of separately encrypted initial codes. Once loaded, the longest path between flip-flops on the programmable LFSR may be an AND gate followed by XOR gate, and loading the LFSR may also only take one clock cycle; hence, the decryption of the instructions may easily occur during the instruction unit's fetch cycle. For branch look-ahead techniques or intermediate loop instruction storage, the proper decrypted translation codes for each stream may be stored with the branch predictions or loop instructions.
The data code transformation logic may be much larger. The offset address field may contain a 20-bit offset, which may result in 20 transformation functions, each of which may have 32 bits of 32 AND gates masking the input signals to a 6-level tree of 31 XOR gates. Each of the 20 transformation functions may then contain eight levels of logic (1 AND, 6 XORs and 1 multiplexor), for a total of 1,024 AND gates, 992 XOR gates, 32 multiplexors, and 32 32-bit transform mask registers. The worst-case path in such a structure may be up to 160 gate levels long. This may be reduced where the terms are not needed, but the result may still require many clock cycles. Still, the time needed to calculate the proper cache line translation code may overlap with the time required to process a cache line miss request to either an L2 cache or main memory, which also may take many clock cycles. Upon receiving the externally requested cache line, the translation code may be stored in the L1 data cache with the encrypted cache line. Upon a subsequent cache hit, the translation code may be retrieved to decrypt the data retrieved from the cache or to encrypt the data written to the cache, as shown in
In yet another embodiment of the present invention, the mask register and code transformation logic may be reduced by limiting the programming to a subset of the bits.
In another embodiment, debugging of applications may be performed without recompiling the application or altering its cycle-by-cycle operation. Unencrypted applications may also be modified before the final load module creation, e.g., by creating a zero initial translation code and appending to the selected instructions a zero translation code. Execution of the unencrypted application may then be performed with all the available transparent debug facilities as may exist in the processor, and with the translation logic enabled. Furthermore, the unencrypted code may then perform in the same cycle-by-cycle manner as the encrypted code. Similarly, when subsequently encrypting the application, or re-encrypting the application, its size and cycle-by-cycle operation may not change.
In another embodiment, the LFSR, code transformation logic, and checksum logic may be used to generate random instructions and data to test the processor prior to normal operation. Reference is now made to
The control signals may include interrupt signals, instruction addresses, and/or other signals generated by the execution of the test and captured by the checksum prior to being disabled. Alternatively, some amount of encoded instructions may be loaded into the I-cache, and encoded data into the D-cache to perform partial or full diagnostic tests. In this manner, the LFSR, transformation logic and checksums may be used to perform processor BIST or to aid in processor diagnostic tests.
It will be appreciated by persons skilled in the art that the present invention is not limited by what has been particularly shown and described hereinabove. Rather the scope of the present invention includes both combinations an sub-combinations of various features described hereinabove as well as modifications and variations which would occur to persons skilled in the art upon reading the foregoing description and which are not in the prior art.
Number | Name | Date | Kind |
---|---|---|---|
7412468 | Butler | Aug 2008 | B2 |
7657756 | Hall | Feb 2010 | B2 |
7734932 | Buer | Jun 2010 | B2 |
7865733 | Goto et al. | Jan 2011 | B2 |
20020107903 | Richter | Aug 2002 | A1 |
20050105738 | Hashimoto | May 2005 | A1 |
20090282178 | Kailas | Nov 2009 | A1 |
20090293130 | Henry et al. | Nov 2009 | A1 |
20090319673 | Peters | Dec 2009 | A1 |
20100192014 | Mejdrich et al. | Jul 2010 | A1 |
20110296204 | Henry et al. | Dec 2011 | A1 |
20120079281 | Lowenstein et al. | Mar 2012 | A1 |
20120096282 | Henry et al. | Apr 2012 | A1 |
20130067245 | Horovitz et al. | Mar 2013 | A1 |
Entry |
---|
Int'l Search Report and Written Opinion issued Nov. 13, 2014 in Int'l Application No. PCT/US2014/031396. |
Number | Date | Country | |
---|---|---|---|
20140317419 A1 | Oct 2014 | US |