This disclosure describes a way to mask (hide or obfuscate) computer data and code against reverse engineering attacks.
Nowadays, there are more and more uses of computer software applications (programs) where the user may also be interested in learning information about the software code which is executed on his computing device. This is, for instance, the case with DRM (Digital Rights Management) applications, used to protect songs, games, applications, or any digital content, against fraud (such illegal copying, spreading content over P2P networks, etc.). Such a user could be interested in (illegally) trying to copy songs, applications or games in order to redistribute them.
Of course, the role of the content distributor or owner is to protect the content against such malicious users (“hackers”). Various known means are used to achieve this goal. One is code obfuscation. Code obfuscation is a well known technique where source code in a computer programming language is made difficult to understand.
As explained above, there is a known need for code obfuscation to protect content against illegal or unauthorized uses. The present inventors have recognized that also object (compiled) code tables of data (also called arrays in the code) have to be protected.
The first need is to protect the data content itself, especially for those data tables containing critical information. This is mostly achieved currently by the use of masks and various other processes. Just as an illustrative example, consider a table with data. Instead of storing data as it, it is stored using a masking function. The associated unmasking process is only done when a given variable of the table has to be used, in a first known solution. In a second known solution, the data is used as masked/protected but the process is adapted to include this kind of mask.
In all the cases, retrieving the original data reverses the process. Sometimes also performing a complementary dynamic analysis speeds up the process. This means accessing the code at run time (execution time) since the unmasking operation is only done upon execution.
The second need, and this is a goal of this disclosure, is to protect the table accesses themselves. When a table of data is involved in a process, this protects the data contained inside this table, and also the place/position in the table of each data item or element. This approach can also be combined with and is not exclusive to methods involving the masking of the data itself.
Hence contemplated in accordance with the invention is the method for obfuscating the code, the associated computer software tool that does the obfuscating and is embodied in a computer program product stored in computer memory, the associated programmed computing apparatus that does the obfuscating, and the resulting obfuscated code embodied in a computer program product stored in computer readable memory. Also contemplated is the inverse method of executing the obfuscated code so as to de-obfuscate it, the associated programmed computing apparatus, and the resulting de-obfuscated code embodied in a computer program product stored in a computer readable memory.
Consider a table of data designated T having a number designated tLen of entries of any data type. There are various ways to access the table. The first and the most simple is to use the table as T directly. This operation is specified in the C computer language as: Table[position] to access the next position (entry). The second one is to define a memory pointer at a given position of T (address of T). Let ptrT designate this memory pointer. In a non-protected implementation, the x+1 element (entry) of T is accessed by reading T[x] (which is equivalent to ptrT+x).
In accordance with the invention, an intermediate table is provided which stores the positions of the data elements which need to be used for each data table. This is the principle of modified indexes. Consider the previously described data table T; and denote f as the function used to “shuffle” (transform) the accesses/positions of the table elements. This function has to be invertible and is used to create a bijection between two groups of tLen elements (where as stated above tLen is the number of the elements in the table). Denote flnv as the inverse function to function f. Table T is transformed (shuffled) into fT with the use of the pre-defined f function.
Accessing the element numbered a+1 inside the data table T is done as (again expressed logically):
b=T[a]
b
(where b is not the (a+1) element if table T is shuffled.)
But in accordance with the invention, this operation is replaced by:
b=fInv(a)
As seen, there is no particular difficulty in carrying this out in computer software and the resulting change inside the source code is minimal. The advantage is the protection of the data element against a static analysis by an attacker. When the present pointer principle is used in a sub-function where T is defined (using ptrT instead of T directly), this attack technique is much harder because the starting position (address in memory) of the data table is unknown.
The length of the table is unknown in the sub-function and the starting point (address) is also not known. This is however necessary to access the table since the pointer just gives an address in the (logical) memory. Therefore the present approach masks a data table very efficiently and is able to manage pointers.
Given all the data tables (where table also refers here to data arrays) T in a particular piece of software, at the compilation time of the source code of the piece of software, through a software “tool” (program), one generates a table of masks which will later contain the starting addresses of each of the tables T, the length tLen of each of the tables and a description of the functions f used to mask the addresses for each table T. Denote this table of masks as masterTable; it is depicted graphically in terms of its organization in
This process 10 of obfuscating the code is shown in
The software tool 16 processes the source code 12 to be masked in the following manner. When a table pointer requiring obfuscation is detected in the original (unmasked) source code 12 by the tool 16 (via specific annotations provided or present for instance in the original source code), the tool 16 modifies the occurrence of the pointer in the source code 12 with a call to a software handler function. (Handlers are well known generally; they are asynchronous and generic callback subroutines.) The table of masks denoted masterTable is then also updated if needed (one needs to update the masterTable each time a pointer to a new buffer—a memory location—is detected).
As shown in
During the later execution of the obfuscated (masked) software compiled in this way and now in object code (compiled) form, the execution process 30 shown in
Given a data table T (with the corresponding masking information stored in the masterTable), consider a pointer pointing to a given position of the original table T: ptrT. Instead of using this pointer prtT directly, the masterTable is accessed. Using ptrT and the associated table T length (tLen) recovered from the masterTable, it is possible to define in which element in which table T pointer ptrT+x is pointing to. Then, using the associated function f also recovered from masterTable, b=T[a] can be replaced to access to the correct position of the shuffled table.
Consider an example in the following data table T in which table T is a 10-Byte long table, starting in the offset value 0x1234 (first entry, top row). This means the table ends at offset value 0x123E. Let the f function (for transforming the pointer indexes) be the multiplication by 3 modulo the length of the table T. Function flnv is therefore defined as the multiplication by 7 modulo 10. This means the second element of the table (at 0x1235, second element top row) is stored at 0x1237 (second element, bottom row). This table T illustrates this, by showing the link between the original starting address (top row) and the modified starting address (bottom row). Note that this table does not represent the values stored at each address:
Suppose the pointer ptrT is pointing to the third element of table T, and the process is only working with the table elements from the third to the tenth. If the pointer in the original source code was pointing to the element 0x1236, with the masking of the addresses this is pointing now to element 0x123A. Given a pointer, it is always possible to determine the starting address because the f function is a bijection over a finite group with a number of elements equal to the length of the table T.
In one embodiment, unmasking operation of the addresses can be done on-the-fly during the execution time. As explained above, during the code generation process the software tool has been used to change a simple access to a more complicated process to access to a masked address based on the f function and the masterTable.
With this solution, each time a data table or a part of a data table is accessed, the code is modified (before or during the source code compilation) to access the right position. All this can be done without changing the original source code on the developer (source code) side and only by the dedicated tool used at or immediately before the compilation into object code. Note also this tool can be used to modify the static constants (data) stored in the data tables by rewriting the data where needed.
A typical transformation (shuffling) function f for modifying the addresses is the use of affine transformations defined modulo the number of elements of the table applied to the position index, expressed logically as (the multiplicative element of the affine transformation being co-prime with the number of elements of the table):
elementNumberindex=originalAddresse+f(index)
Another suitable such function is a permutation table, for example.
Another advantage of this solution is to provide a copy of the data table inside another data table of the same length. This is carried out in two steps. The first one is to recopy the table from its first element to its end for instance. The first element in the table is accessed as T [0] in the C computer language, for instance. Then, a simple update of the masterTable is necessary. It is also sometimes needed to recopy a table inside another table and to complete it for future treatments (appending data). This can be done easily and efficiently by first recopying the existing table at the beginning (this is not restrictive) and then completing the remaining part of the table using another transformation function, by updating the table only for the appended part. This means the new table is just considered as a set of two consecutive tables with specific dedicated modified addresses. Such an operation is impossible with prior solutions. Note also that table T can be enlarged and the added entries filled with padding (meaningless) values.
Another advantage of the present solution is the completely free possibility for the transformation function f associated with each pointer address and the length. This function can be quite complex, and can be changed depending on the tables (that is, differ for each table). One can also use a specific function to assemble two data tables (concatenation). Suppose the shuffling (transformation) of two data tables T1 and T2 is affine. Then it is easy to find a function such that the table T1 elements are shuffled over the odd positions, whereas the table T2 elements are shuffled over the even positions. It can, of course, be more complex.
Given this description, coding the above described software tool in any convenient computer language would be routine, as would be combining that tool with conventional source code compilation. The tool is used prior to conventional compilation, or is part of the compiler. The resulting compiled code would then execute conventionally, resulting in transparency to the ultimate end user of the obfuscated software application, and no special software or hardware modifications are needed to that end user's computing device.
The following shows for one embodiment variables and parameters used in the above software tool:
Computing system 60 can also include a main memory 58 (equivalent to memories 52, 56, 58), such as random access memory (RAM) or other dynamic memory, for storing information and instructions to be executed by processor 64. Main memory 68 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 64. Computing system 60 may likewise include a read only memory (ROM) or other static storage device coupled to bus 62 for storing static information and instructions for processor 64.
Computing system 60 may also include information storage system 70, which may include, for example, a media drive 62 and a removable storage interface 80. The media drive 72 may include a drive or other mechanism to support fixed or removable storage media, such as flash memory, a hard disk drive, a floppy disk drive, a magnetic tape drive, an optical disk drive, a compact disk (CD) or digital versatile disk (DVD) drive (R or RW), flash memory, or other removable or fixed media drive. Storage media 78 may include, for example, a hard disk, floppy disk, magnetic tape, optical disk, CD or DVD, flash memory or other fixed or removable medium that is read by and written to by media drive 72. As these examples illustrate, the storage media 78 may include a computer-readable storage medium having stored therein particular computer software or data.
In alternative embodiments, information storage system 70 may include other similar components for allowing computer programs or other instructions or data to be loaded into computing system 60. Such components may include, for example, a removable storage unit 82 and an interface 80, such as a program cartridge and cartridge interface, a removable memory (for example, a flash memory or other removable memory module) and memory slot, and other removable storage units 82 and interfaces 80 that allow software and data to be transferred from the removable storage unit 78 to computing system 60.
Computing system 60 can also include a communications interface 84 (equivalent to port 32 in
In this disclosure, the terms “computer program product,” “computer-readable medium” and the like may be used generally to refer to media such as, for example, memory 68, storage device 78, or storage unit 82. These and other forms of computer-readable media may store one or more instructions for use by processor 64, to cause the processor to perform specified operations. Such instructions, generally referred to as “computer program code” (which may be grouped in the form of computer programs or other groupings), when executed, enable the computing system 60 to perform functions of embodiments of the invention. Note that the code may directly cause the processor to perform specified operations, be compiled to do so, and/or be combined with other software, hardware, and/or firmware elements (e.g., libraries for performing standard functions) to do so.
In an embodiment where the elements are implemented using software, the software may be stored in a computer-readable medium and loaded into computing system 60 using, for example, removable storage drive 74, drive 72 or communications interface 84. The control logic (in this example, software instructions or computer program code), when executed by the processor 64, causes the processor 64 to perform the functions of embodiments of the invention as described herein.
This disclosure is illustrative and not limiting; further modifications will be apparent to those skilled in the art in light of this disclosure and are intended to fall within the scope of the appended claims.