The disclosure relates to software programs and/or applications and to protection of such applications against reverse engineering and protection of data manipulated or included in such applications.
Reverse engineering of an application consists of an analysis of the application in order to gather information about its operation and data it manipulates. The aim of reverse engineering is to summarize, to reconstruct the original source code of the application from its distributed binary form.
In parallel, in recent years, the number of software applications that can be downloaded to mobile devices from public “application stores” or “markets” has exploded. Applications such as Android applications are distributed using a file called Android™ Package kit (APK) comprising code and data that the application needs to be installed in relation to an Android operating system and to properly operate.
To produce an APK file, a program designed for Android platform is first compiled, and then all of its parts are assembled into one package file APK including all program executable codes (such as .dex files), resource file, assets file, certificates file, and manifest file. An APK file is usually a ZIP formatted package file having the Java Archive (JAR) file format. An APK file is an archive that usually contains the following files and folders:
Each APK file has one or more program files “classes.dex”, “classesN.dex”, which contain the entire program Java code. For portability reasons, programs for Android devices are commonly written in Java and compiled to bytecode (which is contained in program.class files). The compiled bytecode is then converted from Java Virtual machine-compatible.class files to the Dalvik-compatible.dex (Dalvik Executable) file in order to enable installation on a computing device. The Java bytecode in the .class files has a binary format and represents the instructions that can be executed by a Dalvik virtual machine.
An APK file can be disassembled using a reverse engineering tool such as APKTOOL which produces bytecode program files called “classes” and data files arranged in directories and sub-directories called “packages”. A class such as a Java class is an extensible program-code-template for creating objects. A class represents an object category and has members, methods and attributes that are common to a set of objects. The attributes of a class define properties of the objects belonging to the class. An object is an instance of a class. The methods of a class define behaviors or functions of the objects belonging to the class. A method is a set of program instructions manipulating the attributes of an object of the class.
A lot of information present in a Java source code file will frequently be contained in the corresponding bytecode file. This includes information about classes, methods, class attributes, and source code which can be present in a “LineNumberTable”. This table links instructions to specific lines of the source code and another table called “StackMapTable” is used to verify variable types.
Reverse engineering tools such as decompilation tools exploit this additional embedded information to increase the output quality of the decompiled source code.
Some applications include, store or manipulate potentially sensitive or private information. This sensitive information can be retrieved by disassembling the application package file and by manually analyzing the files thus obtained, and/or by analyzing the executed code and/or the manipulated data at runtime. Techniques and tools such as obfuscation techniques have been developed to prevent such an analysis or make it more difficult and time consuming. Some tools transform code portions of the application that have to be hidden to make them complex and/or confusing. The purpose of obfuscation techniques is to make a code portion less transparent and less readable and to increase the complexity of the data and control flow of the code portion. Obfuscation can be applied to code, data, control flow, and layout. Obfuscation can also prevent decompilation.
However, when considering an application, it is difficult to evaluate or measure the tamper resistance of the application, or the quality of the obfuscation applied to it, in particular when the application after disassembling comprises hundreds or thousands of class files. The evaluation of the quality of obfuscation should require determining what code portions are obfuscated, how they are obfuscated, how resilient is the obfuscation and what the obfuscation contributes to hide from external world. Indeed, security-sensitive code portions of an application may need to be more obfuscated than non-sensitive code portions. In addition, obfuscation of a security-sensitive code portion may aim to hide security-sensitive information.
There already exists some software metrics used in software development, for measuring maintainability of software. Some of these metrics use characteristics such as transparency, readability, and data and control flow complexity of a code portion, to evaluate comprehension of the latter. Transparency reflects how straightforward algorithmic concepts can be extracted from a code portion. Readability reflects the contribution of the names of variables, functions and classes to the comprehension of the logic of the code portion. Control flow defines which conditions control the operations of the code portion, and in which sequence they are tested. Data flow defines the data inputs to the code portion, and how these inputs are transformed and output. Data and control flow complexity reflect interdependencies between parts of the code portion and how they communicate with each other at runtime.
Consequently, there is a need to quantify obfuscation of program files. Obfuscation quantification may then be used to sort or select files that should be more obfuscated. There is also a need to select program files that are insufficiently obfuscated regarding current needs, and optionally to process the selected files to improve their protection using obfuscation.
A method is described for evaluating the obfuscation level of an application and for selecting or sorting files as a function of the evaluated obfuscation quality. The method of assessing obfuscation of an executable software application, may comprise: generating, by a processor, program text files from the executable software application; for each program text file of the software application, computing at least one syntactical metrics or program complexity metrics, computing, by the processor, a respective score for each program text file based on the computed metrics; performing, by the processor, a sorting operation of the program text files as a function of their respective computed scores; and providing, by the processor, a result of the sorting operation, or generating a new executable software application, by applying an obfuscation processing to a respective source code file corresponding to each identified program text file, to generate a modified source code file for each identified program text file, the new executable software application being obtained from source code files corresponding to the program text files with the source code file of each identified program text file being replaced by the corresponding modified source code file.
According to an embodiment, the method further comprises: extracting program code files from the executable application file; and converting the program code files into the program text files.
According to an embodiment, the method further comprises: comparing each computed score with a threshold value, and selecting program text files as a function of results of the comparisons.
According to an embodiment, the method further comprises displaying the computed scores, each score being displayed as a tile having a color depending on the score value and being associated with a name of a corresponding program text file.
According to an embodiment, the displayed tiles are arranged in rows and columns.
According to an embodiment, the method further comprises extracting program code and data files from the executable software application file, the program code and data files being distributed in folders of a folder tree structure, each folder corresponding to a package of the executable software application, and each program code file corresponding to a class in one of the packages.
According to an embodiment, the syntactical metrics are applied to character strings of the program text files and comprise at least one of: ratios of character string numbers by size of character string with respect to a total number of character strings, a distribution of these ratios, and an average size of character string, ratio of a number of character strings having at least one non-ASCII character, with respect to the total number of character strings, ratio of a number of character strings having at least one non-alphabetic character, with respect to the total number of character strings, ratio of a number of character strings having a first encoding and character strings having a second encoding, with respect to the total number of character strings, ratio of a number of character strings having at least one Unicode character, with respect to the total number of character strings, ratio of a number of character strings having at least one special character, with respect to the total number of character strings, ratio of a number of character strings having a non-printable character, with respect to the total number of character strings, and ratio of a number of character strings having an unknown encoding, with respect to the total number of character strings.
According to an embodiment, the program complexity metrics comprise at least one of: a total number of lines of code, numbers of packages, classes and functions or methods, numbers of instructions per function or method, a distribution of Application Programming Interface calls, distribution of classes, methods, class attributes, constants, Halstead's metrics, and McCabe cyclomatic complexity analysis.
According to an embodiment, the method further comprises: generating a control flow graph for each of the program text files, and/or generating a call graph including all the methods of functions of the application.
According to an embodiment, the method further comprises comparing each computed score with a respective threshold value, selecting the scores as a function of the comparison with their respective threshold value, and computing a global score for each program text file by adding coefficients defined respectively for each selected score as a function of the metrics corresponding to the selected score.
According to an embodiment, the method further comprises computing a global score for a file set of program text files by applying weighting coefficients to the global scores computed for the program text files of the file set, and/or computing a global score for the software application by applying weighting coefficients to the scores computed from each program text file of the software application.
According to an embodiment, the generation of a new executable software application comprises: selecting an obfuscation processing, applying the selected obfuscation processing to the source code file of each identified program text file; and compiling and assembling the source code files, the new executable software application being further tested to identify program text files that are insufficiently obfuscated.
According to an embodiment, the global scores computed for the program text files are based on one or more metrics selected by a user, and the obfuscation processing is selected as a function of the selected metrics.
Embodiments may also relate to a computer system for selecting executable program files of an executable software application, the computer system being configured to implement the method as previously defined.
Embodiments may also relate to a computer program product loadable into a computer memory and comprising code portions which, when carried out by one or more computers, configure the one or more computers to carry out the method as previously defined.
The method and/or device may be better understood with reference to the following drawings and description. Non-limiting and non-exhaustive descriptions are described with the following drawings. In the figures, like referenced signs may refer to like parts throughout the different figures unless otherwise specified.
As used hereinafter, the following terms have the following meanings, except when specifically indicated otherwise. The term “metrics” refers to values that express a degree to which the examined code satisfies some evaluation criterion.
The term “obfuscation” refers to a transformation of a program code in order to hide the original intent of the code by, for example, increasing complexity. Obfuscations may be measured by computing suitable metrics to the obfuscated code.
The module PPRC disassembles the application file APF into package files or source files and/or separate files CF, DF which can be stored for example in a database APDB. The files in the database APDB comprise program code files CF and data files DF. The module PPRC can extract other information from the application packages or from the extracted files to facilitate obfuscation analysis of the application.
The module OAM comprises metrics modules MT1, MT2, . . . MTn that have access to the stored information that can be in the database APBD. Each of the metrics modules MT1, MT2, . . . MTn provides a respective metrics score QT1, QT2, . . . QTn for each program code file CF of the application file APF in the database APDB.
The module MTAM is configured to compute a respective global obfuscation score FMT for each program code file CF extracted from the application file APF, on the basis of the metrics scores QT1-QTn computed for the file CF. The module MTAM is further configured to compute a respective global score PMT for each of packages extracted from the application file APF, on the basis of the global scores FMT computed for the program code files of the package. The module MTAM is further configured to compute a global score GMT for the application on the basis of the global scores PMT computed for the packages of the application.
The system 1 can further comprise a graphical user interface WUI enabling the user to input commands into the system 1 and to display obfuscation analyses results.
The user interface WUI can be configured to enable the user to define obfuscation threshold values to be applied to the global scores FMT, PMT and AMT. The definition of an obfuscation threshold value for the file scores FMT triggers the generation of a list of insufficiently obfuscated files SFL. The list SFL designates code files CF that could be advantageously submitted to an obfuscation process.
The disassembling module DASM disassembles the application file APF into different files enabling the application to be decompiled. In the case of an Android application package, the disassembling module may produce data files, resource files and a binary file containing the executable code. The binary file is further decompiled into a text file.
The analyzing module SCA performs parsing of the text file generated by the module DASM. The parsing operation extracts lists AF of classes and for each class:
For each method, the module SCA provides a list of arguments and a call tree between methods of the same class or different classes.
The information extracted by the module SCA is stored into files AF in the database APDB. The module SCA can receive and process interactive commands from the user. These commands include search commands for searching for classes, properties, methods and method calls, using search filters. Search commands consist of a single query or multiple queries that may be combined in one. A query corresponds to a single piece of information that the user wants to fetch or know. Multiple queries can be combined in one in order to form cross-queries. This enables the user to better filter the result(s) he expects by combining multiple queries. The analyzing module SCA may also construct a call graph showing all method calls between the class methods of one file or one package or the whole application.
The decompilation module DCMP converts the text file containing the executable code into source files comprising class files CF dispatched into folders and subfolders FD arranged in a tree structure, each folder or subfolder FD corresponding to a package or sub-package. The folder tree structure with the files CF, DF in it is stored in the database APDB.
The control flow graph module CFGS is configured to analyze the text file containing the executable code produced by the disassembling module DASM, and to generate a control flow graph CFG for each method in each text code file CF. A code box in a control flow graph represents a sequential instruction code block, an arrow between two code blocks represents a program unconditional jump, and a test box (diamond box) with three arrows represents a conditional jump which links three code blocks. Each generated control flow graph CFG may be stored in the database APDB in association with the corresponding source code file CF. The control flow graph module CFGS may be further configured to analyze the generated control flow graphs to determine their respective complexity level.
In the case of an Android application package, the disassembling module DASM decompiles the binary file into a human-readable text file including instructions in an assembly language such as SMALI, or the like. The text file may be further disassembled into the Java programming language (which approximatively corresponds to the original source code syntax). The disassembling module DASM can be realized using a software tool such as APKTOOL. The analyzing module can be realized using a software tool such as SMALISCA. The decompilation module DCMP can be realized using a software tool such as DEX2JAR converting SMALI source code files into Java source code files. The control flow graph generation module CFGS can be realized using the software module CFGScanDroid.
Some of the metrics MT1-MTn modules of the module OAM can perform character string analysis, and/or program code analysis.
Classes, methods or functions, class attributes, variables and constants are usually identified in the files CF by a name formed of a character string composed of a certain number of characters. Also some constants can have a character string as a value.
In the case of Android, the assembly code generally uses SMALI syntax (as illustrated at https://www.quora.com/What-is-smali-inAndroid). SMALI syntax is very verbose, such that a lot of information is still present after decompilation, and may provide valuable information for the reverse engineering. Class, method or function, attribute, variable and constant names are examples of valuable information. Some obfuscation techniques consist of altering those names. In their non-obfuscated version, character strings are usually meaningful, understandable and/or intelligible. Character strings are easy to obfuscate and represent the simplest way to make harder the reverse engineering process as it removes a lot of meaningful information that can be used to understand the function of a program code portion, and how the latter operates. This is because character strings have a size, a character distribution, an encoding and sometimes uses a human language.
Generally, meaningful strings have 6 to 15 characters. One of the simplest obfuscations used by the most used obfuscator (ProGuard https://www.guardsquare.com/en/proguard) modifies all those names so they finally only comprise one character in their names. This is the simplest form of string obfuscation. Obfuscated character strings can also have a lot of characters, and consequently, long character strings, character strings with a distinguishable character distribution, and/or several character encodings.
The metrics resulting from character string analysis can also compute in each of the text code files CF:
To determine whether a character string is obfuscated or not, it may be considered which human language is used for programming. In the case of the English language, the average word length is 5 characters. In addition, the programmers often use names merging at most 3 or 4 words. Therefore a threshold value to decide whether a character string is obfuscated or not can be set to 2×5 when the English language is used. Below this threshold value, it can be considered that the character string may be obfuscated, which requires a further analysis to determine if it is really obfuscated.
The metrics in the form of a number can be used by comparing the metrics number with a threshold value, a score of the metric being 0 or 1 depending whether the metrics number is lower or greater than the threshold value.
The metrics in the form of a distribution can be used by comparing the obtained distribution with a reference distribution or using criteria, the score of the metrics being 0 or 1 depending on the result of the comparison or the application of a criterion.
The histogram of
The histogram of
The histogram of
It can be deduced that a histogram with high peaks can be an indication of obfuscated software entity names. When such high peaks are observed for names of one or two characters or 2×M characters, M being the average word length in the used language, it can be said that software entity names are obfuscated. In contrast, a histogram having a Gaussian shape cannot be exploited to determine if the software entity names are or are not obfuscated, thus requiring further investigation. Additional investigations may also be carried out in the previous cases to confirm first analysis results.
The computation of two standard deviations respectively applied to the names having less than five characters and to the names having more than five characters may be used to reveal obfuscation. When the standard deviation is lower than an appropriate threshold value, no obfuscation can be supposed. Above this threshold value, the standard deviation can reveal obfuscation or the quality of obfuscation is better.
The metrics resulting from program code analysis can compute in each of the code files CF:
Each of these metrics has a value which may be compared with a threshold value (that may be defined by the user), the score of the metrics being 0 or 1 depending on the result of the comparison.
Halstead's metrics is computed using the number n1 of distinct operators, the number n2 of distinct operands n, the total number N1 of operators, and the total number N2 of operands. When applied to a method or function, these numbers are used to compute:
The programming effort can be considered as representative of a complexity of the program code, and compared with a threshold value, the score of the metrics being 0 or 1 depending on the result of the comparison.
McCabe cyclomatic complexity analysis determines complexity and density of a program code of a particular function by analyzing its control flow graph CFG. This analysis is based on the computation of a number of linearly independent paths LIPN in a control flow graph CFG. The number LIPN can be computed using the following equation:
LIPN=E−N+2 (1)
where E is the number of arrows and N is the number of nodes (i.e. program code blocks without jump) in the graph CFG. In the example of the graph CFG in
Other metrics performing program code analysis can compute in each of the code files CF:
The module MTAM computes a global obfuscation score FMT for each of the program code files CF extracted from the application file APF. The character string analysis metrics may be applied to the whole file CF. The program code analysis metrics may be applied to each of the functions or methods in the file CF. Then, code analysis metrics scores can be computed for a file CF using the scores provided by the metrics applied to the methods or functions of the file CF. A global score FMT can be computed for each of the program code file CF extracted from the application file APF, from all metrics scores computed for the file CF. The global score FMT for a file CF can be computed by combining using a linear combination the metrics scores QT1-QTn computed for the file CF, DF, each metrics score QT1-QTn being weighted by a respective weighting coefficient. The weighting coefficients can be defined such that the computed global scores FMT have a value between 0 and 1. When the metrics scores are numbers equal to 0 or 1, the global score FMT for a file CF is the sum of the weighting coefficients associated to the metrics scores equal to 1.
According to an embodiment, the module MTAM also computes a global score PMT for each package (corresponding to a folder FD) by adding the global scores FMT of all the program files CF belonging to the package (i.e. located in the folder FD and subfolders of the package). The module MTAM can further compute a global score AMT for the application by adding the global scores PMT of all separate packages of the application or adding the global scores FMT computed for each of all the program files CF of the application. The global scores PMT and AMT can be also obtained by computing weighted averages using weighting coefficients from the global scores FMT and PMT, respectively. The weighting coefficients can be determined and adjusted using machine learning.
By positioning a pointer (e.g. a mouse pointer) on a tile TL, the name of the corresponding code file CF and the position thereof in the tree structure FTS may be displayed for example in a separate window, together with the computed obfuscation score FMT of the code file. The position of a code file CF is defined by the names of the (nested) packages to which the corresponding class belongs. The displayed window can also display APIs (Application Programming Interfaces) types (cryptography, database access, network access, . . . ) that are called in the file.
In the example of
The module WUI can further display a list of packages of the application, each package being displayed associated with the global obfuscation score PMT computed by the module MTAM. The module WUI can further display the global obfuscation score AMT computed for the application.
According to an embodiment, the module MTAM is configured to allow the user to set a score threshold value and to extract a list SFL of code files CF having an obfuscation score FMT lower than the score threshold value. The tiles TL of the image OMP can be also displayed with two different colors or grey level depending on whether the corresponding obfuscation score is greater or not than the threshold value.
According to an embodiment, the module MTAM is configured to allow the user to select a set of one or more metrics and to compute obfuscation scores for the program code files CF extracted from the application file APF, for the packages of the application and for the application itself, taking into account only the results provided by the selected metrics. When the user selects a set of at least one metrics, the image OMP can be constructed on the basis of the scores computed for the files CF by the module MTAM using the selected metrics. In this way, the user can focus on a particular metrics set, so that the list SFL designates code files CF having an insufficient score with respect to the selected metrics set.
The code files CF that are thus selected as being insufficiently obfuscated or the corresponding source code files can be modified to improve their obfuscation. For this purpose, there exist obfuscation tools corresponding to the obfuscation metrics MT1-MTn, that can process the corresponding source files to improve the metrics scores that can be computed from the resulting files CF.
One of the obfuscation techniques can rename some or all the names included in a code file CF, using arbitrary or random names. Other examples of known obfuscation techniques that can be implemented by the module OBM are disclosed in the following documents, which are incorporated herein by reference:
Cloakware/Transcoder™: The core of Cloakware Code Protection™ (Cloakware product overview advertising material),
U.S. Pat. No. 6,594,761 (Chow et al.),
U.S. Pat. No. 6,668,325 (Collberg et al.),
U.S. Pat. No. 6,779,114 (Chow et al.),
The obfuscation module OBM receives the list SFL of files CF selected by the system 1 and the source code files SCF of the software application. The obfuscation module OBM can be configured to successively select a source code file SCF to process as a function of the files CF as set out in the list SFL. In the case of a software application written in an object-oriented programming language, each file in the list SFL corresponds to a class. The corresponding source code file SCF to process is the one defining this class. If the corresponding source code file SCF contains the definition of more than one class, the module OBM may process only the class corresponding to the selected file CF in the source code file SCF. The obfuscation parameters or techniques to apply to the selected source code file SCF can be selected by the user, or for example, as a function of a set of one or more metrics if the selection of the file by the system 1 is based on a particular set of metrics.
When all source code files SCF corresponding to a file in the list SFL are processed, the module CAM compiles the source code files SCF and assembles them to produce a modified application file APF1 that can be input to the system 1. The methods implemented by the system 1 and the obfuscation tool 2 can be performed several times on an application file APF and then on a resulting modified application file APF1, using different obfuscation parameter(s) or technique(s), until the global score AMT or identified package global scores PMT reach an expected value, or until the file list SFL is empty or only contains files that do not need to be obfuscated.
One or more of the above-described techniques can be implemented in or involve one or more computer systems.
The computing environment CMP can include additional features. For example, the computing environment CMP may include storage MST, one or more input devices CM, one or more output devices DSP, and/or one or more communication connections COM. An interconnection mechanism, such as a bus, controller, or network interconnects the components of the computing environment CMP. Typically, operating system software or firmware (not shown) provides an operating environment for any other software executed in the computing environment CMP, and coordinates activities of the components of the computing environment CMP.
The storage MST may be removable or non-removable, and may include magnetic and/or optic disks, or any other medium which can be used to store information and which can be accessed within the computing environment CMP. The storage MST may store instructions related to the software.
The input device(s) CM may be a touch input device such as a keyboard, mouse, pen, trackball, touch screen, or game controller, a voice input device, a scanning device, a digital camera, remote control, or another device that provides input to the computing environment CMP. The output device(s) DSP may be a display, television, monitor, printer, speaker, or another device that provides output from the computing environment CMP.
The communication connection(s) COM enable communication over a communication medium to another computing entity. The communication medium conveys information such as the application file APF, audio or video information, or other data.
The methods and computer systems previously disclosed can process applications written in an object-oriented language as well as in other programming languages.
The described embodiment can be modified in arrangement and detail without departing from principles defined in the appended claims. It should be understood that the programs, processes, or methods described herein are not related or limited to any particular type of computing environment, unless indicated otherwise. Various types of general purpose or specialized computing environments may be used with or perform operations in accordance with the teachings described herein. Elements of the described embodiment shown in software may be totally or partially implemented in hardware and vice versa.
The illustrations described herein are intended to provide a general understanding of the structure of various embodiments. These illustrations are not intended to serve as a complete description of all of the elements and features of apparatus, processors and systems that utilize the structures or methods described therein. Many other embodiments or combinations thereof may be apparent to those of ordinary skills in the art upon reviewing the disclosure by combining the disclosed embodiments. Other embodiments may be utilized and derived from the disclosure, such that structural and logical substitutions and changes may be made without departing from the scope of the disclosure.
Further, the disclosure and the illustrations are to be considered as illustrative rather than restrictive, and the appended claims are intended to cover all such modifications, enhancements and other embodiments, or combinations thereof, which fall within the true spirit and scope of the description. Therefore, the scope of the following claims is to be determined by the broadest permissible interpretation of the claims and their equivalents, and shall not be restricted or limited by the foregoing description.
Number | Date | Country | Kind |
---|---|---|---|
17194988.6 | Oct 2017 | EP | regional |
This application is a bypass continuation of PCT Application No. PCT/EP2018/076218, filed Sep. 27, 2018, which claims priority to European Application No. 17194988.6, filed May 10, 2017, the disclosures of which are incorporated herein by reference in their entireties.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/EP2018/076218 | Sep 2018 | US |
Child | 16836099 | US |