Application software suites such as Microsoft® Office® and Adobe® Acrobat® allow the end user to edit complex documents that contain text, tables, charts, pictures, videos, sounds, hyperlinks, interactive objects, etc. Some of these rich content features rely on the support of scripting languages by application software suites, such as Visual Basic® for Application (abbreviated VBA) for Microsoft® Office® suite and JavaScript® (abbreviated JS) for Adobe® Acrobat® suite:
Cybercriminals have leveraged the support of scripting languages in these application software files and have written malicious code to perform malicious actions such as installing malware (Ransomware, spyware, trojan, etc.) on the end user's device, re-directing the end user to a phishing website, etc. As security vendors have started to develop technologies to detect malicious VBA and JS scripts, cybercriminals have increased the sophistication of their cyberattacks using different techniques, such as source code obfuscation.
Source code obfuscation is the deliberate act of creating source code that is difficult for humans to understand. Source code obfuscation is widely used in the software industry, mainly to protect source code and to deter reverse engineering for security and intellectual property reasons. Source code obfuscation, however, is very rarely used in benign VBA and JS scripts embedded in Microsoft® Office® and Adobe® Acrobat® files, as those scripts are usually simple and many do not have any intellectual property value.
The detection of obfuscated code, therefore, can be a useful tool in detecting potentially malicious code in malware.
In the context of malicious code, obfuscation has one main purpose: to bypass security vendor's filtering technologies. More precisely:
The following lists a few common JS obfuscation techniques used by cybercriminals to obfuscate malicious code:
The aforementioned list of obfuscation techniques is not exhaustive, and these techniques may be combined with one another and/or other techniques to achieve even higher levels of obfuscation.
Similar obfuscation techniques exist in VBA.
According to one embodiment, a function called EvaluateFile may be defined, in which:
The EvaluateFile function and its use is shown relative to
The following data is defined:
In the highlighted steps below, computer-implemented methods for determining whether code is obfuscated according to embodiments are detailed with reference to
Step 1: A getType function may be called to identify the type Tf of the file f. If Tf is not null then Tf identifies the type of application software suite and the EvaluateFile function proceeds to the next step. Otherwise, if Tf equals null, then EvaluateFile function exits and returns NoCode, as shown at B803 in
Extraction of Scripts
The following data is defined:
Step 2: As shown at B804 in
Whitelisting of Benign Scripts
Files created with application software suites such as Microsoft® Office® and Adobe® Acrobat® may contain benign scripts. For example,
Another example of a benign script is shown in
One embodiment defines an applyWhitelist function. The following data is defined:
Step 3: As shown at B806 in
Size Condition on Suspect Scripts
At this point of execution of the present computer-implemented method according to an embodiment, a non-zero list of suspect scripts S′f={s′f,1, . . . , s′f,p} has been extracted from file f. The algorithm should be provided with sufficient data to determine, with the requisite degree of accuracy, whether the code is obfuscated or not. Indeed, if there is insufficient data, a sufficiently accurate statistical representation of the suspect scripts may not be obtained.
The following data may be defined:
Step 4: As suggested at B810 in
Determination of Scripting Language
The following data may be defined:
Step 5: If the SuspectScriptsSize is sufficiently large, the scripting language Lf may be identified, as suggested at B812 in
Statistical Modeling of Scripting Languages
Code obfuscation techniques, such as those presented in
The following data is defined:
For each scripting language L, a non-obfuscated code model corpus ModelCorpusL may be built. For example:
One or several discrete probability distribution models ML={ML,1, . . . , ML,q} may be generated from the parsing and analysis of ModelCorpusL, examples of which are provided in
Table 1 shows MJS,3 i.e., the discrete probability distribution of character unigrams of special characters of ModelCorpusJS.
Similarly, one or several discrete probability distributions PL,f={PL,f,1, . . . , PL,f,q} may be generated from the parsing and analysis of the list of suspect scripts S′f={s′f,1, . . . , s′f,p}.
Distances Computation Between Models and Suspect Scripts
Step 6: As shown at B816 in
Considering now the previously-presented obfuscation techniques and the discrete probability distribution models presented relative to
Table 2 shows the discrete probability function of characters unigrams of special characters of the obfuscated script presented at 104.
The computation of distances between ML={ML,1, . . . , ML,q} and PL,f={PL,f,1, . . . , PL,f,q} is helpful in characterizing and detecting many obfuscation techniques, as long as the models are carefully defined and constructed. For example, if the Jensen-Shannon distance JSD with base 2 logarithm between the probability distributions of Table 1 and Table 2 is computed, then JSD=0.650 where JSD is rounded up to three decimal places.
The following data are defined:
Step 7: Compute distances between ML and PL,f: D=Dist(ML,PL,f), as shown at B816 in
Evaluation of Distances Between Probability Distributions
Finally, according to one embodiment, the distance D is evaluated with the EvaluateDist function defined below:
In order to set the threshold to a value yielding satisfying detection results, several methods may be applied. In one embodiment, the threshold may be set by considering the bounds of the distance algorithm used. For example, if we consider the Jensen-Shannon distance with base 2 logarithm, then EvaluateDistThreshold could be set to 0.5 as the Jensen-Shannon distance with base 2 logarithm between two probability distributions P and Q has the following property: 0≤JSD(P∥Q)≤1.
In one embodiment, the threshold may be set to a dynamically-determined value by applying the EvaluateFile function on a test corpus TestCorpusL constructed beforehand for this purpose. TestCorpusL may include t application software files FNonObf={fNonObf,1, . . . , fNonObf,t} with non-obfuscated code and t application software files FObf={fObf,1, . . . , fObf,t} with obfuscated code, where code is written in scripting language L. Then, the following algorithm may be applied:
Step 8: Finally, as shown at B818 in
Use Case Example: Email Received by a MTA
As shown in
Note that
Furthermore, in the case where at least one email attachment of the email contains potentially malicious code, alternative defensive policies may be applied including, for example, deleting the email, removing each potentially malicious attachment from the email and delivering the sanitized email to the end user's inbox, performing a behavioral analysis of each potentially malicious attachment with a sandboxing technology, and delegating the delivery decision (to deliver or not to deliver the email and/or its attachment) to the sandboxing technology, to name but a few of the possibilities. Another defensive action that may be taken if the extracted attachment is determined to contain obfuscated code may include disabling a functionality of the obfuscated code before delivery to the end user. Note that, in one embodiment, the EvaluateFile function may be provided as a HTTP-based API, as shown in
In other embodiments, the computer-implemented method may further comprise applying a whitelist of known, non-obfuscated scripts against the extracted script(s) and the distance may be computed only on those extracted scripts (if any) having no counterpart in the whitelist. The method may also comprise determining the scripting language of the extracted script(s). The computer-implemented method may further comprise computing a probability distribution of the one or more features (variable names, function names, comments, alphanumeric characters and/or special characters, for example) of the extracted script(s). In that case, the computed distance measure may comprise a computed distance between the computed probability distribution of the one or more features of the extracted script(s) and a previously-computed probability distribution of the corresponding one or more selected features of scripts of a model corpus of non-obfuscated script files. For example, the computed distance may be a Jensen-Shannon distance or a Wasserstein distance.
In one embodiment, the defensive action may include delivering the received electronic message to a predetermined folder (such as a spam folder, for example) deleting the electronic message and/or its attachment and/or delivering a sanitized version of the attachment, without the obfuscated code, to an end user. When the extracted script(s) is determined to not comprise obfuscated code, the method may further comprise forwarding the electronic message and the attachment to an end user. The computer-implemented method, in one embodiment, may be at least partially performed by a MTA.
As shown, the storage device 1207 may include direct access data storage devices such as magnetic disks 1230, non-volatile semiconductor memories (EEPROM, Flash, etc.) 1232, a hybrid data storage device comprising both magnetic disks and non-volatile semiconductor memories, as suggested at 1231. References 1204, 1206 and 1207 are examples of tangible, non-transitory computer-readable media having data stored thereon representing sequences of instructions which, when executed by one or more computing devices, implement the computer-implemented methods described and shown herein. Some of these instructions may be stored locally in a client computing device, while others of these instructions may be stored (and/or executed) remotely and communicated to the client computing over the network 1226. In other embodiments, all of these instructions may be stored locally in the client or other standalone computing device, while in still other embodiments, all of these instructions are stored and executed remotely (e.g., in one or more remote servers) and the results communicated to the client computing device. In yet another embodiment, the instructions (processing logic) may be stored on another form of a tangible, non-transitory computer readable medium, such as shown at 1228. For example, reference 1228 may be implemented as an optical (or some other storage technology) disk, which may constitute a suitable data carrier to load the instructions stored thereon onto one or more computing devices, thereby re-configuring the computing device(s) to one or more of the embodiments described and shown herein. In other implementations, reference 1228 may be embodied as an encrypted solid-state drive. Other implementations are possible.
Embodiments of the present invention are related to the use of computing devices to implement novel detection of obfuscated code. Embodiments provide specific improvements to the functioning of computer systems by defeating mechanisms implemented by cybercriminals to obfuscate code and evade detection of their malicious code. Using such improved computer system, URL scanning technologies such as disclosed in commonly-assigned U.S. patent application Ser. No. 16/368,537 filed on Mar. 28, 2019, the disclosure of which is incorporated herein in its entirety, may remain effective to protect end-users by detecting and blocking cyberthreats employing obfuscated code. According to one embodiment, the methods, devices and systems described herein may be provided by one or more computing devices in response to processor(s) 1202 executing sequences of instructions, embodying aspects of the computer-implemented methods shown and described herein, contained in memory 1204. Such instructions may be read into memory 1204 from another computer-readable medium, such as data storage device 1207 or another (optical, magnetic, etc.) data carrier, such as shown at 1228. Execution of the sequences of instructions contained in memory 1204 causes processor(s) 1202 to perform the steps and have the functionality described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions to implement the described embodiments. Thus, embodiments are not limited to any specific combination of hardware circuitry and software. Indeed, it should be understood by those skilled in the art that any suitable computer system may implement the functionality described herein. The computing devices may include one or a plurality of microprocessors working to perform the desired functions. In one embodiment, the instructions executed by the microprocessor or microprocessors are operable to cause the microprocessor(s) to perform the steps described herein. The instructions may be stored in any computer-readable medium. In one embodiment, they may be stored on a non-volatile semiconductor memory external to the microprocessor or integrated with the microprocessor. In another embodiment, the instructions may be stored on a disk and read into a volatile semiconductor memory before execution by the microprocessor.
Portions of the detailed description above describe processes and symbolic representations of operations by computing devices that may include computer components, including a local processing unit, memory storage devices for the local processing unit, display devices, and input devices. Furthermore, such processes and operations may utilize computer components in a heterogeneous distributed computing environment including, for example, remote file servers, computer servers, and memory storage devices. These distributed computing components may be accessible to the local processing unit by a communication network.
The processes and operations performed by the computer include the manipulation of data bits by a local processing unit and/or remote server and the maintenance of these bits within data structures resident in one or more of the local or remote memory storage devices. These data structures impose a physical organization upon the collection of data bits stored within a memory storage device and represent electromagnetic spectrum elements.
A process, such as the computer-implemented detection of obfuscated code in application software files methods described and shown herein, may generally be defined as being a sequence of computer-executed steps leading to a desired result. These steps generally require physical manipulations of physical quantities. Usually, though not necessarily, these quantities may take the form of electrical, magnetic, or optical signals capable of being stored, transferred, combined, compared, or otherwise manipulated. It is conventional for those skilled in the art to refer to these signals as bits or bytes (when they have binary logic levels), pixel values, works, values, elements, symbols, characters, terms, numbers, points, records, objects, images, files, directories, subdirectories, or the like. It should be kept in mind, however, that these and similar terms should be associated with appropriate physical quantities for computer operations, and that these terms are merely conventional labels applied to physical quantities that exist within and during operation of the computer.
It should also be understood that manipulations within the computer are often referred to in terms such as adding, comparing, moving, positioning, placing, illuminating, removing, altering and the like. The operations described herein are machine operations performed in conjunction with various input provided by a human or artificial intelligence agent operator or user that interacts with the computer. The machines used for performing the operations described herein include local or remote general-purpose digital computers or other similar computing devices.
In addition, it should be understood that the programs, processes, methods, etc. described herein are not related or limited to any particular computer or apparatus nor are they related or limited to any particular communication network architecture. Rather, various types of general-purpose hardware machines may be used with program modules constructed in accordance with the teachings described herein. Similarly, it may prove advantageous to construct a specialized apparatus to perform the method steps described herein by way of dedicated computer systems in a specific network architecture with hard-wired logic or programs stored in nonvolatile memory, such as read only memory.
While certain example embodiments have been described, these embodiments have been presented by way of example only and are not intended to limit the scope of the embodiments disclosed herein. Thus, nothing in the foregoing description is intended to imply that any particular feature, characteristic, step, module, or block is necessary or indispensable. Indeed, the novel methods and systems described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the methods and systems described herein may be made without departing from the spirit of the embodiments disclosed herein.