The present invention relates to computer security.
Byte-distribution analysis is a statistical analysis technique, which has been used to classify digital data. Byte-distribution analysis generally involves examining a binary file in terms of its byte constituents. I.e., a binary file is a sequence of bytes with values i, ranging from i=0 to i=255, and each byte has a frequency of occurrence, fi within the file. Byte-analysis uses the histogram of frequencies fi, 0≦i≦255, to classify a file.
Byte analysis is described in Abou-Assaleh, T., Cercone, N., Keselj, V. and Sweidan, R., N-gram based Detection of New Malicious Code, Proceedings of the 28th Annual International Computer Software and Applications Conference, IEEE, 2004. N-gram analysis is a generalization of byte-distribution analysis to sequences of N consecutive bytes (i1, i2, . . . , iN).
Prior art implementations of byte-distribution analysis for security analysis of files have not been sufficiently robust and accurate to make their way into commercial products. Such implementations suffer from false negatives and false positives. False negatives are malicious files that elude detection, and false positives are non-malicious files that are reported as being malicious. It is thus desirable to find an implementation of byte-distribution analysis that has low enough margins of false negatives and false positives, that warrant its commercial use.
The present invention concerns a method and system for scanning files for potential security threats, using a form of byte-distribution analysis, which is commercially viable. The present invention is based on the discovery that for files of certain mime types, including inter alia media files, if known spikes are removed from their byte-distribution histogram, then the remaining parts of the histograms are approximately uniformly distributed. The locations of the spikes are designated in a byte exclusion list.
For legitimate non-malicious files of a mime type amenable to byte-distribution analysis, removable of the excluded bytes from their histograms results in an approximately uniform distribution. For malicious files, however, the histogram for the non-excluded bytes exhibits spikes. Thus, a subject file can be classified as potentially malicious if its byte-distribution histogram for the non-excluded bytes deviates substantially from a uniform distribution.
There is thus provided in accordance with an embodiment of the present invention a method for scanning files for security, including receiving an unfamiliar file for scanning, generating a histogram of frequencies of occurrence of bytes within a buffer of file data from the unfamiliar file, excluding a designated set of bytes, and if the generated histogram of frequencies of occurrence of the non-excluded bytes deviates substantially from a reference distribution, then signaling that the unfamiliar file is potentially malicious.
There is additionally provided in accordance with an embodiment of the present invention a system for scanning files for security, including a histogram generator for building a histogram of frequencies of occurrences of bytes within a buffer of file data from an unfamiliar file, excluding frequencies of a designated set of bytes, and a threshold detector for detecting if the frequencies of the non-excluded bytes deviate substantially from a reference distribution.
There is further provided in accordance with an embodiment of the present invention a computer-readable storage medium storing program code for causing a computing device to receive an unfamiliar file for scanning, to generate a histogram of frequencies of occurrence of bytes within the buffer of file data, excluding a designated set of bytes, and if the generated histogram of frequencies of occurrence of the non-excluded bytes deviates substantially from a reference distribution, then to signal that the unfamiliar file is potentially malicious.
The present invention will be more fully understood and appreciated from the following detailed description, taken in conjunction with the drawings in which:
The present invention concerns analysis of files for the presence of malicious embedded code. The present invention uses byte-distribution analysis and achieves sufficient accuracy to make it commercially viable.
Reference is now made to
The algorithm described herein below is designed to recognize these differences, and to issue a warning signal accordingly. Briefly, the algorithm identifies mime types for which there is approximate uniformity in their histograms, and then uses this uniformity to recognize potentially malicious files.
Reference is now made to
From inspection of
The training algorithm shown in
Generally, a byte i is classified as an outlier if the frequency fi differs from the average of all of the frequencies by more than a percentage of the average, such as 30% of the average.
At step 215 the average, AVG, of the non-excluded frequencies is calculated; namely,
where n is the number of non-excluded bytes. At step 220 a threshold, τ, is defined by
i.e., τ is the largest absolute ratio
over the non-excluded bytes i, for all good files. It is noted that the threshold, τ, is being cumulatively generated. I.e., as each good file is processed, the threshold τ is increased if an absolute ratio
exceeds the current value of τ. It will be appreciated by those skilled in the art that alternatively the threshold τ may be computed in a separate loop over the good files, after the loop with steps 205-215 is completed and the set E, of excluded bytes, has been completely cumulated.
Processing then returns to step 205. If all of the good files have been processed, then processing advances to step 225 and the bad files are processed. At step 225 a decision is made whether there are more bad files to process. If so, then at step 230 the average in EQUATION 1 is calculated for the next bad file to be processed, where fi is the frequency of occurrence of byte i in the bad file. At step 235 byte numbers, i, are determined for which
|fi−AVG|>AVG*τ. (3)
where τ is the threshold parameter determined from the good files, as above in EQUATION 2. The frequencies satisfying EQUATION 3 are considered as violating approximate uniformity, and in turn such violation signals that the file is potentially malicious. If none of the frequencies, fi, satisfy EQUATION 3, then the bad file being processed has eluded the test, and is considered a false negative.
Processing then returns to step 225. If all of the bad files have been processed, then processing advances to step 240 where the percentage of false negatives is calculated; namely, the ratio of bad files that eluded the test, divided by the total number of bad files that were tested. At step 245 a decision is made whether the percentage of false negatives is greater than a pre-designated percentage, PERC; for example, PERC=50%. If not, then at step 250 the specific mime type being tested by the training algorithm is designated as suitable for byte-distribution analysis. Otherwise, if the percentage of false negatives is greater than PERC, then at step 255 the specific file type being tested by the training algorithm is designated as unsuitable for byte-distribution analysis.
After processing the training algorithm of
Reference is now made to
At step 300 a decision is made whether the mime type of the file to be analyzed is one of the types deemed suitable for byte-distribution analysis in the training phase. If not, then processing advances to step 370 and no conclusion can be made.
Otherwise, if the file is of a type deemed suitable for byte-analysis, then at step 310 the bytes i ∈ E are excluded, where E is the list of excluded bytes determined in the training phase. At step 320 a buffer of a designated size of bytes from the file is received. The size of the buffer is a parameter, BUFFER_SIZE. It will be appreciated by those skilled in the art that use of a fixed size buffer for byte-distribution analysis has several advantages. It serves to control the size of the data stream being statistically analyzed, since files input to the scanning algorithm may be of arbitrary sizes.
At step 330 a byte-distribution histogram of frequencies, fi, i ∉ E, is generated. Generally, steps 330 and 340 are repeated until the entire file is processed. Alternatively, if the file is very large, then steps 330 and 340 may be repeated until a designated number of bytes have been processed; or in some instances, depending on the size of BUFFER_SIZE, steps 330 and 340 may be performed only once, without repetition.
At step 340 the average in EQUATION 1 is calculated. At step 350 a decision is made whether any of the frequencies fi, i ∉ E, satisfies EQUATION 3, thereby violating the approximate uniformity. If so, then at step 360 the file is deemed potentially malicious. Otherwise, if none of the frequencies fi, i ∉ E, satisfy EQUATION 3, then no conclusion is made.
The following is an example of a configuration file used in the scanning phase for GIF image files, with parameters that were determined in the training phase.
Generally, each mime type has a unique configuration file. The Parameters DefaultHeaderSize and DefaultTrailerSize are header and trailer sizes of histograms that are treated as outliers. A DefaultHeaderSize of 20 indicates that bytes 0-19 are treated as outliers, and a DefaultTrailerSize of 15 indicates that bites 241-255 are treated as outliers.
Reference is now made to
The parameter designation component includes a processor 400 for processing a training set of good, i.e., non-malicious files, and a processor 410 for processing a training set of bad, i.e., malicious files. The two training sets include files of a specific mime type, such as MPEG image files, or MP3 audio files, or MP4 video files.
Processor 400 includes a spike filter 420, for identifying spikes in a histogram of byte frequencies for a file from the training set of good files. Spike filter 420 generates a list of bytes to be excluded, in order that the remaining bytes have an approximate uniform distribution.
Processor 400 further includes an average calculator 430, for calculating an average frequency, AVG, for the non-excluded bytes. Average calculator 430 uses EQUATION 1 above to calculate the value of AVG.
Processor 400 further includes a threshold calculator 440, for calculating a threshold, τ, according to EQUATION 2 above.
The output of processor 400 includes a list of excluded bytes and a threshold, which in turn are inputs to processor 410.
Processor 410 includes an average calculator 450, for calculating an average frequency, AVG, for a file from the training set of bad files. Average calculator 450 uses EQUATION 1 above to calculate the value of AVG. Processor 410 also includes a false negative calculator for checking whether or not a frequency of occurrence of a non-excluded byte in the file deviates from AVG substantially, according to EQUATION 3. If not, then the bad file being processed has eluded the byte-distribution test, and represents a false negative.
The output of processor 410 includes an indication of whether or not the mime type of the files being tested is deemed suitable for byte-distribution analysis. Thus, after the parameter designation component of
The real-time scanning component of
Processor 470 generally does not operate on the entire input file. Instead, a fixed length buffer of data from the input file is analyzed. Processor 470 includes an average calculator 480, for calculating an average, AVG, of frequencies of occurrences of bytes, for the non-excluded byte values, for the data in the buffer. Average calculator 480 uses EQUATION 1 to calculate the value of AVG. Processor 470 also includes a threshold detector 490, for determining if any of the frequencies of occurrence of a non-excluded byte deviates from AVG according to EQUATION 3. If so, the subject file is signaled as being potentially malicious. If not, the result of the scan is inconclusive.
It will be appreciated by those skilled in the art, that in some circumstances it may be advantageous to pre-process a file by transforming the file, prior to scanning by processor 470. Thus, (i) files such as Java applets, which include byte code, may be disassembled prior to processing; and (ii) files that are encoded may be decoded prior to processing. Moreover, (iii) files of a specific mime type that generally have a substantially non-uniform byte distribution, such as a normal distribution, may be pre-processed by transforming them to files with a substantially uniform byte distribution; specifically, the individual byte value are transformed to other byte values, so that the resulting histogram has a substantially uniform distribution.
In reading the above description, persons skilled in the art will realize that there are many apparent variations that can be applied to the methods and systems described. Thus it will be appreciated that the methods described apply to general hypothesis analysis of files, including inter alia security analysis, type analysis and author analysis.
In the foregoing specification, the invention has been described with reference to specific exemplary embodiments thereof. It will, however, be evident that various modifications and changes may be made to the specific exemplary embodiments without departing from the broader spirit and scope of the invention as set forth in the appended claims. Accordingly, the specification and drawings are to be regarded in an illustrative rather than a restrictive sense.
Number | Name | Date | Kind |
---|---|---|---|
6971019 | Nachenberg | Nov 2005 | B1 |
7657935 | Stolfo et al. | Feb 2010 | B2 |
20040111632 | Halperin | Jun 2004 | A1 |
20050177737 | Takeda et al. | Aug 2005 | A1 |
20050281291 | Stolfo et al. | Dec 2005 | A1 |
20060015630 | Stolfo et al. | Jan 2006 | A1 |
20060026675 | Cai et al. | Feb 2006 | A1 |
20070280114 | Chao et al. | Dec 2007 | A1 |
Number | Date | Country | |
---|---|---|---|
20080276320 A1 | Nov 2008 | US |