This application claims priority to UK Application No. GB 2007055.3, filed May 13, 2020, under 35 U.S.C. § 119(a). The above-referenced patent application is incorporated by reference in its entirety.
The present disclosure relates to the field of software fingerprinting. It finds application in the software field in general. More particularly it relates to generating a fingerprint of a software file in order to identify the software file and thereby compare it with another software file. It may for example be used to fingerprint and thereby compare program files, which are also known as executables.
Software files, for example program files or executables (e.g. *.exe, *.dll), are conventionally identified as pertaining to a particular program, and more specifically to a version thereof.
Historically, it was feasible to identify executables or program files on a computer by reference to file metadata or to other attributes that are stored with the file. These techniques can be used even if the file names or even file contents are slightly different, e.g. due to being different versions of the same program. For example, the metadata may identify the originator: “Adobe®”, the program: “Acrobat®”, and its version: “11.0”. It would then be easy to identify another program with metadata “Adobe®”: “Acrobat®”: “11.1” as being a later version of the same program, even though the file contents would tend to differ, without recourse to any other kind of analysis.
Having this knowledge conveniently enables computer system managers to manage installed software across large estates of computers. For example, the knowledge can be used to audit and track installed software or cleanse computer systems, for instance by removing old versions of software. In other instances, checks can be made to ensure that all installed software is correctly licensed, by comparing license information (e.g. we have a license for v11.0 of some software) with the software that is installed on a computer (e.g. anyone found to be running v11.1 is not licensed to do so).
However there remains room for improvements in identifying and comparing software files. Program files relating to Open Source software may lack file metadata or other attributes, making it difficult to use such known techniques to identify and compare programs. The ability to reliably identify or compare files may also be useful in cases where it is possible to fake the file metadata or other attributes so that a program with a virus appears to be legitimate. These problems are particularly acute for organisations wishing to manage and optimise large estates of computers.
Thus, a need exists for improved techniques for identifying and comparing software files.
According to a first aspect of the present disclosure a method of comparing a candidate file with an exemplar file is provided. The method includes:
According to a second aspect of the present disclosure a method of generating a candidate file fingerprint representing a candidate file is provided. This method includes:
A similar method may be used to generate the exemplar file fingerprint of the exemplar file.
In accordance with the nomenclature used herein the term “exemplar” file refers to a reference, or authentic version of a file, and against which a sample file, i.e. the “candidate” file is compared. The terms “candidate” and “exemplar” are therefore purely labels used to distinguish between these files.
In some examples of the present disclosure the candidate file and the exemplar file are described as being a program file; a program file being defined herein as a file comprising software code used to run a program. The software code may be un-compiled, or it may have been compiled. In other words it may be source code or machine code. A program file is also commonly referred to as an executable file. Executable files are ubiquitous in the Microsoft® Windows® operating system and typically have the file extension “*.exe”. However, the present disclosure also finds application with other types of program files such as, and without limitation, Dynamic Link Library (*.DLL) files that are used in conjunction with such executable files. It is therefore to be appreciated that the candidate file and the exemplar file may in general be any software file. The present disclosure may therefore be used with files having different file extensions to *.exe, and *.DLL, for example with data files or document files, as well as with files that have no file extension at all. It is also noted that the present disclosure finds application with different operating systems to Microsoft® Windows®. Non-limiting examples of alternative operating systems in which the present disclosure also finds application include: Linux®, macOS (formerly OS X), iOS and Android.
As described in more detail below, the methods described herein may be implemented by a computer. The methods may therefore be carried out by a combination of software and hardware. Such a combination may for instance include one or more processors and one or more memories that store instructions corresponding to the method, and which instructions when carried out on the processor cause the processor to carry out the described instructions.
Further features and advantages of the present disclosure will become apparent from the following description, which is made with reference to the accompanying drawings.
Some examples described herein provide a method of generating a candidate file fingerprint representing a candidate file. Other examples described herein relate to a method of comparing a candidate file with an exemplar file using the candidate file fingerprint. One example relates to a computer program product. It is to be appreciated that features described in relation to one example may equally be used in another example and that all features are not necessarily duplicated in each example for the sake of brevity.
The above method is illustrated in more detail in
With reference to
After receiving the candidate file CF, the candidate file data CFD is processed to generate a candidate file fingerprint CFF representing the candidate file CF. The candidate file fingerprint CFF includes a plurality of fingerprint strings FPS1 . . . m, each representing a portion of the candidate file data CFD. The processing involves applying a rolling hash function RHF to the candidate file data CFD in order to generate a sequence of strings SOS.
Broadly speaking, a hash function maps a string of input data elements to a string of output data elements. A string of data elements is a sequence of characters of an alphabet, such as 1's and 0's, or other characters. The string of output data elements generated by the hash function is sometimes termed a “hash string” or simply a “hash”. Hash functions are typically chosen on the basis that the chance of “collisions”, i.e. the mapping of different strings of input data elements to the same string of output data elements, is negligible. In so doing, the hash can be thought of as providing a near unique identifier of the string of input data elements.
In the method of the present disclosure, a rolling hash function RHF is applied to portions of the candidate file data CFD, i.e. to strings of input data elements, in order to generate the sequence of strings SOS, i.e. strings of output data elements. A rolling hash function is used in particular because rolling hash functions can generate hash strings that are characteristic of the strings of input data elements in a computationally-efficient manner. This is now described with reference to
In the upper part of
Various rolling hash functions are suitable for generating each string of output data elements in the sequence of strings SOS. One example is the polynomial rolling hash, H:
H=c
1
a
m-1
+c
2
a
m-2
+c
3
a
m-3
+ . . . +c
m
a
0 Equation 1
Here, a is a constant and c1 . . . m are the input data elements. The result of H may be computed as modulo p, wherein p may be a prime number. In order to reduce the chance of collisions, p may be a large prime number and/or a may be larger than the alphabet of possible input data elements.
Other types of rolling hash may alternatively be used, including the Rabin fingerprint, and the Cyclic polynomial. In one implementation, the Rabin-Karp Rolling Hash algorithm is used. This is described in document: “Efficient randomized pattern-matching algorithms”; IBM Journal of Research and Development, Volume: 31, Issue: 2, March 1987.
Returning to the above method, as indicated in
With reference to the decision box in
In some implementations, the substring from the sequence of strings SOS that is added to the candidate file fingerprint CFF in the above method is the entire string of output data elements that is generated by the rolling hash function RHF at the window position at which the determination is made. However, a reduction in the size of the candidate file fingerprint CFF may be achieved by including in the candidate file fingerprint CFF only a portion, i.e. not the whole, of the string of output data elements that is generated by the rolling hash function RHF at the window position at which the determination is made. In particular, it is noted that the predetermined string pattern PSP within each string of output data elements generated by the rolling hash function RHF that triggers the inclusion of a substring in the candidate file fingerprint CFF, “triggering string”, is the same for each triggering string. The predetermined string pattern PSP part of each triggering string therefore has only a minor contribution to the distinctiveness of each fingerprint. In order to reduce the size of a fingerprint, the predetermined string pattern PSP part, or another selection of data in the triggering string, may therefore be omitted from each fingerprint string FPS1 . . . m.
The predetermined string pattern PSP that is used in the above-described determination corresponds to a selection of one or more characters of each string of output data elements generated by the rolling hash function RHF. By way of an example implementation, a string of output data elements generated by the rolling hash function RHF may for instance have 64-bits and the predetermined string pattern PSP may correspond to the lowest 10-bits of the string having a zero, “0” value. With this implementation, a portion or all of a string of output data elements generated by the rolling hash function RHF would be included in the candidate file fingerprint CFF each time the lowest 10-bits of the string are all 0's. Different predetermined string patterns, for example patterns that make different selections of the characters in each string of output data elements generated by the rolling hash function RHF, or patterns having different values to the example 0 values above, may alternatively be used to trigger the inclusion of a fingerprint string FPS1 . . . m in the candidate file fingerprint CFF in a similar manner.
In some implementations, rather than including in the candidate file fingerprint CFF a substring from the string of output data elements generated by the rolling hash function RHF at the window position in which the predetermined string pattern PSP appears, it may alternatively be a substring from another string of output data elements generated by the rolling hash function RHF that is included in the candidate file fingerprint CFF when the predetermined string pattern PSP appears in the sequence of strings SOS. It may for instance be a substring from a string of output data elements generated by the rolling hash function RHF that is near to, i.e. within approximately ±1-10 window positions of, the string of output data elements generated by the rolling hash function RHF in which the predetermined string pattern PSP appears, that is included in the candidate file fingerprint CFF.
Summarising the above, a fingerprint string FPS1 . . . m comprising a substring from the sequence of strings SOS is added to the candidate file fingerprint CFF when a predetermined string pattern PSP appears in the sequence of strings SOS.
As mentioned above, the use of a rolling hash function RHF in the above-described method is computationally efficient at generating hashes. The use of a rolling hash function is also computationally efficient in generating the candidate file fingerprint CFF because it provides a mechanism for quickly determining at each window position whether or not to include a substring from the sequence of strings SOS in the candidate file fingerprint CFF.
After the candidate file fingerprint CFF has been generated, it may be stored in a memory or database, for example as an array, and/or linked to the candidate file CF. For example, the candidate file fingerprint CFF may be linked to the candidate file CF by providing the file fingerprint CFF with a pointer that points to the candidate file CF. The candidate file fingerprint CFF may alternatively or additionally be reported in combination with the name of the candidate file CF.
Candidate file fingerprints generated using the above method have advantageously been found to require only modest data storage requirements. Candidate file fingerprints CFF generated in accordance with some examples of the present disclosure have been generated that are in the order of 0.25% of the size of the candidate file CF. This value may be increased or decreased by varying the length of the predetermined string pattern PSP. The modest data storage requirements arise from only including fingerprint strings FPS1 . . . m in the candidate file fingerprint CFF when a predetermined string pattern PSP appears in the sequence of strings SOS. More particularly, it is because each substring that is included in the candidate file fingerprint is (a portion of) a string of output data elements that are generated by the rolling hash function RHF. Thus, the method of the present disclosure contrasts with other methods in which hashes of all the data in a file are included in a file fingerprint. Candidate file fingerprints generated in accordance with examples of the present disclosure have also been found to require only modest processing time. In some tests, around 4000 fingerprints per minute were generated. This makes the present disclosure particularly suitable for implementation across large estates of computers. In some examples, fingerprints may be generated on a single core of a processor, thereby avoiding interruptions to a user, or to other processor processes.
A further advantage offered by examples of the method of the present disclosure, specifically relating to the use of strings of output data elements generated by a rolling hash function RHF to trigger the inclusion of a substring in the candidate file fingerprint CFF, is that it provides fingerprints that are relatively robust to trivial data insertions or deletions to candidate file data CFD. Such changes tend to have a minor impact on the candidate file fingerprint CFF because they typically only affect fingerprint strings FPS1 . . . m that are local to the change. Specifically, a fingerprint string FPS1 . . . m is typically only altered, or removed, if a change occurs at a position at which a boundary positions BP1 . . . k would have been generated in the candidate file data CFD, or if the change generates a new boundary position BP1 . . . k in the candidate file data CFD.
Referring again to
for each of n positions P1 . . . n in the submask SM, comparing a value in the submask SM with a corresponding value in each string in the sequence of strings SOS, and adding to the candidate file fingerprint CFF a fingerprint string FPS1 . . . m comprising a substring from the sequence of strings SOS if every value in the submask SM is identical to its corresponding value in the string in the sequence of strings SOS.
This is illustrated in more detail in
In general, the likelihood of the predetermined string pattern PSP appearing in the sequence of strings SOS decreases as the length of the predetermined string pattern PSP increases. Increasing the length of the predetermined string pattern PSP therefore reduces the number of fingerprint strings FPS1 . . . m that are added to the candidate file fingerprint CFF. In some examples, distinctive file fingerprints may be generated with between 100 and 200 fingerprint strings. A tradeoff may therefore be made between the number of fingerprint strings in a candidate file fingerprint, the length of the predetermined string pattern PSP, and the distinctiveness of the fingerprint.
In some candidate files there can be large amounts of similar data. This may be due to the presence of long strings of identical characters or due to large gaps between different sections of a file. When a rolling hash function is applied to such data it will tend to also produce identical strings, particularly when the width of the strings of identical data exceeds the width of the window applied to the input data. Including identical strings in the candidate file fingerprint CFF adds to its size but contributes little to its distinctiveness. In order to reduce the size of the candidate file fingerprint CFF, it may therefore be beneficial to only include distinct strings in the candidate file fingerprint CFF. In order to do this, in some implementations, only distinct fingerprint strings are added to the candidate file fingerprint. In other words; adding to the candidate file fingerprint CFF a fingerprint string FPS1 . . . m comprising a substring from the sequence of strings SOS when a predetermined string pattern PSP appears in the sequence of strings SOS, may comprise:
only adding to the candidate file fingerprint CFF a fingerprint string FPS1 . . . m comprising a substring from the sequence of strings SOS if said fingerprint string is distinct from every other fingerprint string already included in the candidate file fingerprint CFF.
Using the above-described method, one or more additional candidate file fingerprints may also be generated from the same candidate file in a similar manner, each using a different rolling hash function. Advantageously the file fingerprints may be generated simultaneously in order to save time. This is illustrated in
In order to generate such a second fingerprint, the above-described method of generating a candidate file fingerprint can further include:
processing the candidate file data CFD to generate a second candidate file fingerprint SCFF representing the candidate file CF, the second candidate file fingerprint SCFF comprising a plurality of fingerprint strings each representing a portion of the candidate file data CFD;
wherein, processing the candidate file data CFD to generate a second candidate file fingerprint SCFF representing the candidate file CF, comprises: applying a second rolling hash function SRHF to the candidate file data CFD to generate a second sequence of strings SSOS, and adding to the second candidate file fingerprint SCFF a fingerprint string comprising a substring from the second sequence of strings SSOS when a second predetermined string pattern SPSP appears in the second sequence of strings SSOS; and
wherein the second candidate file fingerprint SCFF is generated simultaneously with the candidate file fingerprint CFF.
The second rolling hash function SRHF is different to the rolling hash function RHF. As with the rolling hash function RHF described above, various rolling hash functions may be used for the second rolling hash function SRHF. With reference to Equation 1, the second rolling hash function SRHF may for instance use a different value for constant a to rolling hash function RHF. In one implementation the Rabin-Karp Rolling Hash algorithm is used.
The above-described candidate file fingerprint CFF finds particular application in comparing the candidate file CF with an exemplar file EF. The method may for instance be used to determine how closely the two files match. The exemplar file EF may for example be an authentic version of a program file such as Adobe Acrobat version 11.1 and the method may be used to determine whether the candidate file CF is indeed the same version as the exemplar file EF based on the closeness of the match.
Thereto, a method of comparing a candidate file CF with an exemplar file EF includes:
This method is illustrated with reference to
A value indicative of the similarity of the comparison may also be computed. This may subsequently be stored, or reported to a user.
The exemplar file fingerprint EFF representing the exemplar file EF is generated in a similar manner as the aforementioned candidate file fingerprint CFF; specifically by:
In the method of comparing a candidate file CF with an exemplar file EF, the comparison between the candidate file fingerprint CFF and the exemplar file fingerprint EFF may for instance be determined based on the proportion of fingerprint strings FPS1 . . . m in the candidate file fingerprint CFF that correspond to fingerprint strings in the exemplar file fingerprint EFF. In one implementation, comparing the candidate file fingerprint CFF with an exemplar file fingerprint EFF representing the exemplar file EF, comprises:
calculating a Jaccard similarity index across the fingerprint strings of the candidate file fingerprint CFF and the exemplar file fingerprint EFF.
The Jaccard similarity index J(X, Y) may be computed from the fingerprint strings X in the candidate file fingerprint CFF and the fingerprint strings Y in the exemplar file fingerprint EFF using Equation 2:
J(X,Y)=|X∩Y|/|X∪Y| Equation 2
It may also be useful to indicate whether a match between the candidate file CF and the exemplar file EF has been obtained. Comparing the candidate file fingerprint CFF with an exemplar file fingerprint EFF representing the exemplar file EF, may therefore comprise:
An exact match may for instance be represented by 1.0 and the predetermined threshold may for instance be 0.85 such that if the value indicative of the similarity of the comparison is greater than or equal to 0.85 the candidate file CF matches the exemplar file EF.
It may also be useful to determine whether a match exists between multiple candidate files and the exemplar file EF. In this case the method of comparing the candidate file CF with the exemplar file EF may include:
The comparison between the candidate file CF and the exemplar file EF as described in accordance with examples of the present disclosure has been found to be reliable because the fingerprints used in the comparison are determined by analysing data throughout the candidate file CF and the exemplar file EF. By contrast, techniques used to compare files based purely on file header information or a name of a file extension may be subject to malicious attempts to mask their appearance. Moreover, the file fingerprints generated in accordance with examples of the present disclosure and which are used in the comparison can be generated quickly and have a small size. This simplifies the processing and memory requirements of systems that are used to compare candidate files with exemplar files. Thus, the methods described herein enable systems managers to reliably manage installed software across large estates of computers. For example, the knowledge can be used to audit and track installed software or cleanse computer systems, for instance by removing old versions of software. In other instances, checks can be made to ensure that all installed software is correctly licensed, by comparing licence information (e.g. we have a licence for v11.0 of some software) with the software that is installed on a computer (e.g. anyone found to be running v11.1 is not licensed to do so).
Examples of the methods described herein may be provided in the form of a non-transitory computer-readable storage medium comprising a set of computer-readable instructions stored thereon which, when executed by at least one processor, cause the at least one processor to perform the method.
Examples of the present disclosure may also be provided in the form of a computer program product. The computer program product can be provided by dedicated hardware or hardware capable of running the software in association with appropriate software. When provided by a processor, these functions can be provided by a single dedicated processor, a single shared processor, or multiple individual processors that some of the processors can share. Moreover, the explicit use of the terms “processor” or “controller” should not be interpreted as exclusively referring to hardware capable of running software, and can implicitly include, but is not limited to, digital signal processor “DSP” hardware, read only memory “ROM” for storing software, random access memory “RAM”, flash memory, a nonvolatile storage device, and the like.
Furthermore, examples of the present disclosure can take the form of a computer program product accessible from a computer usable storage medium or a computer readable storage medium, the computer program product providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable storage medium or computer-readable storage medium can be any apparatus that can comprise, store, communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system or device or propagation medium. Examples of computer readable media include semiconductor or solid state memories, magnetic tape, removable computer disks, random access memory “RAM”, read only memory “ROM”, rigid magnetic disks, a Redundant Array of Independent Disks “RAID”, and optical disks. Current examples of optical disks include compact disk-read only memory “CD-ROM”, optical disk-read/write “CD-R/W”, Blu-Ray™, and DVD.
The above implementations and examples are to be understood as illustrative examples of the disclosure. Further implementations and examples of the disclosure are also envisaged. It is to be understood that any feature described in relation to any one implementation may be used alone, or in combination with other features described, and may also be used in combination with one or more features of any other implementation, or any combination of the implementations. Any reference signs in the claims should not be construed as limiting the scope. Furthermore, equivalents and modifications not described above may also be employed without departing from the scope of the disclosure, which is defined in the accompanying claims.
Number | Date | Country | Kind |
---|---|---|---|
2007055.3 | May 2020 | GB | national |