The present invention relates to the field of file format identification, in particular to a method and system for content-based, automatic file format identification.
File format identification is a salient feature of each software program, and is performed while conducting multiple operations. The operations vary from loading and identifying files on a host computer, to downloading and streaming files in a network. The growth in the customized software market has introduced myriad software-specific file formats. This increase in the number of file formats has made file format identification even more complex for software programs.
Many techniques have been developed to handle the problem of the increasing number of file formats. A conventional way of solving this problem is to identify and use a standard file format. In fact, most software programs still support a particular set of file formats. Such software programs have limitations, since they can only read file formats they recognize. Moreover, these software programs give an error message when directed to process a file of an unsupported file format.
Another common approach is to statically associate a file extension to a particular application—a form of external file format meta-data. This solution is familiar to the users of Microsoft Windows™ operating system. Variants of this solution include downloading the mappings over a network at login time, or even the runtime registration of an application to a format, or vice versa. In all of these cases, the know-how that maps a data format to an application is statically defined, and so the mapping will not be registered if the specific application is not installed on the target machine.
In addition, these static format-application mappings are error prone due to inconsistencies in implementation and lack of standards. Moreover, when files are delivered as streams, the application is forced to process the incoming data with the assumption that it adheres to the expected format. For example, consider a case when a compound file (a file containing one or more files) format is acceptable by an application but the format of a file present in the compound file is not. In such a case, the application assumes that the format of the file in the compound file is acceptable and processes it accordingly. At a later stage, this may lead to an error in processing the file.
Improved techniques for file format identification include statistical analysis of known file formats. This technique is described in the research paper by Mason McDaniel and M. Hossain Heydari, titled ‘Content Based File Type Detection Algorithm’. The paper was published in the 36'th Annual Hawaii International Conference on System Science, on Jan. 6, 2003. The paper relates to a threefold approach to file format identification. In the first approach, the paper proposes a statistical analysis of all the known file types. This statistical analysis is based on the frequency of the occurrence of a byte in a particular file type. The technique generates normalized histograms for each file type and identifies a file type by matching the byte frequency histogram of the unknown file with that of the known files. In the second approach, correlation is established between characters used in a particular file format. For example, in an HTML document, the frequency of the occurrence of the character [<] is the same as that of the character [>]. This correlation enables more efficient file format identification. Finally, the paper also proposes a byte frequency histogram for the header and footer bytes of a file format. For file format identification, this technique compares the byte frequency histogram of the header and the footer of unknown files with that of known file formats.
However, a lot of training of file samples is required for the above-mentioned approach to work efficiently. Moreover, to identify an unknown file, the approach mentioned above parses the whole file for format identification. This makes the process of format identification both time consuming and less accurate. For example, the file format identification process mentioned in the research paper may have a problem in distinguishing between ‘xml’, ‘sgml’, ‘html’ and ‘xhtml’ file formats. This is because these formats use characters, which will give identical frequency distribution for the methods mentioned in the research paper.
Therefore, there is a need for an efficient method and system that does not depend on the meta-data for format identification. There is also a need for a method and system that does not parse the whole file for its identification.
An object of the present invention is to provide a method and system that selectively uses the content of a file and the external information linked to the file to determine the format of the file.
Another object of the present invention is to dynamically select a set of bytes for byte-pattern matching.
The file format identification system of the present invention performs content-based, automatic file format identification. The system also dynamically selects a set of bytes from a file for byte-pattern recognition.
The file format identification system of the present invention comprises a selection unit, a comparison unit, a verification unit, a detection unit, a data format identifier, an extraction unit, and a plurality of text file-based parsers.
The method for byte pattern recognition begins with checking relevant file format information in the meta-data linked to the input file. If relevant file format information is available, it is extracted from the meta-data. The selection unit identifies the file formats that match the relevant file format information and calculates the length (in bytes) of the longest data signature. A set of bytes is selected at the corresponding location in the input file and is compared to the corresponding data signature of the selected file formats. If relevant file format information is not available, the selection unit selects the length of bytes from a set of known file formats.
The method described above is also used for content-based, automatic file format identification. The file format identification begins by selecting a set of bytes at the beginning of the input file. The set of bytes is chosen by a process identical to the byte pattern recognition method described above. After the bytes have been selected, the comparison unit matches the set of selected bytes with the data signature of the known/selected file formats. The file formats for which the comparison is successful are verified by the verification unit, which performs verification by comparing the data structure of the file with that of the known/selected file formats. The mode selected for verification is chosen, based on the set of file format(s) for which the matching is successful. The detection unit then checks the file format that is verified for the presence of a compound file format. If the file format is identified to be compound, the file format identification system finds the format of the files present in the identified compound file format, otherwise the file format is returned as the format that represents the file.
However, if the matching is unsuccessful, or if the verification does not produce any relevant file format, the selection unit chooses a set of bytes at the end of the file and compares it with the corresponding data signature of the known file formats. The file formats for which the data signature matches the selected set of bytes are chosen, and verified. The matching and verification processes followed for the bytes at the end of the file, are the same as followed for the bytes at the beginning of the file. The detection unit then compares the file format verified with a list of known compound file formats. If the file format is identified to be compound, the file formats present in it are recursively identified, otherwise the file format is identified as the format that represents the file. If the comparison of the set of bytes at the end of the file is unsuccessful, or if verification does not yield at least one file format, the data format identifier checks the language and character set of the input file, to identify a textual file format. If the data type of the input file is identified to be textual, the extraction unit compares the file format-specific syntax and characters. This step is performed to select a list of possible representative textual file formats. Meta-data available with the file may be used to determine the language and character set of the file.
In the next step, parsers corresponding to the text file formats that match the content of the file parse the file. The file format for which the corresponding parser successfully parses the maximum length of the file is selected as the format of the file. In case the parsing is unsuccessful, meta-data is applied to the input file for file format identification. Whereas, if the data type of the input file is not textual, the file format identification system applies meta-data to the file to identify the corresponding file format. The step of applying meta-data to identify the format of the input file is only performed if meta-data has not been used previously to constrain the search space. If meta-data has been used previously, the meta-data and the file format selected are invalidated and file format detection is performed over a set of known file formats.
Once the file format is identified, the detection unit checks whether the file format is compound. If the file format is compound, the file formats present in the identified compound file format are identified. The result of the file format identification process is returned as a vector containing a full recursive description of the file formats detected.
In case no file format matches the input file, and the data type of the file is identified to be textual, the file format identification unit returns the input file as an unknown simple text file with no embedded control or markup instructions. Whereas, if the data type is identified to be non-textual, the file format identification unit returns the file format of the input file as unknown, and recommends a file format that best represents the input file.
The preferred embodiments of the invention will hereinafter be described in conjunction with the appended drawings provided to illustrate and not to limit the invention, wherein like designations denote like elements, and in which:
The present invention relates to a method and system for content-based, automatic file format identification. It aims at detecting a file format that represents an input file in the best possible manner. The input file may be a binary or text data type. The binary data type includes card data types, word documents and video types, while text data type includes print data types and XML. The invention also relates to a method and system for dynamically selecting a set of bytes from the input file for byte-pattern recognition. This byte-pattern recognition is further used in the method for content-based, automatic file format identification.
Computing device 100 comprises a file format identification system 104, a microprocessor 106, a memory device 108, an operating system 110, a network adaptor 112 for interacting with the network, and a display unit 212 for displaying the data. Computing device 100 may receive data either from memory device 108 or from the network. Memory device 108 may be a magnetic or optical storing media, such as a hard disk, a tape drive, a compatible disc (CD), or a memory chip, etc.
File format identification system 104 in one of its embodiments comprises sub-components, as described in
The steps involved in dynamically selecting a set of bytes for file format identification are described further with the help of
In step 301, if relevant file format information is available with the meta-data linked to the file, step 303 is performed. In step 303, file format identification system 104 extracts the relevant file format information from the meta-data linked to the file. The most general file information provided by the meta-data is the file extension itself. The file information may be extracted based on data extraction techniques known in the art. Once the file information is extracted, step 305 is performed.
In step 305, file format identification system 104 compares the known file formats to the relevant file format information provided by the meta-data. In the present invention, meta-data is used in an advisory fashion to select a set of known file formats that match the file information provided by the meta-data. The relevant file format information provided by the meta-data is used to constrain the sample set of file formats that are used for file format identification. For example, consider a situation when external MIME meta-data is available with a binary image file downloaded from the Internet, and the external meta-data indicates that the binary image file can run on the Microsoft Imaging™ application. The present pattern recognition algorithm selects the file formats supported by the Microsoft Imaging™ application (jpeg, bmp, and tiff file formats) for byte-pattern recognition.
In step 305, if a file format matches the file information provided by the meta-data, the file format is selected in step 307, for comparing its data signature to the byte-pattern of the input file. Otherwise, the file format is rejected in step 309. In step 311, file format identification system 104 performs a check if all known file formats have been compared to the file information provided by the meta-data. If the operation has been performed for all known file formats, the selected file formats of step 307 proceed to step 313, otherwise file format identification system 104 performs step 305, and compares the relevant file format information with the remaining file formats.
In step 313, selection unit 202 identifies the length of the longest data signature from the selected file formats. The data signature may be present at the beginning or at the end of the known file formats. The data signature of a file format represents the expected byte values at specific locations relative to the start of the file, or relative to other expected locations. For example, consider a case when the data signature at the beginning of a selected file format is 100 bytes long, whereas the corresponding data signatures of other selected file formats are less than 100 bytes. Selection unit 202 selects 100 bytes from the beginning of the file for which byte-pattern matching has to be performed. These 100 bytes are then compared to the data signature of the selected file formats for file format identification.
In case the relevant file format information is not available in the meta-data linked to the file, step 315 is performed directly. In step 315, selection unit 202 identifies the length of the longest data signature from a set of known file formats.
The steps involved in dynamically selecting the set of bytes from a file can be used for content-based, automatic file format detection. The steps involved in content-based, automatic file format identification are further described with the help of a flowchart in
In
After determining the value of ‘n’, comparison unit 204 performs step 403. In step 403, comparison unit 204 chooses the first ‘n’ bytes of the file for which the file format identification is performed. Comparison unit 204 then matches this set of bytes with the data signature of the known/selected file formats. File types that are common are checked before obscure file types. This prioritized list of file formats is maintained by keeping an account of the file types frequently encountered in the past. The following example represents a data signature:
The data signature of each known/selected file format is checked in an iterative fashion. The following pseudo-code describes the steps performed by comparison unit 204 to compare the data signature mentioned above with the selected number of bytes at the beginning of the input file:
The pseudo-code given above refers to an iterative loop that checks the hexadecimal data signature previously mentioned till the nth hexadecimal digit. Steps 405 to 421 are illustrated in
In step 407, verification unit 206 verifies the file formats for which the data signature matches the byte-pattern of the file. Verification unit 206 verifies the selected file formats by comparing their data structure with that of the file. The verification is performed based on the file formats for which the matching is successful in step 403. For example, in case of a ‘pdf’ file format, the verification in this case is performed by navigating the contents of the file. In step 409, verification unit 206 checks if the verification of a file format is successful. The verification process is successful if the data structure of the file matches that of at least one file. If the verification is successful, the file format identification proceeds to step 411. Steps 411 and 413 are illustrated in
In step 411, detection unit 208 compares the file format verified with a list of known compound file formats. If the compound file format is identified, the file format identification system 104 performs step 301, and iteratively identifies the sub-file formats within the compound file format. In step 411, if the file format is not compound, file format identification system 104 performs step 413. In step 413, file format identification system 104 returns the verified file format as the format of the file. The file format identified is returned as a vector. For example, a file identified as a Microsoft Word™ file may be represented as {Word [6]}.
In step 409, if the file format verification is not successful, or in step 405, if the matching is unsuccessful, selection unit 202 performs step 415 (
Where ‘b’ denotes a known/selected file format, b[ ] denotes the location of the data signature, ‘h’ denotes hexadecimal representation, and n′ denotes the number of last n′ bytes selected for file format identification.
In step 418, comparison unit 204 checks if the matching is successful. The matching process is successful if the data signature of at least one file format matches the byte-pattern of the file. If the matching is successful in step 418, verification unit 206 verifies the selected file formats in step 419. Verification unit 206 performs this verification by matching the data structure of the file with that of known file formats. The verification process is performed by a method identical to the process described in step 407. In step 421, verification unit 206 computes the success of the verification process. The verification process is identified to be successful if there is at least one file format for which the data structure matches with that of the file. In step 421, if the verification is successful, the detection unit 208 performs step 411. In step 411, detection unit 208 compares the file format verified with a list of known compound file formats. In step 411, if the file format is not identified as compound, file format identification system 104 performs step 413. In step 413, file format identification system 104 returns the file format verified as the format of the file. For example, an exemplary zip file and its sub-file formats may be represented as follows:
The file format identified is returned as a vector. Whereas, if in step 411 the file format matches a compound file format, file format identification system 104 performs step 301 and iteratively identifies the file formats of files within the compound file.
In case in step 421 no file format is verified, or if matching performed in step 418 is unsuccessful, data format identifier 210 performs step 423. Steps 423 to 425 are illustrated in
In step 427, parsers 214, each corresponding to a text file format identified by extraction unit in step 425, parse the text file. The file is parsed, based on known text-parsing algorithms. If a specific parser successfully parses the content of the file, it is assumed that the file matches the file format associated with that parser. A specific embodiment of this step may essentially contain parsers for many known document formats ranging from NROFF, HTML to Applix Words™. After parser 214 parses the file, file format identification unit 202 performs step 429.
If in step 425 the data type of the file format is not identified to be textual, step 429 is performed. At this stage the input file may be a binary, noise, or an unidentified file.
In step 429, it is checked if the file format is identified. If the file format of the input file is not identified, file format identification unit 202 performs step 431. In step 431, it is checked if meta-data has been used previously to constrain the search space of file formats. If the meta-data has been used previously, file format identification system 104 rejects the set of file formats selected by the meta-data and performs step 401. In this case, in step 401, the value of ‘n’ and n′ is selected from a set of known file formats. File format identification system 104 then iteratively performs steps 401 to 429 to identify the file format.
In step 429, if the file format has been identified, the document is checked to determine if it is a compound document in step 411, in the manner described earlier. If the pass was not constrained by meta-data then file format identification system 104 proceeds to step 433.
In step 433, two possible cases may exist. In the first case, if meta-data was not available for a textual file, then the file format identification system 104 returns the input file as an unknown simple text file with no embedded control or markup instructions. An example of a return vector in this case is {Unknown [Text [ ]]}. Whereas, in case of a non-textual file, file format identification system 104 returns that the file cannot be identified.
In the second case, if meta-data was available with the file (textual and non-textual), file format identification system 104 applies meta-data for format detection. The meta-data linked to the file performs a comparison through a set of identifiers of known file formats and returns the format that is indicated by the meta-data, as the format of the file. For example, for an HTML file the meta-data may read “<META http-equiv=“Content-Type” content=“text/html”>”. In this case file format identification system 104 reads the meta-data and returns the format of the file as {Unknown [text [HTML [ ]]]}.
The algorithm used by file format identification system 104 of the present invention enables it to be used as a stand-alone program, or a program operating as the module of a larger program or an operating system, such as the Windows™ operating system.
The set of instructions may include various instructions that instruct the processing machine to perform specific tasks, such as the steps that constitute the disclosed method. The set of instructions may be in the form of a program or software. The software may be in various forms such as system software or application software. Further, the software might be in the form of a collection of separate programs, or a program module with a larger program or a portion of a program module. The software might also include modular programming in the form of object-oriented programming. The processing of input data by the processing machine may be in response to user commands, to the results of previous processing, or to a request made by another processing machine.
The file format identification system 104, as described in the present invention, or any of its components, may be embodied in the form of a processing machine. Typical examples of a processing machine include a general-purpose computer, a programmed microprocessor, a micro-controller, a peripheral integrated circuit element, and other devices or arrangements of devices which are capable of implementing the steps that constitute the disclosed invention.
A person skilled in the art can appreciate that it is not necessary that the various processing machines and/or storage elements be physically located at the same geographical location. The processing machines and/or storage elements may be located in geographically distinct locations and be connected to each other to enable communication. Various communication technologies may be used to enable communication between the processing machines and/or storage elements. Such technologies include the connection of the processing machines and/or storage elements in the form of a network. The network can be an intranet, an extranet, the Internet, or any client server models that enable communication. Such communication technologies may use various protocols such as TCP/IP, UDP, ATM or OSI.
While the preferred embodiments of the invention have been illustrated and described, it will be clear that the invention is not limited to these embodiments only. Numerous modifications, changes, variations, substitutions and equivalents will be apparent to those skilled in the art, without departing from the spirit and scope of the invention, as described in the claims.