Embodiments of the invention generally relate to the field of computer systems and, more particularly, to a method and apparatus for computer file processing.
In computer operations, computer files, including Java class files, may contain various different types of elements. For example, there may be many different variations in data classifications within the files, such as instances of particular classes. Further, there may be elements such as annotations or other elements that contain additional data or metadata.
In certain circumstances, it may be necessary to identify which data elements are contained within the computer files. Because the data elements generally would not be indexed or otherwise identified in the files, it may be necessary to process the class files to search for the desired elements.
However, the structure of the files may make processing difficult. In one example, a set of class files may be in the form of a file hierarchy, which may not be easily searchable in an efficient manner. In another example, the files may be contained in an archive, which requires additional effort in the need to expand files to obtain the archived data. As a result, the identification of elements within the computer files may require significant processing time.
A method and apparatus are provided for computer file processing.
In a first aspect of the invention, an embodiment of a method includes receiving a serial data stream input, where the serial data stream input represents a set of computer files. The serial data stream input is scanned to extract selected data elements occurring in the set of computer files, and the selected data elements are output in a serial data stream output.
In a second aspect of the invention, a embodiment of a system includes a data input, where the data input is to receive an serial data stream input, the serial data stream input representing a set of computer files. The system further includes a processing module, where the processing module is to scan the serial data stream input to identify one or more elements in the set of computer files. The system also includes a data output, with the data output providing an extracted serial data stream output representing the identified elements of input data stream.
Embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings in which like reference numerals refer to similar elements.
a is an illustration of a file hierarchy including an archive that may be processed in an embodiment of the invention;
b is an illustration of an archive that may be processed in an embodiment of the invention;
Embodiments of the invention are generally directed to computer file processing.
As used herein:
“Computer file” means any file structure used in a computer system. Computer files include files with specific required structures, including Java class files.
“Class file” means a Java class file. A Java class file is a defined format for compiled Java code, which may then be loaded and executed by any Java virtual machine. The format and structure for a Java class file is provided in JSR (Java Specification request)-000202, Java Class File Specification Update (Oct. 2, 2006) and subsequent specifications.
“Traversal” means a process for progressing through the elements of a computer system, including a process for progressing through the elements of a computer archives
“Archive” means a single file that may contain one or more separate files. An archive may also be empty. The files within an archive are extracted, or separated, from the archive for use by a computer program. The files contained within an archive are commonly compressed, and the compressed files are decompressed prior to use. An archive may further include data required to extract the files from the archives “Archive” may also refer to act of transferring one or more files into an archives
“Compression” means the conversion of data into a form that requires less storage space. The term “compression” includes the use of any known compression algorithm. “Compression” also may commonly be referred to as “zipping” a file. The reverse process to compression is decompression, or expansion, of the compressed data back into a usable form. Compressed data is decompressed (expanded or unzipped) prior to use. Compression includes both lossy compression, in which data is lost in the process of compression, and lossless compression, in which no data is lost in the process of compression.
In an embodiment of the invention, computer files are processed in the form of a data stream. The computer files may include, but are not limited to, Java class files. In an embodiment, the computer files are converted to a serial data stream input for processing, and the processing of the computer files is conducted with the data remaining in the data stream form.
In an embodiment of the invention, a set of computer files are processed in a single pass as a serial data stream. In an embodiment, the serial data stream form is maintained both on input and output, thereby allowing further processing of class files without further file conversion. In an embodiment, the same data format is used for the data input and the data output. In an embodiment, the data stream conversion allows processing without any dependency on random access files, and broadens the applicable scope of the process for the input. In an embodiment, the processing of class files as a data stream allows processing without requiring use of, for example, Java library utilities that may normally be required to conduct the file processing.
In an embodiment of the invention, a system includes a serial data processing module for scanning received data. In an embodiment, the processing module receives computer files in the form of a serial data stream, and outputs an extracted data stream. In an embodiment, the processing module processes the files in a single pass, without requiring multiple readings of the file data.
In an embodiment of the invention, the conversion of computer files to a data stream allows for the use of a protocol for both the data producer (the computer file processor) and the data consumer without creating a complete file representation, thereby simplifying the data structure. In an implementation for Java class files, the processing system operates with a class file data model, without requiring the addition of any major abstraction for data processing.
In an embodiment, the conversion of computer files to a serial data format may include, but is not limited to, the operation of a traversal of a hierarchical data structure or of a data archive as provided respectively in U.S. patent application Ser. No. 11/648,065, entitled “Computer File System Traversal”, filed Dec. 30, 2006. Other processes for conversion of a set of files to a serial data stream may also be utilized in embodiments of the invention.
In an embodiment of the invention, for the processing of computer files it is assumed that processing occurs on an inner loop for critical processing stages. In an embodiment, a system provides high performance for inner loop class file processing.
In an embodiment of the invention, processing is designed to provide sufficient performance for overall computer file processing. For example, in an embodiment a system includes stream buffering to buffer data as it is obtained and processed. In addition, an embodiment of the invention provides a compact internal file state in the data stream, thereby minimizing the amount of data that will be required in the process of transferring and processing the computer files.
In an embodiment of the invention, a dedicated, independent processing module is provided. In an embodiment, the processing module may be utilized to identify type dependency data or annotation data in a serial data stream. A similar design and implementation may be utilized for either type of data,
In an embodiment of the invention, a file processor may be provided in multiple implementations, depending on the system requirements. In one example, native processing implementations may be provided for a computer file processor, with the native implementations being based upon relevant Java standards. In another example, a non-native implementation may be provided, as required. A particular non-native implementation may include a BCEL (Byte Code Engineering Library) implementation, with the BCEL API being a toolkit for the static analysis and dynamic creation or transformation of Java class files.
In an embodiment of the invention, a data consumer is a main framework expansion point for which neutral utility implementations might be required. In an embodiment of the invention, a file processor (the data producer) operates using the same data protocol as the data consumer protocol. In an embodiment of the invention, the data consumer may have control over the data to be provided to the data consumer. In an embodiment, the data producer and the data consumer may cooperate to agree on the data to be provided from the serial data stream. In an embodiment of the invention, a system may include complexity control, including configuring the file processor to deliver the data of interest. In an embodiment, the data of interest includes data meeting a certain degree of detail, or certain types of data. In an embodiment of the invention, the structure of the data processing may allow for a system to be utilized with loose semantics and implementation constraints. For example, the technical framework and protocol data types may be defined. However, there may be leeway for implementation characteristics, such as the result order sequence and analysis capabilities.
In an embodiment of the invention, file processing may be included within a set of tools that are provided to search files. The tools may, for example, provide for the conversion of files into serial form by a traversal process, the scanning of data for desired elements, and other related processes.
In a particular embodiment of the invention, a process is applied to Java class files contained within a hierarchical file structure or within a Java archive (or JAR file), including class files for J2EE systems (Java 2, Enterprise Edition). In an embodiment, the output of process for traversal of the hierarchical file structure or archive is a class file data stream. An embodiment may utilize Java under the JDK (Java Development Kit) 5.0, including the JSR-175 recommendation regarding code annotations. In an embodiment of the invention, the class files may contain elements such annotations or occurrences of particular class types, and there may be a need to extract these elements from the class files. In an embodiment of the invention, the class files are converted to a serial data stream format for the input to a processing module, and the processing module scans the serial data stream and extracts the desired elements.
In an embodiment of the invention, the traversal module 120 walks through the file structure or archive. Using only the names of the elements, the traversal module 120 may make a determination whether to process or skip each element of the archives In an embodiment of the invention, the archive traversal module processes only portions of interest, and does not process any element more than once.
In an embodiment of the invention, the traversal module 120 then outputs a serial data stream 130 representing the elements of interest in the file structure or archive 110. In an embodiment, the data stream 130 may be used for any purpose, including the provision of the data to a data stream processing module 140. In an embodiment of the invention, the processing module 140 may be intended to process the archive in a serial form to, for example, search for certain elements in the portions of interest in the archives The processing module 140 may then produce a data output 150 that, for example, includes information regarding elements that were found in the archives
a and 2b illustrate respectively a file hierarchy and an archive that may be converted to a serial data stream, such as by the operation of the traversal module 120 illustrated in
In addition, a branch may be an archive 212, the archive containing one or more files (unless the archive is empty). The file hierarchy 200 may include any number of archives, with the archives existing at any point in the hierarchy. In an embodiment of the invention, the file hierarchy may be subject to processing for a file system. In an embodiment, the operation may be transferred to a separate processing to address the computer archive when it is encountered. In an embodiment, after completion of processing of the archive the operation may return to the original processing.
b is an illustration of an archive that may be processed in an embodiment of the invention. The archive 220 may, in one example, be an archive encountered in the processing of a file system, such as archive 212 in the processing of the hierarchical file system 200. A non-empty archive will contain one or more files or archives (each such archive being an archive within an archive, or an inner archive). In this illustration, archive 220 contains file 222, file 232, and file 242, but also contains inner archive 224 and inner archive 234. Archive 224 contains one or more files or archives, which are shown here as file 226, file 228, and file 230. Archive 234 contains one or more files or archives, which are shown here as file 236, file 238, and file 240.
In an embodiment of the invention, the contents of file hierarchy 200 or archive 220 are traversed, with the outcome of the traversal being a data stream representing selected portions of such contents. In an embodiment of the invention, the traversal addresses each element of the file hierarchy or archive no more than once. In an embodiment of the invention, the selection of elements to process is based upon the names of the elements, thus preventing the need to enter archived elements, such as to decompress such elements, if the elements will not be processed.
In an embodiment of the invention, the serial data stream 310 then is provided to a file processor or scanner 315, which processes the data, including scanning the data stream for data elements of interest. The file processor 315 may contain multiple modules or sub-modules, depending on the particular embodiment. The file processor 315 outputs an extracted data stream 320, which represents elements of the data stream that have been selected by the file processor 315. The extracted data stream 320 then is eventually provided to a consumer 325, which may be entity or agent that requires the result of the scanning operation. The consumer 325 may receive additional reports or data processing as required for the needs of the consumer 325.
In an embodiment of the invention, the operation of the computer file processing system 400 is directed by certain inputs and settings. The operation of the file processor 410 may be directed by a scanner configuration 425. In addition, a data mode configuration 430 affects both the file processor 410 and the consumer 420. The file processor 410 also may include one of multiple implementations. In particular embodiments, the implementation may be a native implementation 435 or a BCEL (Byte Code Engineering Library) implementation. The BCEL implementation may include the Apache BCEL process 445, as developed by the Apache Software Foundation. In addition, the consumer 420 may utilize a framework utility 450 and a framework extension 455 in the operation of the computer file processing.
In the interface layer, the code walk interfaces 680 may include a class file annotation value interface module 682, a class file program element interface module 684, a class file annotation handler interface module 686, a class file annotation scanner interface module 688, a class file dependency scanner interface module 690, and a class file dependency listener interface module 692. The file walk interfaces then may include a file condition interface module 612, a file name classifier interface module 614, a directory walker handler interface module 616, a directory walker interface module 618, a zip walker handler interface module (“zip” indicating use for archives) 620, a zip walker interface module 622, and a file notification interface module 624.
In an embodiment of the invention, the code processing 650 may provide for parsing types from class file descriptors. Code processing 650 may include a class file format helper module 652 and a class file descriptor parser module. The code walk implementation 660 for class file processing may include a class file annotation record module 662, a class file element record module 664, a class file annotation filter 666, a class file annotation for native elements 668, a class file dependencies module for native elements 670, a class file dependencies module for BCEL (Byte Code Engineering Library) elements 672, a class file dependency concentrator module 674, and a class file dependency filter 676.
In an embodiment of the invention, the file processing 655 may include a comma separated value (CSV) formatter and a CSV scanner. The file walk implementation 630 for locating files may include a simple file condition module 632, a basic file name classifier module 634, a directory finder module 636, a directory walker implementation module 638, a walk recorder module 640, a zip (archive) condenser module 642, and a zip walker implementation module 644.
As illustrated in
Memory 710 is or includes the main memory of the computer system 700. Memory 710 represents any form of random access memory (RAM), read-only memory (ROM), flash memory, or the like, or a combination of such devices. Memory 710 stores, among other things, the operating system 715 of the computer system 700.
Also connected to the processors 705 through the bus system 720 are one or more mass storage devices 725 and a network adapter 735. Mass storage devices 725 may be or may include any conventional medium for storing large volumes of instructions and data 730 in a non-volatile manner, such as one or more magnetic or optical based disks. In an embodiment of the invention, the mass storage devices may include storage of file or an archive 732 that requires processing. In an embodiment of the invention, the processors 705 may operate to traverse the files or archive 732, the traversal of the files or archive 732 resulting in output of a serial data stream representing selected elements of the archives The processor 705 may scan the serial stream for desired data elements within the computer files. In another embodiment the computer system 700 may provide for the conversion of the computer files into a serial data stream, while another system or systems is responsible for scanning the data stream for desired data elements.
The network adapter 735 provides the computer system 700 with the ability to communicate with remote devices, over a network 740 and may be, for example, an Ethernet adapter. In one embodiment, the network adapter may be utilized to output data including, for example, an extracted serial data stream representing selected elements of the files or archive 732.
Client systems 805-815 may execute multiple application or application interfaces. Each instance or application or application interface may constitute a user session. Each user session may generate one or more requests to be processed by server 830. The requests may include instructions or code to be executed on a runtime system, such as virtual machine 845 on server 830.
In the description above, for the purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced without some of these specific details. In other instances, well-known structures and devices are shown in block diagram form.
The present invention may include various processes. The processes of the present invention may be performed by hardware components or may be embodied in machine-executable instructions, which may be used to cause a general-purpose or special-purpose processor or logic circuits programmed with the instructions to perform the processes. Alternatively, the processes may be performed by a combination of hardware and software.
Portions of the present invention may be provided as a computer program product, which may include a computer-readable medium having stored thereon computer program instructions, which may be used to program a computer (or other electronic devices) to perform a process according to the present invention. The machine-readable medium may include, but is not limited to, floppy diskettes, optical disks, CD-ROMs (compact disk read-only memory), and magneto-optical disks, ROMs (read-only memory), RAMs (random access memory), EPROMs (erasable programmable read-only memory), EEPROMs (electrically-erasable programmable read-only memory), magnet or optical cards, flash memory, or other type of media/machine-readable medium suitable for storing electronic instructions. Moreover, the present invention may also be downloaded as a computer program product, wherein the program may be transferred from a remote computer to a requesting computer.
Many of the methods are described in their most basic form, but processes can be added to or deleted from any of the methods and information can be added or subtracted from any of the described messages without departing from the basic scope of the present invention. It will be apparent to those skilled in the art that many further modifications and adaptations can be made. The particular embodiments are not provided to limit the invention but to illustrate it. The scope of the present invention is not to be determined by the specific examples provided above but only by the claims below.
It should also be appreciated that reference throughout this specification to “one embodiment” or “an embodiment” means that a particular feature may be included in the practice of the invention. Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims are hereby expressly incorporated into this description, with each claim standing on its own as a separate embodiment of this invention.
This application is related to and claims priority to U.S. provisional patent application 60/953,933, filed Aug. 3, 2007. This application is further related to: U.S. patent application Ser. No. 11/648,065, entitled “Computer File System Traversal”, filed Dec. 30, 2006;U.S. patent application Ser. No. ______, entitled “Computer Computer Archive Traversal”, attorney docket 6570P472, filed Aug. 1, 2008, claiming priority to U.S. provisional application 60/953,932, filed Aug. 3, 2007;U.S. patent application Ser. No. ______, entitled “Annotation Processing of Computer Files”, attorney docket 6570P474, filed Aug. 1, 2008, claiming priority to U.S. provisional application 60/953,935, filed Aug. 3, 2007;U.S. patent application Ser. No. ______, entitled “Annotation Data Filtering of Computer Files”, attorney docket 6570P475, filed Aug. 1, 2008, claiming priority to U.S. provisional application 60/953,937, filed Aug. 3, 2007;U.S. patent application Ser. No. ______, entitled “Annotation Data Handlers for Data Stream Processing”, attorney docket 6570P476, filed Aug. 1, 2008, claiming priority to U.S. provisional application 60/953,938, filed Aug. 3, 2007;U.S. patent application Ser. No. ______, entitled “Dependency Processing of Computer Files”, attorney docket 6570P492, filed Aug. 1, 2008, claiming priority to U.S. provisional application 60/953,963, filed Aug. 3, 2007; andU.S. patent application Ser. No. ______, entitled “Data Listeners for Type Dependency Processing”, attorney docket 6570P493, filed Aug. 1, 2008, claiming priority to U.S. provisional application 60/953,964, filed Aug. 3, 2007.
Number | Date | Country | |
---|---|---|---|
60953933 | Aug 2007 | US | |
60953932 | Aug 2007 | US | |
60953935 | Aug 2007 | US | |
60953937 | Aug 2007 | US | |
60953938 | Aug 2007 | US | |
60953963 | Aug 2007 | US | |
60953964 | Aug 2007 | US |