Embodiments of the invention relate generally to file systems, and more particularly to traversal of a file system tree.
The recording medium for the storage device, e.g., disks 73, contains a number of files of different types including directory files, i.e., files which identify other files, and non-directory files, for example, data or application files. Typically, these files are organized according to a structure known as a directory tree. The number of files that can be stored on the hard disk drive 51 depends on the capacity of the disks 73. Typically, a disk drive with capacity C can hold N files of with file average size Savg, where N=C /Savg. Disk drives now typically have a capacity C of up to 750 Gigabytes, and the average file size may be as small as 10-100 bytes for files that contain SMS messages or 100-1000 bytes for typical emails.
Accordingly, a typical 400 Gigabytes disk drive, can hold just under 100 million files having an average size of 4096 bytes, for example, which must be managed efficiently to keep response times small and to optimize the use of the storage device.
With such a large number of small-sized files in a file system, the average number of files in a single directory can be very large. The average number of files in a single directory tree may depend on how deep the directory tree is. For instance, the average number of files in a single directory may vary from about 100 (if a directory tree is four levels deep) to about 465 (if the directory tree is three levels deep) to about 10,000 (if the directory tree is two levels deep). With such a large number of small files, file operations, such as file system backup, that traverse the directory tree and access each file data, can take a very long time. Backup of a disk drive and similar operations involve traversal of the file system tree and reading data of each file in order of the traversal. This is particularly true if the files were created in a random order, i.e. when file location in the directory tree is not correlated with the physical location on disk. Of course, disk drive backup represents just one example from a more general class of disk workloads to which the problem applies.
Modern disk drives can access data in a sequential rate of 40-100 Megabytes per second (millions of bytes per second). This rate of data access is controlled in great part by a product of bytes per track multiplied by rotations per second. At the rate of 40 Megabytes (1 Megabyte=1000,000 bytes) per second it takes roughly 10,000 seconds to access all the data a 400 Gigabyte disk may contain. Seek time is the time period to position the actuator 71 (
At block 111, the application accesses files in the directory in the order returned by the call to the file system. At blocks 121 and 131, for each entry, the method 100 determines if the entry is itself a directory. If so, then control returns to block 101. Otherwise, if the entry is not a directory and is a file on disk, at block 141, the method 100 seeks to the file on disk and at block 151, reads the content of the file.
On average the time taken to search a file on disk between two random locations on the disk (approximately 10-20 ms) is much larger than the time taken to actually read the file (approximately 0.1 ms). Therefore, the time taken to traverse the files in the directory of the file system tree with a large number of files can be dominated by the seek operations and can be 100 to 200 times greater than the time needed to read the disk data sequentially.
One solution to speed up traversal of a file system tree is to perform block level operations that access data sequentially. Such block level operations take up to a few hours, and thus, are significantly faster. However, due to issues relating to user convenience and flexibility, file mode, in which the directory tree traverses and accesses each file in the directory, is more desirable than the block mode.
A method for traversing a file system tree on a storage device, such as a disk drive, includes obtaining a list of entries within a directory of the file system tree on the storage device. The list of entries is sorted in order of the file locations on the storage device. The entries within the list of entries are accessed for tree traversal in order in which they are sorted.
Embodiments of the present invention are described in conjunction with systems, methods, and machine-readable media of varying scope. In addition to the aspects of the embodiments of the invention described in this summary, further aspects of the embodiments of the invention will become apparent by reference to the drawings and by reading the detailed description that follows.
One or more embodiments of the present invention are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements, and in which:
A method and system for improving performance of file system tree traversal to access files on a storage device are described herein. Files located in a single directory are read in the order of their physical locations on the storage device rather that in the order the file entries are kept in the directory structure. Accordingly, the average seek time between individual read requests is reduced. Consequently, the total elapsed time for file system tree traversal is significantly reduced, especially for a file system tree with a very large number of files, because the seek distances (and seek times) between consecutive files are smaller.
At block 411, the list of file entries is sorted in the order of the file locations on the storage device. For each file, the file system maintains a list of blocks that contain data of such a file. For small files all the data blocks are typically consecutive because they occupy only one or a few blocks (disk blocks, for example, are typically 512 bytes). In one embodiment, the file system can sort directory entries according to the logical block addresses of the first block used by each file. In another embodiment, the file system sorts the list of file entries based on the track number and/or sector number of the location of the file on the disk drive.
Accordingly, block 411 utilizes the concept that most modern storage device technologies, such as disk drive technologies, use logical block addresses (LBAs) that number available data blocks in a consecutive way. An LBA is used to address a specific location on a disk, or within a stack of multiple disks, for example, and is mapped by the disk controller to a cylinder or track, head number indicating a particular head in a multi-disk system, and sector. For example, typically block ‘0’ is located on at the beginning of a first track on a first cylinder, and the block with the highest available number is the last block on a last track on a last cylinder.
At block 421 and 431, for each entry in the directory, the method 400 determines if the entry is itself a directory. If so, then control returns to block 401. Otherwise, if the entry is not a directory and is a file on the storage device, at block 441, the method 100 seeks to the file on the storage device and at block 451, reads the content of the file.
Thus, because the time taken to search a file on disk between two locations on the disk that are close by (approximately 2 ms) is smaller than the time taken to search a file on disk between two random locations on the disk (approximately 10 ms-20 ms), the time taken to traverse the files in the directory of the file system tree is reduced. In the example case of a hard disk drive embodiment, the disk head for a hard disk drive would not need to travel to distant portions of the disk to read a first file and then back to another portion to read a next file.
A reason why the seek time between files is smaller after sorting is because a seek between two disk location consists of radial seek (comprising an actuator move) and rotational seek in the case of a hard disk drive. Time taken by actuator movements between nearby cylinders can be as short as 1-2 ms while the movements between distant cylinders can take 10-20 ms. Also rotational seeks between locations on the same or nearby cylinders can take time shorter than a half of the rotation. Thus, seek times between locations sorted according to their LBAs can be much shorter than average seek times for a given disk type.
To illustrate, if a list of 465 (an average number of files if the directory tree is three levels deep) to about 10,000 files (an average number of files if the directory tree is two levels deep) is sorted in the order of their disk locations, then the average seek time between consecutive locations can be reduced by approximately 5-10 times. While the seek operation will still dominate over read operation in terms of time, the overall time to access the data will be approximately 5-10 times smaller. Accordingly, process 400 may be used to improve performance of traversal of large file systems that have a very large number of files that are small in size. Further, process 400 may be applied to multiple file systems and to various existing and future storage devices in which seek time between close locations is much shorter than between distant locations.
In the foregoing description, the invention has been described with reference to magnetic disk based storage devices. However, the invention applies to any storage device in which seek time between two locations with distant addresses takes substantially more time than a seek between two addresses that are close by. For instance, the invention can be used to traverse a file system tree on a storage device that is tape-based storage, has a rotating disk or employs MEMS-based storage.
In practice, the method 400 may constitute one or more programs made up of machine-executable instructions. Describing the method with reference to the flowchart in
The following description of
It will be appreciated that the computer system 52 is one example of many possible computer systems which have different architectures. For example, personal computers based on an Intel microprocessor often have multiple buses, one of which can be an input/output (I/O) bus for the peripherals and one that directly connects the processor 55 and the memory 59 (often referred to as a memory bus). The buses are connected together through bridge components that perform any necessary translation due to differing bus protocols.
It will also be appreciated that the computer system 52 is controlled by operating system software which includes a file management system, such as a disk operating system, which is part of the operating system software. The file management system is typically stored in the non-volatile storage 65 and causes the processor 55 to execute the various acts required by the operating system to input and output data and to store data in memory, including storing files on the non-volatile storage 65.
The present invention also relates to apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may comprise an electronic tester selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, and magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, or any type of media suitable for storing electronic instructions, and each coupled to a computer system bus.
In the forgoing specification, the invention has been described with reference to specific exemplary embodiments. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the invention. The specification and drawings are accordingly to be regarded in an illustrative sense rather than a restrictive sense.