This application is related to U.S. patent application Ser. No. 10/271,633, entitled ZERO COPY WRITES THROUGH USE OF MBUFS, by Douglas Santry, et al., the teachings of which are expressly incorporated herein by reference.
The present invention relates to file systems and, more specifically, to a technique for writing files within a file system onto a storage medium.
Afile server is a computer that provides file service relating to the organization of information on storage devices, such as disks. The file server or filer may be embodied as a storage system including a storage operating system that implements a file system to logically organize the information as a hierarchical structure of directories and files on the disks. Each “on-disk” file may be implemented as a set of disk blocks configured to store information, such as text, whereas the directory may be implemented as a speciallyformatted file in which information about other files and directories are stored.
A filer may be configured to operate according to a client/server model of information delivery to thereby allow many clients to access files stored on a server, e.g., the filer. In this model, the client may comprise an application, such as a file system protocol, executing on a computer that “connects” to the filer over a computer network, such as a point-to-point link, shared local area network (LAN), wide area network (WAN), or virtual private network (VPN) implemented over a public network such as the Internet. Each client may request the services of the filer by issuing file system protocol messages, usually in the form of packets, to the filer over the network.
As used herein, the term storage operating system generally refers to the computer-executable code operable on a storage system that manages data access and client access requests and may implement file system semantics in implementations involving filers. In this sense, the Data ONTAP™ storage operating system, available from Network Appliance, Inc. of Sunnyvale, Calif., which implements a Write Anywhere File Layout (WAFL™) file system, is an example of such a storage operating system implemented as a microkernel within an overall protocol stack and associated disk storage. The storage operating system can also be implemented as an application program operating over a general-purpose operating system, such as UNIX® or Windows NT®, or as a general-purpose operating system with configurable functionality, which is configured for storage applications as described herein.
The disk storage is typically implemented as one or more storage volumes that comprise physical storage disks, defining an overall logical arrangement of storage space. Currently available filer implementations can serve a large number of discrete volumes (150 or more, for example). Each volume is associated with its own file system and, for purposes hereof, volume and file system shall generally be used synonymously. The disks within a volume are typically organized as one or more groups of Redundant Array of Independent Disks (RAID). RAID implementations enhance the reliability/integrity of data storage through the redundant writing of data “stripes” across a given number of physical disks in the RAID group, and the appropriate storing of parity information with respect to the striped data. As described herein, a volume typically comprises at least one data disk and one associated parity disk (or possibly data/parity partitions in a single disk) arranged according to a RAID 4, or equivalent high-reliability, implementation.
Additionally, a filer may be made more reliable and stable in the event of a system shutdown or other unforeseen problem by employing a backup memory consisting of a non-volatile random access memory (NVRAM) as part of its architecture. An NVRAM is typically a large-volume solid-state memory array (RAM) having either a back-up battery, or other built-in last-state-retention capabilities (e.g. a FLASH memory), that holds the last state of the memory in the event of any power loss to the array.
Packets of information are received at a filer by a network subsystem comprising one or more integrated software layers that provide data paths through which clients access information stored in the filer. The received information is typically copied into memory buffer data structures or mbufs in the filer's “in-core” memory. The mbufs organize the received information in a standardized format that can be manipulated within the network subsystem. For instance, each mbuf typically comprises header and data portions of a predetermined size and format. Information stored in an mbuf may include a variety of different data types including, inter alia, source and destination addresses, socket options, user data and file access requests. Further, an mbuf also may reference information contained in a separate mbuf data section. Mbufs can be used as elements of larger data structures, e.g. linked lists, and are particularly useful in dynamically changing data structures since they can be created or removed “on the fly.” A general description of mbuf data structures is provided in TCP/IP Illustrated, Volume 2 by Wright et al (1995) which is incorporated herein by reference.
A filer often receives information from a network as data packets of various lengths. Accordingly, these packets are received by the filer's network subsystem and are copied into variable length “chains” (e.g., linked lists) of mbufs. However, the filer's storage subsystem (e.g., including its file system) usually operates on fixed-sized blocks of data. For instance, the WAFL file system is configured to operate on data stored in contiguous 4 kilobyte (kB) blocks. Therefore, data received by the filer must be converted from the variable length mbufs used within the filer's network subsystem to fixed-sized data blocks that may be manipulated within its storage subsystem.
Conventionally, the process of converting data stored in mbufs to fixed-sized blocks involves copying the contents of the mbufs into one or more fixed-sized data buffers in the filer's memory. While the network subsystem maintains “ownership” (i.e., control) of the mbuf data structures, the fixed-sized data buffers instead are managed within the storage subsystem. After the mbufs' data has been copied into the fixed-sized data buffers, the network subsystem typically de-allocates the mbufs. More specifically, the de-allocated (“free”) mbufs may be placed in a list or “pool” of free mbufs that later may be used to store new data packets received at the filer.
Problems often arise because a significant amount of time and system resources, such as memory and central processing unit (CPU) cycles, is typically required to copy received data from variable length mbufs to fixed-sized data buffers. Moreover, the consumption of time and resources becomes particularly noticeable when the contents of a large number of mbufs must be copied to fixed-sized data buffers before their contained data may be written to disk storage. It would therefore be desirable to minimize the number of times received data is copied from mbuf data structures into fixed-sized data buffers, e.g., in the process of writing the received data to disk storage.
The present invention provides techniques for managing ownership (i.e., control) of one or more memory buffer (mbuf) data structures within a network subsystem and a storage subsystem of a storage operating system implemented in a storage system. Data to be written to a storage medium in the storage system is received at the network subsystem and stored in one or more variable-length chains of mbufs. Unlike conventional approaches, the received data is not subsequently copied out of the mbufs into fixed-sized data buffers for use by the storage subsystem. Instead, the storage subsystem can directly manipulate the received data in the mbufs. By eliminating the steps of copying data out of the mbufs and into fixed-sized data buffers, the invention reduces the amount of time and system resources consumed by the storage system when writing the received data to disk storage. As a result, the storage system utilizes its memory more efficiently and increases the throughput of its write operations.
In the illustrative embodiments, a “write path” of the storage operating system is modified so the storage subsystem may directly manipulate data stored in mbuf data structures. However, the operating system's other file-access request paths, such as its “read path,” remain unaltered. As used herein, the write path defines the code used by the storage operating system to process requests to WRITE data to a storage medium, and the read path defines the code used by the storage operating system to process requests to READ data from a storage medium. Those skilled in the art will appreciate that the write and read paths can be implemented as software code, hardware or some combination thereof.
Operationally, a request to WRITE user data to a storage medium is received at an interface of the storage system. The network subsystem copies the received data into one or more mbuf data structures, which may be stored in non-contiguous regions in the storage system's “in-core” memory. Because the storage system received a WRITE request, the received data is logically partitioned into contiguous, e.g., 4 kilobyte, data blocks that may be written to the storage medium. Advantageously, the received data is partitioned directly in the mbufs without copying the data into newly allocated data buffers. To that end, the storage subsystem generates sets of one or more buffer pointers that address various portions of the received data. The union of the data portions referenced by a set of buffer pointers defines a contiguous data block that may be written to the storage medium. Each set of buffer pointers is stored in a corresponding buffer header. According to the invention, mbuf data sections containing the partitioned data blocks are not de-allocated by the storage operating system until their data blocks are written to the storage medium.
In a first illustrative embodiment, the storage subsystem “shares” ownership of the mbuf data sections with the network subsystem. That is, both the storage subsystem and the network subsystem control when the mbuf data sections may be de-allocated. The network subsystem de-allocates the mbuf data sections based on the values of their respective “mbuf reference counts.” Here, an mbuf data section's associated mbuf reference count indicates the number of mbufs referencing data stored in the data section. Accordingly, the network subsystem de-allocates an mbuf data section when the value of its associated mbuf reference count equals zero. The storage subsystem controls when each mbuf data section is de-allocated by preventing the mbuf data section's associated mbuf reference count from equaling zero until after the data section's contained data is written to the storage medium.
According to the first embodiment, for each buffer pointer generated by the storage subsystem, the storage subsystem also generates a “file-system proxy mbuf” through which the buffer pointer references a portion of the received data. The file-system proxy mbufs ensure that the values of the mbuf reference counts associated with the mbuf data sections storing the received data are greater than or equal to one. In this manner, the generated file-system proxy mbufs prevent their referenced mbuf data sections from being de-allocated by the network subsystem. Every time a contiguous block of the received data is written to the storage medium, the storage subsystem de-allocates the buffer header, buffer pointers and file-system proxy mbufs associated with the written data block. After the file-system proxy mbufs have been de-allocated, the network subsystem may then de-allocate those mbuf data sections whose mbuf reference counts equal zero.
In a second illustrative embodiment, control of the mbuf data structures storing the received data is transferred from the network subsystem to the storage subsystem. In this embodiment, the storage subsystem allocates a “reference-tracking” data structure comprising a reference count that is incremented every time the storage subsystem generates a set of buffer pointers defining a contiguous block of the received data. The reference count is decremented after every time a contiguous block of the received data is written to disk storage. Accordingly, the storage subsystem does not de-allocate the one or more mbufs storing the received data nor de-allocate their associated reference-tracking data structure until the reference count in their associated reference-tracking data structure equals zero.
The above and further advantages of the invention may be better understood by referring to the following description in conjunction with the accompanying drawings in which like reference numerals indicate identically or functionally similar elements, of which:
In the illustrative embodiment, the memory 124 comprises storage locations for data and processor instructions that are addressable by the processor 122 and adapters 126 and 128. A portion of the memory is organized as a buffer cache 135 having buffers used by the file system to store data associated with, e.g., write requests. The processor and adapters may, in turn, comprise additional processing elements, memory and/or logic circuitry configured to execute software code and manipulate data structures. The operating system 200, portions of which are typically resident in memory and executed by the processing elements, functionally organizes the filer by, inter alia, invoking storage operations in support of a file service implemented by the filer. It will be apparent to those skilled in the art that other processing and memory means may be used for storing and executing program instructions pertaining to the inventive technique described herein. For example, memory 124 may comprise any combination of volatile and non-volatile computer readable media, and processor 122 may offload some processing to other specialized processors that may or may not be resident within filer 120.
Notably, the filer 120 includes an NVRAM 121 that provides fault-tolerant backup of data, enabling the integrity of filer transactions to survive a service interruption based upon a power failure, or other fault. The size of the NVRAM is variable. It is typically sized sufficiently to log a certain time-based chunk of transactions (for example, several seconds worth). An NVLOG 123 in the NVRAM 121 is filled after each client request is completed (for example, file service requests to LOAD, MODIFY, etc.), but before the result of the request is returned to the requesting client. The NVLOG contains a series of ordered entries corresponding to discrete client messages requesting file trans-actions such as “WRITE,” “CREATE,” “OPEN,” and the like. These entries are logged in the particular order completed. In other words, each request is logged to the NVLOG at the time of completion—when the results of the requests are about to be returned to the client. The use of the NVLOG for system backup and crash recovery operations is generally described in commonly assigned application Ser. No. 09/898,894, entitled System and Method for Parallelized Replay of an NVRAM Log in a Storage Appliance by Steven S. Watanabe et al. which is expressly incorporated herein by reference.
In an illustrative embodiment, the disk shelf 132 is arranged as a plurality of separate disks 130. The disk shelf 132 may include, in some embodiments, dual connectors for redundant data paths. The disks 130 are arranged into a plurality of volumes, each having a file system associated therewith. Each volume includes one or more disks 130. In one embodiment, the physical disks 130 are configured into RAID groups so that some disks store striped data and at least one disk stores separate parity for the data, in accordance with a preferred RAID 4 configuration. However, other configurations (e.g. RAID 5 having distributed parity across stripes) are also contemplated. In this embodiment, a minimum of one parity disk and one data disk is employed. However, a typical implementation may include three data and one parity disk per RAID group, and a multiplicity of RAID groups per volume.
The network adapter 126 comprises the mechanical, electrical and signaling circuitry needed to connect the filer 120 to a client 110 over a computer network 140, which may comprise a point-to-point connection or a shared medium, such as a local area network. The client 110 may be a general-purpose computer configured to execute applications 112, such as a database application. Moreover, the client 110 may interact with the filer 120 in accordance with a client/server model of information delivery. That is, the client may request the services of the filer, and the filer may return the results of the services requested by the client, by exchanging packets 150 encapsulating, e.g., the Common Internet File System (CIFS) protocol or Network File System (NFS) protocol format over the network 140.
The storage adapter 128 cooperates with the operating system 200 executing on the filer to access information requested by the client. The information may be stored on disks 130 attached to the filer via the storage adapter. The storage adapter 128 includes input/output (I/O) interface circuitry that couples to the disks over an I/O interconnect arrangement, such as a conventional high-performance, Fibre Channel serial link topology. The information is retrieved by the storage adapter and, if necessary, processed by the processor 122 (or the adapter 128 itself) prior to being forwarded over the system bus 125 to the network adapter 126, where the information is formatted into a packet and returned to the client 110.
A local user 160 or administrator may also interact with the storage operating system 200 or with the one or more applications 129 executing on filer 120. The local user can input information to filer 120 using a command line interface (CLI) or graphical user interface (GUI) or other input means known in the art and appropriate I/O interface circuitry, e.g. a serial port. Thus, operating system 200 can receive file requests not only from a remote client 110, but also from a local user 160 and local applications 129.
Again to summarize, as used herein, the term “storage operating system” generally refers to the computer-executable code operable on a storage system that implements file system semantics (such as the previously-referenced WAFL) and manages data access. In this sense, Data ONTAP™ software is an example of such a storage operating system implemented as a microkernel. The storage operating system can also be implemented as an application program operating over a general-purpose operating system, such as UNIX® or Windows NT®, or as a general-purpose operating system with configurable functionality, which is configured for storage applications as described herein.
The organization of a storage operating system for the exemplary filer is now described briefly. However, it is expressly contemplated that the principles of this invention can be implemented using a variety of alternate storage operating system architectures. As shown in
Bridging the disk software layers with the network subsystem is a file system layer 230 of the storage operating system 200. Here, the file system layer and the disk software layers may be collectively referred to as a storage subsystem. Generally, the layer 230 implements a file system having an on-disk format representation that is block-based using, e.g., 4-kilobyte (KB) data blocks and using modes to describe the files. An mode is a data structure used to store information about a file, such as ownership of the file, access permission for the file, size of the file, name of the file, location of the file, etc. In response to file access requests, the file system generates operations to load (retrieve) the requested data from disks 130 if it is not resident “in-core”, i.e., in the buffer cache 135. If the information is not in buffer cache, the file system layer 230 indexes into an mode file using an mode number to access an appropriate entry and retrieve a logical volume block number. The file system layer 230 then passes the logical volume block number to the disk storage (RAID) layer 224, which maps that logical number to a disk block number and sends the latter to an appropriate driver (for example, an encapsulation of SCSI implemented on a fibre channel disk interconnection) of the disk driver layer 226. The disk driver accesses the disk block number from disks 130 and loads the requested data in memory 124 for processing by the filer 120. Upon completion of the request, the filer (and storage operating system) returns a reply, e.g., a conventional acknowledgement packet defined by the CIFS specification, to the client 110 over the network 140.
It should be noted that the software “path” 250 through the storage operating system layers described above needed to perform data storage access for the client request received at the filer may alternatively be implemented in hardware or a combination of hardware and software. That is, in an alternate embodiment of the invention, the storage access request path 250 may be implemented as logic circuitry embodied within a field programmable gate array (FPGA) or an application specific integrated circuit (ASIC). This type of hardware implementation increases the performance of the file service provided by filer 120 in response to a file system request packet 150 issued by client 110. Moreover, in another alternate embodiment of the invention, the processing elements of network and storage adapters 126 and 128 may be configured to offload some or all of the packet processing and storage access operations, respectively, from processor 122 to thereby increase the performance of the file service provided by the filer. Further, in a multiprocessor implementation, different software layers of the operating system 200 (or portions thereof) may concurrently execute on a plurality of processors 122.
Referring again to
In general, mbufs are used to organize received information in a standardized format that can be passed among the different layers of a storage operating system. However, not every storage operating system uses the same mbuf structure.
Mbuf 300 comprises header portion 310 and data portion 320 typically having a fixed combined length, e.g. 128 bytes. Header portion 310 comprises a plurality of fields used to describe the contents of the mbuf which may include data (such as a packet) stored in data portion 320. The fields in an mbuf can be tailored to specific applications, although one representative set of fields 301-309 is included in mbuf 300. The NEXT field 301 and NEXTPKT field 302 contain pointers to other data structures, such as another mbuf. When data is received in packets, the NEXT pointer usually links mbufs storing data from the same received packet, whereas the NEXTPKT pointer usually links different data packets associated with the same overall data transmission. When data is stored in the data portion 320 of mbuf 300, the DATAPTR field 304 comprises a pointer to locate the data and LENGTH field 303 keeps track of the amount of data stored. The TYPE field 305 can indicate the type of data stored in the mbuf, e.g. packet header data and/or user data, and the FLAGS field 306 is used to indicate anomalies and special attributes, e.g. the presence of an mbuf data section or packet header, associated with the mbuf. When a received data packet spans multiple mbufs, the PKTLEN field 307 can be used to store the overall packet length.
When the data in a received packet is too large to fit in data portion 320, the mbuf may extend its data portion through use of an mbuf data section 330. The EXTPTR field 308 can store a pointer that references an mbuf data section whose size can be stored in EXTSIZE field 309. Usually, mbuf data sections are a fixed sized, e.g. 1 kilobyte or 2 kilobytes, although EXTPTR could point to mbuf data sections of any arbitrary size. Because an mbuf data section extends the data portion of an mbuf, DATAPTR field 304 may point directly to an address in the data section 330 instead of an address in data portion 320.
Furthermore, an mbuf data section may be used by more than one mbuf, so DATAPTR pointers from a plurality of mbufs could point to different addresses within the same data section 330. To keep track of the number of DATAPTR pointers referencing data stored in an mbuf data section, the data section may include an mbuf REFERENCE COUNT 335 that is incremented, e.g., by a network layer in a storage operating system, every time a new mbuf references data in the data section. Similarly, the count 335 is decremented when an mbuf no longer references data in the mbuf data section 330. In this way, the storage operating system may de-allocate and “free” the mbuf data section when its reference count is decremented to zero.
Although
To form the linked list of mbufs 400, the NEXT fields 411 and 431 point to adjacent mbufs, although NEXT field 451 is set to NULL since it resides in the last mbuf of the chain. The NEXTPKT fields 412, 432 and 454 are all set to NULL since there is only a single data packet. The amount of data in each respective mbuf is stored by LENGTH fields 413, 433 and 453 and the DATAPTR fields 414, 434 and 454 locate their stored data. The PKTLEN field 417 in “leading” mbuf 410 stores the overall length of data packet 400 and subsequent PKTLEN fields 437 and 457 are set to NULL, although each mbuf in the chain could store the overall packet length. The TYPE field 415 and FLAGS field 416 indicate the data in mbuf 410 is header data whereas the TYPE fields 435 and 455 indicate mbufs 430 and 450 store user data.
Packet header data 425 is small enough to fit in data portion 420, however user data 447 and 467 are too large to fit in their respective data portions 440 and 460 and therefore require use of the mbuf data sections 445 and 465. Each of the data sections stores user data for only one mbuf, as indicated by their respective MBUF REFERENCE COUNTS 448 and 468. FLAGS fields 436 and 456 indicate mbuf data sections are being used by the mbufs 430 and 450, and EXTPTR fields 438 and 458 point to the beginning of the data sections 445 and 465. Each of the mbuf data sections in
Broadly stated, information is usually received by a filer as a plurality of data packets that may be of varying sizes. The filer's storage operating system receives each packet and stores it in a chain of mbufs as shown in
In summary, when a filer receives data from a network, its network subsystem allocates memory buffers from the buffer cache 135 to store the received data. The memory buffers used to store in-coming data from a network can subsequently be used within the different layers of a storage operating system. Since data is often received from a network in packets of unequal sizes, linked lists of memory buffers may be employed to store a received data packet. Typically, a plurality of data packets will be associated with a single data transmission, and a linked list of memory buffer chains may store the overall data transmission.
Storage subsystems often organize data in fixed-sized data blocks and represent files as a sequence of these data blocks. For instance, the Data ONTAP™ storage operating system, available from Network Appliance, Inc. of Sunnyvale, Calif., implements a Write Anywhere File Layout (WAFL™) file system that stores files in 4 kilobyte data blocks. However, when a network subsystem receives a file as a series of packets having various lengths, the file is usually stored in a linked list of mbuf chains, each chain storing one of the data packets. Therefore, the received data typically must be converted from the variable length mbuf chains to fixed block sizes used in the storage subsystem.
In accordance with the invention,
As shown in
If it is assumed the file system manipulates data in 4 kB blocks, the 8 kB data transmission 600 can be partitioned into two fixed-sized data blocks having buffer headers 640 and 650. Although the fixed blocks illustrated are assumed to be 4 kB, the file system could partition the received file into blocks of an arbitrary predetermined size. Each buffer header comprises one or more buffer pointers that define a single fixed size, e.g., 4 kilobyte, block of data. For example, buffer header 640 is associated with a 4 kB block defined by buffer pointers 642 and 644. Buffer pointer 642 addresses 2 kB of data in mbuf chain 610, and buffer pointer 644 addresses the first 2 kB of data in mbuf chain 620. Buffer header 650 is associated with another 4 kB block of data defined by buffer pointers 652 and 654. Buffer pointer 652 addresses the remaining 3 kB of data in mbuf chain 620 and buffer pointer 654 addresses 1 kB of data stored in mbuf chain 630.
Because each fixed-sized data block in
Next, at step 708, the IP and TCP layers of the storage operating system strip network header information from the received data and forward the resultant mbufs to a file system protocol layer. In some implementations, the IP and TCP layers may additionally perform other functions such as data integrity checks and/or cryptographic functions conventionally known in the art. At step 710, the file system protocol layer determines the type of file access request that has been received. In many cases, such as a request to READ or REMOVE a file, the received data does not need to be partitioned into fixed block sizes. However, at step 712, if the file system protocol layer determines a file WRITE request has been received, the storage operating system must “divide” the received data into fixed block sizes its file system can manipulate. Similarly, step 712 may also check for file access requests besides WRITE requests that require the received data to be partitioned into equal sized segments. If the file access request does not require the data to be partitioned, the file system can process the file access request as normal at step 714.
If a file WRITE request is received, the file system layer of the present invention generates one or more buffer pointers that define consecutive fixed-sized data blocks at step 716. The one or more generated buffer pointers can address data directly in the mbuf data structures and, according to an aspect of the invention, obviate the need to copy the contents of the mbufs into memory for partitioning purposes. Thus, the file system layer can generate a plurality of buffer headers, each buffer header storing one or more pointers that define a fixed-sized data block. In the event a procedure not in the write path, i.e., a file READ request, attempts to access a fixed-sized data block of the present invention before it is written to disk, the storage operating system may convert, e.g., via a conventional scatter/gather method, the one or more buffer pointers that define the data block to a contiguous “in-core” data block that may be accessed by the procedure.
Typically, after the received data has been divided into fixed-sized blocks as defined by a set of generated buffer headers, a RAID layer receives the data blocks, or alternatively receives the buffer headers that define the data blocks, at step 718. For RAID implementations that store parity information, such as RAID-4, the RAID layer can be modified to use the buffer pointers generated at step 716 to locate consecutive fixed-sized data blocks defined by a set of buffer pointers. The RAID layer then performs exclusive- or (XOR) operations on each data block and associates a computed parity value with the data block. The RAID layer may also assign physical block numbers to each data block. At step 720, a disk driver layer, e.g. SCSI layer, receives information from the RAID layer. According to the inventive method, the disk storage layer uses a conventional scatter/gather or equivalent method to convert the set of buffer pointers that address data in one or more mbufs into contiguous fixed-sized data blocks at step 722. That is, the disk storage layer performs a translation procedure that conforms, e.g. a larger “address space” in a plurality of mbufs to a smaller address space in a fixed-sized data block.
At step 724, the disk driver layer uses information from the RAID layer, such as block numbers and parity data, to write the newly formed contiguous data blocks to disk. In an illustrative embodiment, the disk storage layer may transfer the partitioned blocks of data to a storage medium at predetermined time intervals, or consistency points (CP). More specifically, just before step 716, the received data may be copied to an NVRAM 121 and the WRITE request logged in an NVLOG 123 until a system backup operation is performed at a CP. When such a system backup occurs, the data blocks stored in the “incore” mbuf data structures are transferred to disk, as per steps 716-724, and the contents of the NVRAM and NVLOG are cleared or reset. By periodically writing data to disk in this manner, the state of the filer at the last saved CP may be restored, e.g., after a power failure or other fault, by replaying the file access requests logged in the NVLOG.
Once the data blocks have been transferred to disk storage, at step 726, the disk driver layer sends an acknowledgement back to the file system layer when the WRITE is successful and the file system layer can “free” the memory occupied by the mbuf chains. The mbufs included in the mbuf chains can then be de-allocated and designated into a pool of “free” memory buffers in the filer's memory, at step 728, and the file system can de-allocate all buffer headers that reference the mbufs' data contents.
In accordance with the invention, ownership (i.e., control) of one or more mbufs may be transferred or shared within a network subsystem and a storage subsystem of a storage operating system. Further, an mbuf data section is not de-allocated in the storage operating system until after the data section's contained data is written to a storage medium. In a first implementation, the storage subsystem shares control of the mbufs' data sections with the network subsystem by allocating one or more “file-system proxy” mbufs. In contrast to conventional techniques, the network subsystem can not de-allocate an mbuf data section until after the storage subsystem de-allocates the file-system proxy mbufs referencing data stored in the mbuf data section. In a second implementation, ownership of one or more mbufs is transferred from the network subsystem to the storage subsystem. In this embodiment, the storage subsystem maintains ownership of the one or more mbufs by allocating a “reference-tracking” data structure for each mbuf or chain of mbufs transferred from the network subsystem. The storage subsystem does not de-allocate the mbufs until a reference count stored in the reference-tracking data structure is decremented to zero.
(i) File-System Proxy Mbufs
According to the first illustrative embodiment, a file system allocates a “file-system proxy” mbuf for each buffer pointer generated by the file system. As used herein, a file-system proxy mbuf is a conventional mbuf that serves as a means of indirection through which a buffer pointer references data stored in an mbuf data section. Because a file-system proxy mbuf is a conventional memory buffer data structure, an mbuf data section's MBUF REFERENCE COUNT is incremented when its contents are referenced by a file-system proxy mbuf. In other words, the mbuf data sections referenced by the allocated file-system proxy mbufs are guaranteed to have non-zero reference counts until the file system de-allocates the file-system proxy mbufs. Thus, unlike previous file system implementations, the file-system proxy mbufs enable the file system in the present invention to maintain control of data transferred from the networking layers without having to copy the data out of one or more mbuf data sections and into newly allocated data buffers.
In operation, the file system preferably de-allocates the file-system proxy mbufs after the data stored in the mbuf data sections is successfully copied to a storage medium. When the storage operating system is configured to transfer data from “in-core” mbuf data structures to disk storage at predetermined consistency points, the file system can de-allocate file-system proxy mbufs that reference transferred data immediately after execution of a CP. Once all the file-system proxy mbufs referencing data in an mbuf data section have been de-allocated, the data section's reference count may be decremented to zero, and the data section may be de-allocated and placed back into a list or “pool” of free (i.e., unused) mbuf data sections.
In accordance with the invention, the file system partitions the data stored in the chain of mbufs without having to copy the data into fixed-sized data buffers. Instead, the file system allocates buffer headers, such as headers 840 and 850, each of which comprises a set of buffer pointers that defines a fixed-sized block of the data stored in the chain of mbufs 800. For instance, the buffer header 840 stores buffer pointers 842 and 844 that reference a fixed-sized block of data stored in the mbuf data sections 810 and 820. Similarly, buffer pointer 852 and 854 in buffer header 850 reference another fixed-sized block of data stored in the mbuf data sections 820 and 830. While two buffer pointers are illustrated in each buffer header in
As shown, each of the buffer pointers references data in the mbuf data sections through a corresponding file-system proxy mbuf. Namely, the pointers 842, 844, 852 and 854 reference data in the chain of mbufs 800 through their respective file-system proxy mbufs 860, 870, 880 and 890. Advantageously, once the file system allocates the file-system proxy mbufs 860-890, the original mbufs 802-808 may be subsequently de-allocated, and the value stored in the MBUF REFERENCE COUNT fields 812, 822 and 832 each may be decremented by one. Because the file-system proxy mbufs ensure the reference counts stored in mbuf data sections 810-830 are greater than or equal to one, the data stored in the data sections is not de-allocated even when the mbufs 802-808 are. Here, it is assumed the mbuf data sections may only be de-allocated when their reference counts equal zero. Thus, by implementing file-system proxy mbufs in this manner, the mbufs 802-808 may be reused to store new network data packets, such as file WRITE requests, even though the mbuf data sections 810-830 may not.
(ii) Reference-tracking Data Structure
According to the second illustrative embodiment, the file system allocates a “reference-tracking” data structure that enables the file system to gain ownership of the one or more mbufs storing the received data, e.g., associated with a WRITE request. The reference-tracking data structure may comprise a reference count that is incremented every time the file system generates a set of buffer pointers defining a fixed-sized block of the received data. The reference count is decremented every time a fixed-sized data block of the received data is written to disk storage. Accordingly, the file system does not de-allocate the one or more mbufs storing the received data nor their associated reference-tracking data structure until the reference count in their associated reference-tracking data structure equals zero.
In accordance with this embodiment, one or more network layers are modified so they cannot de-allocate mbufs transferred to the file system until a reference count in the mbufs' associated reference-tracking data structure equals zero. That is, while an mbuf is conventionally de-allocated when the value of its MBUF REFERENCE COUNT field equals zero, the second illustrative embodiment instead de-allocates an mbuf when the reference count in its associated reference-tracking data structure equals zero. In addition, each buffer header in this embodiment is modified to include a pointer that references the header's associated reference-tracking structure. Operationally, when the file system generates a new buffer header, it stores a pointer in the buffer header to reference an associated reference-tracking data structure and increments a reference count in that referenced reference-tracking data structure.
Further to this illustrative embodiment, the file system partitions the data stored in the chain of mbufs 1000 without having to copy the data into fixed-sized data buffers. Instead, the file system allocates buffer headers, such as headers 1040 and 1050, each of which comprises a set of generated buffer pointers defining a fixed-sized block of the data stored in the chain of mbufs 1000. For instance, the buffer header 1040 stores buffer pointers 1044 and 1046 that reference data stored in the mbuf data sections 1010 and 1020. Similarly, buffer pointer 1054 and 1056 in buffer header 1050 reference data stored in the mbuf data sections 1020 and 1030. While two buffer pointers are illustrated in each buffer header in
The file system may generate a reference-tracking data structure associated with one or more mbufs transferred from the network layers of the storage operating system. Each generated reference-tracking structure includes a REFERENCE COUNT field that stores the number of buffer headers that contain buffer pointers referencing data stored in the reference-tracking structure's associated mbuf(s). Put another way, since each buffer header is associated with a fixed-sized data block that may be passed within layers of the storage operating system, the value of the REFERENCE COUNT field indicates the number of fixed-sized data blocks stored in the reference-tracking structure's associated mbuf(s).
For example, a reference-tracking data structure 1060 having a REFERENCE COUNT field 1065 is associated with the mbuf chain 1000. As shown, the value stored in the field 1065 equals two since two buffer headers 1040 and 1050 contain buffer pointers that reference data stored in the mbuf chain 1000. Each of the buffer headers 1040 and 1050 comprises a respective pointer 1042 and 1052 that references the reference-tracking structure 1060. Thus, the value stored in the REFERENCE COUNT field 1065 is incremented by the file system every time a pointer in a buffer header references the reference-tracking structure 1060. Likewise, the value is decremented when a pointer in a buffer header no longer references the structure.
Next, at step 1106, the file system generates sets of one or more buffer pointers used to identify fixed-sized blocks of data stored in the transferred mbufs. The file system may store a generated set of buffer pointers, defining a fixed-sized data block, in a corresponding buffer header. For each buffer header that comprises at least one buffer pointer referencing data stored in the one or more mbufs, the file system sets a pointer in the header to reference the generated reference-tracking data structure. At step 1108, the file system updates the value of the reference-tracking structure's REFERENCE COUNT field based on the number of buffer header pointers that reference the reference-tracking structure. The sequence ends at step 1110.
The foregoing has been a detailed description of an illustrative embodiment of the invention. Various modifications and additions can be made without departing from the spirit and scope of the invention. While this description has been written in reference to filers and file servers, the principles are equally pertinent to all types of computers, including those configured for block-based storage systems (such as storage area networks (SAN)), file-based storage systems (such as network attached storage (NAS) systems), combinations of both types of storage systems, and other forms of computer systems. Moreover, the invention may be implemented in a storage system that functions as a network caching device configured to transmit and receive data to/from other network nodes according to a predetermined network communication protocol, such as the network file system (NFS) protocol, hypertext transfer protocol (HTTP), etc.
Although the illustrative embodiments describe writing data to a storage medium as fixed-sized data blocks, e.g., 4 kilobytes, it is also expressly contemplated that the invention is equally applicable to storage systems that write variable-sized data blocks to a storage medium. Thus, sets of buffer pointers stored in different buffer headers in the illustrative embodiments may reference variable-sized blocks of received data. For example, buffer headers 840 and 1040 may store sets of buffer pointers defining contiguous data blocks that are a different size than the contiguous data blocks defined by the buffer pointers stored in the buffer headers 850 and 1050.
Further, it is expressly contemplated that the teachings of this invention can be implemented as software, including a computer-readable medium having program instructions executing on a computer, hardware, firmware, or a combination thereof. Accordingly this description is meant to be taken only by way of example and not to otherwise limit the scope of the invention.
Number | Name | Date | Kind |
---|---|---|---|
3883847 | Frank | May 1975 | A |
5440726 | Fuchs et al. | Aug 1995 | A |
5506979 | Menon | Apr 1996 | A |
5819292 | Hitz et al. | Oct 1998 | A |
5931918 | Row et al. | Aug 1999 | A |
5963962 | Hitz et al. | Oct 1999 | A |
6038570 | Hitz et al. | Mar 2000 | A |
6154813 | Martin et al. | Nov 2000 | A |
6330570 | Crighton | Dec 2001 | B1 |
6389513 | Closson | May 2002 | B1 |
6434620 | Boucher et al. | Aug 2002 | B1 |
6496901 | De Martine et al. | Dec 2002 | B1 |
6697846 | Soltis | Feb 2004 | B1 |
7194569 | Shaylor | Mar 2007 | B1 |
20020057697 | Yamamori et al. | May 2002 | A1 |
20030131190 | Park et al. | Jul 2003 | A1 |
20050015564 | Dierks et al. | Jan 2005 | A1 |
Entry |
---|
Cohen et al., A Dynamic Approach for Efficient TCP Buffer Allocation, IEEE, 1998, pp. 817-824. |
Santry, et al., U.S. Appl. No. 10/271,633, filed Oct. 15, 2002, titled Zero Copy Writes Through Use of Mbufs. |
Watanabe, et al., U.S. Appl. No. 09/898,894, filed Jul. 3, 2001, titled System and Method for Parallelized Replay of an NVRAM Log in a Storage Appliance. |
David Hitz et al., TR3002 File System Design for a NFS File Server Appliance published by Network Appliance, Inc. |
Common Internet File System (CIFS) Version: CIFS-Spec 0.9, Storage Networking Industry Association (SNIA), Draft SNIA CIFS Documentation Work Group Work-in-Progress, Revision Date: Mar. 26, 2001. |
Fielding et al., (1999) Request for Comments (RFC) 2616, HTTP/1.1. |
Wright et al., TCP/IP Illustrated, vol. 2, Chapter 2: MBUG+FS: Memory Buffers, pp. 31-61. |