1. Field of the Invention
This invention relates to computer systems and, more particularly, to off-host virtualization of bootable devices within storage environments.
2. Description of the Related Art
Many business organizations and governmental entities rely upon applications that access large amounts of data, often exceeding a terabyte of data, for mission-critical applications. Often such data is stored on many different storage devices, which may be heterogeneous in nature, including many different types of devices from many different manufacturers.
Configuring individual applications that consume data, or application server systems that host such applications, to recognize and directly interact with each different storage device that may possibly be encountered in a heterogeneous storage environment would be increasingly difficult as the environment scaled in size and complexity. Therefore, in some storage environments, specialized storage management software and hardware may be used to provide a more uniform storage model to storage consumers. Such software and hardware may also be configured to present physical storage devices as virtual storage devices (e.g., virtual SCSI disks) to computer hosts, and to add storage features not present in individual storage devices to the storage model. For example, features to increase fault tolerance, such as data mirroring, snapshot/fixed image creation, or data parity, as well as features to increase data access performance, such as disk striping, may be implemented in the storage model via hardware or software. The added storage features may be referred to as storage virtualization features, and the software and/or hardware providing the virtual storage devices and the added storage features may be termed “virtualizers” or “virtualization controllers”. Virtualization may be performed within computer hosts, such as within a volume manager layer of a storage software stack at the host, and/or in devices external to the host, such as virtualization switches or virtualization appliances. Such external devices providing virtualization may be termed “off-host” virtualizers, and may be utilized in order to offload processing required for virtualization from the host. Off-host virtualizers may be connected to the external physical storage devices for which they provide virtualization functions via a variety of interconnects, such as Fiber Channel links, Internet Protocol (IP) networks, and the like.
In many corporate data centers, as the application workload increases, additional hosts may need to be provisioned to provide the required processing capabilities. The internal configuration (e.g., file system layout and file system sizes) of each of these additional hosts may be fairly similar, with just a few features unique to each host. Booting and installing each newly provisioned host manually may be a cumbersome and error-prone process, especially in environments where a large number of additional hosts may be required fairly quickly. A virtualization mechanism that allows hosts to boot and/or install operating system software off a virtual bootable target device may be desirable to support consistent booting and installation for multiple hosts in such environments. In addition, in some storage environments it may be desirable to be able to boot and/or install off a snapshot volume or a replicated volume, for example in order to be able to re-initialize a host to a state as of a previous point in time (e.g., the time at which the snapshot or replica was created).
Various embodiments of a system and method for external encapsulation of a volume into a logical unit (LUN) to allow booting and installation on a complex volume are disclosed. According to a first embodiment, a system may include a host, one or more physical storage devices, and an off-host virtualizer. The off-host virtualizer (i.e., a device external to the host, capable of providing block virtualization functionality) may be configured to aggregate storage within the one or more physical storage devices into a logical volume and to generate metadata to emulate the logical volume as a bootable target device. The off-host virtualizer may make the metadata accessible to the host, allowing the host to boot off the logical volume, e.g., off a file system resident in the logical volume.
The metadata generated by the host may include such information as the layouts or offsets of various boot-related partitions that the host may need to access during the boot process, for example to load a file system reader, an operating system kernel, or additional boot software such as one or more scripts. The metadata may be operating system-specific, i.e., the location, format and contents of the metadata may differ from one operating system to another. In one embodiment, a number of different logical volumes, each associated with a particular boot-related partition or file system, may be emulated as part of the bootable target device. In another embodiment, the off-host virtualizer may be configured to present an emulated logical volume as an installable partition (i.e., a partition in which at least a portion of an operating system may be installed). In such an embodiment, the host may also be configured to boot installation software (e.g., off external media), install at least a portion of the operating system on the installable partition, and then boot from a LUN containing the encapsulated volume.
The logical volume aggregated by the off-host virtualizer may support a number of different virtualization features in different embodiments. In one embodiment, the logical volume may be a snapshot volume (i.e., a point-in-time copy of another logical volume) or a replicated volume. The logical volume may span multiple physical storage devices, and may be striped, mirrored, or a virtual RAID volume. In some embodiments, the logical volume may include a multi-layer hierarchy of logical devices, for example implementing mirroring at a first layer and striping at a second layer below the first. In one embodiment, the host may be configured to access the logical volumes directly (i.e., without using the metadata) subsequent to an initial phase of the boot process. For example, during a later phase of the boot process, a volume manager or other virtualization driver may be activated at the host. The volume manager or virtualization driver may be configured to obtain configuration information for the logical volumes (such as volume layouts), e.g., from the off-host virtualizer or some other volume configuration server, to allow direct access.
a is a block diagram illustrating the mapping of blocks within a logical volume to a virtual LUN according to one embodiment.
b is a block diagram illustrating an example of a virtual LUN including a plurality of partitions, where each partition is mapped to a volume, according to one embodiment.
While the invention is susceptible to various modifications and alternative forms, specific embodiments are shown by way of example in the drawings and are herein described in detail. It should be understood, however, that drawings and detailed description thereto are not intended to limit the invention to the particular form disclosed, but on the contrary, the invention is to cover all modifications, equivalents and alternatives falling within the spirit and scope of the present invention as defined by the appended claims.
The process of booting a host 101 may include several distinct phases. In a first phase, for example, the host 101 may be powered on or reset, and may then perform a series of “power on self test (POST)” operations to test the status of various constituent hardware elements, such as processor 110, memory 112, peripheral devices such as a mouse and/or a keyboard, and storage devices including bootable target device 120. In general, memory 112 may comprise a number of different memory modules, such as a programmable read only memory (PROM) module containing boot code 114 for early stages of boot, as well as a larger random access memory for use during later stages of boot and during post-boot or normal operation of host 101. One or memory caches associate with processor 110 may also be tested during POST operations. In traditional systems, bootable target device 120 may typically be a locally attached physical storage device such as a disk, or in some cases a removable physical storage device such as a CD-ROM. In systems employing the Small Computer System Interface (SCSI) protocol to access storage devices, for example, the bootable target device may be associated with a SCSI “logical unit” identified by a logical unit number or LUN. (The term LUN may be used herein to refer to both the identifier for a SCSI target device, as well as the SCSI target device itself.) During POST, one or more SCSI buses attached to the host may be probed, and SCSI LUNS accessible via the SCSI buses may be identified.
In some operating systems, a user such as a system administrator may be allowed to select a bootable target device from among several choices as a preliminary step during boot, and/or to set a particular target as the device from which the next boot should be performed. If the POST operations complete successfully, boot code 114 may proceed to access the designated bootable target device 120. That is, boot code 114 may read the operating system-specific boot metadata 122 from a known location in bootable target device 120. The specific location and format of boot-related metadata may vary from system to system; for example, in many operating systems, boot metadata 122 is stored in the first few blocks of bootable target device 120.
Operating system specific boot metadata 122 may include the location or offsets of one or more partitions (e.g., in the form of a partition table), such as partitions 130A-130N (which may be generically referred to herein as partitions 130), to which access may be required during subsequent phases of the boot process. In some environments the boot metadata 122 may also include one or more software modules, such as a file system reader, that may be required to access one or more partitions 130. The file system reader may then be read into memory at the host 101 (such as memory 112), and used to load one or more additional or secondary boot programs (i.e., additional boot code) from a partition 130. The additional or secondary boot programs may then be loaded and executed, resulting for example in an initialization of an operating system kernel, followed by an execution of one or more scripts in a prescribed sequence, ultimately leading to the host reaching a desired “run level” or mode of operation. Various background processes (such as network daemon processes in operating systems derived from UNIX, volume managers, etc.) and designated application processes (e.g., a web server or a database management server configured to restart automatically upon reboot) may also be started up during later boot phases. When the desired mode of operation is reached, host 101 may allow a user to log in and begin desired user-initiated operations, or may begin providing a set of preconfigured services (such as web server or database server functionality). The exact nature and sequence of operations performed during boot may vary from one operating system to another.
If host 101 is a newly provisioned host without an installed operating system, or if host 101 is being reinstalled or upgraded with a new version of its operating system, the boot process may be followed by installation of desired portions of the operating system. For example, the boot process may end with a prompt being displayed to the user or administrator, allowing the user to specify a device from which operating system modules may be installed, and to select from among optional operating system modules. In some environments, the installation of operating system components on a newly provisioned host may be automated—e.g., one or more scripts run during (or at the end of) the boot process may initiate installation of desired operating system components from a specified device.
As noted above, traditionally, computer hosts 101 have usually been configured to boot off a local disk (i.e., disks attached to the host) or local removable media. For example, hosts configured to use a UNIX™-based operating system may be configured to boot off a “root” file system on a local disk, while hosts configured with a version of the Windows™ operating system from Microsoft Corporation may be configured to boot off a “system partition” on a local disk. However, in some storage environments it may be possible to configure a host 101 to boot off a virtual bootable target device, that is, a device that has been aggregated from one or more backing physical storage devices by a virtualizer or virtualization coordinator, where the backing physical storage may be accessible via a network instead of being locally accessible at the host 101. The file systems and/or partitions expected by the operating system at the host may be emulated as being resident in the virtual bootable target device.
In general, virtualization refers to a process of creating or aggregating logical or virtual devices out of one or more underlying physical or logical devices, and making the virtual devices accessible to device consumers for storage operations. The entity or entities that perform the desired virtualization may be termed virtualizers. Virtualizers may be incorporated within hosts (e.g., in one or more software layers within host 101) or at external devices such as one or more virtualization switches, virtualization appliances, etc., which may be termed off-host virtualizers. In
In addition to aggregating storage into logical volumes, off-host virtualizer 210 may also be configured to emulate storage within one or more logical volumes 240 as a bootable target device 250. That is, off-host virtualizer 210 may be configured to generate operating system-specific boot metadata 122 to make a range of storage within the one or more logical volumes 240 appear as a bootable partition (e.g., a partition 130) and/or file system to host 101. The generation and presentation of operating system specific metadata, such as boot metadata 122, for the purpose of making a logical volume appear as an addressable storage device (e.g., a LUN) to a host may be termed “volume tunneling”. The virtual addressable storage device presented to the host using such a technique may be termed a “virtual LUN”. Volume tunneling may be employed for other purposes in addition to the emulation of bootable target devices, e.g., to support dynamic mappings of logical volumes to virtual LUNs, to provide an isolating layer between front-end virtual LUNs and back-end or physical LUNs, etc.
a is a block diagram illustrating the mapping of blocks within a logical volume to a virtual LUN according to one embodiment. In the illustrated embodiment, a source logical volume 305 comprising N blocks of data (numbered from 0 through (N−1)) may be encapsulated or tunneled through a virtual LUN 310 comprising (N+H) blocks. Off-host virtualizer 210 may be configured to logically insert operating system specific boot metadata in a header 315 comprising the first H blocks of the virtual LUN 310, and the remaining N blocks of virtual LUN 310 may map to the N blocks of source logical volume 305. A host 101 may be configured to boot off virtual LUN 310, for example by setting the boot target device for the host to the identifier of the virtual LUN 310. Metadata contained in header 315 may be set up to match the format and content expected by boot code 114 at a LUN header of a bootable device for a desired operating system, and the contents of logical volume 305 may include, for example, the contents expected by boot code 114 in one or more partitions 130. In some embodiments, the metadata and/or the contents of the logical volume may be customized for the particular host being booted: for example, some of the file system contents or scripts accessed by the host 101 during various boot phases may be modified to support requirements specific to the particular host 101. Examples of such customization may include configuration parameters for hardware devices at the host (e.g., if a particular host employs multiple Ethernet network cards, some of the networking-related scripts may be modified), customized file systems, or customized file system sizes. In general, the generated metadata required for volume tunneling may be located at a variety of different offsets within the logical volume address space, such as within a header 315, a trailer, at some other designated offset within the virtual LUN 310, or at a combination of locations within the virtual LUN 310. The number of data blocks dedicated to operating system specific metadata (e.g., the length of header 315), as well as the format and content of the metadata, may vary with the operating system in use at host 101.
The metadata inserted within virtual LUN 310 may be stored in persistent storage, e.g., within some blocks of a physical storage device 220 or at off-host virtualizer 210, in some embodiments, and logically concatenated with the mapped blocks 320. In other embodiments, the metadata may be generated on the fly, whenever a host 101 accesses the virtual LUN 310. In some embodiments, the metadata may be generated by an external agent other than off-host virtualizer 210. The external agent may be capable of emulating metadata in a variety of formats for different operating systems, including operating systems that may not have been known when the off-host virtualizer 210 was deployed. In one embodiment, off-host virtualizer 210 may be configured to support more than one operating system; i.e., off-host virtualizer 210 may logically insert metadata blocks corresponding to any one of a number of different operating systems when presenting virtual LUN 310 to a host 101, thereby allowing hosts intended to use different operating systems to share virtual LUN 310. In some embodiments, a plurality of virtual LUNs emulating bootable target devices, each corresponding to a different operating system, may be set up in advance, and off-host virtualizer 210 may be configured to select a particular virtual LUN for presentation to a host for booting. In large data centers, a set of relatively inexpensive servers (which may be termed “boot servers”) may be designated to serve as a pool of off-host virtualizers dedicated to provide emulated bootable target devices for use as needed throughout the data center. Whenever a newly provisioned host in the data center needs to be booted and/or installed, a bootable target device presented by one of the boot servers may be used, thus supporting consistent configurations at the hosts of the data center as the data center grows.
For some operating systems, off-host virtualizer 210 may emulate a number of different boot-related volumes using a plurality of partitions within the virtual LUN 310.
As noted above and illustrated in
As noted earlier, the boot process at host 101 may include several phases. During each successive phase, additional modules of the host's operating system and/or additional software modules may be activated, and various system processes and services may be started. During one such phase, in some embodiments a virtualization driver or volume manager capable of recognizing and interacting with logical volumes may be activated at host 101. In such embodiments, after the virtualization driver or volume manager is activated, it may be possible for the host to switch to direct interaction with the logical volumes 240 (block 455 of
As noted previously, a number of different virtualization functions may be implemented at a logical volume 240 by off-host virtualizer 210 in different embodiments. In one embodiment, a logical volume 240 may be aggregated from storage from multiple physical storage devices 220, e.g., by striping successive blocks of data across multiple physical storage devices, by spanning multiple physical storage devices (i.e., concatenating physical storage from multiple physical storage devices into the logical volume), or by mirroring data blocks at two or more physical storage devices. In another embodiment, a logical volume 240 that is used by off-host virtualizer 210 to emulate a bootable target device 250 may be a replicated volume. For example, the logical volume 240 may be a replica or copy of a source logical volume that may be maintained at a remote data center. Such a technique of replicating bootable volumes may be useful for a variety of purposes, such as to support off-site backup or to support consistency of booting and/or installation in distributed enterprises where hosts at a number of different geographical locations may be required to be set up with similar configurations. In some embodiments, a logical volume 240 may be a snapshot volume, such as an instant snapshot or a space-efficient snapshot, i.e., a point-in-time copy of some source logical volume. Using snapshot volumes to boot and/or install systems may support the ability to revert a host back to any desired previous configuration from among a set of configurations for which snapshots have been created. Support for automatic roll back (e.g., to a desired point in time) on boot may also be implemented in some embodiments. In one embodiment, a logical volume 240 used to emulate a bootable target device may be configured as a virtual RAID (“Redundant Array of Independent Disks”) device or RAID volume, where parity based redundancy computations are implemented to provide high availability. Physical storage from a plurality of storage servers may be aggregated to form the RAID volume, and the redundancy computations may be implemented via a software protocol. A bootable target device emulated from a RAID volume may be recoverable in the event of a failure at one of its backing storage servers, thus enhancing the availability of boot functionality supported by the off-host virtualizer 210. A number of different RAID levels (e.g., RAID-3, RAID-4, or RAID-5) may be implemented in the RAID volume.
In some embodiments, a logical volume 240 may include multiple layers of virtual storage devices.
After host 101 has booted, logical volume 240 may be configured to be mounted within a file system or presented to an application or other volume consumer. Each block device within logical volume 240 that maps to or includes another block device may include an interface whereby the mapping or including block device may interact with the mapped or included device. For example, this interface may be a software interface whereby data and commands for block read and write operations is propagated from lower levels of the virtualization hierarchy to higher levels and vice versa.
Additionally, a given block device may be configured to map the logical block spaces of subordinate block devices into its logical block space in various ways in order to realize a particular virtualization function. For example, in one embodiment, logical volume 240 may be configured as a mirrored volume, in which a given data block written to logical volume 240 is duplicated, and each of the multiple copies of the duplicated given data block are stored in respective block devices. In one such embodiment, logical volume 240 may be configured to receive an operation to write a data block from a consumer, such as an application running on host 101. Logical volume 240 may duplicate the write operation and issue the write operation to both logical block devices 504 and 506, such that the block is written to both devices. In this context, logical block devices 504 and 506 may be referred to as mirror devices. In various embodiments, logical volume 240 may read a given data block stored in duplicate in logical block devices 504 and 506 by issuing a read operation to one mirror device or the other, for example by alternating devices or defaulting to a particular device. Alternatively, logical volume 240 may issue a read operation to multiple mirror devices and accept results from the fastest responder.
In some embodiments, it may be the case that underlying physical block devices 220A-C have dissimilar performance characteristics; specifically, devices 220A-B may be slower than device 220C. In order to balance the performance of the mirror devices, in one embodiment, logical block device 504 may be implemented as a striped device in which data is distributed between logical block devices 508 and 510. For example, even- and odd-numbered blocks of logical block device 504 may be mapped to logical block devices 508 and 510 respectively, each of which may be configured to map in turn to all or some portion of physical block devices 220A-B respectively. In such an embodiment, block read/write throughput may be increased over a non-striped configuration, as logical block device 504 may be able to read or write two blocks concurrently instead of one. Numerous striping arrangements involving various distributions of blocks to logical block devices are possible and contemplated; such arrangements may be chosen to optimize for various data usage patterns such as predominantly sequential or random usage patterns. In another aspect illustrating multiple layers of block virtualization, in one embodiment physical block device 220C may employ a different block size than logical block device 506. In such an embodiment, logical block device 512 may be configured to translate between the two physical block sizes and to map the logical block space defined by logical block device 506 to the physical block space defined by physical block device 220C.
The technique of volume tunneling to emulate a bootable target device may be implemented using a variety of different storage and network configurations in different embodiments.
In some embodiments, host 101 may be configured to boot from an emulated volume using a first network type such as iSCSI, and to then switch to directly accessing the volume using a second network type such as fibre channel. iSCSI-based booting may be less expensive and/or easier to configure than fibre-channel based booting in some embodiments. An off-host virtualizer 210 that uses iSCSI (such as an iSCSI boot appliance) and at the same time accesses fibre-channel based storage devices may allow such a transition between the network type that is used for booting and the network type that is used for subsequent I/O (e.g., for I/Os requested by production applications).
In one embodiment, illustrated in
As noted above, an off-host virtualizer 210 may comprise a number of different types of hardware and software entities in different embodiments. In some embodiments, an off-host virtualizer 210 may itself be a host with its own processor, memory, peripheral devices and I/O devices, running an operating system and a software stack capable of providing the block virtualization features described above. In other embodiments, the off-host virtualizer 210 may include one or more virtualization switches and/or virtualization appliances. A virtualization switch may be an intelligent fiber channel switch, configured with sufficient processing capacity to perform desired virtualization operations in addition to supporting fiber channel connectivity. A virtualization appliance may be an intelligent device programmed to perform virtualization functions, such as providing mirroring, striping, snapshot capabilities, etc. Appliances may differ from general purpose computers in that their software is normally customized for the function they perform, pre-loaded by the vendor, and not alterable by the user. In some embodiments, multiple devices or systems may cooperate to provide off-host virtualization; e.g., multiple cooperating virtualization switches may form a single off-host virtualizer. In one embodiment, the aggregation of storage within physical storage devices 220 into logical volumes 240 may be performed by one off-host virtualizing device or host, while another off-host virtualizing device may be configured to emulate the logical volumes as bootable target devices and present the bootable target devices to host 101.
Although the embodiments above have been described in considerable detail, numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications.
Number | Date | Country | Kind |
---|---|---|---|
PCT/US04/39306 | Nov 2004 | WO | international |
This application is a continuation-in-part of U.S. patent application Ser. No. 10/722,614, entitled “SYSTEM AND METHOD FOR EMULATING OPERATING SYSTEM METADATA TO PROVIDE CROSS-PLATFORM ACCESS TO STORAGE VOLUMES”, filed Nov. 26, 2003.
Number | Date | Country | |
---|---|---|---|
Parent | 10722614 | Nov 2003 | US |
Child | 11156636 | Jun 2005 | US |