Embodiments of the present invention relate to a computer system that hosts virtual machines, and more specifically, to storage allocation in a virtual machine system.
Efficient storage allocation is critical to the performance of a virtual machine system. In any file system, files are frequently created, modified and deleted. When adding data to files, storage blocks have to be allocated. In order to save resources, storage can be allocated on a need-to-use basis, a method sometimes referred to as thin provisioning. In file systems, files allocated in such a manner are referred to as sparse files. When a write operation is performed to a sparse file, blocks are allocated to store the added data.
Virtual machine “hard drives” are implemented via a file or a block device, and is usually referred to as an “image.” Conventionally, image files tend to unnecessarily inflate in volume. This is because the data blocks of an image file deleted by a virtual machine cannot be easily reused by the host of the virtual machine. The backing disk storage is unaware of the file deletion that happens in the VM. Thus, in a conventional virtual machine system, the size of the images can continue to grow, thereby eliminating a major benefit of using thin provisioning.
One conventional approach uses a utility in the virtual machine that periodically writes zeros to deallocated blocks. The hypervisor of the virtual machine system “catches” these write operations, and detects that the written blocks are zeros. The hypervisor then redirects the blocks to point to a “zero” block, which is linked to the written blocks. All of the written blocks that are linked to the “zero” block are freed and can be reused. With this approach, free blocks are regained only periodically and image files can still inflate in the interim. Further, the hypervisor needs to check all of the written blocks and compare those blocks to zero. The checking and comparing operations are not efficient and, as a result, reduce the performance of the system.
The present invention is illustrated by way of example, and not by way of limitation, and can be more fully understood with reference to the following detailed description when considered in connection with the figures in which:
Described herein is a method and system for block deallocation in virtual environments with high efficiency. In one embodiment, a computer system hosting a virtual machine includes an I/O device driver in the guest operating system of the virtual machine. The I/O device driver intercepts an operation performed by the guest operating system that causes a data block to be deallocated in the virtual machine. The I/O device driver informs a hypervisor of the computer system that the data block is to be deallocated. The hypervisor then instructs the data storage to deallocate the data block for reuse.
Embodiments of the present invention utilize a paravirtualized mechanism to deallocate a data block in the data storage. The guest operating system communicates with the hypervisor regarding block deallocation via an I/O device driver in the guest operating system and a corresponding backend device driver in the hypervisor. Operations that cause block deallocation are intercepted as they take place, and the hypervisor is informed of the block deallocation right away. As a result, the data blocks that are deallocated in a virtual machine can also be deallocated (i.e., “freed”) in the data storage for reuse without delay.
The term “data block” (also referred to as “block”) hereinafter refers to a basic unit of data storage. A block may be addressed by a guest operating system using a logical block address, and can also be addressed by a hypervisor, a host operating system, or a data storage (e.g., disks) using a physical block address. A block addressed by a logical block address can be referred to as a “logical block,” and a block addressed by a physical block address can be referred to as a “physical block.”
In the following description, numerous details are set forth. It will be apparent to one skilled in the art, however, that the present invention may be practiced without these specific details. In some instances, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the present invention.
The computer system 100 also runs a host OS 160 to manage system resources. In one embodiment, the computer system 100 runs a hypervisor 125 to emulate underlying host hardware 190, making the use of the virtual machine 130 transparent to the guest OS 140 and the user of the computer system 100. The hypervisor 125 may also be known as a virtual machine monitor (VMM) or a kernel-based hypervisor. In some embodiments, the hypervisor 125 may be part of the host OS 160.
The computer system 100 also includes one or more physical central processing units (CPUs), memory, I/O devices and other hardware components. The computer system 100 may also be coupled to a data storage 180, which may include mass storage devices, such as magnetic or optical storage based disks, tapes or hard drives.
According to one embodiment of the present invention, the computer system 100 implements a paravirtualization scheme for data block deallocation. Before further describing data block deallocation, some concepts relating to paravirtualization are explained as follows. In a paravirtualization environment, the guest OS is aware that it is running on a hypervisor and includes code to make guest-to-hypervisor transitions more efficient. By contrast, in full virtualization, the guest OS is unaware that it is being virtualized and can work with the hypervisor without any modification to the guest OS. The hypervisor in full virtualization traps device access requests from the guest OS, and emulates the behaviors of physical hardware devices. However, without the help of the guest OS, emulation in full virtualization can be much more complicated and inefficient than emulation in paravirtualization.
In one embodiment, the computer system 100 implements paravirtualization by including an I/O device driver 142 (also referred to as “guest I/O device driver”) in each guest OS 140 and a corresponding backend device driver 126 in the hypervisor 125. The I/O device driver 142 communicates with the backend device driver 126 regarding the deallocation of data blocks in the data storage 180. By having the device drivers 142 and 126, the guest OS 140 can provide information to the hypervisor 125 as a data deallocation operation takes place, without having the hypervisor 125 trap every device access request from the pest OS 140. The I/O device driver 142 and the backend device driver 126 may reside in the memory or the data storage 180 accessible by the computer system 100.
In one embodiment, the I/O device driver 142 and the backend device driver 126 use the Virtual I/O (VIRTIO) application programming interface (API) to communicate with the hypervisor 125. The VIRTIO API is a standardized interface originally developed for the Linux® kernel. The VIRTIO API defines a protocol that allows a guest OS 140 to communicate with the hypervisor 120, utilizing paravirtualization to facilitate device emulation with increased efficiency. Although VIRTIO is described herein, it is understood that other interfaces can also be used.
Referring to
Referring to
The exemplary computer system 500 includes a proc sing device 502, a main memory 504 (e.g., read-only memory (ROM), flash memory, dynamic random access memory (DRAM) such as synchronous DRAM (SDRAM) or Rambus DRAM (RDRAM), etc.), a static memory 506 (e.g., flash memory, static random access memory (SRAM), etc.), and a secondary memory 518 (e.g., a data storage device), which communicate with each other via a bus 530.
The processing device 502 represents one or more general-purpose processing devices such as a microprocessor, central processing unit, or the like. More particularly, the processing device 502 may be a complex instruction set computing (CISC) microprocessor, reduced instruction set computing (RISC) microprocessor, very long instruction word (VLIW) microprocessor, processor implementing other instruction sets, or processors implementing a combination of instruction sets. The processing device 502 may also be one or more special-purpose processing devices such as an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a digital signal processor (DSP), network processor, or the like. The processing device 502 is configured to execute block deallocation logic 522 for performing the operations and steps discussed herein.
The computer system 500 may further include a network interface device 508. The computer system 500 also may include a video display unit 510 (e.g., a liquid crystal display (LCD) or a cathode ray tube (CRT)), an alphanumeric input device 512 (e.g., a keyboard), a cursor control device 514 (e.g., a mouse), and a signal generation device 516 (e.g., a speaker).
The secondary memory 518 may include a machine-readable storage medium (or, more specifically, a computer-readable storage medium) 531 on which is stored one or more sets of instructions (e.g., block deallocation logic 522) embodying any one or more of the methodologies or functions described herein (e.g., the I/O device driver 142 and/or the backend device driver 126 of
The machine-readable storage medium 531 may also be used to store the block deallocation logic 522 persistently. While the machine-readable storage medium 531 is shown in an exemplary embodiment to be a single medium, the term “machine-readable storage medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions. The term “machine-readable storage medium” shall also be taken to include any medium that is capable of storing or encoding a set of instructions for execution by the machine that causes the machine to perform any one or more of the methodologies of the present invention. The term “machine-readable storage medium” shall accordingly be taken to include, but not be limited to, solid-state memories, and optical and magnetic media.
The computer system 500 may additionally include block deallocation modules 528 for implementing the functionalities of the I/O device driver 142 and/or the backend device driver 126 of
Some portions of the detailed descriptions which follow are presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.
It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise, as apparent from the following discussion, it is appreciated that throughout the description, discussions utilizing terms such as “intercepting”, “informing”, “instructing ”, “sending” or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.
Embodiments of the present invention also relate to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may comprise a general purpose computer system selectively programmed by a computer program stored in the computer system. Such a computer program may be stored in a computer readable storage medium, such as, but not limited to, any type of disk including optical disks, CD-ROMs, and magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic disk storage media, optical storage media, flash memory devices, other type of machine-accessible storage media, or any type of media suitable for storing electronic instructions, each coupled to a computer system bus.
The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Various general purpose systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct a more specialized apparatus to perform the required method steps. The required structure for a variety of these systems will appear as set forth in the description below. In addition, the present invention is not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the invention as described herein.
It is to be understood that the above description is intended to be illustrative, and not restrictive. Many other embodiments will be apparent to those of skill in the art upon reading and understanding the above description. Although the present invention has been described with reference to specific exemplary embodiments, it will be recognized that the invention is not limited to the embodiments described, but can be practiced with modification and alteration within the spirit and scope of the appended claims. Accordingly, the specification and drawings are to be regarded in an illustrative sense rather than a restrictive sense. The scope of the invention should, therefore, be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.