This invention relates to data storage, including but not limited to databases which store material to provide a mobile search service.
Data storage systems are often required to be both scalable and fault-tolerant. A scalable storage system is one where the components implementing the system can be arranged in such a way that the total capacity available can be expanded by deploying additional hardware (typically consisting of servers and hard disks). In contrast, a non-scalable storage system would not be able to take advantage of additional hardware and would have capacity fixed at its originally deployed size. A fault tolerant storage system is one where the system can tolerate the software or hardware failure of a subset of its individual parts. Such tolerance typically involves implementing redundancy of those parts such that for any one part that fails, there is at least one other part still functioning and providing the same service. In other words, at least two replicas of each unit of data are stored on distributed hardware.
A key challenge to the implementation of these fault-tolerant systems is how to manage repair following a failure: if a unit of hardware such as a hard disk fails and its data is lost, the problem is how to resynchronise its data and bring it back online. An easy solution is to take the entire system offline and perform the synchronization of the replicas manually-safe in the knowledge that the surviving data is not being modified during this process. However, the obvious draw back to this approach is the required down-time which may not be permissible in some applications. So the challenge then becomes how to manage the repair following a failure whilst maintaining the live service. This challenge boils down to how to re-synchronise a replica of a unit of data whilst the surviving data continues to receive updates and thus complicate the re-synchronisation process.
A common solution to this problem is to use journaling: first a snapshot of the surviving data is made available to the recovery process, while the copying of the snapshot data proceeds, all new update (write/delete) requests are logged to a journal. When the copying of the snapshot has finished, the system is locked (all update requests are temporarily blocked) while the additional changes stored in the journal are replayed on to the newly copied data, thus bringing it completely up to date with the surviving data and all the changes that happened during the (presumably lengthy) snapshot copy process. The system can then be unlocked again and normal operation restored. However, the drawback to this approach is the need to implement the snapshot mechanism. As a result of this, either the data storage format itself needs to support fast snapshots or at least three replicas of the data are required such that in the face of one failure, one copy can continue to serve live requests and the second copy can be taken offline to serve as a static snapshot.
According to one aspect there is provided a data storage system comprising a plurality of storage devices, each storage device comprising a plurality of storage nodes and each storage node comprising a plurality of logical partitions such that there are at least Q copies of a particular logical partition in the storage system, wherein each logical partition is divided into a plurality of sub-ranges which are individually lockable to both data input and data output whereby data in a particular logical partition is synchronisable sub-range by sub-range with the other copies of said particular logical partition.
Said particular logical partition may be a failed logical partition or a newly declared copy logical partition on a new, additional storage device. Each sub-range may be individually lockable in the sense that the subrange may be locked to read requests and/or or write requests or the sub-range may be made unavailable or by a combination of both locking and making unavailable.
According to another aspect there is provided a data storage system comprising a plurality of storage devices, each storage device comprising a plurality of storage nodes and each storage node comprising a plurality of logical partitions such that there are at least Q copies of a particular logical partition in the storage system, wherein each logical partition is divided into a plurality of sub-ranges which are individually lockable to both data input and data output whereby in the event of a failure of a logical partition, data is recoverable sub-range by sub-range in said failed logical partition from said copies of the failed logical partition.
According to another aspect there is provided a data storage system comprising a plurality of storage devices, each storage device comprising a plurality of storage nodes and each storage node comprising a plurality of logical partitions such that there are at least Q copies of a particular logical partition in the storage system, and at least one further storage device having a plurality of storage nodes with a plurality of logical partitions, wherein each logical partition is divided into a plurality of sub-ranges which are individually lockable to both data input and data output whereby data is synchronisable sub-range by sub-range between a logical partition in the at least one further storage device and corresponding copy logical partitions in the plurality of storage devices.
According to another aspect of the invention, there is provided a system for a user to store and retrieve data, said system comprising a storage system as described above and at least one user device connected to the storage system whereby when data is to be stored on the storage system said at least one user device is configured to input said data to an appropriate logical partition on said storage system and said storage system is configured to copy said data to all copies of said appropriate logical partition, and when data is to be retrieved on the storage system said at least one user device is configured to send a request to at least of the logical partitions storing said data to output said data from the storage system to the user device.
In other words, the present invention solves the live recovery process by arranging for incremental availability of recovering partitions without using journaling, snapshots or any system-wide locking. This is achieved by treating all partitions as collections of smaller partitions of varying data size, where each smaller partition is small enough to be resynchronized (copied) within a time period for which it is acceptable to block (delay) a fraction of the live write requests.
According to another aspect of the invention, there is a method of maintaining a fault-tolerant data storage system comprising providing a data storage system comprising a plurality of storage devices, each storage device comprising a plurality of storage nodes and each storage node comprising a plurality of logical partitions, configuring the plurality of logical partitions so that there are at least Q copies of any logical partition in the storage system, dividing each logical partition into a plurality of sub-ranges which are individually lockable to both data input and data output, whereby data in a particular logical partition is synchronisable sub-range by sub-range with the other copies of said particular logical partition.
Maintaining a data store may include creating and updating data in the data store, recovering from failure of an element of the data store and/or increasing capacity in the data store.
According to another aspect of the invention, there is a method of data recovery in a fault-tolerant storage system, said data storage system comprising a plurality of storage devices, each storage device comprising a plurality of storage nodes and each storage node comprising a plurality of logical partitions such that there are at least Q copies of any logical partition in the storage system and each logical partition is divided into a plurality of sub-ranges which are individually lockable to both data input and data output, the method comprising locking all sub-ranges of a failed logical partition, selecting a single sub-range of the failed logical partition to be synchronised, locking said selected single sub-range in all copies of said failed logical partition, synchronising data in said single sub-range of said failed logical partition with said single sub-range in all copies of said failed logical partition, unlocking said selected single sub-range in all copies of said failed logical partition, including said failed logical partition and repeating the selecting to unlocking steps until all sub-ranges are synchronised and unlocked.
According to another aspect of the invention, there is a method of increasing data storage in a fault-tolerant storage system comprising a plurality of storage devices, each storage device comprising a plurality of storage nodes and each storage node comprising a plurality of logical partitions such that there are at least Q copies of any logical partition in the storage system and each logical partition is divided into a plurality of sub-ranges which are individually lockable to both data input and data output, the method comprising locking all sub-ranges of said defined logical partition in said further storage device, selecting a single sub-range to be synchronised, locking said selected single sub-range in all copies of said defined logical partition, synchronising data in said single sub-range of said defined logical partition with said single sub-range in all copies of said defined logical partition, unlocking said selected single sub-range in all copies of said defined logical partition, including said defined logical partition and repeating the selecting to unlocking steps until all sub-ranges are unlocked.
The invention further provides processor control code to implement the above-described methods, in particular on a data carrier such as a disk, CD- or DVD-ROM, programmed memory such as read-only memory (Firmware), or on a data carrier such as an optical or electrical signal carrier. Code (and/or data) to implement embodiments of the invention may comprise source, object or executable code in a conventional programming language (interpreted or compiled) such as C, or assembly code, code for setting up or controlling an ASIC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array), or code for a hardware description language such as Verilog™ or VHDL (Very high speed integrated circuit Hardware Description Language). As the skilled person will appreciate such code and/or data may be distributed between a plurality of coupled components in communication with one another.
Various aspects of the invention are set out in the independent claims. Any additional features can be added, and any of the additional features can be combined together and combined with any of the above aspects. Other advantages will be apparent to those skilled in the art, especially over other prior art. Numerous variations and modifications can be made without departing from the claims of the present invention. Therefore, it should be clearly understood that the form of the present invention is illustrative only and is not intended to limit the scope of the present invention.
How the present invention may be put into effect will now be described by way of example with reference to the appended drawings, in which:
The overall topology is illustrated in
The logical partitions may be termed “buckets” for storing data objects. Each bucket is replicated such that there are usually at least Q copies of a particular bucket available to the storage subsystem, e.g. bucket 114a is a replica of bucket 14. The location of a copy of a bucket can be determined by maintaining a lookup table 17 that lists the current buckets and their associated identifier ranges. As shown in
Each bucket is responsible for a sequential range of integer object identifiers, e.g. 0-9999, 10000-19999 etc. The identifiers used to determine which bucket an object is within do not have to be unique across all objects in the system. If more than one object has the same identifier then all such objects will reside in the same bucket and additional means of specifying which object is actually being referred to will be required (e.g. by passing a normal filename along with the integer identifier).
To handle Write requests and maintain the synchronisation of the replicas of a bucket, the range of a bucket is divided up into sub-ranges. At step S104 a write request is sent from the client application and received at the appropriate bucket. At step S106, the write request is denied by the bucket (as explained in more detail below). Thus, the write request is sent to another bucket and step S104 is repeated. If this request is also denied, the system will loop until the write request is allowed and proceeds to step S108, where a write-lock is obtained and used to protect access to the relevant sub-range, i.e. the sub-range including the integer object identifier. At step S110, the data is written to the bucket and at step S112 it is copied to the plurality of replica buckets which are also responsible for the appropriate range, e.g. 12000-12999. The write lock is then released at step S114. In this way, there is no need for the client application to perform write requests to each any every replica bucket.
The bucket which is selected by the client application and which receives the original integer object identifier may be regarded as the master bucket and the replica buckets which receive copies may be regarded as slaves. The client application maintains knowledge of which is the master bucket. If the client application sends a write request to the wrong bucket, the write request is denied and an error message is returned to the client application with the details of the master bucket. The client application then resends the write request to the master bucket. Any bucket may act as a master bucket but only one bucket may be master at any one time. Communication between buckets is by any standard mechanism. It is noted that a bucket may contain zero or more stored objects, depending on whether any objects have been stored with an identifier falling within that buckets range.
The use of a write-lock is similar to other storage solutions using either fine-grained or a more coarse-grained locking solution. The use of a write-lock is applicable to both blocking-I/O uses and nonblocking-I/O uses (which simply affects whether a write request is delayed (blocked) or denied (failed) when occurring before a previous write request to the same subrange has completed). In other words, a write-lock may be applied as explained with reference to
The main benefit to using sub-ranges within a bucket is for use in recovering from a failure situation which is illustrated in
At step S304, a write-lock is obtained for a single sub-range of the previously failed bucket and for that sub-range across all replicas. At step S306, the state of the data in the sub-range in the previously failed bucket is compared and re-synchronised if necessary. At step S308, the write-lock is released for this sub-range for all buckets, including the previously failed bucket. Thus at step S310, this sub-range in the previously failed bucket is made available for read and write requests. At step S312, the system determines whether or not there are any additional sub-ranges to be synchronized and if so, loops through steps S304 to S312. In this way, the sub-ranges of a bucket are each brought back into active service (made online) one-by-one, thereby incrementally bringing the whole bucket back online. Any read or write requests that arrive at the bucket for an online sub-range are carried out, and any read or write requests for a still offline sub-range are denied.
This scheme therefore permits the recovery of a partition of data (bucket) without the need to implement a snapshot-journal-replay solution. Instead, recovery of partitions can happen during full live service and the necessary locking is made fine-grained to minimise the system impact to continued write-requests. Further, because the recovering bucket is incrementally made available, it begins to support the system load (particularly read requests that only need be handled by a single replica) as soon as the (potentially time consuming) recovery process has begun.
As shown in
At step S404, a write-lock is obtained for a single sub-range of the new replica bucket and for that sub-range across all copies on the existing machines and any other new replicas already created on the new machine. At step S406, the data in the sub-range in the new replica bucket is synchronised with the existing buckets. At step S408, the write-lock is released for this sub-range for all buckets, including the new replica bucket. Thus at step S410, this sub-range in the new replica bucket is made available for read and write requests. At step S412, the system determines whether or not there are any additional sub-ranges to be synchronized and if so, loops through steps S404 to S412. The process thus populates the new replicas by treating them as completely out-of-date and synchronising each sub-range in turn until the replica bucket is fully online at step S414. Once the new replicas have been made, the replicas residing on the near-full machines can be taken offline and deleted as at optional step S416. Again, the pattern of sequentially locking each sub-range in turn avoids the need to implement a more heavy-weight snapshot-journal-replay solution whilst still maintaining full system availability.
In each embodiment, the number of sub-ranges (and therefore the data storage size associated with each sub-range) is configurable and tuned to make an optimal compromise between the speed of bucket recover/data migration versus the length of time any one sub-range is blocked. The larger the data stored in a sub-range, the longer that sub-range will take to resynchronise and therefore the longer any pending write-requests will have to be blocked. The sub-range sizing to select depends on many factors including the performance characteristics of the disk hardware, network and server processors. Merely as an example, recovery of a whole bucket of say 10 Mb may take one minute or longer but by using appropriately sized sub-ranges, of say 200 small files or 1 Mb, blocking of read/write requests for each sub-range may be reduced to milliseconds.
Similarly, in each embodiment, the distribution of sub-ranges within a range can be uniform or non-uniform and the sub-range pattern used in one bucket does not need to match the pattern used by a different bucket. The only constraint is that the sub-ranges between bucket replicas are identical to allow for consistent locking of these sub-ranges for write and recovery operations. The size of sub-ranges can be modified dynamically if suitable support is implemented to synchronise these changes across the replicas of a bucket. Such resizing can be used to maintain a reasonably constant amount of data stored within each sub-range—otherwise, the system will depend on the uniform distribution of object identifiers to keep the number of objects (and their total size) similar across all sub-ranges. It is desirable to keep the data size associated with each sub-range similar, or at least capped to a tunable maximum in order to guarantee a maximum block time during write or recovery operations.
There are many topologies for deploying this arrangement of storage components. The minimum system that still provides for fault-resilience requires a single machine with two disks, each disk storing a single bucket consisting of a single sub-range. However, this minimum configuration becomes equivalent to a simple mirrored disk solution such as RAID-0 (although with different recovery algorithms). The real benefit to this arrangement is realised when the data stored per disk is large such that it takes non-trivial time to copy an entire disk between machines, and when the total data capacity of the storage subsystem requires multiple disks on multiple machines. Further, the number of replicas of each bucket (partition) does not need to be consistent across the system, the only constraint is the system maintains knowledge of how many (and where) replicas are for each bucket. The size of each bucket can also vary and does not need a system-defined limit. Multiple buckets can share the same physical storage medium (e.g. hard disk partition) and grow until their total size reaches the capacity of the physical storage.
The storage mechanism used within each bucket may be any suitable mechanism. The requirement is that an object can be created, updated and read from. The simple underlying storage requirements also mean that no specialised storage formatting is required. This scheme can be layered on top of any file system or database allowing for the, preferably convenient, copying of objects or collections of objects. These simple requirements also mean that no meta-data about each sub-range needs to be stored (and synchronised) other than the current object identifier range limits that each sub-range is responsible for. However, this lack of meta-data requires that every sub-range is at least considered for resynchronisation during recovery which is an operation that might take considerable time. This time is not necessarily a problem as the system is still serving client requests while the recovery proceeds (potentially slowly) in the background.
If recovery time and network load are important factors to minimise then there are useful optimisations to be made if additional meta-data is stored per subrange such as the modification time or a modification sequence number. When such modification information is available, the copy operations required to resynchronise a bucket can be limited to copying only the data in the sub-ranges that have changed since a bucket replica failed.
A convenient means to arrange this modification information is to maintain, per bucket (and its replicas), an operation sequence number. This number is incremented on every update operation (write or delete request), and stored in the meta data (on each replica) of the relevant sub-range. In this way each replica of each sub-range knows the operation number that last modified it. When a bucket replica has been offline and needs to be restored, it can compare its last operation sequence number with the latest operation numbers of its other replicas, and only needs to copy the sub-ranges from a surviving bucket that have operation numbers between the recovering replica's last number and the number a surviving replica has got to.
Mobile devices that are capable of accessing content on the world wide web are becoming increasingly numerous. Some of the problems of known mobile search services are addressed in US 2007/00278329, US 2007/0067267, US 2007/0067304, US 2007/0067305 and US2007/0208704 to the present applicants and the contents of these applications are herein incorporated by reference. The overall topology of such a system is illustrated in
The search results sent to the users by the query server can be tailored to preferences of the user or to characteristics of their device. When conducting a search, the indexer builds a database of documents of numerous different types, e.g. images, music files, restaurant reviews, Wikipedia™ pages. For each type of document, various score data is also obtained using type-specific methods, e.g. restaurant reviews documents might have user supplied ratings, web pages have traffic and link-related metrics, music links often have play counts etc. Each of the above storage systems may be used to create, modify or otherwise maintain a database of searched material for use in such mobile search services.
A mobile device may be any kind of mobile computing device, including laptop and hand held computers, portable music players, portable multimedia players, mobile phones. Users can use mobile devices such as phone-like handsets communicating over a wireless network, or any kind of wirelessly-connected mobile devices including PDAs, notepads, point-of-sale terminals, laptops etc. Each device typically comprises one or more CPUs, memory, I/O devices such as keypad, keyboard, microphone, touchscreen, a display and a wireless network radio interface. These devices can typically run web browsers or microbrowser applications e.g. Openwave™, Access™, Opera™ Mozilla™ browsers, which can access web pages across the Internet. These may be normal HTML web pages, or they may be pages formatted specifically for mobile devices using various subsets and variants of HTML, including cHTML, WML, DHTML, XHTML, XHTML Basic and XHTML Mobile Profile. The browsers allow the users to click on hyperlinks within web pages which contain URLs (uniform resource locators) which direct the browser to retrieve a new web page.
Such mobile search services may also comprise a database that stores detailed device profile information on mobile devices and desktop devices, including information on the device screen size, device capabilities and in particular the capabilities of the browser or microbrowser running on that device. Such a database may also be created, modified or otherwise maintained as described above.
The client applications and servers can be implemented using standard hardware. The hardware components of any server typically include: a central processing unit (CPU), an Input/Output (I/O) Controller, a system power and clock source; display driver; RAM; ROM; and a hard disk drive. A network interface provides connection to a computer network such as Ethernet, TCP/IP or other popular protocol network interfaces. The functionality may be embodied in software residing in computer-readable media (such as the hard drive, RAM, or ROM). A typical software hierarchy for the system can include a BIOS (Basic Input Output System) which is a set of low level computer hardware instructions, usually stored in ROM, for communications between an operating system, device driver(s) and hardware. Device drivers are hardware specific code used to communicate between the operating system and hardware peripherals. Applications are software applications written typically in C/C++, Java, assembler or equivalent which implement the desired functionality, running on top of and thus dependent on the operating system for interaction with other software code and hardware. The operating system loads after BIOS initializes, and controls and runs the hardware. Examples of operating systems include Linux™, Solaris™, UniX™, OSX™ Windows XP™ and equivalents.
Any of the additional features can be combined together and combined with any of the aspects. Other advantages will be apparent to those skilled in the art, especially over other prior art. No doubt many other effective alternatives will occur to the skilled person. It will be understood that the invention is not limited to the described embodiments and encompasses modifications apparent to those skilled in the art lying within the spirit and scope of the claims appended hereto.
This application claims the benefit of earlier filed provisional application Ser. No. 61/019,610 filed Jan. 8, 2008 entitled “Method of Recovering from Node Failure in Distributed Fault-tolerant Data Store”.
Number | Date | Country | |
---|---|---|---|
61019610 | Jan 2008 | US |