Various entities are increasingly relying on “cloud” storage services provided by various cloud storage vendors and so many applications have been designed to employ application program interfaces (“APIs”) provided by these vendors. Presently, a commonly used cloud storage service is AMAZON's Simple Storage Service (“S3”). A second commonly employed cloud storage service is MICROSOFT AZURE.
Although entities desire to use these applications that are designed to function with one or more cloud service APIs, they also sometimes want more control over how and where the data is stored. As an example, many entities prefer to use data storage systems that they have more control over, e.g., data storage servers commercialized by NetApp, Inc., of Sunnyvale, Calif. Such data storage systems have met with significant commercial success because of their reliability and sophisticated capabilities that remain unmatched, even among cloud service vendors. Entities typically deploy these data storage systems in their own data centers or at “co-hosting” centers managed by a third party.
Data storage systems provide their own protocols and APIs that are different from the APIs provided by cloud service vendors and so applications designed to be used with one often cannot be used with the other. Thus, some entities are interested in using applications designed for use on cloud storage services but with data storage systems they can exercise more control over.
Technology is disclosed for prefix matching using distributed tables for storage services compatibility (“disclosed technology”). In various embodiments, the disclosed technology supports capabilities for enabling a data storage system to provide aspects of a cloud data storage service API. The technology may employ an eventually consistent database for storing metadata relating to stored objects. The metadata can indicate various attributes relating to data that is stored separately. These attributes can include a mapping between how data stored at a data storage system may be represented at a cloud data storage service, e.g., an object storage namespace. For example, data may be stored in a file in the data storage service, but retrieved using an object identifier (e.g., similar to a uniform resource locator) provided by a cloud storage service.
A commercialized example of an eventually consistent database is “Cassandra,” but the technology can function with other databases. Such databases are capable of handling large amounts of data without a single point of failure, and are generally known in the art. These databases have partitions that can be clustered. Each partition can be stored in a separate computing device (“node”) and each row has an associated partition key that is the primary key for the table storing the row. Rows are clustered by the remaining columns of the key. Data that is stored at nodes is “eventually consistent,” because in that other locations may be informed of the additional data (or changed data) over time.
Because data is partitioned and stored at different nodes, it can be difficult to retrieve the data in sorted order form. That is because each partition can retrieve data in a sorted form, but the data can be returned from the various partitions at different times and in different orders. Thus, returning sorted data quickly is difficult. In various embodiments, the technology employs key prefixes and full keys (or prefixes and suffixes together). A prefix identifies a partition and a suffix (or full key) can be used to retrieve data from the partition in a sorted manner.
In various embodiments, the technology creates and employs a “key_by_bucket” table to associate “buckets” of a cloud storage service provider with keys in the eventually consistent database. The key_by_bucket table can include a bucket_id column, a key_prefix column, a generation column, a key column, and a metadata column. The bucket_id column identifies a bucket identifier as would be associated with a cloud storage provider. The key_prefix column stores key prefixes that identify a partition, as explained above. The generation column can be used to indicate which stored data is newest. For example, when data is updated, the data may merely be added without replacing older data, and the generation for the added data may be incremented from the generation for the previously stored data. The key column can store the full key for each row. The metadata column stores the actual metadata that can be used to map a file stored at a data storage system to an object identifier. The primary key for this table can be a combination of the bucket_id, key_prefix, generation, and the key.
The disclosed technology can also create a key_prefix_by_bucket table to associate buckets of a storage service with key prefixes. This table can include a bucket_id column, a key_prefix column, a generation column, an active column, and a splitting column. The bucket_id column, key_prefix column, and generation column, store information as described above. The active column and the splitting column can store Boolean values indicating whether a row corresponds to active data and/or has a key prefix that is being split, and are described in further detail below. The primary key for this table can be a combination of the bucket_id, key_prefix, and the generation. In various embodiments, all key prefixes for a bucket are stored in a single partition. Doing so enables ordered retrieval because it guarantees that all key prefixes are retrieved in sorted order prior to the key query and “roll-up.”
Thus, the disclosed technology is able to provide bucket ordering when using an eventually consistent database without relying on locking features of the underlying database and without interleaving results from multiple partitions.
Several embodiments of the described technology are described in more detail in reference to the Figures. The computing devices on which the described technology may be implemented may include one or more central processing units, memory, input devices (e.g., keyboard and pointing devices), output devices (e.g., display devices), storage devices (e.g., disk drives), and network devices (e.g., network interfaces). The memory and storage devices are computer-readable storage media that may store instructions that implement at least portions of the described technology. In addition, the data structures and message structures may be stored or transmitted via a data transmission medium, such as a signal on a communications link. Various communications links may be used, such as the Internet, a local area network, a wide area network, or a point-to-point dial-up connection. Thus, computer-readable media can comprise computer-readable storage media (e.g., “non-transitory” media) and computer-readable transmission media.
While
Those skilled in the art will appreciate that the logic illustrated in
If the technology receives updates to a row while the row's keys are being split, the update can go to both the old prefix and the new prefix. Doing so can facilitate in mitigation or elimination of race conditions. Queries (e.g., SELECTs) can retrieve data associated with the original prefix until the splitting is complete. In various embodiments, the new prefixes are set to active before the old prefixes are set to inactive. That way, the new data, now active, are returned instead of the old data. Thus, queries can return the highest generation active prefix. Cleanup of deletions can occur at a later time.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. Accordingly, the invention is not limited except as by the appended claims.