Technical Field
This patent application relates to data storage systems, and more particularly to methods and systems for implementing an in-situ data management solution.
Background Information
The growth and management of unstructured data is perceived to be one of the largest issues for businesses that purchase and deploy data storage. Unstructured data is anticipated to grow at a rate of 40-60% per year for the next several years, as the proliferation of content generation in various file formats takes hold, and that content is copied multiple times throughout data centers.
Enterprises are already starting to feel the pain of this rapid data growth, and are looking for ways to store, manage, protect and migrate their unstructured data in a cost effective manner without needing to manage increasing volumes of hardware.
Conventionally, these enterprises purchase data storage assets in an appliance form factor, often needing to migrate data from one set of monolithic appliances to another as their data needs grow and scale. This approach is capital intensive, as storage appliances cost in the thousands of dollars per terabyte, and data migration projects routinely overrun their intended timeframes and incur additional service costs as a result.
The Public Cloud would seem to be one place to look for relief from such challenges, but a variety of objections stand between legacy storage appliances and Public Cloud adoption: data privacy concerns, data lock-in, data egress costs, complexity of migration, and the inability to make an all-or-none architecture decision across a diverse set of applications and data.
The in-situ cloud data management solution(s) described herein offer the ability to decouple applications and data from legacy storage infrastructure on a granular basis, migrating the desired data to a cloud architecture as policies, readiness, and needs dictate, and in a non-disruptive fashion. In so doing, the in-situ data management solution(s) allow consolidating no fewer than what are typically a half dozen products and data management functions into a single system, scaling on-demand, and shifting capital expenditure (CAPEX) outlays to operational expenditures (OPEX) while substantially reducing total cost of ownership (TCO).
In one implementation, an in-situ cloud data management solution may be comprised of one or more application or file servers, each running one or more software connector components, which in turn are connected to one or more data management nodes. The data management nodes are in turn connected to one more data storage entities.
The connector components may be installed into application or file servers, and execute software instructions that intercept input/output (I/O) requests from applications or file systems and forward them to one or more data management nodes. The I/O requests may be file system operations or block-addressed operations to access data assets such as files, directories or blocks. The I/O intercepts may be applied as a function of one or more policies. The policies may be defined by an administrative user, or may be automatically generated based on observed data access patterns.
The data management nodes execute software instructions implementing application and file system translation layers capable of interpreting requests forwarded from software connectors. The data management nodes also may include a database of object records for both data and meta-data objects, persistent storage for meta-data objects, a data object and meta-data cache, a storage management layer, and policy engines.
The data management nodes may store file system and application meta-data in a database as objects. Along with the conventional attributes commonly referred to as file system meta-data (i.e. the Unix stat structure), the database may be used to associate i-node numbers with file names and directories using a key-value pair schema.
Files, directories, file systems, users and application and file servers may have object records in the database, each of which may be uniquely identified by a cryptographic hash or monotonically increasing number.
Contents of files may be broken into variable sized chunks, ranging from 512 b to 10 MB, and those chunks are also assigned to object records, which may be uniquely identified by a cryptographic hash of their respective contents. The chunks themselves may be considered to be data objects. Data objects are described by object records in the database, but are not themselves stored in the database. Rather, the data objects are stored in one or more of the data storage entities.
File system meta-data in the database points to the data object(s) described by that meta-data via the unique identifiers of those objects.
The data storage entities may typically include cloud storage services (i.e. Amazon S3 or other Public Cloud. Infrastructure as a Service (IaaS) platforms in Regional Clouds), third party storage appliances, or in some implementations, one or more solid state disks, or one or more hard drives. The data management nodes may communicate with each other, and therefore, by proxy, with more than one data storage entity.
The database accessible to the data management node may contain a record for each object consisting of the object's unique name, object type, reference count, logical size, physical size, a list of storage entity identifiers consisting of {the storage entity identifier, the storage container identifier (LUN), and the logical block address}, a list of children objects, and/or a set of custom object attributes pertaining to the categories of performance, capacity, data optimization, backup, disaster recovery, retention, disposal, security, cost and/or user-defined meta-data.
The custom object attributes in the database contain information that is represented as object requirements and/or object policies.
The database may also contain storage classification information, representing characteristics of the data storage entities accessible to data management nodes for any of the aforementioned custom object attributes.
Object requirements may be gathered during live system operation by monitoring and recording information pertaining to meta-data and data access within the file system for any of the aforementioned custom attribute categories. In this case, object attributes may be journaled to an object attribute log in real-time and subsequently processed to determine object requirements and the extent to which those requirements and/or policies are being satisfied.
Object requirements may also be gathered by user input to the system for any of the aforementioned attribute categories.
Object policies may be defined by user input to the system for any of the aforementioned attribute categories, and may also be learned by interactions between the software connector(s) and data management node(s), wherein the data management node may perform its own analysis of the requirements found within the custom object attribute information.
Requirements and policies may be routinely analyzed by a set of policy engines to create marching orders. Marching orders reflect the implementation of a policy with respect to its requirements for any object or set of objects described by the database
When the data storage entities are unable to meet the requirements and/or fulfill the policies, the data management node may describe and/or provision specific data storage entities that are additionally required to meet those needs.
If the required data storage entities to meet those needs are virtual entities, such as data volumes in a Public Cloud, or data volumes on a third party storage appliance (IaaS or otherwise), the data management node can provision such virtual entities via an Application Programming Interface (API), and the capacity and performance of those entities is immediately brought online and is usable by the data management node.
Objects may be managed, replicated, placed within, or removed from data storage entities as appropriate via the marching orders to accommodate the requirements and policies associated with those objects.
File system disaster recovery and redundancy features may also be implemented, such as snapshots, clones and replicas. The definition of data objects in the system enables the creation of disaster recovery and redundancy policies at a fine granularity, specifically sub-snapshot, sub-clone, and sub-replica levels, including on a per-file basis.
Features and Advantages:
The disclosed system has a number of advantageous features, it being understood that not all embodiments described herein necessarily implement all described features.
The disclosed system may operate in-situ of legacy application and file servers, enabling all described functionality with simple software installations.
The disclosed system may allow for an orderly and granular adoption of cloud architectures for legacy application and file data with no disruption to those applications or files.
The disclosed system may decouple file system meta-data and user data in a singularly managed cloud data management solution.
The disclosed system may allow for the creation and storage of custom meta-data associated with users, application and file servers, files and file systems, enabling the opportunity to create data management policies as a function of that custom meta-data.
The disclosed system may store requirements and service level agreements for users, application and file servers, files, and file systems, and can implement policies to accommodate them at equivalent granularities and custom subsets of granularities.
The disclosed system may enable data storage entities that are classically used for the storage of application and file data to be deployed independently of the entities used for data management and file system meta-data storage.
The disclosed system may allow for the mobility of meta-data required for applications or users to access data independent of the location of storage entities housing the actual data.
The disclosed system may create a truly granular pay-as-you-grow consumption model for data storage entities by allowing for the flexible deployment of one more data storage entities in an independent manner.
The disclosed system may create the opportunity to dispose of legacy data storage entities and replace them with more cost effective, enterprise quality, commodity components, whether physical or virtual, at a greatly reduced total cost of ownership (TCO).
The disclosed system may allow for mobility of data objects across different data storage assets, including those in various clouds, in the most cost-effective possible manner that meets prescribed requirements and service level agreements.
The disclosed system may eliminate the need for data storage migration projects.
The disclosed system may free enterprises from the concept of vendor lock-in with their data storage assets, whether physical or virtual.
The disclosed system may allow enterprises to optimize their data sets, via technologies such as deduplication and compression, globally, across all data storage entities being used for the storage of their data.
The disclosed system may enable fine-grained data management policies on backup copies of data, specifically at sub-snapshot, sub-clone, and sub-replica granularity, creating the opportunity to optimize storage requirements and costs for backup data.
The disclosed system may collapse several data storage and data management products into a single, software only offering.
The description below refers to the accompanying drawings, of which:
The following is a detailed description of an in-situ data management solution with reference to one or more preferred embodiments. It will be understood however by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention(s) sought to be protected by this patent.
Connector components 112 may reside as software executing on one or more application 110 or file servers 111. Connector components 112 may exist as filter drivers or kernel components within the operating system of the application 110 or file servers 111, and may, for example, run on either Windows or Linux operating systems. The connector components 112 may intercept block level or file system level requests and forward them to the data management node 120 for processing. Connector components 112 preferably only forward requests for data assets that the data management node 120 has taken ownership of, either through explicit administrator action or policy.
When software connector components 112 are first installed into an application 110 or file server 111, they typically do not initially intercept or interfere with any Input/Output (I/O) requests on that application or file server. It is only through subsequent action taken on a data management node 120, set via administrator or policy, that indicates to the connector 112 whether it should take over ownership of a “data asset” such as a file, directory, file set, or application's data. Upon doing so, the connector 112 may make use of an existing file system mechanism in the operating system, such as an NTFS reparse point, to redirect the I/O request to the data management node.
Ownership of an asset may be indicated to the connector component 112 in a number of ways. In one example, the connector component 112 receives a command from the data management node 120 to assume ownership and thus responsibility for processing access requests to one or more data assets accessible to the server 110 or application 111. The connector component 112 may then store a persistent cookie, or some other data associated with the one or more affected data assets. Thus, upon subsequent processing of an access request related to a specific data asset, if a persistent cookie is found, then the software component knows to forward the access request to the data management node. Otherwise, the remote device will process the access request locally, without forwarding the access request to the data management node.
Multiple data management nodes 120 are typically present for the purposes of redundancy. Data management nodes 120 can exist in a cluster 122 or as standalone entities. If in a cluster 122, data management nodes 120 may communicate with each other via a high speed, low latency interconnect, such as 10 Gb Ethernet.
Data management node(s) 120 also connect to one or more data storage entities 130.
Data storage entities 130 may include any convenient hardware, software, local, remote, physical, virtual, cloud or other entity capable of reading and writing data including but not limited to individual hard disk drives (HDD's), solid state drive (SSD's), directly attached Just a Bunch of Disks (JBOD) enclosures 134 thereof, third party storage appliances 133 (i.e. EMC, SwiftStack), file servers, and cloud storage services (i.e. Amazon S3 131, Dropbox, OneDrive, IaaS 132, etc.).
Data storage entities 130 can be added to the data management solution 100 without any interruption of service to application 110 or file 111 servers, and with immediate availability of the capacity and performance capabilities of those entities 130.
Data storage entities 130 can also be targeted to be removed from the data management solution 100, and after data is transparently migrated off of those data storage entities (typically onto one or more other data storage entities). The data storage entities 130 targeted for removal can then be disconnected from the data management solution without any interruption of service to application or file servers. For example, the content of JBOD entities 134 may be migrated to Amazon S3 131 or cloud storage 132 using the techniques described herein.
A data management node 120 may be a data processor (physical, virtual, or more advantageously, a cloud server) that contains various translation layers—for example, one layer for each supported file system—that interpret a stream of native I/O requests routed from a file 110 or application 110 server via one or more connector components 112.
A data management node 120 may be accessed via a graphical user interface (GUI) 202. This GUI 202 may be used to perform administrative functions 203, such as system configuration, establishing relationships with file 111 and application 110 servers that have software connector components 112 installed, integrating with data storage entities 130 (whether cloud services or third party storage appliances or otherwise), and setting and configuring policies with respect to data management functions.
A data management node 120 may contain an object cache and a meta-data cache 204, which contain non-persistent copies of recently or frequently accessed data objects and/or file system meta-data objects.
A data management node 120 may also contain a database 206 which contains file system meta-data and object records for files and data objects known to the data management node 120, and storage classification data for the connected data storage entities 130. Contents of meta-data objects may persistently reside in local storage associated with the database 206, however contents of data objects persistently reside on data storage entities 130 that are connected to the data management node(s). Thus, the storage of file system meta-data and file data are decoupled in the context of data management node 120.
File system meta-data in one embodiment may refer to a standard set of file attributes such as those found in POSIX compliant file systems. A meta-data object record in this embodiment may refer to a custom data structure, an example of which is shown in
Storage classification 210C refers to a set of information on a per data storage entity 130 basis that reflects characteristics pertinent to the custom object attribute categories. In one example, these may be performance 317, capacity 318, security 324, cost 325, and user-defined 326, although myriad other storage classifications 210C may be defined.
A data management node 120 may contain an object attribute log 208 which represents activity relative to the object requirements 210A and object policies 210B, including custom object attribute categories, as recorded by the data management node 120. This object attribute log 208 is used to create or update custom object attribute meta-data within the database 206. Example attribute log entries may pertain to access speed, as expressed in latency, frequency of access or modification, as expressed in number of accesses or modifications per unit of time. Entries within the object attribute log 208 pertaining to a single object and single attribute category may be consolidated into a single attribute log entry so as to optimize database update operations. Processed object attribute log information stored in the database 206 may subsequently be fed to the policy engines 212 as object requirements 210A or object policies 210B.
A data management node 120 may also generate object requirements 210A and object policies 210B based on custom object attributes that are stored in the database 206. Object requirements 210A are either gathered via user input or generated automatically by the data management node. When generated automatically, object requirements 210A may be gathered during live system operation by monitoring and recording information pertaining to meta-data and data access within the file system. Object policies 210B may be defined by user input to the data management node, and may also be learned by the data management node 120 performing its own analysis of the requirements found within the custom object attribute information.
A data management node 120 may also contain a set of policy engines 212 which take object requirements 210A, object policies 210B and storage classifications 210C as input and generate a set of marching orders 214 by performing analysis of said inputs. Storage classifications 210C may consist of both real-time measured and statically obtained data reflecting capabilities of the underlying data storage entities. For instance, performance capabilities of a given storage entity 130 may be measured in real-time, and utilization calculations are then performed to determine the performance capabilities of said entity. Capacity information, however, may be obtained statically by querying the underlying storage entity 130 via publicly available API's, such as OpenStack. Marching orders 214 may reflect the implementation of a policy with respect to its requirements for any object or set of objects described in the database 206. More specifically, marching orders 214 may describe where and how objects should be managed or placed within, or removed from, the data management solution 100.
The data management node 120 may also contain a storage management layer 216 which manages the allocation and deallocation of storage space on data storage entities within the solution 100. The storage management layer 216, therefore, is responsible for implementing the marching orders it is presented with by the policy engines. In order to fulfill this obligation, the storage management layer has access to information concerning the characteristics and capabilities of the underlying data storage entities 130 with respect to the custom object attribute categories previously defined, and creates storage classifications on those dimensions, which in turn are stored in the database 206. Thusly, marching orders 214 clearly describe an object or object set, the operation, and the associated data storage entity or entities to be targeted by each operation.
By way of example, a description of an object requirement is shown in
More particularly, step 504 may determine if a latency policy already exists. If no latency policy exists, then a threshold is set in step 505. If a latency policy already exists, then step 506 modifies its threshold per the input from steps 502 and/or 503. Next, in step 508, the database 206 is updated with this new requirement. Step 510 analyzes the latency requirement and current performance available from the data storage entities 130. If the requirements are satisfied, step 512 marks them as such and then step 514 ends this process. Otherwise, step 515 assesses available storage entities and if the requested performance is available in step 516, step 522 moves the objects associated with the access request to the new appropriate entities and then ends this process in step 530. If step 516 cannot find appropriate entities, then an administrator can be notified in step 517 before ending this process.
By way of example, a method to routinely assess object requirements and provision data storage entities to accommodate those requirements is shown in
Further extending the above example, a method to dispose of a legacy data storage entity is shown in
From step 706, migration may be performed completely transparently to any of the application or file servers in question. Upon completion of the migration, the user could be notified in step 707 that the legacy storage appliance could be unplugged and removed from the solution with no interruption of service to any of application or file servers.
By way of example, a method for using custom, user-defined meta-data to define and fulfill a data management policy is shown in
By way of example, a method for using the disclosed to system to mobilize meta-data and enable data to be accessed in another location without moving the associated data is shown in
Thus, in a first step 902, the system identifies a need to mobilize meta-data to permit access to existing data objects by a new server in Boston, Mass., such as via user input or via automated analysis of access requirements. In step 903, a new data management node is deployed in the new region. In step 904, a new connector component is installed on the new server. Step 905 joins the new data management node to the existing data management node cluster in Nashua, N.H. Step 906 replicates the meta-data between data management nodes—but the data objects themselves remain in the data storage entities. Steps 907, 908, and 909 then instantiate a file system on the new server (again, without copying actual data objects or files).
The illustrated data management solution 1000 comprises one or more network clients 1010 connected to one or more data management nodes 1020, and data management nodes connected to one or more data storage entities 1030.
Multiple data management nodes 1020 may be present for the purposes of redundancy.
Network clients 1010 may connect to the data management nodes 1020 via standard network data protocols, such as but not limited to NFS, SMB, iSCSI, or object storage protocols such as Amazon S3.
Data management nodes 1020 can exist in a cluster 1022 or as standalone entities. If in a cluster 1022, data management nodes 1020 communicate with each other via a high speed, low latency interconnect, such as Infiniband or 10 Gigabit Ethernet.
Data management nodes 1020 also connect to one or more data storage entities 1030.
Data storage entities 1030 may include any convenient hardware, software, local, remote, physical, virtual, cloud or other entity capable of reading and writing data objects including but not limited to individual hard disk drives (HDD's), solid state drive (SSD's), directly attached JBOD enclosures thereof, third party storage appliances (i.e. EMC), file servers, and cloud storage services (i.e. Amazon S3, Dropbox, OneDrive, etc.).
Data storage entities 1030 can be added to the data management solution 1000 without any interruption of service to connected network clients 1010, and with immediate availability of the capacity and performance capabilities of those entities.
Data storage entities 1030 can be targeted to be removed from the data management system, and after data is transparently migrated off of those data storage entities, can then be removed from the system without any interruption of service to network clients.
The data storage methods and systems described herein provide for decoupling of data from related meta-data for the purpose of improved and more efficient access to cloud-based storage entities.
The methods and systems described herein also enable replacement of legacy storage entities, such as third party storage appliances, with cloud based storage in a transparent online data migration process.
Specific data access requirements, service levels (SLAs) and policies needed to implement them are also supported. These requirements, service levels, and policies are also expressed as metadata maintained within a database in the data management node. The system also provides the ability to measure and protect growing data requirements and identify and deploy data storage entities required to fill those requirements. User-defined metadata may also be stored with the system-generated meta-data and exposed for further use in applying the policies and/or otherwise as the user may determine.
In other aspects, the systems and methods enable global migration of objects across heterogeneous storage entities.
The present application claims priority to U.S. Provisional Patent Application No. 62/270,338, filed on Dec. 21, 2015 by Gregory J. McHale for a “FLEXIBLY DEPLOYABLE STORAGE ENTITIES IN A POLICY AND REQUIREMENTS AWARE DATA MANAGEMENT ECOSYSTEM WITH DECOUPLED FILE SYSTEM META-DATA AND USER DATA”, the contents of which are incorporated by reference herein in their entirety.
Number | Date | Country | |
---|---|---|---|
62270338 | Dec 2015 | US |