Technical Field
This patent application relates to data-aware storage systems.
Background Information
Most data storage systems embody tradeoffs between performance, capacity, reliability, price, and other factors. Heterogeneous systems employ diverse types of storage to optimize these tradeoffs, which succeeds to the extent that the tradeoffs are nonlinear. For example, a few fast disks combined with many cheap disks can be nearly as fast as using fast disks alone, and nearly as cheap as using cheap disks alone.
Tiering generally refers to heterogeneous systems wherein data is placed on one type of storage versus another. Caching can be viewed as a heterogeneous system wherein data temporarily resides on one type of storage and is then moved to another type of storage. These techniques, which are all well-known in the art, are sometimes referred to as Hierarchical Storage Management (HSM). An assumption of hierarchy typically underlies these techniques, where storage tier 0 is “better” in terms of performance and costlier in every way than storage tier 1, which is better in every way and costlier in every way than storage tier 2, and so on.
The storage management systems and methods described herein are data-aware, and consider more than performance criteria in how they manage data. In particular, data is mapped to appropriate storage entities based on a classification for the data that is determined from analytics. A set of policies is also determined that define how data in a particular class is to be assigned, or mapped, to appropriate storage entities.
The approach provides a number of unique advantages. The analytics provides in-depth pattern matching, search and discovery, to automatically analyze data and assign it to a class, for example, a sensitive data class that may be at risk and must be protected. Policies defined for the data class are then applied to ensure the data is housed in a storage entity appropriate for the class.
In one embodiment, an active data-aware storage management system includes at least a data classifier and a mapper. The data classifier and mapper have access to data stored in one or more logical storage entities.
The logical storage entities may include local or on-premesis storage arrays, network attached storage, storage appliances (remote or local), or data storage services such as public or private cloud services, or any other storage entity accessible to the system.
The classifier of the active data-aware storage management system serves as way to intelligently classify data of many types, including data objects, data streams, or even data that may be embedded in other structures such as virtual machine files. The data-aware classifier can potentially recognize a wide variety of characteristics of the data, limited only by available computing power rather than hardcoded conventions.
In some embodiments, the data-aware classifier may discover characteristics of the data by analyzing its content, encoding schemes, format, by monitoring access patterns, sampling the behavior of a data object or data stream, observing relationships between objects, and so on. Such a classifier may also determine that some objects have sensitivity (e.g., they contain information which should be protected, or to which access should be restricted, etc.), or functional importance (e.g., system files), that data objects have semantic importance (e.g., an annual report), or that some data objects have both functional and semantic importance (e.g., an index), and that while some objects aren't important at all they should be given high retention guarantees (e.g., an old email).
The classifier may also infer relationships between these characteristics. In one instance instance, relating content attributes, workload attributes, and schedule attributes might reveal that objects containing the keyword “virtualization” are write-intensive on weekdays.
The mapper associates or maps a data class to a logical data storage entity, which is to say that it intelligently selects a set of policies that apply to that data class, and identifies a logical storage entity (such as an on-premise storage subsystem or remote data storage service, etc.) that matches the desired policies to at least some degree. The mapping function is responsible for computing the optimality of a storage entity for a data class. An implementor or user may be responsible for defining the policies, such as in the form of an objective function or more simply, in the form of a static graph relating appropriate classes to appropriate storage entities.
An optional data mover may operate on the results of the mapping function to enact movement of classified data from one storage entity to another. In particular, the data mover may systematically perform subsequent analytics on the If a policy violation is found, or some other indication is evident that a different mapping of logical storage entit(ies) would be a better match, then the mapping for that data is changed by moving the data.
An optional provisioner provides a way to create diverse storage entities, custom-tailored to each data class determined by the classifier. Provisioned storage entities could then be used by the mapper and/or mover.
The data-aware provisioner can define the properties of storage entities (such as storage sub-systems or storage services), which it then uses to assemble custom logical storage entities suitable for the data classes discovered by the classifier. Such elements might include a variety of cache intake policies, a variety of cache eviction policies, a variety of RAID policies, a variety of priority policies, a variety of protection policies, a variety of compliance policies, and so on.
Accordingly, the provisioner operates under a novel definition of provisioning: to provision a storage service not merely to allocate resources, but to dynamically instantiate a selected set of policies into a logical storage entity that is appropriate for an observed data class. Such instantiation may encapsulate the allocation of storage resources or other aspects of conventional storage provisioning.
The description below refers to the accompanying drawings, of which:
A particular embodiment includes an intelligence module and a mapper. The intelligence module has access to one or more data storage entities that ingest data in the form of data objects and/or data streams. The intelligence module collects real time intelligence by performing analytics on the data, such as its sensitivity, importance, activity history, and the like. The extracted analytics are then used to classify the data. Once classified, the mapper then uses one or more policies to assign an appropriate logical storage entity to subsequently handle the data.
The storage system(s) 300 may provide information concerning the attributes of the various logical storage entities 310 (such as latency, sequential vs. random read/write performance, fragmentation, queue depth, cache size), as well as functional characteristics (backup frequency, replication, mean time between failure metrics, on-premesis or in-cloud, etc.) to be used by the intelligence node 200 as described in more detail below. In some implementations, the different storage systems may have different purposes or attributes. For example, if storage system 300-1 is performing as a primary data store, and storage system 300-2 performs replication 302 of the primary in storage system 300-1, storage system 300-2 may have different attributes and different types of logical storage entities 310-3 than the logical storage entities 310-1, 310-2 in storage system 300-1. In some implementations, the attributes of the logical storage entities 310 and storage systems 300 may be measured or otherwise inferred by the intelligence node.
The intelligence node 200 may be maintained and managed separately from the elements of the storage system(s) 300. That is, the logical storage entities 310 may be provided as stand-alone, on-premise storage sub-systems, storage networks, storage appliances, and the like, or may be provided as remote storage services such as public cloud (for example Amazon S3 or Rackspace) or private cloud storage services. However, in other embodiments, the intelligence node and storage system may be provided as a single integrated system that merges primary data storage, data protection, and intelligence functions as described in the co-pending U.S. patent application Ser. No. 14/499,886 referenced above. In such an integrated system, intelligence 200 is provided through in-line data analytics, and data intelligence and analytics are gathered on protected data and prior analytics, and stored in discovery points, all without impacting performance of the primary storage.
The intelligence module 200 maintains a change catalog 210, discovery points 220, and other information for data that it is responsible for. The intelligence module also includes additional functions such as analytics 260, a classifier 230, a mapper 240, and an optional provisioner 250, described in more detail below.
Regardless of how the logical storage entities 310 are provided, the notion is that the intelligence node 200 connects to each logical storage entity 310 to perform analysis 260 on the data stored therein. In some implementations, a client 400 may accesses the data to provide information (indirectly) about how often data is accessed, by whom, and so forth. Arrow 510 is to indicate that the intelligence node 200 extracts information about content from the data objects stored in the logical storage entities; the thinner arrow 520 indicates client 400 access of the objects, and may include explicit information about object usage. The dotted arrow 530 from client 400 to intelligence node 200 represents another embodiment wherein the client 400 sends access information directly to the intelligence node 200.
The intelligence node 200 may also captures data snapshots in the form of discovery points 220, and extract intelligence from the data in the logical storage entities 310 or the discovery points 230. The storage may also collect conventional input/output (I/O) characteristics as well as proprietary operational intelligence.
As used in this document, a data object is defined as any collection of data: a file, a directory, a set of files, a block, a range, an object, and so on. A data stream is any collection of events involving data objects; for example, a stream may be identified with a session (e.g., file open/close), a connection, a host, a user, an application, a file, a target, and so on.
The classifier 230 collects and analyzes whatever sorts of intelligence are presented by the analytics 260. The classifier 230 may include a plurality of classifier subsystems, each subsystem responsible for interpreting the available anayltics in a particular fashion, such as a subsystem that determines “importance” (as more fully explained below). For any given data object or data stream, the classifier 230 may output at least one labeled tuple of bounded integers, where the label identifies the a class for the data object or data stream, and the tuple encodes the analytics' 260 findings. A subsystem that determines importance might output a single tuple containing a single integer representing importance normalized to a range of 0 to 100, and this tuple would be labeled “importance.” These labeled tuples will be called attributes. A data class in this embodiment is then defined as an unordered set of attributes.
So defined, data classes can stand in set-theoretic relations to each other, where a superset is a subclass, a subset is a superclass, a union is a multiclass, and so on, along with the usual transitivity and associativity properties of such relations. Richer relations are possible by selecting or filtering on attribute labels. A practitioner of ordinary skill in the art will recognize that this definition facilitates a modest form of automated reasoning in the classifier 230, and will also recognize when to prefer a more expressive classification formalism, such as may be required by more powerful reasoning engines.
The classifier 230 may include, but is not limited to, subsystems that output attributes relating to importance, content types (for example “contains personally identifiable information”), access schedules, access locality, content types, and so forth. These will be described in greater detail below.
A classification strategy implemented by the analytics 260 and classifier 230 could be as simple as a static list of categories (i.e., the classifier's job is to decide which of several buckets each object belongs to), or it may be as complex as an entity-relationship model or a formal ontology (i.e., an ontology language like OWL is used to provide a set of nouns/nodes and verbs/edges, and the classifier's job is to summarize each object in terms of these descriptors).
Accordingly, an example classification strategy may be selected, under the following constraints:
For example, a static classifier trivially satisfies these constraints, whereas a textual classifier that measures grammatical correctness would generally not be considered to have a well-defined mapping to any variety of storage services. The illustrative embodiment will disclose an example classification strategy.
The mapper 240 maps or associates a data class to a logical storage entity, which may be recursively represented as a tuple wherein each element is either a policy or a tuple. Policies 570 may include one or more protection policies, security policies, encryption policies, intelligence-related policies, caching policies, prefetching policies, and so on. Such a tuple may be construed as defining a data path or data flow through the components of the system, but this construal is limiting and the concept is more general: it may be any set of available polices that can be assembled, or can self-assemble, to map data objects to a logical storage entity. The mapper works on a “best fit” basis if it cannot find a logical storage entity that is an exact match to a policy.
A data class, being described as a set of attributes, is thus mapped to one or more logical storage entities 310 by piecewise mapping each attribute to one or more policies associated with that data class. The policies 570 that specify attribute mappings may be statically defined, obtained via input from a user via a graphical user interface, or dynamically defined.
Once data objects 550-1, 550-2, 550-3 in the logical storage subsystem(s) 310 have had their content analyzed by analyzer 260, the mapper 240 is then used to implement the mapping from data objects 550 (which have their own properties) onto logical storage 310. Multiple logical storage entities 310 may satisfy the data-aware mapping constraints; the intent is to discover which have their constraints violated (or, requirements unmet).
Returning attention to
It may also be the case that the storage management system does not have control over where the data is initially placed. So it may be the case that data is initially found in the wrong type of storage entity. In that case, the data mover 280 may also be used to ensure that a class is determined for the data, and an appropriate storage entity mapped.
Once the mapper 240 (and/or data mover 280) finds policy violations, it may suggest or even automatically implement repair actions that can be undertaken, or interact directly with the storage system 200 to enact repair. Schedule-driven prefetching of data for possible repair action, such as warming caches by pre-reading data, is another possible repair action.
In one simple example, a policy may specify that an importance attribute in the range of 75-100 is preferentially mapped to a high-reliability logical storage. Even this relatively simple mapping scheme can express a rich concept: “a high-importance low-performance text document is preferably mapped to a storage entity comprising on-premises HDDs (physical storage policy), triple parity RAID (high-reliability low-performance policy), and a demand prefetch cache (text document policy)”
The mapper 240 can also express negative assertions. In one example, defining a “don't cache streaming reads” policy 570 may specify that data objects belonging to a “streaming read” superclass may simply be mapped to a storage entity 310 that has no cache.
Some other examples of polcies might include:
As mentioned above, it is also possible to operate on tuples or sets of policies. For example,
An audio file that is owned by the CFO that mentions client's street address and birthday could be analyzed to have a set of attributes including “PII”, “important”, “owned by Finance”, “streamed”.
Policies may also specify logical storage choices based on their physical characteristics. For example, a certain logical storage entity 310 may included a replicated HDD RAIDS array or an all SSD datastore, and the mapper could choose to assign the file to the HDD array because it meets more of the requirements in the policies.
From the discussion above, it is understood that the system both (a) maps a set of policies to a logical storage entity, and (b) maps a data class to storage entities.
In the particular example of
If a precise logical storage entity 310 is not already available to the system 100, the optional provisioner 270 (
The classifier 230 subsystems are now described in further detail.
To further elaborate on the above description, the policies may include:
Additional examples demonstrate how analytics-derived data classification is used to select a logical storage entity for handling data.
In one specific example, the classifier 230 may encounter a medical dictation audio file, and assigns a data class comprising these attributes: MP3 audio file, HIPPA-level security, importance level 90, sequential access, two accesses per week. This class would be mapped to a logical storage entity as a result of mapping these policies: copy-on-write file store (mapped from MP3), no compression (mapped from MP3), object encryption (mapped from HIPPA), drive encryption (mapped from HIPPA), RAID with triple parity (mapped from importance), demand prefetch (mapped from sequential streaming). If this exact logical storage entity is not available in the system 100, the mapper finds the closest match.
In another example, the classifier 230 encounters the medical dictation file again as part of scheduled or some other subsequent analysis of data that has previously been classified. However, based on new information from the intelligence module 200, the classifier 230 removes the attribute “two accesses per week” from the object's class and assigns the attribute “dormant.” The assignment of this new class signals a mismatch between the object's class and the logical storage entity in which it currently resides; the object's new class now maps to an entirely different storage entity comprising: cloud file storage, object encryption, no compression. The data mover 280 thus periodically attempts to move these now mismatched objects to their appropriate storage entity. In some instances, the data mover may invoke provisoner 270 for provisioning a new storage entity if indicated and if available.
In another example, the classifier 230 determines that groups of files are always accessed together, which become identified as locality groups. The “locality prefetch” policy thus functions as a content-aware prefetch service. So a storage entity 310 can be mapped such that every time someone accesses file #1, we will also prefetch files #2 through #50. This is possible because accessing any object also fetches that object's data class; if the object belongs to a locality group, its class will contain a locality attribute that contains the locality group ID. The function of the “locality prefetch” policy is to extract the locality group ID from any object access, ask the classifier 230 for the full list of objects in that locality group, and then prefetch all of those objects.
Each logical storage entity may also be defined by a set of one or more policies (physical, caching, intelligence, or other policies). The storage service definitions may be associated with a user or process that provides the data as input—thus optimizing the assignment of a storage class based on user identity. In one example, if the classifier 230 knows from past history that a particular user is data intelligence intensive, and that user typically needs access to the data immediately, the provisioner can also assign an appropriate cache policy.
In another example, the policies may be defined for two types of caches (say, Least Recently Used and Most Recently Used). In this example, the desire to forward data objects or data streams to one or the other cache type may be accomplished by mapping logical storage entities with one or the other cache policy, and storing each object in its desired storage service. The mapping of each object's class to its desired storage may be based on some other policy or group of policies or groups of other attributes derivable from the data object or data stream.
A further example could provide optimized storage based on file type. In one example the system may be asked to ingest a large number of video files with a high streaming rate. For a video file with a high streaming rate, a logical storage entity with a relatively large RAID chunk size may be desirable for initial ingest. While this is extremely good for video streaming, it may be quite bad for a lot of other things—such as frame-by-frame image pattern recognition processing. The system can thus dynamically adjust, mapping to the larger RAID chunk size for initial ingest, and mapping a smaller RAID stripe size for later intelligence. In another example with video files, the system may determine that it is better to store video files on small, slow, legacy HDD's rather than faster HDD's or SSD.
Another example can optimize for backup processing. The system might default to keeping backups for three (3) weeks. But based on intelligence data, it appears that many users are actually reviewing files in the backup even on the very day they are deleted. Upon detecting this pattern, the system may decide to modify an object's data class to signify a backup retention of six (6) weeks instead of three. Thus a storage class having a particular backup (retention) policy may be mapped to a certain logical storage entity 310 based on actual activity. The object's new data class maps to a logical storage entity that is completely identical to the object's old storage entity except for a new backup policy. Since the new storage entity does not physically differ from the prior one, the new storage entity preferably reuses the underlying resources already allocated, and the object preferably remains without undergoing a physical migration. More generally, a system that has many hundreds of available logical storage entities does not necessarily need to have many hundreds of, say, physical filesystem instances.
In another example, intelligence may determine the data object is a large file that should be prefetched—but that only the intial part is accessed sequentially by the user, and other parts randomly. For example, the file may be a large text file for which only the first 10 pages ever get read on a regular basis. Thus the storage class appropriate for this file may define that the initial parts of the file are stored on a logical storage entity optimized for sequential access, and other parts on a logical storage entity optimized for random access. To be more specific, when data extraction is performed in the intelligence phase, the system can index the file to determine where the section titles are, and then create a policy that prefetch is only performed up to byte number “x”. In yet another example, a database having content related to personal data for a large number persons may be defined by database keys. If the application is using the database for facial recognition pattern, the policy may specify only prefetching records for first persons named “Bob” when “we are only looking for someone named Bob”. . . .
It should be understood that the storage services, when provisioned 270, can be provisioned in advance, or “on the fly” in response to real-time intelligence.
This application claims priority to co-pending U.S. Patent Application Ser. No. 62/305,011 filed Mar. 8, 2016 entitled “Active Data-Aware Provisioning Technology”, and co-pending U.S. Patent Application Ser. No. 62/340,219 filed May 23, 2016 entitled “Active Data-Aware Provisioning Technology”, and is related to co-pending U.S. patent application Ser. No. 14/499,886 filed Sep. 29, 2014 entitled “System and Method of Data Intelligent Storage” the entire contents of each of which are hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
62305011 | Mar 2016 | US | |
62340219 | May 2016 | US |