A portion of the disclosure of this patent document may contain command formats and other computer language listings, all of which are subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the United States Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
This disclosure relates generally to big data storage analytics.
Businesses and other enterprises generate large amounts of information, which may be stored in a cost-effective manner while ensuring acceptable levels of availability, security, and accessibility. Currently, stored information is managed through a set of manual, automatic, or semi-automatic policies, procedures, and practices. These methods are applied in a variety of ways to a variety of data and data storage systems.
A method, a computer program product, and a system for analyzing heterogeneous storage system data, the method comprising receiving metadata from storage systems; analyzing the metadata; and based on the analyzed metadata, providing recommendations to the storage systems.
Objects, features, and advantages of embodiments disclosed herein may be better understood by referring to the following description in conjunction with the accompanying drawings. The drawings are not meant to limit the scope of the claims included herewith. For clarity, not every element may be labeled in every figure. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating embodiments, principles, and concepts. Thus, features and advantages of the present disclosure will become more apparent from the following detailed description of exemplary embodiments thereof taken in conjunction with the accompanying drawings in which:
Traditionally, workload data is captured from a data storage system for different reasons. Typically, one reason workload data is captured from a data storage system is for troubleshooting analysis. Generally, workload data is also captured from a data storage system for performance-related issues. Typically, a problem encountered in trace processing and analysis is that a huge amount of information may be contained in a captured trace. Typically, a trace size for several minutes of data collection may reach hundreds of megabytes. Generally, because of size constraints, an analysis program may not be able to hold all relevant data in computer memory.
Traditionally, the size of a trace file depends on the events being traced, the number of Inputs/Outputs (IOs) traced, and the trace duration. Generally, accessed data in the form of a trace file may be made ready for analysis. Usually, a trace file may contain information about IO activity also referred to workload data on data storage systems from which the trace file was accessed.
In many embodiments, exemplary data storage systems for which workload data may be captured and analyzed may be Symmetrix® Integrated Cache Disk Arrays, Celerra® Data Access Real Time (DART) file system, and ViPR® software-defined storage system, all available from EMC Corporation of Hopkinton, Mass. In other embodiments, the techniques of the current disclosure may be applied to data storage systems in general.
In many embodiments, there may be an analytics platform that may receive information of input/output (IO) activity produced in different data storage systems. In some embodiments, the information received by an analytics platform may relate to load characteristics of the different storage systems. In certain embodiments, there may be a software framework that stores, processes, and analyzes big data. In other embodiments, there may be an analytics platform that may process and analyze information received from different storage systems and stored on a software framework. In many embodiments, there may be an analytics platform that may analyze trace data. In certain embodiments, there may be an analytics platform that may provide recommendations to different data storage systems to change its configuration based on a load prediction ascertained from analyzed information received from different storage systems. In some embodiments, there may be an analytics platform that may provide recommendations to an end user of a storage system to change the configuration of its storage system based on a load prediction ascertained from analyzed information received from different storage systems. In other embodiments, an analytics platform may provide recommendations to a storage system to move data from one storage array to another storage array. In certain embodiments, an analytics platform may provide recommendations to a storage system recommending a more efficient storage system based on the current storage system's usage of data. In many embodiments, an analytics platform may provide recommendations to a storage system recommending a more efficient storage system based on a prediction of the storage system's future load characteristics.
In many embodiments, an analytics platform may have the following components: a software agent, a software framework, and an analytics platform. In some embodiments, a software agent may activate automatically. In certain embodiments, a software agent may be in a wait status on a storage system. In many embodiments, a software agent in a wait status may be awaiting recommendations from an analytics platform. In other embodiments, a software agent may be in a run status on a storage system. In certain embodiments, a software agent in a run status may be executing recommendations received from an analytics platform. In many embodiments, big data may be referred to as a high volume of information. In many embodiments, a software agent may receive a high volume of information from an analytics platform. In some embodiments, a software agent may send metadata from storage systems to an analytics platform. In certain embodiments, a software agent may receive storage system recommendations from an analytics platform to change storage system configurations. In many embodiments, a change in a storage system configuration may include moving data within a storage system to increase performance of the storage system.
In many embodiments, a software agent may collect information relating to load characteristics of a storage system. In some embodiments, a software agent may collect information pertaining to which port an IO was received. In certain embodiments, a software agent may collect information pertaining to which logical unit received an IO, the offset of the IO and the Length of the IO. In other embodiments, a software agent may collect information pertaining to the time the IO began and ended. In many embodiments, a software agent may collect information pertaining to the current placement of data in a storage system.
In many embodiments, a software agent may collect information in a file system pertaining to which file received an IO. In some embodiments, a software agent may collect information in a file system pertaining to which file opened for IO.
In certain embodiments, a software agent may collect information varying from storage system to storage system. In some embodiments, a software agent may send information chronologically to an analytics platform. In many embodiments, a software agent may periodically send information to an analytics platform.
In many embodiments, a software agent may send information through various protocols. In some embodiments, a software agent may send information to an analytics platform using Fibre Channel (FC) protocol.
In many embodiments, a software framework may be Hadoop®. In other embodiments, a software framework may store and process big data. In some embodiments, a software framework may distribute data and associated computing (e.g., execution of application tasks). In certain embodiments, a software framework may provide the ability to reliably store huge amounts of data.
In many embodiments, an analytics platform may receive information from different storage systems. In some embodiments, an analytics platform may preprocess data received from different storage systems. In certain embodiments, an analytics platform may store information, or data, to a software framework. In many embodiments, an analytics platform may be built on different big data platforms to run big data analysis.
In many embodiments, an analytics platform may allow the running of pluggable applications [e.g., storage planner, fully automated storage tiering (FAST)] using data in a software framework utilizing an application program interface (API). In certain embodiments, an analytics platform may enable running of pluggable applications using data in cached information utilizing an API. In some embodiments, an API may be MapReduce (MR) jobs that may run on top of a software framework. In other embodiments, an API may be Python/CPP libraries that may enable access to information in a software framework cluster. In some embodiments, an API may be Python/CPP libraries that may enable access to information in a platform cache (in near real-time data). In some embodiments, an analytics platform may provide storage systems with information resulting from using data in a software platform and cached information utilizing an API.
In certain embodiments, an analytics platform may develop a prediction algorithm relating to load characteristics of different storage systems. In many embodiments, having access to a storage system's history may allow running of prediction algorithms. In other embodiments, an analytics platform may develop one or more predictive algorithms that determine where data should be placed or where data will be placed. In some embodiments, an analytics platform may periodically analyze data received from storage systems. In many embodiments, an analytics platform may make decisions based on data received from storage systems. In other embodiments, an analytics platform may send metadata about a decision made by an analytics platform to a storage system. In other embodiments, an analytics platform may send data about a decision made by an analytics platform to a storage system.
In many embodiments, an analytics platform may utilize a FAST algorithm. In certain embodiments, an analytics platform may make a recommendation as to where to move data within a storage system to increase performance of the storage system based on the FAST algorithm.
In many embodiments, an analytics platform may enable auto-tiering of information between different storage systems. In certain embodiments, an analytics platform may enable control of auto-tiering between different storage systems. In other embodiments, auto-tiering of storage systems may be based on an examination of metadata of IOs' activities over long periods of time. In certain embodiments, auto-tiering may move stored data based on predicted future behavior of an IO by tracking IO pattern in storage systems. In some embodiments, an analytics platform may control the auto-tiering of data between different storage systems. In certain embodiments, auto-tiering may utilize an advanced or complex big data analysis methodology. In other embodiments, auto-tiering may utilize machine learning techniques. In other embodiments, auto-tiering control may provide efficient data placement across different storage arrays. In many embodiments, analytics platform may impact auto-tiering of different storage systems through hinting. In some embodiments, hinting may impact auto-teiring by sending hints to a storage system on what storage system data should be placed and on what tier data should be placed on within a given storage system. In certain embodiments, an analytics platform may create smart hinting. In many embodiments, smart hinting may diagnose a storage system problem and provide recommendations on how storage system configurations may be changed to improve storage system performance. In other embodiments, an analytics platform may enable cache hinting to improve cache performance. In certain embodiments, improved cache performance may include plugging a flash card into a specific host in the system to enable more IOs to be stored on the read cache of the host system.
In many embodiments, an analytics platform may create algorithms that may be used for auto-tiering based on an IO pattern analysis in different storage systems. In some embodiments, an analytics platform may create algorithms that may be used to control tiering in different storage systems. In certain embodiments, an analytics platform may create algorithms to create hints for storage system tiering. In other embodiments, algorithms may be created for predicting periodic IO activity behavior that may enable auto-tiering. In many embodiments, an analytics platform may create algorithms that may predict hot spots in a cache which may enable better cache hits. In many embodiments, an analytics platform may create algorithms for defragmentation of file systems based on expected IO pattern while not performing defragmentation on locations that may not be expected to be accessed.
In many embodiments, an analytics platform may enable large-scale impact prediction. In some embodiments, large-scale impact prediction may enable access to storage system information to be logged. In other embodiments, large-scale impact prediction may enable location of IO within a storage system with exactness. In certain embodiments, large-scale impact prediction may enable logged access information to predict the impact of adding more cache to a storage system. In many embodiments, addition of more cache to a storage system may increase the performance of a SATA solid-state drive (SSD) cache on storage systems.
In other embodiments, an analytics platform may utilize a product advisor. In other embodiments, a product advisor may analyze storage system information. In some embodiments, a product advisor may perform a storage system examination. In certain embodiments, a product advisor may collect information during storage system examination to be used to advise storage system about expected data growth that may result in processing delays. In many embodiments, a product advisor may suggest an alternative storage system equipped to handle an increase in data growth experienced by a storage system. In certain embodiments, a product advisor may utilize a product simulator to be used to simulate data flow on alternative storage systems and examine the efficiency of alternative storage systems.
In certain embodiments, an analytics platform may enable federation of storage systems. In some embodiments, federation of storage systems may enable an analysis of data obtained from various storage systems by a single analytics platform. In many embodiments, federation of storage systems may determine which logical unit may be moved between two storage arrays due to increased usage to improve performance of the logical unit.
In many embodiments, an analytics platform may enable IO replay. In some embodiments, IO replay may enable storage system usage information pertaining to LUNs to be replayed. In certain embodiments, an analytics platform may collect traces continuously to record IO patterns within a storage system. In other embodiments, collection of traces may enable replay of IO patterns that occurred before a storage system crashed.
Refer now to the example embodiment of
Refer now to the example embodiment of
Refer now to the example embodiment of
Refer now to the example embodiments of
Refer now to the example embodiments of
Refer now to the example embodiments of
Refer now to the example embodiments of
Refer now to the example embodiments of
Refer now to the example embodiment of
The logic for carrying out the method may be embodied as part of the aforementioned system, which is useful for carrying out a method described with reference to embodiments shown in, for example,
Refer now to the example embodiment of
The scope of the invention is limited only by the claims and the invention encompasses numerous alternatives, modifications, and equivalents. Numerous specific details are set forth in the above description in order to provide a thorough understanding of the invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in detail so that the invention is not unnecessarily obscured. Accordingly, the above implementations are to be considered as illustrative and not restrictive, and the invention is not to be limited to the details given herein, but may be modified within the scope and equivalents of the appended claims.