AUTOMATED REMEDIATION OF DEVIATIONS FROM BEST PRACTICES IN A DATA MANAGEMENT STORAGE SOLUTION

FIELD

Various embodiments of the present disclosure generally relate to monitoring and remediation of the health of information technology (IT) equipment, clusters thereof, and/or services deployed within a private or public cloud, for example, running on virtual machines (VMs) or containers (or pods) managed by a container orchestration platform. In particular, some embodiments relate to an auto-healing feature that monitors events within one or more clusters of nodes each representing a distributed data management storage system and facilitates automated remediation of noncompliance with best practices by identifying corresponding appropriate courses of action.

BACKGROUND

Data is the lifeblood of every business and must flow seamlessly to enable digital transformation, but companies can extract value from data only as quickly as the underlying infrastructure can manage it. Data centers and the applications they support are becoming more and more complex day-by-day. Issues arising in an on-premise or public cloud-based data management storage solution can have an adverse effect on an organization and can cause loss of revenue as a result of downtime. Troubleshooting issues (e.g., deviations from best practices) and fixing them is often time consuming and exhausting and distracts users from other business objectives and customer service related tasks.

SUMMARY

Systems and methods are described for automated remediation of deviations from best practices in the context of a data management storage system. According to one embodiment, after receiving a notification regarding a rule-evaluation trigger event, a determination is made regarding an existence of a deviation from a best practice by a data storage system by: (i) identifying a set of one or more rules associated with the rule-evaluation trigger event, in which the set of one or more rules define one or more conditions that are indicative of a root cause of the deviation; and (ii) evaluating the set of one or more rules with respect to one or more of historical data and a current state of the data storage system. Based on the set of one or more rules, a determination is made regarding whether a remediation associated with the deviation that addresses or mitigates the deviation is available.

Other features of embodiments of the present disclosure will be apparent from accompanying drawings and detailed description that follows.

BRIEF DESCRIPTION OF THE DRAWINGS

The present disclosure is best understood from the following detailed description when read with the accompanying figures.

FIG. 1A is a block diagram illustrating a feedback loop through which a data management storage solution may be updated out-of-cycle with a release schedule for software of the data management storage solution in accordance with one or more embodiments.

FIG. 1B is a flow diagram illustrating an example of a set of operations for automated creation of trigger events, rules, and remediations in accordance with one or more embodiments.

FIG. 1C is a block diagram illustrating an example of an ML classification model in accordance with one or more embodiments.

FIG. 2 is a block diagram illustrating an example of a distributed storage system in accordance with one or more embodiments.

FIG. 3 is a block diagram illustrating an example on-premise environment in which various embodiments may be implemented.

FIG. 4 is a block diagram illustrating an example cloud environment in which various embodiments may be implemented.

FIG. 5 illustrates an example screen shot of a system manager dashboard in accordance with one or more embodiments.

FIG. 6 illustrates an example dialog box that may be presented by the system manager dashboard responsive to selection of the details button from the screen shot of FIG. 5 in accordance with one or more embodiments.

FIG. 7 illustrates another example screen shot of a system manager dashboard in accordance with one or more embodiments.

FIG. 8 illustrates an example dialog box that may be presented by the system manager dashboard responsive to selection of the details button from the screen shot of FIG. 7 in accordance with one or more embodiments.

FIG. 9A is a block diagram illustrating components of an auto-healing service that may be implemented within a node of a cluster in accordance with one or more embodiments.

FIG. 9B is a message sequence diagram illustrating example interactions among various components of the auto-healing service in accordance with one or more embodiments.

FIG. 10A is an entity relationship diagram for rules and remediations in accordance with one or more embodiments.

FIG. 10B is an example of a rules table in accordance with one or more embodiments.

FIG. 11A is a high-level flow diagram illustrating an example of a set of operations for performing automated remediation in accordance with one or more embodiments.

FIG. 11B is a block diagram illustrating a modified feedback loop of FIG. 1A in which a cloud-based service performs auto-healing for a fleet of storage systems in accordance with one or more embodiments.

FIG. 12 is a flow diagram illustrating an example of a set of operations for performing pub/sub processing in accordance with one or more embodiments.

FIG. 13 is a flow diagram illustrating an example of a set of operations for coordinating execution of rules and remediations in accordance with one or more embodiments.

FIG. 14 is a flow diagram illustrating an example of a set of operations for performing rule execution in accordance with one or more embodiments.

FIG. 15 is a flow diagram illustrating an example of a set of operations for performing remediation execution in accordance with one or more embodiments.

FIG. 16 is a block diagram illustrating an example of a network environment in accordance with one or more embodiments.

FIG. 17 is a flow diagram illustrating an example of a set of operations for performing automated remediation of a risk identified responsive to evaluation of rules associated with a trigger event in accordance with one or more embodiments.

FIG. 19 illustrates an example computer system in which or with which embodiments of the present disclosure may be utilized.

The drawings have not necessarily been drawn to scale. Similarly, some components and/or operations may be separated into different blocks or combined into single blocks for the purposes of discussion of some embodiments of the present technology. Moreover, while the technology is amenable to various modifications and alternate forms, specific embodiments have been shown by way of example in the drawings and are described in detail below. The intention, however, is not to limit the technology to the particular embodiments described or shown. Rather, the technology is intended to cover all modifications, equivalents, and alternatives.

DETAILED DESCRIPTION

Systems and methods are described for automated remediation of deviations from best practices in the context of a data management storage system. At present, some storage equipment and/or data management storage solution vendors monitor customer clusters using automated support (“ASUP”). ASUP is often used to proactively monitor the health of the storage system and automatically send messages to the vendor, internal support teams, or support partners. These messages can include telemetry data, configuration details, system status, performance metrics, system events, as well as other data that may be useful to proactively detect and avoid potential issues.

In some cases, tens or hundreds of thousands of deployed assets (e.g., storage controllers) of a particular data management storage solution vendor may deliver ASUP telemetry data to the vendor on a regular basis. As a result, the volume of data collected by the ASUP back-end system (e.g., ASUP records including events, configuration details, logs, performance data, counter information, etc.) contains a wealth of system information. Many of these messages may include benign event messages indicative of a healthy system while a few of the messages may be indicative of a problem. For example, relatively simple call-home messages can create a significant volume of data while mostly indicating the deployed assets are able to communicate with the ASUP back-end system. Moreover, the deployed assets will have different configurations (e.g., hardware and software configurations) running different applications. As such, sorting through the large volume of data reported to the ASUP back-end system to effectively, and efficiently, identify and interpret presents significant challenges.

Some storage equipment and/or data management storage solution vendors may allow administrative users of customers to log in (e.g., via cloud-based portals) to check for issues associated with their installation and then proceed with manual fixes based on the community wisdom. Unfortunately, each customer has to primarily rely on their own expertise and resources to identify and resolve problems with the storage solutions. While vendor support personnel are typically available to assist, they often have to manually parse through the ASUP data to identify actual operational problems and failures, particularly where a large number of benign event messages are produced.

Even where it is determined that one or more actions should be taken with respect to the storage system, traditional ASUP functionality does not automatically take responsive action. Instead, separate tools have been used to manually initiate change within the storage solutions where an issue has been identified. One historical reason for this is that security and other operational concerns usually mandate that changes to the storage system be tightly controlled. Thus, allowing the ASUP back-end to initiate manipulation of the storage system or components thereof, without being at the direction or under control of the front-end system, is generally not acceptable to the users or operators of the storage system. As a result, the support personnel who may have identified a problem generally needs to work with front-end management personnel in order to identify a solution and directly manipulate system changes (e.g., through storage administrator use of a management application).

One drawback with these traditional approaches is time. The time taken to identify and fix an issue may be quite long. The length can depend on a lot of factors including, but not limited to, the time between check-ins by the administrative user, knowledge of the administrative user, type of issue, severity of the problem, and the like. With respect to best practices, in particular, the volume and technical nature of documentation (e.g., best practice guides and/or technical reports provided by a vendor of the storage system) may be difficult to digest by customer personnel (e.g., an administrative user of the storage system). Additionally, the descriptions of particular best practices in their current form may not be easily translated into operation in terms of evaluation for compliance or non-compliance and/or performance of appropriate remediation(s) by customer personnel.

Various embodiments of the present technology allow for an intelligent data infrastructure that can proactively monitor system data from multiple deployed storage solutions, identify various insights by learning from system data, and provide auto-healing functionality. For example, in one embodiment, rules may be executed by a data management storage solution to identify deviations from best practices. When a deviation is identified, a corresponding remediation may be identified and potentially automatically implemented to bring the configuration or operation of the data management storage solution into compliance with the best practice at issue.

In some embodiments, the received ASUP telemetry data can be added to a multi-petabyte data lake and processed through one or more machine-learning (ML) classification models to perform predictive analytics and arrive at “community wisdom” derived from the vendor's installed user base. Various embodiments described herein seek to provide an insight-based approach to risk detection and remediation including more proactively addressing issues before they turn into more serious problems. For example, by continuously learning from the community wisdom and making it available for use by cognitive computing co-located with a customer's cluster, insights may be extracted from this data to deliver actionable intelligence.

The general idea behind some embodiments is to offer storage consumers insights (e.g., a set of words, phrases, visual cues, or other indicators providing a level of understanding or discernment), guidance, and actions into issues that are affecting their environment rather than an endless list of cryptic error events. Such insights, guidance, and actions (collectively referred to as actionable intelligence) can lead to higher availability, improved security, and simplified storage operations. When derived based at least in part from information (e.g., telemetry data, interactions with support staff, and the like) received from the vendor's consumer base, the actionable intelligence may be referred to as community wisdom. In various embodiments, described herein the actionable intelligence may be operationalized by locally triggering automated evaluation of a set of one or more rules to identify the existence of a particular risk to which a customer's distributed data management storage system (e.g., in the form of a cluster of nodes) is exposed. In response to identifying a particular risk to which the distributed data management storage system is exposed, associated insights, guidance, and actions may be presented via a system manager dashboard as part of an alert to an administrative user that will facilitate maintaining the health and resiliency of the customer's cluster.

By moving rich ML models and/or rule sets (which may individually or collectively be referred to as rules, a rule set, or rule sets) local to customer clusters, the provision of proactive and real-time health analysis, notifications to customers, and automated remediation (auto healing) is facilitated. As described further below, in one embodiment, various rule sets for identifying the existence of various risks to which the customer's cluster may be exposed and various remediation sets, including remediation actions or scripts for mitigating the various identified risks may be proactively delivered to the customer's cluster by an artificial intelligence for IT operations (AIOps) platform that derives the rule sets and remediation sets based on community wisdom collected from a vendor's consumer base. Some remediations and rules may be generally applicable to all of the products/services of a vendor, whereas other remediations and rules may be more narrowly applicable to only a subset of the products/services of the vendor. In some examples, only those rule and remediations derived from community wisdom that are deemed to be relevant or applicable (e.g., based on the cluster being of a same or similar class and/or type of data storage system as the community wisdom) may be delivered to the customer's cluster. Based on the rule sets received by a customer's cluster, monitoring may be performed to identify a risk to which the customer's cluster is exposed (via inferencing performed by a rule set provided in the form of a rich ML classification model and/or by analysis performed by a rule engine on a rule set provided in the form of conditional logic). After identifying a risk to which the customer's cluster is exposed, a corresponding remediation may be identified that mitigates or addresses the risk.

In some examples, a predefined or configurable set of event management system (EMS) events may be used to trigger a deep analysis (e.g., via a local ML classification model and/or via a rule engine, as the case may be) to identify the existence of a risk to the cluster or a node thereof. When such a risk is determined to exist, an alert may be raised and presented via a system manager dashboard associated with the cluster. Alternatively, or in addition to the triggering of rule evaluation by a rule engine responsive to certain EMS events, some rules may be run (or evaluated) by the rule engine on a periodic schedule. For example, a scheduler/job manager may execute rules on a schedule specified by the rules themselves. In this manner, active risks may be checked on a periodic basis (by re-running an associated rule) to determine if the risk condition still exists or has been resolved. Risks that are known to arise as a result of periodic changes may be good candidates for checking on a periodic schedule.

In one embodiment, auto-heal functionality may be enabled by monitoring a data management storage solution or a data storage system thereof (e.g., the Data ONTAP storage operating system available from NetApp, Inc. of San Jose, CA) for key events via a publisher/subscriber pattern (e.g., a Pub/Sub bus) and signaling an analytic engine when an issue is identified based on an event. Identified issues may be further analyzed using the rich community wisdom and such analysis may be mapped to known rules to facilitate determination of a root cause and a corresponding appropriate course of action. An administrative user of the data management storage solution may then be notified via an EMS of the issue (e.g., a risk, an error, or a failure) and potential corrective action (e.g., a remediation). Alerts may be provided in the form of an EMS stateful event (e.g., an EMS event that contains state information). The state information may include a corrective action identified for the issue at hand. In some embodiments, the state information may include sufficient information for external infrastructure or a cloud-based service (e.g., a third-party cloud-based workflow automation platform, such as ServiceNow or the like, or a cloud-based service of the vendor of the storage system) to remotely initiate performance of remediations. For example, the state information may include information regarding the API (e.g., exposed by the storage system or by the auto-healing service) to call as well as any information needed to make the call, for example, authentication information.

Depending upon the particular implementation, some issues may be automatically remediated, while others may be proactively brought to the attention of the administrative user and remediated upon receipt of authorization from the administrative user. Preferences relating to the desired type of remediation (e.g., automated vs. user activated) for various types of identified issues arising within the data management storage solution may be configured by the administrative user, learned from historical interactions (e.g., dismissal of similar issues or approving automated application of a remediation for similar issues) with the administrative user, and/or based on community wisdom. For example, the administrative user may select automated remediation for issues/risks known to arise as a result of periodic changes to the environment in which the data management storage solution operates and/or to the configuration of the data management storage solution. Auto-healing data management storage solution nodes and/or the cluster adds customer value by monitoring and fixing (or at least mitigating) issues before they become more serious problems, thereby freeing administrative users from researching and implementing remediations and instead allowing them to spend time on more strategic objectives.

While for purposes of explanation, various specific examples of events (e.g., Network Attached Storage (NAS) events), risks (e.g., deviation from a particular best practice), and corresponding remediations are described herein, it is to be appreciated the methodologies described herein are broadly applicable to other types of events (e.g., storage area network (SAN) events, security issues, performance issues, capacity issues, other best practices, and/or compliance issues). More broadly speaking, and as described further below, the methodologies described herein are applicable to any signaling event (e.g., a manually initiated check, an event management system (EMS) event, expiration of a timer, or the like) that can be associated with a rule that does the analysis, for example, to identify or existence or non-existence of a risk to the storage system; and when the existence of the risk associated with the rule is present, further determines an associated corrective action. For example, the described approach can be applied to misconfiguration issues, environmental issues, security issues, performance issues, capacity issues, deviations from best practices, and/or compliance issues. While, for simplicity, in the context of various examples a rule may be said to be caused to be evaluated after a given signaling event, it is to be appreciated certain rules may be grouped together in various combinations and all of such rules in the group or set may be evaluated (e.g., in series or in parallel). For example, at certain predefined time intervals or responsive to other events arising in the storage system, a given set of multiple rules (e.g., relating to capacity fullness of all volumes of the storage system, best practices relating to security objectives for the storage system confidentiality, integrity, and availability, or other related or unrelated predefined or configurable groupings of rules) may all be evaluated.

Various embodiments of the present technology provide for a wide range of technical effects, advantages, and/or improvements to computing systems and components. For example, various embodiments may include one or more of the following technical effects, advantages, and/or improvements: 1) use of non-routine and unconventional operations to facilitate operationalization of community wisdom in the form of rule sets and remediation sets derived therefrom that may be proactively distributed by an artificial intelligence for IT operations (AIOps) platform to a data management storage solution; 2) use of an auto-healing service architecture for coordination of execution of rules and remediations for a data management storage solution; 3) providing an insight-based approach to risk detection and remediation, including more proactively addressing issues before they turn into more serious problems; 4) cross-platform integration of system monitoring capabilities with machine learning and artificial intelligence to automatically monitor and heal (e.g., repair or reallocate) storage solutions in a timely and efficient manner; 5) use of non-routine and unconventional operations and system configurations to analyze the health of storage solutions to improve the speed of the diagnosis and resolution of customer issues; 6) provide an integrated monitoring platform that uses non-routine and unconventional techniques to both reactively and proactively detect, correct, and/or avoid potential issues with storage solutions; 7) use of a distributed architecture with local cognitive computing co-located with customer storage using non-routine and unconventional operations to analyze data and submit issues and solutions to global ASUP platform for additional analysis and integration; and 8) facilitating a more intelligent data infrastructure by continuously learning from community wisdom and making rules and remediations derived therefrom available for use by cognitive computing co-located with a customer's storage cluster to facilitate auto remediation (auto-healing) functionality.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of embodiments of the present disclosure. It will be apparent, however, to one skilled in the art that embodiments of the present disclosure may be practiced without some of these specific details. In other instances, well-known structures and devices are shown in block diagram form.

Terminology

Brief definitions of terms used throughout this application are given below.

A “computer” or “computer system” may be one or more physical computers, virtual computers, or computing devices. As an example, a computer may be one or more server computers, cloud-based computers, cloud-based cluster of computers, virtual machine instances or virtual machine computing elements such as virtual processors, storage and memory, data centers, storage devices, desktop computers, laptop computers, mobile devices, or any other special-purpose computing devices. Any reference to “a computer” or “a computer system” herein may mean one or more computers, unless expressly stated otherwise.

The terms “connected” or “coupled” and related terms are used in an operational sense and are not necessarily limited to a direct connection or coupling. Thus, for example, two devices may be coupled directly, or via one or more intermediary media or devices. As another example, devices may be coupled in such a way that information can be passed there between, while not sharing any physical connection with one another. Based on the disclosure provided herein, one of ordinary skill in the art will appreciate a variety of ways in which connection or coupling exists in accordance with the aforementioned definition.

If the specification states a component or feature “may”, “can”, “could”, or “might” be included or have a characteristic, that particular component or feature is not required to be included or have the characteristic.

As used in the description herein, the meaning of “a,” “an,” and “the” includes plural reference unless the context clearly dictates otherwise. Also, as used in the description herein, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise.

The phrases “in an embodiment,” “according to one embodiment,” and the like generally mean the particular feature, structure, or characteristic following the phrase is included in at least one embodiment of the present disclosure and may be included in more than one embodiment of the present disclosure. Importantly, such phrases do not necessarily refer to the same embodiment.

As used herein, the term “storage operating system” generally refers to computer-executable code operable on a computer to perform a storage function that manages data access and may, in the case of a storage system (e.g., a node of a cluster representing a distributed storage system), implement data access semantics of a general purpose operating system. The storage operating system can also be implemented as a microkernel, an application program operating over a general-purpose operating system, such as UNIX or Windows NT, or as a general-purpose operating system with configurable functionality, which is configured for storage applications as described herein. A non-limiting example of a storage operating system that may implement one or more of the various file system, Redundant Array of Independent Disks (RAID), storage, auto-healing, rule evaluation, ML model training and inferencing, remediation, and other functionality described herein is the ONTAP data management software available from NetApp, Inc.).

As used herein a “cloud” or “cloud environment” broadly and generally refers to a platform through which cloud computing may be delivered via a public network (e.g., the Internet) and/or a private network. The National Institute of Standards and Technology (NIST) defines cloud computing as “a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.” P. Mell, T. Grance, The NIST Definition of Cloud Computing, National Institute of Standards and Technology, USA, 2011. The infrastructure of a cloud may be deployed in accordance with various deployment models, including private cloud, community cloud, public cloud, and hybrid cloud. In the private cloud deployment model, the cloud infrastructure is provisioned for exclusive use by a single organization comprising multiple consumers (e.g., business units), may be owned, managed, and operated by the organization, a third party, or some combination of them, and may exist on or off premises. In the community cloud deployment model, the cloud infrastructure is provisioned for exclusive use by a specific community of consumers from organizations that have shared concerns (e.g., mission, security requirements, policy, and compliance considerations), may be owned, managed, and operated by one or more of the organizations in the community, a third party, or some combination of them, and may exist on or off premises. In the public cloud deployment model, the cloud infrastructure is provisioned for open use by the general public, may be owned, managed, and operated by a cloud provider (e.g., a business, academic, or government organization, or some combination of them), and exists on the premises of the cloud provider. The cloud service provider may offer a cloud-based platform, infrastructure, application, or storage services as-a-service, in accordance with a number of service models, including Software-as-a-Service (SaaS), Platform-as-a-Service (PaaS), and/or Infrastructure-as-a-Service (IaaS). In the hybrid cloud deployment model, the cloud infrastructure is a composition of two or more distinct cloud infrastructures (private, community, or public) that remain unique entities, but are bound together by standardized or proprietary technology that enables data and application portability (e.g., cloud bursting for load balancing between clouds).

As used herein “AutoSupport” or “ASUP” generally refers to a telemetry mechanism that proactively monitors the health of a cluster of nodes (e.g., implemented in physical or virtual form) and/or individual nodes of a distributed computing system. A non-limiting example of a distributed computing system is a distributed data management storage solution (or a distributed storage system), for example, in the form of a cluster of nodes.

As used herein “community wisdom” generally refers to data received from and/or derived from a user base of one or more products/services of a vendor. A non-limiting example of such a product/service is a distributed storage system. Community wisdom may be collected to acquire a deep knowledge base to which predictive analytics and cognitive computing may be applied to derive insight-driven rules for identifying exposure to particular risks and insight-driven remediations for addressing or mitigating such risks. In the context of the enterprise-data-storage market, even a one to two percent market share represents a massive user base from which billions of data points may be gathered by a vendor on a daily basis from potentially hundreds of thousands of data management storage solutions. Insights may be extracted from this data by or on behalf of the vendor with cloud-based analytics that combine predictive analytics and proactive support to deliver actionable intelligence. Community wisdom may be said to be relevant to or applicable to a particular data storage system when such community wisdom was received from or derived from a same or similar class (e.g., entry-level, midrange, or high-end), and/or type (e.g., on-premise, cloud, or hybrid) of data storage system. Other classifications may include, but are not limited to workload type (e.g., high throughput, read only, etc.), features that are enabled (e.g., snapshot, replication, data reduction, Internet small computer system interface (iSCSI) protocol), applications running on the storage controllers, hardware (e.g., serial-attached SCSI (SAS), serial advanced technology attachment (SATA), non-volatile memory express (NVMe) disks, cache adapter installed, network adapters, and so on), system-defined performance service level (e.g., extreme performance (extremely high throughput at a very low latency), performance (high throughput at a low latency), value (high storage capacity and moderate latency), extreme for database logs (maximum throughput at the lowest latency), extreme for database shared data (very high throughput at the lowest latency), extreme for database data (high throughput at the lowest latency)).

As used herein, a “best practice” or “recommended practice” generally refers to a standard or a guideline that provides the best course of action in a given situation. In the realm of technology, a best practice may refer to a method, a technique, a configuration, or the like that is accepted as superior because it produces results that are better than those achieved by other means. In the context of various examples described herein, best practices for planning and optimizing a storage system deployment, for example, within different ecosystems or with different protocols there may be a variety of best practices for making use of certain features and capabilities of a storage system of a particular family, model, type, and/or class. For example, best practices may be related to how to optimally carry out a particular task within a cluster of nodes representing a distributed storage system, how to optimally configure the cluster or an individual node of the cluster when making use of particular functions/features of the storage operating system. Non-limiting examples of classes or groups of best practices, which may be granular with respect to a particular family, model, type and/or class of storage system, the current version of the storage operating system, and/or functions/features enabled within the storage operating system, may relate to one or more of the following:

- Use of certain types of applications and databases;
- Ensuring continuous availability of the storage solution, for example, with application-level granularity (e.g., via NetApp SnapMirror active sync (formerly, SnapMirror Business Continuity available from NetApp, Inc.) or by way of active-active clustered solutions in which a cluster spans multiple sites located in different locations (e.g., via NetApp MetroCluster configurations available from NetApp, Inc.);

Implementation and usage of data protection and disaster recovery mechanisms (e.g., based on data replication solutions, such as NetApp SnapMirror storage and data replication software available from NetApp, Inc.), which may be available in an asynchronous and/or a synchronous replication configuration,

- Use of compliance solutions (e.g., NetApp SnapLock compliance software available from NetApp, Inc.);
- Use of caching solutions (e.g., NetApp FlexCache caching technology available from NetApp, Inc.) to bring data and files closer to the user for faster throughput, for example via a storage volume that caches SMB and NFS read data from an origin (or source) volume, thereby allowing subsequent reads to be accelerated;
- Use of particular types of storage volumes;
- Implementation of data protection and backup for particular types of storage volumes;
- Use of file-based (e.g., NAS) vs. block-based (e.g., SAN) storage.
- Use of particular storage protocols (e.g., Common Internet Filesystem (CIFS) protocol, Network Filesystem (NFS) protocol, Small Computer System Interface (SCSI), Internet Small Computer System Interface (iSCSI), Fiber Channel Protocol (FCP), Simple Storage Service (S3), non-volatile memory express (NVMe), and/or the like) and/or other protocols/services (e.g., Domain Name System (DNS) and the like);
- Use of particular networking capabilities and configurations;
- Meeting various security objectives for information system confidentiality, integrity, and availability (e.g., including use of ransomware solutions, use of Zero Trust principles, for example, in which the storage system becomes the segmentation gateway to protect and monitor access of customer data, use of multifactor authentication and/or Secure Shell (SSH) authentication with a “smart” identity card (e.g., a Department of Defense (DOD) Common Access Card) to authenticate administrative users of the storage system, and implementation of secure multitenancy, for example, using storage VMs (SVMs) (e.g., VMs (or vservers) to partition the cluster or the storage operating system);
- Complance with various standards (e.g., Payment Card Industry (PCI) Data Security Standard (DSS) 4.0) requirements;
- Implementation and use of data tiering solutions (e.g., NetApp Cloud Tiering data tiering software (formerly, NetApp FabricPool) available from NetApp, Inc.) for storage efficiencies, for example, by maintaining active (or hot) data on high-performance SSDs and tiering inactive (or cold) data to low-cost object storage.
- Use of virtualization solutions or platforms (e.g., VMware vSphere available from VMware of Palo Alto, CA, or Microsoft Windows Server (Hyper-V) available from Microsoft Corporation of Redmond, WA);

As used herein, “on-box” generally refers to or describes one or more functions, processes, services, or features implemented local to or on a data management storage solution (e.g., a node of nodes of a physical or virtual storage system), whereas “off-box” generally refers to or describes one or more functions, processes, services, or features implemented remote from or external to the data management storage solution.

As used herein, a “risk” may identify an issue within a cluster of nodes and/or individual nodes of a distributed computing system (e.g., data management storage solution). A risk may be communicated to an auto-heal system as an alert (e.g., an EMS event that contains state information (an EMS stateful event)). In some embodiments, the state information contained within an EMS stateful event may include an associated corrective action (e.g., a remediation). In one embodiment, risk identification may be triggered responsive to a predefined or configurable set of EMS events, which may be referred to herein as key EMS events. Risk identification may additionally or alternatively be performed responsive to rules that are run on a periodic schedule, responsive to configuration changes made to the distributed computing system, or on demand (e.g., responsive to a request made by an administrative user of the cluster). A deviation from a best practice is a non-limiting example of a risk, for example, to be addressed and/or brought to the attention of support or administrative personnel.

As described herein a “remediation” generally represents one or more corrective actions that may be used to resolve an identified risk. In some embodiments, in order to facilitate auto-healing, remediations may be comprised of Python code. In other cases, remediations may be provided in the form of detailed directions (e.g., similar to the type of guidance and/or direction that might be received via level 1 (L1) or level 2 (L2) technical support) to allow an administrative user to perform remediations manually. Non-limiting examples of remediation actions include configuration recommendations for a data management storage solution or node thereof, command recommendations to be issued to a data management storage solution or node thereof, for example, via a command-line interface (CLI), a REST API, or a graphical user interface (GUI).

As described herein “rules” may be used to identify risks within a cluster of nodes and/or individual nodes of a distributed computing system (e.g., a data management storage solution). In some examples, the rules may be represented in the form of self-contained Python file(s) that contain code to identify a given issue (risk). For example, a rule may include one or more conditions or conditional expressions involving the current or historical state (e.g., configuration and/or event data) of the cluster or individual nodes that when true are indicative of the cluster or an individual node being exposed to the given risk. In some embodiments, rules may be hierarchically organized in parent-child relationships, for example, with zero or more child rules depending from a parent rule. A rule may contain or otherwise be associated with information as to whether it can be remediated. If so, the rule may also contain or be associated with steps for remediating the issue and/or explaining how the issue can be remediated. In one embodiment, rules can be executed based on a trigger or a schedule. In the context of trigger-based rules, a publisher/subscriber bus message, for example, identifying the occurrence of a key EMS event may represent the source of a trigger and may be associated with one or more rules to be executed. In the context of schedule-based rules, a scheduler or job manager may execute a given rule in accordance with a schedule associated with the given rule. In other examples rules or rule sets may be represented in a form of a machine-learning (ML) algorithm or model, for example, an ML classification model or a deep learning model, such as a Recurrent Neural Network (RNN).

As used herein, a “publisher/subscriber bus,” a “publisher-subscriber bus,” a “pub/sub bus,” a “pub-sub bus” and the like generally refer to a messaging queue system that facilitates communication among publishers and subscribers. Publishers generally represent systems, components, or applications that produce or generate events or data and subscribers generally represent systems, components, or applications that desire to be made aware of the availability of data produced by one or more publishers or the occurrence of certain events or data relating to one or more publishers. A pub/sub bus eliminates the need for subscribers to poll for data from publishers (e.g., via an application programming interface exposed by a publisher) and instead implements a subscription model. For example, subscribers may subscribe to the data or topic(s) of interest (e.g., the occurrence of a particular event or type of event within a data storage management solution) generated by or otherwise associated with one or more publishers via Application Programming Interfaces (APIs) (e.g., Representational State Transfer (REST) APIs) exposed by a storage operating system of a data storage management solution for use by authorized internal and/or external entities. Non-limiting examples of a pub/sub bus include NetApp ONTAP Pub/Sub. Non-limiting examples of message brokers that may be used to facilitate implementation of a pub/sub bus include Apache Qpid, ActiveMQ, and RabbitMQ.

As used herein, a “storage volume” or “volume” generally refers to a container in which applications, databases, and file systems store data. A volume is a logical component created for the host (e.g., a client) to access storage of the underlying storage primary tier associated with a storage system. A volume may be created from the capacity available in storage pod, a pool, or a volume group. A volume has a defined capacity. Although a volume might consist of more than one storage drive, a volume appears as one logical component to the host (e.g., a client). Non-limiting examples of a volume include a flexible volume and a flexgroup volume.

As used herein, a “flexible volume” generally refers to a type of storage volume that may traditionally be efficiently distributed across multiple storage devices. A flexible volume may be capable of being resized to meet changing business or application requirements. In some embodiments, a storage system may provide one or more aggregates and one or more storage volumes distributed across a plurality of nodes interconnected as a cluster. Each of the storage volumes may be configured to store data such as files and logical units. A flexible volume may be comprised within a storage aggregate (e.g., representing a set of storage devices (disks)) and includes at least one storage device. The storage aggregate may be abstracted over a RAID plex where each plex comprises a RAID group. Moreover, each RAID group may comprise a plurality of storage disks. As such, a flexible volume may comprise data storage spread over multiple storage disks or devices. A flexible volume may be loosely coupled to its containing aggregate (e.g., a file system aggregate, such as a WAFL aggregate). A flexible volume can share its containing aggregate with other flexible volumes. Thus, a single aggregate can be the shared source of all the storage used by all the flexible volumes contained by that aggregate. A non-limiting example of a flexible volume is a NetApp ONTAP Flex Vol volume.

As used herein, a “flexgroup volume” generally refers to a single namespace that is made up of multiple constituent/member volumes. A non-limiting example of a flexgroup volume is a NetApp ONTAP FlexGroup volume that can be managed by storage administrators, and which acts like a flexible volume. In the context of a flexgroup volume, “constituent volume” and “member volume” are interchangeable terms that refer to the underlying volumes (e.g., flexible volumes) that make up the flexgroup volume.

Example Feedback Loop

FIG. 1A is a block diagram illustrating a feedback loop through which a data management storage solution 130 may be updated out-of-cycle with a release schedule for software of the data management storage solution 130 in accordance with one or more embodiments. In the context of the present example, the feedback loop 100 includes technical support 115, an artificial intelligence for IT operations (AIOps) platform 120, and the data management storage solution 130. The data management storage solution 130 may be a storage cluster including one or more nodes, which may individually be referred to as a data storage system, and which may collectively represent a distributed storage system. While, for purposes of simplicity, only a single storage solution 130 is shown in FIG. 1A. However, there may be tens or hundreds of thousands of deployed storage solutions 130 from different clients that communicate with AIOps 120. Each of the data management solutions may have different configurations (e.g., number of nodes, software versions, software, hardware, cloud-based storage, hybrid storage, on-prem storage, etc.) and run different applications.

“AIOps” is an umbrella term for the use of big data analytics, ML, and/or other artificial intelligence (AI) technologies to automate the identification and resolution of common IT issues or risks. Separate rule sets 121 may be generated for different types and/or classes of data storage systems or based on features enabled within the data storage systems. Similarly, separate remediation sets 122 may be created for different types and/or classes of data storage systems or based on features enabled within the data storage systems. A non-limiting example of the AIOps platform 120 is the NetApp Active IQ Digital Advisor available from NetApp, Inc.

As shown in FIG. 1B, in general, an AIOps platform (e.g., AIOps platform 120) may use big data, analytics, and ML to, among other things: 1) collect and aggregate telemetry data; 2) intelligently sift “signals” out of the “noise” to identify events and patterns; 3) diagnose root causes; 4) derive community wisdom, and 5) provide a breakdown of the community wisdom.

As illustrated in FIG. 1B, the AIOps platform may collect and aggregate data (e.g., telemetry data) generated by data management storage solutions (or components thereof) in use by thousands of deployed assets of a given vendor (block 123). The data may be automatically reported from the data management storage solutions. This can happen on a fixed schedule, periodically, upon detection of certain events, upon request, or the like. In some cases, the data reported may vary depending on the timing (e.g., general check-in or health messages to full report of all available system information). Examples of the type of information that may be included in telemetry data includes, but is not limited to information listed in the following table:

Date and

timestamp

of the
Operating
Serial number of
Encrypted

message
system version
the storage system
software licenses

Alarm states
Recent log
Hardware and
Performance

messages
software diagnostic
metrics

outputs

Host name
SNMP contact
SNMP contact
Console encoding

of the storage
name and location
name and location
type

system
(if specified)
(if specified)

Output of
Checksum status
Error-Correcting
Topology of the

commands

Code (ECC)
storage solution

that provide

memory scrubber

system

statistics

information

System ID
Host name of the
HA node status,
Contents of non-

of the
partner in an HA
including the HA
privacy-related

partner in
pair
monitor and HA
files under the/etc

an HA pair

interconnect
directory

statistics

Registry
Usage information
Service statistics
Boot time

information

statistics

NVLOG
NetApp WAFL ®
Modified
X-header

statistics
check log
configurations
information

Some data may be sensitive and/or capable of identifying a customer. Some data may be excluded or scrubbed by default (or when requested by a customer) before being sent to AIOps. Examples of potentially sensitive data may include, but is not limited to: IP addresses, MAC addresses, URIs, DNS names, E-mail addresses, Port numbers, Node names, SVM names, Cluster names, Aggregate names, Volume names, Junction paths, Policy names, User IDs and the like.

Given the volume of data and different configurations of storage system, the next step is to intelligently sift “signals” out of the “noise” to identify significant events and patterns related to existence of potential risks, application performance, and/or availability issues to which a given data management storage solution 130 may be exposed (block 124). This can be done, in accordance with various embodiments, using various machine learning techniques or classifiers. Examples of potential ML classifiers that could be used include but are not limited to the following: Decision Tree, Random Forest, Naive Bayes Classifier, K-Nearest Neighbors, Support Vector Machines, Artificial Neural Networks, and the like. A non-limiting example of an ML classification model in the form of a neural network model is shown and described below with reference to FIG. 1C.

Once the significant events and patterns are identified, the next step is to diagnose root causes (e.g., using Lean Six Sigma, statistical analysis tools, hierarchical clustering, and/or data-mining solutions) and report them to technical support staff, IT, and/or DevOps for rapid response and/or development of appropriate remediations that may be deployed to relevant portions of the customer base (block 125). Alternatively, in in some cases, the AIOps platform 120 may automatically propose remediations without human intervention.

According to one embodiment, the AIOps platform 120 represents a big data platform that aggregates community wisdom (e.g., community wisdom 111a-b) received from or derived from multiple sources (e.g., customers'/users' interactions with technical support staff (e.g., technical support 115), support case histories, and events associated with operation of data management storage solutions of participating customers having a feedback/reporting feature enabled) (block 126).

Community wisdom may start in a form of support troubleshooting workflows, knowledge based articles, and pattern matching within a given customer environment and/or across multiple customer environments. At block 127, the community wisdom may then be broken into multiple segments. For example, in some embodiments, the community wisdom may include trigger event segment (e.g., the “signals” or portions thereof from above), analysis segment, and recovery segment. In some embodiments, the trigger events may represent significant events and patterns related to existence of potential risks, application performance, and/or availability issues to which a given data management storage solution 130 may be exposed. The existence or occurrence of such trigger events may be used as an indicator to start an analysis process. For example, as described further below, one or more key event management system (EMS) events may be used as trigger events to evaluate corresponding sets of one or more rules to confirm or refute the existence of potential risks, application performance, and/or availability issues to which the given data management storage solution 130 may be exposed. The analysis segment may be used to determine if an actual issue associated with the trigger event has occurred. As described further below, the analysis may include the evaluation by a rules engine of one or more rules, for example, written in Python. The recovery segment of the community wisdom, at least in some embodiments, may include the logic (or remediation) that actually corrects the issue identified. Like the rules, the remediations may also be written in Python. A given remediation may be associated with one or more rules.

The community wisdom and the telemetry data (e.g., ASUP telemetry data 131) from which it is derived may include, among other data:

- Information regarding the type and class of the data storage system(s) at issue. This may include details regarding the storage controller hardware and its setup. For example, the information regarding the type and class of a given data storage system may identify the given data storage system as a four-node cluster utilizing solid-state drives (SSDs).
- Configuration (e.g., features that are enabled/disabled, the version of the storage operating system software being run, etc.) of the data management system(s) at issue. For example, the configuration of a given data management system may indicate the given data management system is a cluster with iSCSI and NFS protocols enabled, snapshot and replication services enabled, and having storage efficiency capabilities.
- Feedback in the form of ML model predictions and scores, for example, providing information back to the AIOps platform regarding the accuracy of forecasting performed by an ML classification model trained by the AIOps platform and delivered to the data storage system for local performance of inferencing.
- Historical performance and event data, for example, identifying seasonality in customer data and/or application demands. In one embodiment, historical performance and event data may be used to facilitate forecasting issues and automated proactive adjustments the cluster.
- Streaming real-time operations events, including one or more of:
  - System logs and metrics. In one embodiment, analysis of system logs can be used to facilitate the prediction of future issues.
  - Network data, including packet data. In one embodiment, an understanding client workloads may be used to help to optimize location placement and predict utilization of resources of the various components and systems that make up the storage cluster.
  - Incident-related data and ticketing. In one embodiment, Incident-related data and ticketing, for example, maintained by technical support staff may be used to match customer incidents with changes in the cluster environment, which in turn may facilitate root cause analysis for a given customer and prediction for other customers.
  - Application demand data. As noted above with respect to network data, application demand data may be used to help to optimize location placement and predict utilization of resources of the various components and systems that make up the storage cluster.
  - Infrastructure data. In one embodiment, infrastructure data may include information regarding compute, storage, and/or network resources and interdependencies among them)
  - Performance data. In one embodiment, performance data may include performance metrics for various protocols, storage resources (e.g., a disk array, a disk, a logical unit number (LUN), an aggregate, and/or a volume), and/or components of a storage cluster at various levels of granularity (e.g., cluster level, node level, and/or storage virtual machine level). Non-limiting examples of performance metrics include throughput, latency, input/output operations per second (IOPS), and the like.

Based on the community wisdom, the AIOps platform 120 may apply focused analytics and ML capabilities to, among other things:

- Separate significant event alerts from the “noise:” For example, the AIOps platform 120 may inspect, analyze, correlate and, evaluate the data to separate signals (e.g., significant abnormal event alerts) from noise (e.g., everything else). For example, the AIOps platform 120 may break down time series data into trend, cycle, noise, and seasonality.
- Identify root causes and propose solutions: The AIOps platform 120 may correlate abnormal events or potential risks to which the data management storage solution 130 is exposed with other event data across environments to zero in on the cause of an issue, for example, a misconfiguration, an environmental issue (e.g., a DNS change or network reconfiguration), a security issue, a performance issue, a capacity issue, or a compliance issue, and suggest remediations (e.g., step-by-step remediation actions) to address or mitigate the issue or potential risk. Root cause analyses may be used to determine the root cause of risks/issues/problems in order to facilitate identification of appropriate remediation actions. By identifying root causes, customer support teams can avoid unnecessary work involved with treating symptoms of the issue versus the core problem. For example, the AIOps platform 120 may trace the source of a network outage to facilitate immediate resolution of the issue and set up safeguards to prevent similar problems in the future.
- Learn continually, to improve handling of future problems/issues/risks: AI models can also help the system learn about and adapt to changes in the environment, such as new infrastructure provisioned or reconfigured.

In one embodiment, one or more of the remediations of remediation sets 122 may be generated based on support troubleshooting workflows developed by technical support staff to identify and address problems/issues/risks observed in numerous customer cases. For example, the support troubleshooting workflows may be turned into code modules that perform deep analysis and provide automated recovery. As described further below, a non-limiting example of a potential remediation to address a capacity issue (e.g., a risk of imminently filling the storage capacity of a storage container, for example, a LUN, a volume, and/or an aggregate) causes the storage capacity of the storage container at issue to be increased by a predetermined or configurable percentage.

The code modules may analyze, among other issues:

- Client authentication issues
- Capacity exhaustion issues
- External networking issues
- Directory Services issues
- Client connectivity issues
- Security issues

According to one embodiment, during operation of the data management storage solution 130, a single node called the “primary node,” which may be responsible for coordinating cluster-wide activities, may collect and report telemetry data (e.g., ASUP telemetry data 131) to the AIOps platform 120. The telemetry data may be collected by, among other mechanisms, performance-monitoring tools running on the data management storage solution 130, and service ticketing systems, for example, utilized by technical support staff.

When received from the data management storage solution 130, the AIOps platform 120 may store the telemetry data in an ASUP data lake 110 to allow the raw data to be transformed into structured data that is ready for SQL analytics, data science, and/or ML with low latency. For example, the telemetry data may be processed by one or more analytical models to create the community wisdom that may be stored within ASUP data lake 110. The collection and reporting of the telemetry data by a telemetry mechanism (not shown) may be performed periodically and/or responsive to trigger events. The telemetry mechanism may proactively monitor the health of a particular data storage system or cluster with which it is associated and automatically send information regarding configuration, status, performance, and/or system updates relating to the particular data storage system or cluster to the vendor. This information may then be used by technical support personnel and/or the AIOps platform 120 to speed the diagnosis and resolution of issues (e.g., step-by-step or automated remediations). For example, when predetermined or configurable events are observed within an individual node of a given data management storage solution or at the cluster level, when manually triggered by a customer, when manually triggered by the vendor, or on a periodic basis (e.g., daily, weekly, etc.), ASUP telemetry data 131 (e.g., in the form of an ASUP payload), including, among other things, information indicative of the class and type of the data management system(s) at issue, the configuration (e.g., features that are enabled/disabled) of the data management system(s) at issue, and the version of storage operating system software being run by the data management system may be generated and transmitted to the AIOps platform 120.

In addition to automatically reported telemetry data (e.g., ASUP telemetry data 131), data collected by technical support personnel (e.g., technical support 115) in connection with troubleshooting customer issues may be used to derived community wisdom. In one embodiment, customers of a vendor of the data management storage solution 130 may report potential issues they are experiencing with the data management storage solution 130 to technical support personnel via text, chat, email, phone, or other communication channels. Information collected by technical support 115, for example, regarding a given reported issue, including, among other data, the class and type of data management system, the configuration of the data management system, and the version of storage operating system software being run by the data management system may be provided in near real-time to the AIOps platform 120.

Depending upon the particular implementation, updates (e.g., update 119) may be provided to groups of clusters based on their similarity in terms of class and/or type of data storage systems. For example, a given update may include an updated rule set (e.g., including new and/or updated rules, for example, in the form of conditional logic or in the form of an ML model) and/or an updated remediation set (e.g., including new and/or updated remediations) for use by a particular class and/or a particular type of data storage system. Alternatively, an update may be unique to a particular cluster. According to one embodiment, updates may be performed in accordance with a predefined or configurable schedule (e.g., daily, weekly, monthly, etc.) and/or responsive to manual direction from the vendor. Given a typical feature release schedule for software of a data storage system might be on the order of once or twice per calendar year, the ability to deliver such updates out-of-cycle with the release schedule provides enormous benefit. For example, customers obtain the advantages and results of enhanced risk identification and/or remediation capabilities without having to wait for the next feature release. As will be appreciated, for dark sites (e.g., government or military sites having no Internet connectivity) that may employ one or more data storage systems utilizing features associated with various embodiments, updates to rule sets 121 and/or remediation sets 122 may be delivered via “sneaker net” (e.g., on a computer-readable medium) concurrently with or separate from updates or patches to the software of the data storage systems.

Example ML Classification Model

FIG. 1C is a block diagram illustrating an example of an ML classification model 150 in accordance with one or more embodiments. ML models are algorithms that can identify patterns or make predictions based on datasets. Unlike rule-based programs, ML models do not have to be explicitly coded and can evolve over time as new data enters the system. In one or more embodiments, the ML classification model 150 may be trained by an AIOps platform (e.g., AIOps platform 120) of a vendor based on community wisdom derived from information (e.g., telemetry data, interactions with technical support staff, support case histories, and/or the like) collected from the vendor's consumer base and delivered to one or more data storage systems (e.g., data management storage solution 130) utilized by customers of the vendor.

In the context of the present example, the ML classification model is shown as a network of nodes (or “neurons”) which are organized in layers (e.g., an input layer 152, one or more hidden layers 154, and an output layer 156). Based on the predictors (or inputs) provided to the input layer 152, forecasts (or outputs) are emitted by the output layer 156. Coefficients (not shown) associated with each of the predictors are generally referred to as weights. The forecasts are obtained by a combination (in this case, a non-linear combination) of the inputs. The weights may be selected using a learning algorithm that minimizes a cost function (e.g., mean absolute error, mean squared error, root mean squared error, etc.). The example ML classification model 150 depicted in FIG. 1C is representative of a multilayer feed-forward network, where each layer of nodes receives inputs from the previous layers. The outputs of the nodes in one layer are inputs to the next layer. The inputs to each node are combined using a weighted linear combination. The result is then modified by a nonlinear function before being output.

In general, ML classification algorithms may be used to predict a discrete outcome (y) using independent variables (x). ML has a variety of use-cases in different domains. Subscription-based media streaming platforms like Netflix and Spotify, for instance, use ML to recommend content to users based on their respective activity on the platform. In the context of various embodiments described herein, an ML classification model (e.g., ML classification model 150) may be trained remotely by the AIOPs platform, for example, based on community wisdom and applied locally by a particular data storage system, for example, by an auto-healing service to predict whether the particular data storage system is exposed to a particular issue or risk based on a state of the particular data storage system (e.g., one or more of events occurring within the particular data storage system, results of periodically scheduled checks, and historical data) as inputs (e.g., one of input₁to input_n) to the input layer 152. As described further below, responsive to identification of the particular issue or risk, the auto-healing service may further identify a corresponding remediation to be manually approved or automatically applied to the data storage system that has is known to address or mitigate a root cause of the particular issue or risk, for example, based on analysis of community wisdom performed by the AIOps platform.

While in the context of the present example, only one ML classification model is shown, it is to be appreciated multiple different ML classification models may be employed. According to one embodiment, a different ML classification model may be trained by the AIOps platform for respective target groups of data storage systems of the same or similar class and type of data storage system based on community wisdom derived from information (e.g., telemetry data, interactions with technical support staff, support case histories, and/or the like) collected from the vendor's consumer base that are of the same or similar class and type as the target group. For example, a first ML classification model may be trained by the AIOps platform for virtual storage systems deployed within a particular public cloud (e.g., Amazon Web Services (AWS)), a second ML classification model may be trained by the AIOps platform for virtual storage systems deployed within another public cloud (e.g., Google Cloud Platform (GCP)), and a third ML classification model may be trained by the AIOps platform for virtual storage systems deployed within yet another public cloud (e.g., Microsoft Azure). Similarly, separate ML classification models may be trained by the AIOps platform for more performant virtual storage systems versus less performant virtual storage systems or more performant physical storage systems versus less performant physical storage systems. Those skilled in the art will appreciate there are numerous other potential groupings/classifications/types of data storage systems, for example, based on features that are enabled on the data storage systems, applications running on the data storage systems, the performance service levels for which the data storage systems are configured, the type or nature of the storage media employed by the data storage systems, and the hardware configuration of the data storage systems.

Example High-Level View of a Distributed Storage System

FIG. 2 is a block diagram illustrating an example of a distributed storage system (e.g., cluster 201) within a distributed computing platform 200 in accordance with one or more embodiments. In one or more embodiments, the distributed storage system may be implemented at least partially virtually. In the context of the present example, the distributed computing platform 200 includes a cluster 201, which may be analogous to data management storage solution 130. Cluster 201 includes multiple nodes 202. In one or more embodiments, nodes 202 include two or more nodes. A non-limiting example of a way in which cluster 201 of nodes 202 may be implemented is described in further detail below with reference to FIG. 16.

Nodes 202 may service read requests, write requests, or both received from one or more clients (e.g., clients 205). In one or more embodiments, one of nodes 202 may serve as a backup node for the other should the former experience a failover event. Nodes 202 are supported by physical storage 208. In one or more embodiments, at least a portion of physical storage 208 is distributed across nodes 202, which may connect with physical storage 208 via respective controllers (not shown). The controllers may be implemented using hardware, software, firmware, or a combination thereof. In one or more embodiments, the controllers are implemented in an operating system within the nodes 202. The operating system may be, for example, a storage operating system (OS) that is hosted by the distributed storage system. Physical storage 208 may be comprised of any number of physical data storage devices. For example, without limitation, physical storage 208 may include disks or arrays of disks, solid state drives (SSDs), flash memory, one or more other forms of data storage, or a combination thereof associated with respective nodes. For example, a portion of physical storage 208 may be integrated with or coupled to one or more nodes 202.

In some embodiments, nodes 202 connect with or share a common portion of physical storage 208. In other embodiments, nodes 202 do not share storage. For example, one node may read from and write to a first portion of physical storage 208, while another node may read from and write to a second portion of physical storage 208.

Should one of the nodes 202 experience a failover event, a peer high-availability (HA) node of nodes 202 can take over data services (e.g., reads, writes, etc.) for the failed node. In one or more embodiments, this takeover may include taking over a portion of physical storage 208 originally assigned to the failed node or providing data services (e.g., reads, writes) from another portion of physical storage 208, which may include a mirror or copy of the data stored in the portion of physical storage 208 assigned to the failed node. In some cases, this takeover may last only until the failed node returns to being functional, online, or otherwise available.

Example Operating Environment

FIG. 3 is a block diagram illustrating an example on-premise environment 300 in which various embodiments may be implemented. In the context of the present example, the environment 300 includes a data center 330, a network 305, and clients 305 (which may be analogous to clients 205). The data center 330 and the clients 305 may be coupled in communication via the network 305, which, depending upon the particular implementation, may be a Local Area Network (LAN), a Wide Area Network (WAN), or the Internet. Alternatively, some portion of clients 305 may be present within the data center 330.

The data center 330 may represent an enterprise data center (e.g., an on-premises customer data center) that is built, owned, and operated by a company or the data center 330 may be managed by a third party (or a managed service provider) on behalf of the company, which may lease the equipment and infrastructure. Alternatively, the data center 330 may represent a colocation data center in which a company rents space of a facility owned by others and located off the company premises. The data center 330 is shown including a distributed storage system (e.g., cluster 335). Those of ordinary skill in the art will appreciate additional information technology (IT) infrastructure would typically be part of the data center 330; however, discussion of such additional IT infrastructure is unnecessary to the understanding of the various embodiments described herein.

Turning now to the cluster 335 (which may be analogous to data management storage solution 130 and/or cluster 201), it includes multiple nodes 336a-n and data storage nodes 337a-n (which may be analogous to nodes 202 and which may be collectively referred to simply as nodes) and an Application Programming Interface (API) 338. In the context of the present example, the nodes are organized as a cluster and provide a distributed storage architecture to service storage requests issued by one or more clients (e.g., clients 305) of the cluster. The data served by the nodes may be distributed across multiple storage units embodied as persistent storage devices, including but not limited to hard disk drives, solid state drives, flash memory systems, or other storage devices. A non-limiting example of a node is described in further detail below with reference to FIG. 16.

The API 338 may provide an interface through which the cluster 335 is configured and/or queried by external actors. Depending upon the particular implementation, the API 338 may represent a REST or RESTful API that uses Hypertext Transfer Protocol (HTTP) methods (e.g., GET, POST, PATCH, DELETE, and OPTIONS) to indicate its actions. Depending upon the particular embodiment, the API 338 may provide access to various telemetry data (e.g., performance, configuration and other system data) relating to the cluster 335 or components thereof. As those skilled in the art will appreciate various types of telemetry data may be made available via the API 337, including, but not limited to measures of latency, utilization, and/or performance at various levels (e.g., the cluster level, the node level, or the node component level). The telemetry data available via API 337 may be include ASUP telemetry data (ASUP telemetry data 131) or the ASUP telemetry data may be provided to an AIOps platform (e.g., AIOps platform 120) separately.

FIG. 4 is a block diagram illustrating an example cloud environment (e.g., hyperscaler 420) in which various embodiments may be implemented. In the context of the present example, a virtual storage system 410a, which may be considered exemplary of virtual storage systems 410b-c, may be run (e.g., within a VM or in the form of one or more containerized instances, as the case may be) within a public cloud provided by a public cloud provider (e.g., hyperscaler 420). Collectively, a cluster including one or more of virtual storage systems 410a-c may be analogous to data management storage solution 130 of FIG. 1.

In this example, the virtual storage system 410a makes use of storage (e.g., hyperscale disks 425) provided by the hyperscaler, for example, in the form of solid-state drive (SSD) backed or hard-disk drive (HDD) backed disks. The cloud disks (which may also be referred to herein as cloud volumes, storage devices, or simply volumes or storage) may include persistent storage (e.g., disks) and/or ephemeral storage (e.g., disks), which may be analogous to physical storage 208.

The virtual storage system 410a (which may be analogous to a node of data management storage solution 130, one of nodes 202, and/or one of nodes 336a-n) may present storage over a network to clients 405 (which may be analogous to clients 205 and 305) using various protocols (e.g., small computer system interface (SCSI), Internet small computer system interface (ISCSI), fibre channel (FC), common Internet file system (CIFS), network file system (NFS), hypertext transfer protocol (HTTP), web-based distributed authoring and versioning (WebDAV), or a custom protocol. Clients 405 may request services of the virtual storage system 410 by issuing Input/Output requests 406 (e.g., file system protocol messages (in the form of packets) over the network). A representative client of clients 405 may comprise an application, such as a database application, executing on a computer that “connects” to the virtual storage system 410 over a computer network, such as a point-to-point link, a shared local area network (LAN), a wide area network (WAN), or a virtual private network (VPN) implemented over a public network, such as the Internet.

In the context of the present example, the virtual storage system 410a is shown including a number of layers, including a file system layer 411 and one or more intermediate storage layers (e.g., a RAID layer 413 and a storage layer 415). These layers may represent components of data management software or storage operating system (not shown) of the virtual storage system 410. The file system layer 411 generally defines the basic interfaces and data structures in support of file system operations (e.g., initialization, mounting, unmounting, creating files, creating directories, opening files, writing to files, and reading from files). A non-limiting example of the file system layer 411 is the Write Anywhere File Layout (WAFL) Copy-on-Write file system (which represents a component or layer of ONTAP software available from NetApp, Inc.).

The RAID layer 413 may be responsible for encapsulating data storage virtualization technology for combining multiple hyperscale disks 425 into RAID groups, for example, for purposes of data redundancy, performance improvement, or both. The storage layer 415 may include storage drivers for interacting with the various types of hyperscale disks 425 supported by the hyperscaler 420. Depending upon the particular implementation the file system layer 411 may persist data to the hyperscale disks 425 using one or both of the RAID layer 413 and the storage layer 415.

The various layers described herein, and the processing described below may be implemented in the form of executable instructions stored on a machine readable medium and executed by one or more processing resources (e.g., one or more of a microcontroller, a microprocessor, central processing unit core(s), an application-specific integrated circuit (ASIC), a field programmable gate array (FPGA), and the like) and/or in the form of other types of electronic circuitry. For example, the processing may be performed by one or more virtual or physical computer systems of various forms (e.g., servers, blades, network storage systems or appliances, and storage arrays, such as the computer system described with reference to FIG. 19 below.

Example System Manager Dashboard Screenshots

FIG. 5 illustrates an example screen shot 500 of a system manager dashboard in accordance with one or more embodiments. In the context of various examples described herein the system manager dashboard may be part of a graphical user interface of a management platform that facilitates setup and/or deployment of a distributed storage system (e.g., data management storage solution 130, cluster 201, cluster 335, or a cluster involving one or more of virtual storage systems 410a-c) or a data storage system thereof.

In some examples, the system manager may represent an HTML5-based graphical management interface that enables an administrative user to use a web browser to manage the distributed storage system and associated storage objects (e.g., disks, volumes, and storage tiers) and perform common management tasks related to storage systems. Using the system manager dashboard, the administrator may be provided with view at-a-glance information about, among other things, various types of alerts and notifications, the efficiency and capacity of storage tiers and volumes, the nodes that are available in a cluster, the status of the nodes in a high-availability (HA) pair, the most active applications and objects, and the performance metrics of a cluster or a node.

With the system manager, the administrator may be able to perform many common tasks, such as:

- Create a cluster, configure a network, and set up support details for the cluster.
- Configure and manage storage objects, such as disks, local tiers, volumes, quota trees (qtrees), and quotas.
- Configure protocols, such as server message block (SMB) and network file system (NFS) and provision file sharing.
- Configure protocols such as fibre channel (FC) protocol, FC over Ethernet (FCOE), nonvolatile memory express (NVMe), and internet small computer systems interface (iSCSI) for block access.
- Create and configure network components, such as subnets, broadcast domains, data and management interfaces, and interface groups.
- Set up and manage mirroring and vaulting relationships.
- Perform cluster management, storage node management, and SVM management operations.
- Create and configure SVMs, manage storage objects associated with SVMs, and manage SVM services.
- Monitor and manage HA configurations in a cluster.
- Configure service processors to remotely log in, manage, monitor, and administer the node, regardless of the state of the node.

In the context of the present example, the system manager dashboard includes respective sections relating to health, capacity, performance, management actions, and network. As shown in the management actions section, a DNS lookup failure event has occurred as indicated by 510. An administrative user may view details associated with this event by selecting the “Details” button 511.

As explained further below, the DNS lookup failure event may represent a key EMS event that triggered analysis of one or more rules (e.g., of rule sets 121), for example, by an auto-healing service running on the distributed storage system. Selection of the “Details” button 511 may reveal the particular risk or issue underlying the failure that was identified as a result of analysis of the one or more rules triggered by the key EMS event. As noted above with reference to FIG. 1A, the rules triggered for evaluation and corresponding remediations may have been derived based on community wisdom by an AIOps platform (e.g., AIOps platform 120) and proactively provided to the distributed storage system to facilitate performance of auto-healing functionality by the distributed storage system. In one embodiment, the auto-healing functionality may be provided by an auto-healing service running on the distributed storage system as described below with reference to FIG. 9A. According to one embodiment, remediations may be automatically performed or performed after receipt of manual approval by an administrator of the distributed storage system. A non-limiting example of receiving such manual approval via a graphical management interface is described below with reference to FIG. 6.

FIG. 6 illustrates an example dialog box 600 that may be presented by the system manager dashboard responsive to selection of the details button from the screen shot of FIG. 5 in accordance with one or more embodiments. In the context of the present example, the dialog box 600 provides event details for the DNS lookup failure event, including a signature ID, information regarding the issue, and corrective action. The dialog box 600 provides the administrative user with the option of dismissing the event (e.g., by selecting the “Dismiss” button 610) or allowing the auto-healing service to perform a remediation (e.g., by selecting the “Fix It” button 611). For example, in response to selection of the “Fix It” button 611, a remediation action or script (e.g., of remediation sets 122), associated with a rule (e.g., of rule sets 121), the evaluation of which identified the existence of a particular risk or issue (e.g., a particular DNS server being unreachable) associated with the DNS lookup failure event (e.g., a key EMS event associated with the rule), may be performed by an on-box or off-box auto-healing service to mitigate or address the particular issue or risk. In this case, the order of the DNS servers configured may be changed.

FIG. 7 illustrates another example screen shot 700 of a system manager dashboard in accordance with one or more embodiments. In the context of the present example, a CIFS share offline event has occurred as indicated by 710. CIFS shares may become inaccessible, for example, if the storage objects serving these shares are unavailable. An administrative user may view details associated with this event by selecting the “Details” button 711.

As explained further below, the CIFS share offline event may represent a key EMS event that triggered analysis of one or more rules (e.g., of rule sets 121), for example, by an auto-healing service running on the distributed storage system. Selection of the “Details” button 711 may reveal the particular risk or issue underlying the event that was identified as a result of analysis of the one or more rules triggered by the key EMS event. As noted above with reference to FIG. 1A, the rules triggered for evaluation and corresponding remediations may have been derived based on community wisdom by an AIOps platform (e.g., AIOps platform 120) and proactively provided to the distributed storage system to facilitate performance of auto-healing functionality by the distributed storage system. In one embodiment, the auto-healing functionality may be provided by an auto-healing service running on the distributed storage system as described below with reference to FIG. 9A. According to one embodiment, remediations may be automatically performed or performed after receipt of manual approval by an administrator of the distributed storage system. A non-limiting example of receiving such manual approval via a graphical management interface is described below with reference to FIG. 8.

FIG. 8 illustrates an example dialog box 800 that may be presented by the system manager dashboard responsive to selection of the details button from the screen shot of FIG. 7 in accordance with one or more embodiments. In the context of the present example, the dialog box 800 provides event details for the CIFS share offline event, including a signature ID, information regarding the issue, and corrective action. The dialog box 800 provides the administrative user with the option of dismissing the event (e.g., by selecting the “Dismiss” button 810) or allowing the auto-healing service to perform a remediation (e.g., by selecting the “Fix It” button 811). For example, in response to selection of the “Fix It” button 811, a remediation action or script (e.g., of remediation sets 122), associated with a rule (e.g., of rule sets 121), the evaluation of which identified the existence of a particular risk or issue (e.g., the SVM hosting CIFS shares has root volume offline) associated with the CIFS share offline event (e.g., a key EMS event associated with the rule), may be performed by an on-box or off-box auto-healing service to mitigate or address the particular risk or issue. In this case, the SVM hosting CIFS shares may be brought online.

While for purposes of explanation, two specific examples of NAS events and corresponding remediations have been described above with reference to FIGS. 5-8, it is to be appreciated the methodologies described herein are broadly applicable to other types of events (e.g., SAN events, security issues, performance issues, capacity issues, and/or compliance issues). For instance, consider a capacity example in which a storage capacity forecast may be run on a periodic schedule and/or responsive to an EMS event indicative of a volume being X % (e.g., 80%) full. An associated rule may be run responsive to the EMS event to forecast when the volume will be at Y % (e.g., 100%) full. If the forecasted fullness date is within N (e.g., 3) months, a remediation may be generated. When the remediation is dispatched, the volume size may be increased by M % (e.g., 20%). Similarly, consider a security example in which a deduplication/reduction decrease may be evaluated. An EMS event may be received that is indicative of the deduplication/reduction percentage on a given aggregate has decrease by more than X % (e.g., 5%). In this case, an associated rule may be triggered to run to determine the source volume of the deduplication/reduction decrease. The volume behavior may be compared with historical and forecasted values. If the volume is found to be suspect, a remediation may be created. The rule may also determine whether the last snapshot that was taken before the suspect behavior. When the remediation is dispatched the administrator may be asked to validate if the suspect volume has been compromised, if so the administrator may be given the option to roll the volume back to prescribed snapshot.

Example Auto-Healing Service

FIG. 9A is a block diagram illustrating components of an auto-healing service 900 that may be implemented within a node (e.g., node 202, 336a-n, or one of virtual storage systems 410a-c) of a cluster (e.g., cluster 201, cluster 335, or a cluster of one or more of virtual storage systems 410a-c), for example, representing a data management storage solution (e.g., data management storage solution 130) in accordance with one or more embodiments. Auto-healing is a breakthrough feature that can help resolve many of the pain points of administrative users, for example, by way of offering a “Fix It” button as discussed above with reference to FIGS. 5-8. For example, selection of the “Fix It” button (e.g., “Fix It” button 611 or 811), authorizing performance of a remediation by the auto-healing service 900, may be communicated to the auto-healing service 900 via a REST API (e.g., auto-healing REST API 910) exposed by the auto-healing service 900.

In the context of the present example, the major components that make up the auto-healing service 900 include, a rule/remediation coordinator 940, a cluster-wide task table 912, an auto-healing REST API 910, an event management system (EMS) service 920, a publisher/subscriber (pub/sub)/EMS topic 930, a rules table 911, a pub/sub/auto-heal topic 950, a rules evaluator 960, and a task execution engine 970.

The rule/remediation coordinator 940 may be responsible for coordinating the execution of rules and remediations for the cluster (e.g., data management storage solution 130 of FIG. 1). In one embodiment the auto-healing service 900 runs on a single node called the “primary node” that can coordinate cluster-wide activities.

In the context of the present example, the rule/remediation coordinator 940 includes an event digest 942 and a thread pool 943. The rule/remediation coordinator 940 may utilize the event digest 942 to communicate with a pub/sub bus. For example, the event digest 942 may include functions to subscribe to one or more pub/sub/EMS topics (e.g., pub/sub/EMS topic 930) and/or publish to one or more pub/sub/auto-heal topics (e.g., pub/sub/auto-heal topic 950). According to one embodiment, based on a set of rules derived from community wisdom and distributed to the cluster by an AIOps platform (e.g., AIOps platform 120), the rule/remediation coordinator 940 may create a mapping between key EMS events (e.g., a CIFS share offline event, a DNS server lookup failure, etc.) and associated rules, the evaluation of which may be able to identify underlying root causes (e.g., potential risks or issues to which the cluster may be exposed) and may subscribe to the EMS topics corresponding to the key EMS events. Thereafter, responsive to being notified regarding the occurrence of a particular EMS event to which the rule/remediation coordinator 940 has subscribed, the rule/remediation coordinator 940 may cause the rules evaluator 960 to evaluate a set of one or more rules to which the particular EMS event is mapped. For example, an EMS event indicative of a CIFS share offline event, may be associated with a set of rules that check for one or more of client authentication issues, directory service issues, and/or client connectivity issues. Similarly, an EMS event indicative of a DNS server lookup failure may be associated with a set of rules that check for one or more of external networking issues and/or client connectivity issues. Other EMS events may be associated with a set of rules that check for one or more of client authentication issues, capacity exhaustion issues, external networking issues, directory services issues, and/or connectivity issues as appropriate.

In the context of the present example, the thread pool 943 may represent a collection of polymorphic threads, which allows any of the individual threads to be shared across functions. Thread pools are a software design pattern for achieving concurrency of execution, for example, by maintaining multiple threads waiting for tasks to be allocated for concurrent execution. Creating and destroying threads and their associated resources can be an expensive process in terms of time. One benefit of making use of thread pools over creating a new thread for each task is that thread creation and destruction overhead is restricted to the initial creation of the pool, which may result in better performance and better system stability. Polymorphic means the ability to take different forms. In one embodiment, the thread pool 943 represents generic thread pool of a predetermined or configurable size that may be used by any of the rule/remediation coordinator 940, the rules evaluator 960, or the task execution engine 970. For example, on an as-needed basis as new tasks are allocated to the threads within the thread pool 943 they may inherit different attributes and behaviors, thereby making the threads available for use by any of a rules evaluator (e.g., rules evaluator 960), a task execution engine (e.g., task execution engine 970) or a rule/remediation coordinator (e.g., rule/remediation coordinator 940). In this manner, the availability and usage of threads may be optimized and wasting of resources may be avoided.

According to one embodiment, the rule/remediation coordinator 940 oversees the detection and scheduling of rules and remediations. The rule/remediation coordinator 940 may use a distributed Saga design pattern (e.g., Saga pattern 941) for managing failures and recovery, where each action has a compensating action for roll-back/roll-forward. For example, the distributed Saga design pattern may be used as a mechanism to manage data consistency across multiple services (e.g., microservices) in distributed transaction scenarios. A saga is a sequence of transactions that updates each service and publishes a message or event to trigger the next transaction step. If a step fails, the saga executes compensating transactions that counteract the preceding transactions.

While in the context of the present example, the auto-healing service 900 is described as running on a single node within the cluster, it is to be appreciated if a node running the auto-healing service 900 fails, another node of the cluster may be elected to run the auto-healing service 900.

The cluster-wide task table 912 may be responsible for logging the steps of a given rule execution and/or a given remediation execution. In the case of a failure of the rule/remediation coordinator 940, the cluster-wide task table 912 may be used to restart execution of running rules/remediations from the point at which they were interrupted by the failure.

The auto-healing REST API 910 provides an interface through which requests for remediation execution may be received from an administrative user of the cluster, for example, as a result of interactions of a user interface presented by a system manager dashboard, for example, selection of a “Fix It” button (e.g., “Fix It” button 611 or 811).

The EMS service 920 may represent an event system that includes monitoring and create, read, update, and delete (CRUD)-based alerting. The EMS service 920 may collect and log event data from different parts of the storage operating system kernel and provide event forwarding mechanisms to allow the events to be reported as EMS events. For example the EMS service 920 may be used to create and modify EMS messages (with stateful attributes).

In one embodiment, a pub/sub bus (including, for example, pub/sub/EMS topic 930 and pub/sub/auto-heal topic 950) is provided to facilitate the exchange of messages among components of the auto-healing service 900. In one embodiment, a topic may be specified by the source component when it publishes a message and subscribers may specify the topic(s) (e.g., pub/sub/EMS topic 930 and/or pub/sub/auto-heal topic 950) for which they want to receive publications.

The pub/sub/EMS topic 930 may be used to listen for (e.g., register to be notified regarding) key EMS messages (e.g., those to which the rule/remediation coordinator is subscribed) and used to trigger execution of rule(s) and/or remediations by the rules evaluator 960 and the task execution engine 970, respectively.

The rules table 911 may be used to store and retrieve information about the mapping between EMS events and associated rules (e.g., rules that are part of rule sets 121) to be executed as well as information regarding scheduled risk checks. For example, an EMS event indicative of a CIFS share offline event, may be associated with a set of rules that check for one or more of client authentication issues, directory service issues, and/or client connectivity issues. Similarly, an EMS event indicative of a DNS server lookup failure may be associated with a set of rules that check for one or more of external networking issues and/or client connectivity issues. Other EMS events may be associated with a set of rules that check for one or more of client authentication issues, capacity exhaustion issues, external networking issues, directory services issues, and/or connectivity issues as appropriate.

The pub/sub/auto-heal topic 950 may be used for communication between the rule/remediation coordinator 940 and the rules evaluator 960 and between the rule/remediation coordinator 940 and the task execution engine 970.

The rules evaluator 960 may be responsible for overseeing the execution of rules and the detection of risks. Depending on the needs of the particular deployment, the auto-healing service 900 may be scaled by running multiple instances of the rules evaluator 960 on other nodes of the cluster. The rules evaluator 960 may build the dependency of rules to triage and rules to run to remediate. The rules evaluator 960 may perform triaging using the triage rules and may dispatch remediation based input for remediation executions.

In the context of the present example, the rules evaluator 960 is shown including a logic controller 961, utilities 962, a collector module 964, an open rule platform (ORP) 965, an event digest module 963, and a thread pool 966, which may also represent a polymorphic thread pool like thread pool 943. The logic controller 961 may be responsible for taking care of the rules evaluator logic. For example, the rules evaluator 960 may be responsible for binding the rules to be run. The logic controller 961 may handle mapping rules to collectors and parsers (not shown) as well as executing the rules using the ORP 965. Additionally, the logic controller 961 may be responsible for getting the required sections collected using the collector 964. In one embodiment, the logic controller 961 may use the thread pool 966 to execute the rules evaluator logic. The logic controller 961 may also handle error and exception handling. The logic controller 961 may utilize the event digest 963 to communicate with the pub/sub bus. Depending on the form in which the rules are represented, the execution of the rules may involve, for example, inferencing by an ML model or execution of conditional logic (e.g., represented by Python code).

The utilities 962 may represent helper functions needed for the functioning of the rules evaluator 960. In one embodiment, the utilities 962 may be shared across the rules evaluator 960 and the task execution engine 970.

In one embodiment, the collector module 964 represents a wrapper class for running collection needs, including collecting information from various services within the storage cluster. For example, data may be retrieved from an SMF database (e.g., an SQL collector) by using the DOT SQL package to run collection from the SMF database. The collector module 964 may use the thread pool 966 for asynchronous functionality of the collector 964. The collector 964 may be generic, for example, by accepting instructions, executing the instructions, and returning values.

The ORP 965 may provide the rules that are executed along with the infrastructure to execute the rules. The ORP 965 may be updated with the latest rules on a periodic basis or on demand from the vendor. For example, an update (e.g., update 119) received by the data management storage from an AIOps service (e.g., AIOps 120) may include a new rule set containing updated rules and/or additional rules, in either case, in the form of conditional logic, code, or an ML model, to be used by the auto-healing service 900 to determine the existence of a risk to which the data management storage solution is exposed.

This event digest module 963 may be a generic module used to communicate. In one embodiment, the event digest module 963 is used to register and communicate to the pub/sub bus. For example, the event digest module 963 may include functions to subscribe or publish to auto-heal topics via pub/sub auto-heal topic 950.

The task execution engine 970 may be responsible for overseeing the execution of remediations and other tasks that may be distributed across the cluster. Depending on the needs of the particular deployment, the auto-healing service 900 may be scaled by running multiple instances of the task execution engine 970 on one or more other nodes of the cluster.

In the context of the present example, the task execution engine 970 is also shown including a logic controller 971, utilities, a collector module 974, an open rule platform (ORP) 975, an event digest module 973, and a thread pool 976, which may also represent a polymorphic thread pool like thread pool 943.

The logic controller 971 may handle the remediation logic. For example, the logic controller 971 may be responsible for mapping rules to collectors and parsers. The logic controller 971 may also take care of executing remediation actions (e.g., issuing storage commands, using an ML model to make predictions, and/or executing a remediation script), including getting the required inputs. The logic controller 971 may use the thread pool 976 to execute the remediation logic. Additionally, the logic controller 971 may take care of error and exception handling. The logic controller 971 may make use of the event digest module 973 to communicate back to the pub/sub bus.

As noted above, the utilities 972 may represent helper functions needed for the functioning of the rules evaluator 960 and/or the task execution module 970. In one embodiment, the utilities 972 may be shared between the rules evaluator 960 and the task execution module 970.

In one embodiment, the collector module 974 represents a wrapper class for running collection needs, including collecting information from various services within the storage cluster. For example, data may be retrieved from an SMF database (e.g., an SQL collector) by using the DOT SQL package to run collection from the SMF database. The collector module 974 may use the thread pool for asynchronous functionality of the collector 974. The collector 974 may be generic, for example, by accepting instructions, executing the instructions, and returning values.

The ORP 975 may provide the rules that are executed along with the infrastructure to execute the rules. The ORP 975 may be updated with the latest remediations on a periodic basis or on demand from the vendor. For example, an update (e.g., update 119) received by the data management storage solution from an AIOps service (e.g., AIOps 120) may include a new remediation set containing updated remediations and/or additional remediations to be used by the auto-healing service 900 to mitigate or address risks detected by the rules evaluator 960 automatically or responsive to receipt of manual approval by an administrative user.

This event digest module 973 may be a generic module used to communicate. In one embodiment, the event digest module 973 is used to register and communicate to the pub/sub bus. For example, the event digest module 973 may include functions to subscribe or publish to auto-heal topics via pub/sub auto-heal topic 950.

In the context of the present example, the thread pool 976 generally represents a collection of polymorphic threads, which allows it to be shared across functions.

Returning to the rule/remediation coordinator 940, it may be responsible for overseeing one or more of the following activities:

- Subscribing to key EMS events (e.g., via the pub/sub/EMS topic 930). When an event is detected, the rule/remediation coordinator 940 may route a request to the rules evaluator 960 for the associated rule.
- Logging the various actions of the rule and remediation execution within the cluster-wide task table 912 to facilitate recovery in the case of a failure of the rule/remediation coordinator 940, the rule evaluator 960, and/or the task execution engine 970.
- Receipt of requests for remediation execution. In one embodiment, requests for remediation may arrive via one of two sources. When remediation for a given risk is to be manually approved by an administrative user, the request for execution of the remediation may be received via the auto-healing REST API 910, for example, responsive to the administrative user authorizing the remediation via a user interface presented by a system manager dashboard. Alternatively, when a given risk is set to fully automated, the risk may be routed to the rule/remediation coordinator 940 via the pub/sub/auto-heal topic 950. The rule/remediation coordinator 940 may determine whether the remediation is to be automatically executed. If so, the rule/remediation coordinator 940 may route a remediation request to the task execution engine 970.
- When a stateful EMS event needs to be updated, the rule/remediation coordinator 940 may handle such operations.
- Maintaining of a mapping within the rules tables 911 between EMS events and associated rules to be executed.
- The rule/remediation coordinator 940, rule evaluator 960, and task execution engine 970 may communicate via the pub/sub/auto-heal topic 950. In one embodiment, requests may be routed by rule/remediation coordinator 940 the to the rule evaluator 960 or task execution engine 970 on any of the nodes within the cluster for execution. Responses may be routed to the current primary rule/remediation coordinator 940.
- Failure recovery:
  - In the case of a failure of the rule/remediation coordinator 940, the primary coordinator role can be taken over by another node. In this case, the cluster-wide task table 912 may be used to resume the activities that were in progress at the time of the failure.
  - A timeout mechanism may be used when a request is sent to the rule evaluator 960 or task execution engine 970 to detect a failure of the rule evaluator 960 or the task execution engine 970. If a request times out, the rule/remediation coordinator 940 may be responsible for roll-back or roll-forward for the given activities. Because a given rule or remediation can be called repeatedly in some error conditions, the given rule or remediation should be idempotent (i.e., a given method will produce the same result when called repeatedly).
  - Each step of the risk and remediation may be check pointed (e.g., following the Saga pattern). Such checkpoints facilitate failure recovery. For example, if a coordinator 940 crashes, upon restarting it can look at outstanding operations. If the rules evaluator 960 or task execution engine 970 crashes. Any outstanding request will timeout with the rule/remediation coordinator 940 and the command may be re-sent.
- Missing rules:
  - Rule pre-checks may be performed to determine if all resources are available to run rules. If there is a corruption or missing rules, an error may be raised and the corruption or missing rules may be addressed or restored as appropriate.
- Remediation Missing for an Event:
  - Remediation pre-checks may be performed to check for the existence of a remediation action or script before firing the remediation. If there is a corruption or missing script, an error may be raised.
- Status Updates and Timeout:
  - The rules evaluator 960 and the task executer 970 may communicate back to the rule/remediation coordinator 940 with periodic status updates to provide granular updates. Timeouts may be set for the rules evaluator(s) 940 and/or the task executor(s) 970, so as to ensure there are no hung threads. These timeouts may be overridden in a situation in which there is an obvious long-running thread.
- Idempotency:
  - For Rules:
    - When a new risk is identified, a check may be performed to see if the alert is currently active for the risk. If so, the new risk may be ignored so as to avoid duplication.
    - When a new risk is identified, a check may be performed to see if an alert for the risk was recently dismissed. If so, the alert may be suppressed. The duration for such suppression may be specified in the rule. A recent dismissal of a given risk, for example, within the last X minutes or Y hours may also be used as part of auto-remediation logic as a factor in determining whether the given risk should be automatically remediated or whether the given risk should be remediated after receiving manual approval.
  - For Remediations:
    - When a remediation action or script is executed to implement a remediation, a check may first be performed to ensure the alert is still valid. If the risk is no longer valid, the alert may be placed into a terminal state.

While in the context of the present example, the various components of the auto-healing service 900 are described as being implemented “on-box” (e.g., local to the data management storage solution), given the use of REST APIs, for example, it is to be appreciated in alternative embodiments one or more or all of the rule/remediation coordinator 940, rules evaluator 960, and task execution engine 970 may be implemented “off-box” (external to the data management storage solution). For example, a cloud service may provide an auto-healing service (e.g., auto-healing service 900) on behalf of an individual cluster (e.g., data management storage solution 130), by subscribing to topics of interest that are managed by a pub/sub bus (e.g., pub/sub bus 931) implemented by the cluster and instead of AIOps updates (e.g., update 119) being delivered to the cluster (e.g., as described with reference to FIG. 1), such updates (e.g., including one or more rule sets (in the form of conditional logic, code, and/or an ML model) and remediations) may be provided to the cloud service to allow the cloud service to perform appropriate actions (e.g., identification of risks to which the cluster may be exposed and suggestion and/or implementation of corresponding remediations) on behalf of the cluster. As those skilled in the art will appreciate, external exposure of appropriate APIs further enables the auto-healing functionality described herein to be extended to federation-level activities. For example, a cloud service may provide auto-healing services for the benefit of a group of related clusters (e.g., representing a fleet of distributed storage systems of a customer).

Example Interactions Among Components of the Auto-Healing Service

FIG. 9B is a message sequence diagram 980 illustrating example interactions among various components of the auto-healing service 900 in accordance with one or more embodiments. In the context of the present example, each of the rules/remediation coordinator 940, the rules evaluator 960, and task execution engine 970 are shown subscribing to topics (e.g., pub/sub EMS topic 930 and/or pub/sub/auto-heal topic 950) of interest that are managed by a pub/sub bus 931. The rules/remediation coordinator 940, the rules evaluator 960, and task execution engine 970 may make their respective subscription requests to the pub/sub bus 931 during their respective initialization processing. For example, the rules/remediation coordinator 940 may subscribe to receive notifications regarding the occurrence certain key EMS events within the data storage system that are mapped to rules within rules table 911. The rules/remediation coordinator 940 may also subscribe to receive notifications regarding results of rule evaluations performed by the rules evaluator 960 and results of remediation executions performed by the task execution engine 970. For their part, the rules evaluator 960 and the task execution engine 970 may subscribe to receive notifications regarding rule evaluation requests and remediation execution requests, respectively, made by the rules/remediation coordinator 940.

Thereafter, the pub/sub bus 931 notifies the rules/remediation coordinator 940, the rules evaluator 960, and task execution engine 970 when a message is posted to a topic (e.g., pub/sub EMS topic 930 and/or pub/sub/auto-heal topic 950) to which they have subscribed. For example, following completion of the subscription requests, upon occurrence of a key EMS event to which the rules/remediation coordinator 940 is subscribed, the pub/sub bus 960 is shown notifying the rules/remediation coordinator 940 regarding the occurrence of subscribed EMS event within the data storage system. Advantageously, in this manner, the need for such internal or external subscribers to poll for data from publishers may be eliminated and timely notifications may automatically be delivered to subscribers.

In the context of the present example, based on the notification regarding the subscribed EMS event received from the pub/sub bus 931, the rules/remediation coordinator 940 identifies a rule to which the EMS event is mapped (e.g., via rules table 911) and posts (or publishes) a rule evaluation request to the pub/sub bus 931 that is to be carried out by the rules evaluator 960. Responsive to receipt of the rule evaluation request from the rules/remediation coordinator 940 and after determining the existence of one or more subscribers (in this case, the rules evaluator 960) to the request, the pub/sub bus 931 issues a notification regarding the rule evaluation request to the rules evaluator 960.

Upon completion of the rule evaluation requested by the rules/remediation coordinator 940, the rules evaluator 960 posts (or publishes) a rule evaluation result (in this case, confirming the existence of a particular issue or risk to which the data storage system is exposed) to the pub/sub bus 931, which causes the pub/sub bus 931 to notify the rules/remediation coordinator 940.

Based on the notification regarding the rule evaluation result received from the rules evaluator 960 via the pub/sub bus 931, the rules/remediation coordinator identifies a remediation (e.g., of the remediation sets 122) corresponding to the particular issue or risk, for example, that has been determined by an AIOps platform (e.g., AIOps platform 120) to address or mitigate the particular issue or risk. The rules evaluator 960, then posts (or publishes) a remediation execution request (e.g., identifying the remediation ID of the identified remediation) to be carried out by the task execution engine 970.

Responsive to receipt of the remediation execution request from the rules/remediation coordinator 940 and after determining the existence of one or more subscribers (in this case, the task execution engine 970) to the request, the pub/sub bus 931 issues a notification regarding the remediation execution request to the task execution engine 970. Upon receipt of the notification from the pub/sub bus 931, the task execution engine 970 executes the requested remediation and posts (or publishes) the result of the remediation execution (e.g., completion, success, failure, etc.).

While in the context of this simplified example, only a single rule evaluation request and a single remediation execution request are shown, it is to be appreciated during operation of the auto-healing service 900 many rule evaluations and remediation executions may be performed depending on the number of occurrences of subscribed EMS events and/or the manner in which various rules are related or grouped.

Example Organization and Relationship between Rules and Remediations

FIG. 10A is an entity relationship diagram 1000 for rules and remediations in accordance with one or more embodiments. In one embodiment, a given rule identifier (ID) may be associated with zero or more remediation IDs and a given rule ID may have zero or more child rules IDs.

In the context of the present example, a set of rules is shown organized hierarchically with a parent rule ID (rule ID 101) at the root and three child rule IDs (rule IDs 10001, 10002, and 10003). In this manner, a complex conditional expression may be broken down into a series of less complex conditional expressions in which those rules having dependencies on other rules need not be evaluated until their respective pre-conditions have been confirmed. Those skilled in the art will appreciate the rules and remediations may be organized in various other ways.

FIG. 10B is an example of a rules table 1050 in accordance with one or more embodiments. The rules table 1050 may be analogous to the rules tables 911 used by the auto-healing service 900 of FIG. 9. In the context of the present example, each rule ID has an associated trigger (e.g., event or scheduled), an associated EMS event name, an associated remediation ID, and a last run indicator (e.g., a timestamp indicating the time/date of the last time the rule was run). In this manner, key EMS events may be used to trigger a deep analysis (e.g., of some corresponding set of one or more rules) via a rules engine (e.g., rules evaluator 960) and other rules may be run on a periodic schedule.

As those skilled in the art will appreciate, it may be preferable to perform event-based triggering when available as they may provide reduced overhead and complexity; however, some types of checks (e.g., best practices and performance checks) lend themselves well to scheduling. For example, if an administrative user wants to check whether a given cluster is complying with SAN best practices (e.g., as defined by the vendor), the administrator may schedule one or more rules associated with SAN best practices to run periodically (e.g., once a month). Similarly, the administrator may schedule one or more rules associated with security and/or performance checks to be performed on a periodic basis.

In one embodiment, a given rule may contain the ID(s) of the trigger events (e.g., the EMS event(s)) it is looking for. The trigger event ID information can be inferred by scanning all the active rules or by a catalog that is maintained. A coordinator (e.g., rule/remediation coordinator 940) may register with a pub/sub bus (e.g., pub/sub/EMS topic 930) for the event IDs of interest. In this manner, an auto-healing service (e.g., auto-healing service 900) may avoid listening to all events.

Example Automated Remediation

FIG. 11A is a flow diagram illustrating a set of operations for performing automated remediation in accordance with one or more embodiments. The processing described with reference to FIG. 11A may be performed by an auto-healing service (e.g., auto-healing service 900) running within a distributed storage system (e.g., data management storage solution 130, cluster 201, cluster 335, or a cluster of one or more virtual storage systems 410a-c) or an off-box service (e.g., a cloud-based auto-healing service). In the context of the present example, it is assumed a rule evaluation has been triggered, for example, as a result of the occurrence of a key EMS event (e.g., communicated to rule/remediation coordinator 940 via a pub/sub/EMS topic 930 to which the rule/remediation coordinator 940 is subscribed), as a result of a predetermined or configurable threshold of a certain type of event (e.g., after X file shares have been created), as a result of a schedule associated with a particular rule, or as a result of an on-demand rule-evaluation (e.g., requested by an administrative user via a system manager dashboard of the distributed storage system).

At block 1110, the existence of a risk to which the data storage system is exposed is determined. The risk might represent a misconfiguration of the data storage system, an environmental issue (e.g., a DNS change or network reconfiguration) that might impact the data storage system, a security issue relating to the data storage system, a performance issue relating to the data storage system, or a capacity issue relating to the data storage system. The exposure to a particular risk may be determined by evaluating one or more conditions associated with a set of one or more rules (e.g., of rule sets 121) that are indicative of a root cause of the risk. As noted above, the rules and corresponding remediations (e.g., of remediation sets 122) may have been derived based on community wisdom by an AIOps platform (e.g., AIOps platform 120) and delivered to the data storage system to facilitate automated identification of issues or risks to which the data storage system may be exposed as well as mitigation thereof via performance of the corresponding remediations.

The one or more rules may be associated with a trigger event (e.g., the occurrence of a key EMS event or a predetermined or configurable schedule). According to one embodiment, a rules evaluator (e.g., rules evaluator 960) may be directed (e.g., via a pub/sub pattern) to evaluate (execute) a set of one or more rules (e.g., organized hierarchically with a parent rule at the root and zero or more child rules) by a coordinator (e.g., rule/remediation coordinator 940). Non-limiting examples of pub/sub processing, coordinator processing, and rule execution are described further below with reference to FIGS. 12, 13, and 14, respectively.

At block 1120, a remediation associated with the risk determined in block 1110 is identified that addresses or mitigates the risk. According to one embodiment, a given rule (e.g., a parent rule) may include information regarding or a reference to a remediation, for example, a remediation ID that may be used to look up the remediation action(s) or remediation script within a remediation table. Assuming the existence of an associated remediation, a task execution engine (e.g., task execution engine 970) may be directed (e.g., via a pub/sub pattern) to carry out (implement) a set of one or more remediation actions or a remediation script by the coordinator.

At block 1130, the set of one or more remediation actions are executed. For example, responsive to receipt of a remediation execution request via a pub/sub bus (e.g., pub/sub/auto-heal topic 950) the task execution engine may execute the set of one or more remediation actions to implement the remediation identified in block 1120. A non-limiting example of remediation execution is described further below with reference to FIG. 15.

Examples of more specific details associated with automated remediation of a risk identified responsive to evaluation of rules associated with a trigger event and responsive to evaluation of rules on a periodic schedule are described further below with reference to FIGS. 16 and 17, respectively.

FIG. 11B is a block diagram illustrating a modified feedback loop of FIG. 1A in which a cloud-based service 1140 performs auto-healing for a fleet 1135 of storage systems 1130a-n in accordance with one or more embodiments. According to one embodiment, the storage clusters 1130a-n are related in some manner, for example, they may be associated with and utilized by the same organization and may be in a replication relationship to facilitate continuous availability of a storage solution provided by a group of two or more of the storage clusters 1130a-n. For example, a dataset of one of the storage clusters 1130a (representing a primary storage cluster) may be synchronously or asynchronously replicated to one or more other of the storage clusters (e.g., storage cluster 1130b-n), which may represent a secondary storage cluster in the same or a different site.

FIG. 11B generally parallels FIG. 1A, but instead of an auto-healing service (e.g., auto-healing service 900) being completely implemented on-box (e.g., within a given storage cluster, such as data management storage solution 130), an auto-healing service may be performed on behalf of the fleet 1135 by the cloud-based service 1140. Additionally, instead of update 119 being provided to a given storage cluster (e.g., data management storage solution 130), the update 119 for storage clusters 1130a-n may be directed to the cloud-based service 1140 for use in connection with monitoring and remediating the storage clusters 1130a-n of the fleet 1135. According to one embodiment, the auto-healing-service (not shown) implemented by the cloud-based service 1140 may include a rule/remediation coordinator (e.g., rule/remediation coordinator 940), a rules evaluator (e.g., rules evaluator 960), and a task execution engine (e.g., task execution engine 970) and may make use of various APIs (e.g., a REST API of a pub/sub bus (e.g., pub/sub bus 931), an auto-healing REST API (e.g., auto-healing REST API 910), etc.) implemented within and exposed by each of the storage clusters 1130a-n, for example, to subscribe to EMS events of interest and cause a given storage cluster to remediate an identified risk to which the given storage cluster is exposed.

Depending on the particular implementation, the cloud-based service 1140 may be hosted within a private (e.g., a data center of the vendor of the storage clusters 1130a-n) or a public cloud (e.g., AWS, Microsoft Azure, Google Cloud Platform, or the like).

Best Practices Examples

FIG. 11C is a high-level flow diagram illustrating an example of a set of operations for performing automated remediation to address or mitigate deviations from best practices in accordance with one or more embodiments. The processing described with reference to FIG. 11C may be performed by an auto-healing service (e.g., auto-healing service 900) running within a distributed storage system (e.g., data management storage solution 130, cluster 201, cluster 335, or a cluster of one or more virtual storage systems 410a-c) or an off-box auto-healing service (e.g., an auto-healing service represented by or implemented by cloud-based service 1140).

At block 1150, a rule set for a set of best practices for a data storage system is received. The rule set may be received as part of an update (e.g., update 119) distributed to the data storage system directly (as in the example of FIG. 1A), received as part of as part of an update (e.g., update 119) distributed to a cloud based service (e.g., cloud-based service 1140 as in the example of FIG. 11B) implementing and performing an auto-healing service on behalf of a fleet (e.g., flect 1135) of which the data storage system is a part, or the rule set may be downloaded on demand to the data storage system. For example, during a review (e.g., via a system manager dashboard of the data storage system or otherwise), by an administrative user of the data storage system, of a best practices guide, blog, document, or technical report provided online by a vendor of the data storage system, the administrative user may request the rule set(s) (e.g., some subset of rule sets 121) (and corresponding remediation sets, for example, some subset of remediation sets 122) associated with the best practices at issue be downloaded to the data storage system. It is to be appreciated rule sets may be received by the data storage system in a variety of additional or alternative ways described herein.

At block 1160, after a rule evaluation has been triggered, for example, due to a rule-evaluation trigger event (e.g., occurrence of a key EMS event, a scheduled event associated with a particular rule, or an event representing an on-demand rule-evaluation has been requested, for example, by an administrative user of the data storage system), it is determined whether a risk exists to which the data storage system is exposed by evaluating one or more rules associated with the rule-evaluation trigger event. In this example, the risk represent a deviation from a best practice by the data storage system. As noted above, the one or more rules (e.g., of rule sets 121) and corresponding remediations (e.g., of remediation sets 122) may have been derived based on community wisdom by an AIOps platform (e.g., AIOps platform 120) and delivered to the data storage system to facilitate automated identification of issues or risks to which the data storage system may be exposed as well as mitigation thereof via performance of the corresponding remediations. As also noted above, depending on the form in which the rules are represented, the execution of the rules may involve, for example, inferencing by an ML model or execution of conditional logic (e.g., represented by Python code).

According to one embodiment, a rules evaluator (e.g., rules evaluator 960) may be directed (e.g., via a pub/sub pattern) to evaluate (execute) a set of one or more rules (e.g., organized hierarchically with a parent rule at the root and zero or more child rules) by a coordinator (e.g., rule/remediation coordinator 940). Non-limiting examples of pub/sub processing, coordinator processing, and rule execution are described further below with reference to FIGS. 12, 13, and 14, respectively.

At block 1170, it is determined whether a remediation associated with the risk (i.e., the deviation from the best practice) determined in block 1150 exists that addresses or mitigates the risk. For example, the remediation may be identified based on a rule of one or more of the rule(s) whose conditions have been satisfied According to one embodiment, the rule may include information regarding or a reference to a remediation, for example, a remediation ID that may be used to look up the remediation action(s) or remediation script within a remediation table. Assuming the existence of an associated remediation, a task execution engine (e.g., task execution engine 970) may be directed (e.g., via a pub/sub pattern) to carry out (implement) a set of one or more remediation actions or a remediation script by the coordinator.

At block 1180, the set of one or more remediation actions are executed. For example, responsive to receipt of a remediation execution request via a pub/sub bus (e.g., pub/sub/auto-heal topic 950) the task execution engine may execute the set of one or more remediation actions to implement the remediation identified in block 1170. A non-limiting example of remediation execution is described further below with reference to FIG. 15.

For purposes of illustration and without limiting the general applicability of the proposed auto-healing service to deviations from best practices, a brief description of a concrete use case relating to DNS best practices is now provided. In some implementations, a data storage system may be virtualized, for example, to support multitenancy, In such a case, there may be multiple SVMs (e.g., one for each tenant within the customer's organization). In one example, a rule may represent a best practice of each SVM being associated with a DNS meeting certain conditions indicative of having a particular operational status. The rule may be capable of running based on incoming telemetry data from the data storage system, data collected from the data storage system, or data provided through an EMS message. During evaluation of the rule, appropriate data may be gathered and various conditions may be evaluated relating to the current state of the data storage system as compared to the expected or desired state of the data storage system (represented by the best practice). For example, the rule may iterate through or otherwise evaluate all SVMs of the data storage system, each of which have their own network configurations, to check the DNS status (e.g., it is configured appropriately and is reachable) of each SVM. If the DNS status is not satisfactory for a given DNS of an SVM, it may be added to a list for subsequent remediation. Upon completion of the rule evaluation, a list of those of those DNSs recommended for remediation (e.g., addition of a new name to address a misconfiguration as a result of an SVM host name change) may be presented (e.g., via a system manager dashboard of the data storage system) to an administrative user of the data storage system for approval of the proposed remediation or the remediation may be automatically performed depending on the configuration of rule at issue.

While in the context of the present example, a best practice is called out as a specific type of risk to which a cluster may be exposed, it is to be appreciated best practices may generally be treated like other risks described herein. Therefore, the other rule evaluation (or ML inferencing) and remediation activities or tasks and infrastructure relating to the various other types of risks to which a data storage system might be exposed are generally applicable to the identification of deviations from best practices and remediation or mitigation thereof.

Example Pub/Sub Processing

FIG. 12 is a flow diagram illustrating as set of operations for performing pub/sub processing in accordance with one or more embodiments. The processing described with reference to FIG. 12 may represent an example of the handling of messages published to a pub/sub bus (e.g., pub/sub bus 931).

At decision block 1210, an event indicative of a type of message published to the pub/sub bus is determined. When no message has been published, processing loops back to decision block 1210.

Responsive to a subscription request, processing continues with block 1220 at which the requester is added as a subscriber to a topic specified by the subscription request. For example, a coordinator (e.g., rule/remediation coordinator 940) may subscribe to particular EMS events of which it would like to be notified by making a subscription request to the pub/sub bus for a corresponding EMS topic (e.g., pub/sub/EMS topic 930). Similarly, a rules evaluator (e.g., rules evaluator 960) and a task execution engine (e.g., task execution engine 970) may subscribe to rule execution requests and remediation execution requests, respectively, by making subscription requests to the pub/sub bus for corresponding auto-heal topics (e.g., pub/sub/auto-heal topic 950).

Responsive to a new EMS event, processing continues with block 1230 to notify the coordinator of the new EMS event. Responsive to the new EMS event notification, the coordinator may identify a set of one or more rules (e.g., of rule sets 121) to be evaluated based on a mapping between EMS events and corresponding rules (e.g., rules tables 911) and may cause the rules evaluator to perform the evaluation, for example, by publishing a rule execution request to the pub/sub bus.

Responsive to a rule execution request message published (e.g., to the pub/sub/auto-heal topic 950 of FIG. 9) by the coordinator, processing continues with block 1240 to trigger rule execution by the rule evaluator. Upon completion of the requested rule execution, the rule evaluator may cause the results of the rule execution to be returned to the coordinator, for example, by publishing a rule evaluation result to the pub/sub bus.

Responsive to a rule evaluation result message published (e.g., to the pub/sub/auto-heal topic 950 of FIG. 9A) by the rule evaluator, processing continues with block 1250 to notify the coordinator. When the rule evaluation result message confirms the distributed storage system the existence of a particular risk or issue or has exposure to a particular risk or issue, the coordinator may cause a remediation (e.g., of remediation sets 122) associated with the rule that identified the particular risk to be carried out to mitigate or address the particular risk. For example, responsive to determining the remediation at issue is one that is authorized for automatic performance or receiving express authorization from an administrator of the distributed storage system (e.g., via “Fix It” button 611 or 811) to proceed with the remediation, the coordinator may user of the Depending on the particular implementation, the coordinator may cause the task execution engine to execute a remediation action or script, for example, by publishing a remediation execution request to the pub/sub bus.

Responsive to a remediation execution request message published (e.g., to the pub/sub/auto-heal topic 950 of FIG. 9A) by the coordinator, processing continues with block 1260 to trigger remediation execution by the task execution engine. After completing performance of the remediation execution, the task execution engine may cause the coordinator to be notified, for example, by publishing a remediation complete message to the pub/sub bus.

Responsive to a remediation complete message published (e.g., to the pub/sub/auto-heal topic 950 of FIG. 9A) by the task execution engine, processing continues with block 1270 to notify the coordinator.

Example Rule/Remediation Execution Coordination

FIG. 13 is a flow diagram illustrating as set of operations for coordinating execution of rules and remediations in accordance with one or more embodiments. The processing described with reference to FIG. 13 may represent an example of processing performed by a coordinator (e.g., the rule/remediation coordinator 940 of FIG. 9A).

At block 1305, the coordinator may upon initialization subscribe to desired EMS events. For example, the coordinator may post a subscription request message specifying the pub/sub EMS topic so as to be automatically notified by the pub/sub bus of subsequent messages posted to this topic.

At decision block 1310, it is determined whether a new EMS event has been received. If so, processing continues with decision block 1315; otherwise processing loops back to decision block 1310.

At decision block 1315, one or more rule execution pre-checks may be performed. If all pre-checks pass, processing continues with block 1320; otherwise, processing loops back to decision block 1310 to await receipt of another EMS event. In one embodiment, the one or more rule pre-checks may include performing a check regarding whether a mapping exists for the event ID of the EMS event at issue to a corresponding rule ID of a rule to be executed. If no mapping is found, the pre-checks may be treated as having failed. Alternatively, or additionally the one or more rule pre-checks may include performing a check to determine whether an entry exists in a task table (e.g., the cluster-wide task table 912 of FIG. 9A) for the EMS event at issue. If so, triaging is already in process for this EMS event and the pre-checks may be treated as having failed.

At block 1320, the rule(s) to be run are extracted. For example, the coordinator may determine the rule ID to which the event ID of the EMS event at issue maps.

At block 1325, details (e.g., the rule ID and the event ID) may be logged in the task table and a rule execution request message (including the rule ID and optionally the node ID to which the rule execution is being delegated if the rule execution is not to be performed by the primary node) may be posted/published to a pub/sub topic (e.g., a pub/sub “evaluate” topic) to trigger execution of the rules associated with the rule ID by a rules evaluator (e.g., the rules evaluator 960 of FIG. 9A).

At decision block 1330, it is determined whether a rule evaluation result message (e.g., a reply) has been received (e.g., from the rules evaluator). If so, processing continues with block 1335; otherwise processing loops back to decision block 1330 to await the rule evaluation result. As noted above, in one embodiment, a timeout mechanism may be used when a request is sent to the rule evaluator. If a request times out, the coordinator may perform a roll-back or roll-forward as appropriate.

At block 1335, the appropriate next step is determined based on the rule evaluation result and a checkpoint is created in a cluster-wide log to facilitate failure recovery. With respect to determining the appropriate next step, when a risk has been identified as being associated with the EMS event at issue by the rule evaluator and a remediation has been returned as part of the rule evaluation result (e.g., as part of a stateful EMS event), then the event and the associated corrective action may be brought to the attention of an administrative user of the cluster by creating an EMS alert that is displayed via a user interface of a system manager dashboard (e.g., as described an illustrated with reference to FIGS. 5-8).

At decision block 1345, it is determined whether remediation execution is to be performed. In the context of the present example, while no indication is received regarding remediation execution, processing loops back to decision block 1345. Responsive to the administrative user dismissing the alert displayed via the system manager dashboard (e.g., by selecting the “Dismiss” button), resulting in invocation of an auto-healing REST API (e.g., the auto-healing REST API 910 of FIG. 9A), remediation execution may be skipped and processing branches to decision block 1310 to await notification of a subsequent EMS event. If the identified risk is set to be fully automated or responsive to the administrative user authorizing the remediation to be performed (e.g., by selecting the “Fix It” button), resulting in invocation of the auto-healing REST API, remediation execution may commence by continuing with decision block 1350.

At decision block 1350, one or more remediation pre-checks may be performed. If all pre-checks pass, processing continues with block 1355; otherwise, processing loops back to decision block 1310 to await receipt of another EMS event. In one embodiment, the one or more remediation pre-checks may include performing a check regarding whether a mapping exists for the rule ID to a corresponding remediation ID of a remediation to be executed. If no mapping is found, the pre-checks may be treated as having failed. Alternatively, or additionally the one or more remediation pre-checks may include performing a check to determine whether an entry exists in a task table (e.g., the cluster-wide task table 912 of FIG. 9A) for the rule and/or remediation at issue (e.g., based on one or both of the rule ID and the remediation ID). If so, the remediation is already in process for this risk and the pre-checks may be treated as having failed.

At block 1355, the remediation(s) to be executed are extracted. For example, the coordinator may determine (e.g., with reference to the rules tables) the remediation ID to which the rule ID of the EMS event at issue maps.

At block 1360, a checkpoint may be created within the cluster-wide log (e.g., including the rule ID, the event ID, and the remediation ID) and execution of the remediation(s) may be requested, for example, by posting a remediation execution request message (including the remediation ID, and optionally the node ID to which the remediation execution is being delegated if the remediation execution is not to be performed by the primary node) may be posted to a pub/sub topic (e.g., a pub/sub “remediate” topic) to trigger execution of the remediation actions associated with the remediation ID by a task execution engine (e.g., the task execution engine 970 of FIG. 9A).

At decision block 1365, it is determined whether a remediation status update has been received (e.g., from the task execution engine). If so, processing continues with block 1370; otherwise, processing loops back to decision block 1365 to await the remediation status update. As noted above, in one embodiment, a timeout mechanism may be used when a request is sent to the task execution engine. If a request times out, the coordinator may perform a roll-back or roll-forward as appropriate.

At block 1370, responsive to the remediation status update, the status of the remediation is updated within the cluster-wide task table and via an EMS service (e.g., the EMS service 920 of FIG. 9A) so as to provide feedback to the administrative user, for example, via the user interface of the service manager dashboard.

At decision block 1375, it is determined whether a remediation reply has been received from the task execution engine that is indicative of completion of a given remediation execution. If so, processing continues with block 1380; otherwise, processing loops back to decision block 1375 to await the remediation reply. As noted above, in one embodiment, a timeout mechanism may be used when a request is sent to the task execution engine. If a request times out, the coordinator may perform a roll-back or roll-forward as appropriate.

At block 1380, responsive to the remediation reply update, the status of the remediation is updated (e.g., to a terminal state) within the cluster-wide task table and via the EMS service.

Example Rule Execution

FIG. 14 is a flow diagram illustrating as set of operations for performing rule execution in accordance with one or more embodiments. The processing described with reference to FIG. 14 may represent an example of processing performed by a rule evaluator (e.g., the rule evaluator 960 of FIG. 9A or an instance of a rule evaluator running on another node of the cluster). In the context of the present example, it is assumed a coordinator (e.g., the rule/remediation coordinator 940 of FIG. 9A) has previously published a rule execution request message (including a rule ID) to a pub/sub topic (e.g., a pub/sub “evaluate” topic) and rule evaluator processing has been triggered responsive to a notification by the pub/sub bus responsive to the rule execution request message.

At decision block 1410, a determination is made regarding whether the rule ID contained within the rule execution request message exists. If so, processing continues with block 1430; otherwise, processing branches to block 1420 in which an error may be published. In one embodiment, the rule evaluator may consult a rules table (e.g., the rules tables 911 of FIG. 9A) to make this determination and/or arrive at this determination as a result of a corresponding folder or file for the rule at issue being missing or corrupted.

At block 1430, execution of a sequence of rules is initiated by finding child rules of the rule ID at issue. For example, rule execution logic (e.g., the logic controller 961 of FIG. 9A) associated with the rule evaluator may retrieve the rule ID and the corresponding rule for each of a set of zero or more child rules associated with the rule ID at issue from the rules table and begin sequentially evaluating and executing them as appropriate. Additionally, the EMS event state may be updated to provide feedback to the administrative user of the cluster via the system manager dashboard, for example.

At decision block 1440, it is determined whether any specified rule conditions are satisfied for a given child rule. If all rule conditions are satisfied, processing continues with block 1460; otherwise, processing branches to block 1450 to skip the current child rule and move on to the next child rule after looping back to decision block 1440.

At block 1460, the associated remediation is identified, for example, with reference to a remediation ID associated with the rule ID at issue as indicated in the rules table.

At decision block 1470, it is determined whether the remediation was found. If so, processing continues with block 1490; otherwise, processing branches to block 1480 in which an error may be published. In one embodiment, the rule evaluator may consult the rules tables to make this determination and/or arrive at this determination as a result of a corresponding folder or file for the remediation at issue being missing or corrupted.

At block 1490, the remediation is prepared for publication, for example, by creating the remediation URL and associated parameters; and then, the “reply” (e.g., to the rule execution request message that initiated this process) may be posted/published, for example, in the form of a stateful EMS event via the EMS service to communicate to the coordinator the results of the rule evaluation.

Example Remediation Execution

FIG. 15 is a flow diagram illustrating as set of operations for performing remediation execution in accordance with one or more embodiments. The processing described with reference to FIG. 15 may represent an example of processing performed by a task execution engine (e.g., the task execution engine 970 of FIG. 9A or an instance of a task execution engine running on another node of the cluster). In the context of the present example, it is assumed a coordinator (e.g., the rule/remediation coordinator 940 of FIG. 9A) has previously published a remediation execution request message (including a remediation ID) to a pub/sub topic (e.g., a pub/sub “remediate” topic) and task execution engine processing has been triggered responsive to a notification by the pub/sub bus responsive to the remediation execution request message.

At decision block 1510, a determination is made regarding whether the remediation ID contained within the remediation execution request message exists. If so, processing continues with decision block 1530; otherwise, processing branches to block 1520 in which an error may be published. In one embodiment, the task execution engine may consult rules tables (e.g., rules tables 911 of FIG. 9) to make this determination and/or arrive at this determination as a result of a corresponding folder or file for the remediation at issue being missing or corrupted.

At decision block 1530, a determination may be made regarding whether the issue still exists. If so, then processing continues with block 1550; otherwise, processing branches to block 1540 in which the status of the event/issue may be updated in a cluster-wide task table (e.g., the cluster-wide task table 912 of FIG. 9) and via an EMS service (e.g., the EMS service 920 of FIG. 9) to indicate the event/issue has been remediated.

At block 1550, the associated remediation is identified. For example, task execution logic (e.g., the logic controller 971 of FIG. 9) associated with the task execution engine may attempt to locate a folder and a file associated with the remediation ID at issue.

At decision block 1560, it is determined whether the remediation was found. If so, processing continues with block 1580; otherwise, processing branches to block 1570 in which an error may be published.

At block 1580, a remediation plan is created, a remediation script is executed, and the status may be updated in the cluster-wide task table as well as via the EMS service at each step of the script.

Example Network Environment

FIG. 16 is a block diagram illustrating an example of a network environment 1600 in accordance with one or more embodiments. Network environment 1600 illustrates a non-limiting architecture for implementing a distributed storage system (e.g., data management storage solution 130, cluster 201, cluster 335, or a cluster of one or more of virtual storage systems 410a-c). The embodiments described above may be implemented within one or more storage apparatuses, such as any single or multiple ones of data storage apparatuses 1602a-n of FIG. 16. For example, one or more components of an auto-healing service (e.g., the auto-healing service 900 described with reference to FIG. 9A) may be implemented within node computing devices 1606a-n. In one or more embodiments, nodes 102 may be implemented in a manner similar to node computing devices 1606a-n and/or data storage nodes 1610a-1610n.

Network environment 1600, which may take the form of a clustered network environment, includes data storage apparatuses 1602a-n that are coupled over a cluster or cluster fabric 1604 that includes one or more communication network(s) and facilitates communication between data storage apparatuses 1602a-n (and one or more modules, components, etc. therein, such as, node computing devices 1606a-n (also referred to as node computing devices), for example), although any number of other elements or components can also be included in network environment 1600 in other examples. This technology provides a number of advantages including methods, non-transitory computer-readable media, and computing devices that implement the techniques described herein.

In this example, node computing devices 1606a-n may be representative of primary or local storage controllers or secondary or remote storage controllers that provide client devices 1608a-n (which may also be referred to as client nodes and which may be analogous to clients 205, 305, and 405) with access to data stored within data storage nodes 1610a-n (which may also be referred to as data storage devices) and cloud storage node(s) 1636 (which may also be referred to as cloud storage device(s) and which may be analogous to hyperscale disks 425). The node computing devices 1606a-nmay be implemented as hardware, software (e.g., a storage virtual machine), or combination thereof.

Data storage apparatuses 1602a-n and/or node computing devices 1606a-n of the examples described and illustrated herein are not limited to any particular geographic areas and can be clustered locally and/or remotely via a cloud network, or not clustered in other examples. Thus, in one example data storage apparatuses 1602a-n and/or node computing devices 1606a-n can be distributed over multiple storage systems located in multiple geographic locations (e.g., located on-premise, located within a cloud computing environment, etc.); while in another example a network can include data storage apparatuses 1602a-n and/or node computing devices 1606a-n residing in the same geographic location (e.g., in a single on-site rack).

In the illustrated example, one or more of client devices 1608a-n, which may be, for example, personal computers (PCs), computing devices used for storage (e.g., storage servers), or other computers or peripheral devices, are coupled to the respective data storage apparatuses 1602a-n by network connections 1612a-n. Network connections 1612a-nmay include a local area network (LAN) or wide area network (WAN) (i.e., a cloud network), for example, that utilize TCP/IP and/or one or more Network Attached Storage (NAS) protocols, such as a Common Internet Filesystem (CIFS) protocol or a Network Filesystem (NFS) protocol to exchange data packets, a Storage Area Network (SAN) protocol, such as Small Computer System Interface (SCSI) or Fiber Channel Protocol (FCP), an object protocol, such as simple storage service (S3), and/or non-volatile memory express (NVMe), for example.

Illustratively, client devices 1608a-n may be general-purpose computers running applications and may interact with data storage apparatuses 1602a-n using a client/server model for exchange of information. That is, client devices 1608a-n may request data from data storage apparatuses 1602a-n (e.g., data on one of the data storage nodes 1610a-n managed by a network storage controller configured to process I/O commands issued by client devices 1608a-n, and data storage apparatuses 1602a-n may return results of the request to client devices 1608a-n via the network connections 1612a-n.

The node computing devices 1606a-n of data storage apparatuses 1602a-n can include network or host nodes that are interconnected as a cluster to provide data storage and management services, such as to an enterprise having remote locations, cloud storage (e.g., a storage endpoint may be stored within cloud storage node(s) 1636), etc., for example. Such node computing devices 1606a-n can be attached to the cluster fabric 1604 at a connection point, redistribution point, or communication endpoint, for example. One or more of the node computing devices 1606a-n may be capable of sending, receiving, and/or forwarding information over a network communications channel, and could comprise any type of device that meets any or all of these criteria.

In an example, the node computing devices 1606a-n may be configured according to a disaster recovery configuration whereby a surviving node provides switchover access to the storage devices 1610a-n in the event a disaster occurs at a disaster storage site (e.g., the node computing device 1606a provides client device 1608n with switchover data access to data storage nodes 1610n in the event a disaster occurs at the second storage site). In other examples, the node computing device 1606n can be configured according to an archival configuration and/or the node computing devices 1606a-n can be configured based on another type of replication arrangement (e.g., to facilitate load sharing). Additionally, while two node computing devices are illustrated in FIG. 16, any number of node computing devices or data storage apparatuses can be included in other examples in other types of configurations or arrangements.

As illustrated in network environment 1600, node computing devices 1606a-n can include various functional components that coordinate to provide a distributed storage architecture. For example, the node computing devices 1606a-n can include network modules 1614a-n and disk modules 1616a-n. Network modules 1614a-n can be configured to allow the node computing devices 1606a-n (e.g., network storage controllers) to connect with client devices 1608a-n over the network connections 1612a-n, for example, allowing client devices 1608a-n to access data stored in network environment 1600.

Further, the network modules 1614a-n can provide connections with one or more other components through the cluster fabric 1604. For example, the network module 1614a of node computing device 1606a can access the data storage node 1610n by sending a request via the cluster fabric 1604 through the disk module 1616n of node computing device 1606n when the node computing device 1606n is available. Alternatively, when the node computing device 1606n fails, the network module 1614a of node computing device 1606a can access the data storage node 1610n directly via the cluster fabric 1604. The cluster fabric 1604 can include one or more local and/or wide area computing networks (e.g., cloud networks) embodied as Infiniband, Fibre Channel (FC), or Ethernet networks, for example, although other types of networks supporting other protocols can also be used.

Disk modules 1616a-n can be configured to connect data storage nodes 1610a-n, such as disks or arrays of disks, SSDs, flash memory, or some other form of data storage, to the node computing devices 1606a-n. Often, disk modules 1616a-n communicate with the data storage nodes 1610a-n according to a SAN protocol, such as SCSI or FCP, for example, although other protocols can also be used. Thus, as seen from an OS on node computing devices 1606a-n, the data storage nodes 1610a-n can appear as locally attached. In this manner, different node computing devices 1606a-n, etc. may access data blocks, files, or objects through the OS, rather than expressly requesting abstract files.

While network environment 1600 illustrates an equal number of network modules 1614a-n and disk modules 1616a-n, other examples may include a differing number of these modules. For example, there may be a plurality of network and disk modules interconnected in a cluster that do not have a one-to-one correspondence between the network and disk modules. That is, different node computing devices can have a different number of network and disk modules, and the same node computing device can have a different number of network modules than disk modules.

Further, one or more of client devices 1608a-n can be networked with the node computing devices 1606a-n in the cluster, over the network connections 1612a-n. As an example, respective client devices 1608a-n that are networked to a cluster may request services (e.g., exchanging of information in the form of data packets) of node computing devices 1606a-n in the cluster, and the node computing devices 1606a-n can return results of the requested services to client devices 1608a-n. In one example, client devices 1608a-n can exchange information with the network modules 1614a-n residing in the node computing devices 1606a-n (e.g., network hosts) in data storage apparatuses 1602a-n.

In one example, storage apparatuses 1602a-n host aggregates corresponding to physical local and remote data storage devices, such as local flash or disk storage in the data storage nodes 1610a-n, for example. One or more of the data storage nodes 1610a-n can include mass storage devices, such as disks of a disk array. The disks may comprise any type of mass storage devices, including but not limited to magnetic disk drives, flash memory, and any other similar media adapted to store information, including, for example, data and/or parity information.

The aggregates may include volumes 1618a-n in this example, although any number of volumes can be included in the aggregates. The volumes 1618a-n are virtual data stores or storage objects that define an arrangement of storage and one or more filesystems within network environment 1600. Volumes 1618a-n can span a portion of a disk or other storage device, a collection of disks, or portions of disks, for example, and typically define an overall logical arrangement of data storage. In one example volumes 1618a-n can include stored user data as one or more files, blocks, or objects that may reside in a hierarchical directory structure within the volumes 1618a-n.

Volumes 1618a-n are typically configured in formats that may be associated with particular storage systems, and respective volume formats typically comprise features that provide functionality to the volumes 1618a-n, such as providing the ability for volumes 1618a-n to form clusters, among other functionality. Optionally, one or more of the volumes 1618a-n can be in composite aggregates and can extend between one or more of the data storage nodes 1610a-n and one or more of the cloud storage node(s) 1636 to provide tiered storage, for example, and other arrangements can also be used in other examples.

In one example, to facilitate access to data stored on the disks or other structures of the data storage nodes 1610a-n, a filesystem (e.g., file system layer 411) may be implemented that logically organizes the information as a hierarchical structure of directories and files. In this example, respective files may be implemented as a set of disk blocks of a particular size that are configured to store information, whereas directories may be implemented as specially formatted files in which information about other files and directories are stored.

Data can be stored as files or objects within a physical volume and/or a virtual volume, which can be associated with respective volume identifiers. The physical volumes correspond to at least a portion of physical storage devices, such as the data storage nodes 1610a-n (e.g., a RAID system, such as RAID layer 413) whose address, addressable space, location, etc. does not change. Typically, the location of the physical volumes does not change in that the range of addresses used to access it generally remains constant.

Virtual volumes, in contrast, can be stored over an aggregate of disparate portions of different physical storage devices. Virtual volumes may be a collection of different available portions of different physical storage device locations, such as some available space from disks, for example. It will be appreciated that since the virtual volumes are not “tied” to any one particular storage device, virtual volumes can be said to include a layer of abstraction or virtualization, which allows it to be resized and/or flexible in some regards.

Further, virtual volumes can include one or more LUNs, directories, Qtrees, files, and/or other storage objects, for example. Among other things, these features, but more particularly the LUNs, allow the disparate memory locations within which data is stored to be identified, for example, and grouped as data storage unit. As such, the LUNs may be characterized as constituting a virtual disk or drive upon which data within the virtual volumes is stored within an aggregate. For example, LUNs are often referred to as virtual drives, such that they emulate a hard drive, while they actually comprise data blocks stored in various parts of a volume.

In one example, the data storage nodes 1610a-n can have one or more physical ports, wherein each physical port can be assigned a target address (e.g., SCSI target address). To represent respective volumes, a target address on the data storage nodes 1610a-n can be used to identify one or more of the LUNs. Thus, for example, when one of the node computing devices 1606a-n connects to a volume, a connection between the one of the node computing devices 1606a-n and one or more of the LUNs underlying the volume is created.

Respective target addresses can identify multiple of the LUNs, such that a target address can represent multiple volumes. The I/O interface, which can be implemented as circuitry and/or software in a storage adapter or as executable code residing in memory and executed by a processor, for example, can connect to volumes by using one or more addresses that identify the one or more of the LUNs.

The present embodiments can take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment containing both hardware and software elements. Accordingly, it is understood that any operation of the computing systems of the network environment 1600 and the distributed storage system may be implemented by a computing system using corresponding instructions stored on or in a non-transitory computer-readable medium accessible by a processing system. For the purposes of this description, a non-transitory computer-usable or computer-readable medium can be any apparatus that can store the program for use by or in connection with the instruction execution system, apparatus, or device. The medium may include non-volatile memory including magnetic storage, solid-state storage, optical storage, cache memory, and RAM.

Example Automated Remediation Responsive to a Trigger Event

FIG. 17 is a flow diagram illustrating a set of operations for performing automated remediation of a risk identified responsive to evaluation of rules associated with a trigger event in accordance with one or more embodiments. The processing described with reference to FIG. 17 may be performed by an auto-healing service (e.g., auto-healing service 900) running within a distributed storage system (e.g., data management storage solution 130, cluster 201, cluster 335, or a cluster of one or more virtual storage systems 410a-c) or an off-box service (e.g., a cloud-based auto-healing service).

At decision block 1710, it is determined whether a trigger event has been received. If so, processing continues with block 1720; otherwise, processing loops back to decision block 1710. According to one embodiment, the auto-healing service, for example, via a rule/remediation coordinator (e.g., rule/remediation coordinator 940) may have previously subscribed to a set of one or more key EMS events (trigger events) with a publisher/subscriber bus (e.g., Pub/Sub/EMS Topic 930) so as to be automatically notified upon the occurrence of any of the set of one or more key EMS events in the context of the distributed storage system.

At block 1720, a set of one or more rules are identified for evaluation. According to one embodiment, the auto-healing service, for example, via the rule/remediation coordinator, may maintain a mapping of the set of one or more key EMS events (trigger events) to respective sets of one or more rules to be evaluated responsive to occurrence of a given key EMS event (trigger event). As noted above, in the context of a capacity example, a non-limiting example of a key EMS event may be an EMS event indicative of a volume being X % (e.g., 80%) full. This EMS event may be mapped to an associated rule that causes a forecast to be performed and evaluated.

At block 1730, the set of one or more rules are evaluated with respect to one or more of historical data and a current state of the data storage system. For example, continuing with the capacity example, the associated rule may cause a forecast to be performed based on the current state of the data storage system (e.g., a particular volume is X % full) and historical data (e.g., usage of the particular volume over time) to determine when the particular volume will be at Y % (e.g., 100%) full. A risk may be identified when the forecasted fullness date is within N (e.g., 3) months.

At decision block 1740, it is determined whether a risk has been identified. If so, processing continues with block 1750; otherwise, processing loops back to decision block 1710. For example, continuing with the capacity example, if the forecasted fullness date is within N months from the current date, this condition may be indicative of a volume fullness risk.

At block 1750, the availability of a remediation associated with the risk is identified that addresses or mitigates the risk. For example, continuing with the capacity example, the remediation associated with the risk may be one that causes the volume size to be increased by M % (e.g., 20%).

At block 1760, an administrative user of the data storage system may be notified of the risk and the proposed automated remediation. According to one embodiment, the administrative user may be notified via a system manager dashboard (e.g., system manager dashboard 500 of FIG. 5) that may be part of a graphical user interface of a management platform that facilitates setup and/or deployment of the distributed storage system.

At decision block 1770, it is determined whether performance of the remediation is authorized. According to one embodiment, authorization of the remediation may be received via interaction of the administrative user with a dialog box (e.g., dialog box 600) that provides the administrative user with the option of dismissing the event (e.g., by selecting the “Dismiss” button 610) or allowing the auto-healing service to perform the proposed remediation (e.g., by selecting the “Fix It” button 611).

At block 1780, one or more remediation actions that implement the remediation are executed. For example, responsive to receipt of a remediation execution request via a pub/sub bus (e.g., pub/sub/auto-heal topic 950) the task execution engine may execute the set of one or more remediation actions to implement the remediation identified in block 1750.

While in the context of the present example, a set of one or more rules are identified for evaluation responsive to an EMS event indicative of a volume being X % full, it is to be appreciated the rules to be evaluated may be identified based on various other EMS events and/or responsive to a periodic schedule associated with the set of one or more rules.

While in the context of the present example, authorization to perform a remediation is described as being authorized by an administrative user via a dialog box presented to the administrative user via a system manager dashboard, it is to be appreciated in other examples, the administrative user may configure certain remediations for automated performance without requiring such authorization. For example, as noted above, preferences relating to the desired type of remediation (e.g., automated vs. user activated) for various types of identified issues arising within the distributed storage system may be configured by the administrative user, learned from historical interactions with the administrative user, and/or based on community wisdom. The administrative user may select automated remediation for issues/risks known to arise as a result of periodic changes to the environment in which the distributed storage system operates and/or to the configuration of the distributed storage system.

Example Automated Remediation Responsive to Periodic Scheduled Rule Evaluation

FIG. 18 is a flow diagram illustrating a set of operations for performing automated remediation of a risk identified responsive to evaluation of rules on a periodic schedule in accordance with one or more embodiments. The processing described with reference to FIG. 18 may be performed by an auto-healing service (e.g., auto-healing service 900) running within a distributed storage system (e.g., data management storage solution 130, cluster 201, cluster 335, or a cluster of one or more virtual storage systems 410a-c) or an off-box service (e.g., a cloud-based auto-healing service).

At decision block 1810, it is determined whether evaluation of a set of one or more rules is due for evaluation in accordance with an associated periodic schedule. If so, processing continues with block 1820; otherwise, processing loops back to decision block 1810. According to one embodiment, the auto-healing service, for example, via a scheduler/job manager may launch a task for a given set of one or more rules that periodically wakes up based on a schedule associated with the given set of one or more rules and performs an evaluation of the given set of one or more rules.

At block 1820, the set of one or more rules associated with the periodic schedule are evaluated with respect to one or more of historical data and a current state of the data storage system. For example, as noted above, in the context of a capacity example, a rule may be run on a daily, weekly, or monthly basis to evaluate whether any volumes utilized by the distributed storage system or a particular volume of the distributed storage system are/is X % (e.g., 80%) full. This rule may cause a forecast to be performed and evaluated based on the current state of the data storage system (e.g., the particular volume is X % full) and historical data (e.g., usage of the particular volume over time) to determine when the particular volume will be at Y % (e.g., 100%) full. A risk may be identified when the forecasted fullness date is within N (e.g., 3) months.

At decision block 1830, it is determined whether a risk has been identified. If so, processing continues with block 1840; otherwise, processing loops back to decision block 1810. For example, continuing with the capacity example, if the forecasted fullness date is within N months from the current date, this condition may be indicative of a volume fullness risk.

At block 1840, the availability of a remediation associated with the risk is identified that addresses or mitigates the risk. For example, continuing with the capacity example, the remediation associated with the risk may be one that causes the volume size to be increased by M % (e.g., 20%).

At block 1850, an administrative user of the data storage system may be notified of the risk and the proposed automated remediation. According to one embodiment, the administrative user may be notified via a system manager dashboard (e.g., system manager dashboard 500 of FIG. 5) that may be part of a graphical user interface of a management platform that facilitates setup and/or deployment of the distributed storage system.

At decision block 1860, it is determined whether performance of the remediation is authorized. According to one embodiment, authorization of the remediation may be received via interaction of the administrative user with a dialog box (e.g., dialog box 600) that provides the administrative user with the option of dismissing the event (e.g., by selecting the “Dismiss” button 610) or allowing the auto-healing service to perform the proposed remediation (e.g., by selecting the “Fix It” button 611).

At block 1870, one or more remediation actions that implement the remediation are executed. For example, responsive to receipt of a remediation execution request via a pub/sub bus (e.g., pub/sub/auto-heal topic 950) the task execution engine may execute the set of one or more remediation actions to implement the remediation identified in block 1840.

While in the context of the present example, a set of one or more rules are identified for evaluation responsive to a particular periodic schedule (e.g., of a day, a week, or a month) associated with the set of one or more rules, it is to be appreciated the rules to be evaluated may be identified based on other periodic schedules, an EMS event indicative of a volume being X % full, and/or various other EMS events associated with the set of one or more rules.

While in the context of the examples described with reference to FIGS. 11-15 and 17-18 a number of enumerated blocks are included, it is to be understood that other examples may include additional blocks before, after, and/or in between the enumerated blocks. Similarly, in some examples, one or more of the enumerated blocks may be omitted and/or performed in a different order.

Example Computer System

Various components of the present embodiments described herein may include hardware, software, or a combination thereof. Accordingly, it is to be understood operation of a distributed storage management system (e.g., data management storage solution 130, cluster 201, cluster 335, or a cluster of one or more of virtual storage systems 410a-c) or one or more of components thereof may be implemented using a computing system via corresponding instructions stored on or in a non-transitory computer-readable medium accessible by a processing system. For the purposes of this description, a tangible computer-usable or computer-readable medium can be any apparatus that can store the program for use by or in connection with the instruction execution system, apparatus, or device. The medium may include non-volatile memory including magnetic storage, solid-state storage, optical storage, cache memory, and RAM.

The various systems and subsystems (e.g., file system layer 411, RAID layer 413, and storage layer 415), and/or nodes 102 (when represented in virtual form) of the distributed storage system described herein, and the processing described herein may be implemented in the form of executable instructions stored on a machine readable medium and executed by one or more processing resources (e.g., one or more of and/or a combination one or more of a microcontroller, a microprocessor, central processing unit core(s), an application-specific integrated circuit (ASIC), a field programmable gate array (FPGA), and the like) and/or in the form of other types of electronic circuitry. For example, the processing may be performed by one or more virtual or physical computer systems (e.g., servers, network storage systems or appliances, blades, etc.) of various forms, such as the computer system described with reference to FIG. 19 below.

Embodiments of the present disclosure include various steps, which have been described above. The steps may be performed by hardware components or may be embodied in machine-executable instructions, which may be used to cause one or more processing resources (e.g., one or more general-purpose or special-purpose processors) programmed with the instructions to perform the steps. Alternatively, depending upon the particular implementation, various steps may be performed by a combination of hardware, software, firmware and/or by human operators.

Embodiments of the present disclosure may be provided as a computer program product, which may include a non-transitory machine-readable storage medium embodying thereon instructions, which may be used to program a computer (or other electronic devices) to perform a process. The machine-readable medium may include, but is not limited to, fixed (hard) drives, magnetic tape, floppy diskettes, optical disks, compact disc read-only memories (CD-ROMs), and magneto-optical disks, semiconductor memories, such as ROMs, PROMs, random access memories (RAMs), programmable read-only memories (PROMs), erasable PROMs (EPROMs), electrically erasable PROMs (EEPROMs), flash memory, magnetic or optical cards, or other type of media/machine-readable medium suitable for storing electronic instructions (e.g., computer programming code, such as software or firmware).

Various methods described herein may be practiced by combining one or more non-transitory machine-readable storage media containing the code according to embodiments of the present disclosure with appropriate special purpose or standard computer hardware to execute the code contained therein. An apparatus for practicing various embodiments of the present disclosure may involve one or more computers (e.g., physical and/or virtual servers) (or one or more processors within a single computer) and storage systems containing or having network access to computer program(s) coded in accordance with various methods described herein, and the method steps associated with embodiments of the present disclosure may be accomplished by modules, routines, subroutines, or subparts of a computer program product.

FIG. 19 is a block diagram that illustrates a computer system 1900 in which or with which an embodiment of the present disclosure may be implemented. Computer system 1900 may be representative of all or a portion of the computing resources associated with a node of nodes 102 of a distributed storage system (e.g., cluster 201, cluster 335, or a cluster including virtual storage systems 410a-c). Notably, components of computer system 1900 described herein are meant only to exemplify various possibilities. In no way should example computer system 1900 limit the scope of the present disclosure. In the context of the present example, computer system 1900 includes a bus 1902 or other communication mechanism for communicating information, and one or more processing resources (e.g., one or more hardware processors 1904) coupled with bus 1902 for processing information. Hardware processors 1904 may include, for example, one or more general-purpose microprocessors.

Computer system 1900 also includes a main memory 1906, such as a random-access memory (RAM) or other dynamic storage device, coupled to bus 1902 for storing information and instructions to be executed by processor(s) 1904. Main memory 1906 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor(s) 1904. Such instructions, when stored in non-transitory storage media accessible to processor(s) 1904, render computer system 1900 into a special-purpose machine that is customized to perform the operations specified in the instructions.

Computer system 1900 further includes a read only memory (ROM) 1908 or other static storage device coupled to bus 1902 for storing static information and instructions for processor(s) 1904. A storage device 1910, e.g., a magnetic disk, optical disk or flash disk (made of flash memory chips), is provided and coupled to bus 1902 for storing information and instructions.

Computer system 1900 may be coupled via bus 1902 to a display 1912, e.g., a cathode ray tube (CRT), Liquid Crystal Display (LCD), Organic Light-Emitting Diode Display (OLED), Digital Light Processing Display (DLP) or the like, for displaying information to a computer user. An input device 1914, including alphanumeric and other keys, is coupled to bus 1902 for communicating information and command selections to processor(s) 1904. Another type of user input device is cursor control 1916, such as a mouse, a trackball, a trackpad, or cursor direction keys for communicating direction information and command selections to processor(s) 1904 and for controlling cursor movement on display 1912. This input device typically has two degrees of freedom in two axes, a first axis (e.g., x) and a second axis (e.g., y), that allows the device to specify positions in a plane.

Removable storage media 1940 can be any kind of external storage media, including, but not limited to, hard-drives, floppy drives, IOMEGA® Zip Drives, Compact Disc-Read Only Memory (CD-ROM), Compact Disc-Re-Writable (CD-RW), Digital Video Disk-Read Only Memory (DVD-ROM), USB flash drives and the like.

Computer system 1900 may implement the techniques described herein using customized hard-wired logic, one or more ASICs or FPGAs, firmware or program logic which in combination with the computer system causes or programs computer system 1900 to be a special-purpose machine. According to one embodiment, the techniques herein are performed by computer system 1900 in response to processor(s) 1904 executing one or more sequences of one or more instructions contained in main memory 1906. Such instructions may be read into main memory 1906 from another storage medium, such as storage device 1910. Execution of the sequences of instructions contained in main memory 1906 causes processor(s) 1904 to perform the process steps described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions.

The term “storage media” as used herein refers to any non-transitory media that store data or instructions that cause a machine to operation in a specific fashion. Such storage media may comprise non-volatile media or volatile media. Non-volatile media includes, for example, optical, magnetic or flash disks, such as storage device 1910. Volatile media includes dynamic memory, such as main memory 1906. Common forms of storage media include, for example, a flexible disk, a hard disk, a solid-state drive, a magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, NVRAM, any other memory chip or cartridge.

Storage media is distinct from but may be used in conjunction with transmission media. Transmission media participates in transferring information between storage media. For example, transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise bus 1902. Transmission media can also take the form of acoustic or light waves, such as those generated during radio-wave and infra-red data communications.

Various forms of media may be involved in carrying one or more sequences of one or more instructions to processor(s) 1904 for execution. For example, the instructions may initially be carried on a magnetic disk or solid-state drive of a remote computer. The remote computer can load the instructions into its dynamic memory and send the instructions over a telephone line using a modem. A modem local to computer system 1900 can receive the data on the telephone line and use an infra-red transmitter to convert the data to an infra-red signal. An infra-red detector can receive the data carried in the infra-red signal and appropriate circuitry can place the data on bus 1902. Bus 1902 carries the data to main memory 1906, from which processor(s) 1904 retrieve and execute the instructions. The instructions received by main memory 1906 may optionally be stored on storage device 1910 either before or after execution by processor(s) 1904.

Computer system 1900 also includes a communication interface 1918 coupled to bus 1902. Communication interface 1918 provides a two-way data communication coupling to a network link 1920 that is connected to a local network 1922. For example, communication interface 1918 may be an integrated services digital network (ISDN) card, cable modem, satellite modem, or a modem to provide a data communication connection to a corresponding type of telephone line. As another example, communication interface 1918 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN. Wireless links may also be implemented. In any such implementation, communication interface 1918 sends and receives electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.

Network link 1920 typically provides data communication through one or more networks to other data devices. For example, network link 1920 may provide a connection through local network 1922 to a host computer 1924 or to data equipment operated by an Internet Service Provider (ISP) 1926. ISP 1926 in turn provides data communication services through the worldwide packet data communication network now commonly referred to as the “Internet” 1928. Local network 1922 and Internet 1928 both use electrical, electromagnetic or optical signals that carry digital data streams. The signals through the various networks and the signals on network link 1920 and through communication interface 1918, which carry the digital data to and from computer system 1900, are example forms of transmission media.

Computer system 1900 can send messages and receive data, including program code, through the network(s), network link 1920 and communication interface 1918. In the Internet example, a server 1930 might transmit a requested code for an application program through Internet 1928, ISP 1926, local network 1922 and communication interface 1918. The received code may be executed by processor(s) 1904 as it is received, or stored in storage device 1910, or other non-volatile storage for later execution.

All examples and illustrative references are non-limiting and should not be used to limit any claims presented herein to specific implementations and examples described herein and their equivalents. For simplicity, reference numbers may be repeated between various examples. This repetition is for clarity only and does not dictate a relationship between the respective examples. Finally, in view of this disclosure, particular features described in relation to one aspect or example may be applied to other disclosed aspects or examples of the disclosure, even though not specifically shown in the drawings or described in the text.

The foregoing outlines features of several examples so that those skilled in the art may better understand the aspects of the present disclosure. Those skilled in the art should appreciate that they may readily use the present disclosure as a basis for designing or modifying other processes and structures for carrying out the same purposes and/or achieving the same advantages of the examples introduced herein. Those skilled in the art should also realize that such equivalent constructions do not depart from the spirit and scope of the present disclosure, and that they may make various changes, substitutions, and alterations herein without departing from the spirit and scope of the present disclosure.

	Number	Date	Country
Parent	18392807	Dec 2023	US
Child	18646119		US
Parent	18301091	Apr 2023	US
Child	18392807		US

AUTOMATED REMEDIATION OF DEVIATIONS FROM BEST PRACTICES IN A DATA MANAGEMENT STORAGE SOLUTION

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

Priority Claims (1)

CROSS-REFERENCE TO RELATED APPLICATIONS

Continuation in Parts (2)