Various embodiments of the present disclosure generally relate to monitoring and remediation of the health of information technology (IT) equipment, clusters thereof, and/or services deployed within a private or public cloud, for example, running on virtual machines (VMs) or containers (or pods) managed by a container orchestration platform. In particular, some embodiments relate to an auto-healing feature that monitors events within one or more clusters of nodes each representing a distributed data management storage system and facilitates automated remediation of noncompliance with best practices by identifying corresponding appropriate courses of action.
Data is the lifeblood of every business and must flow seamlessly to enable digital transformation, but companies can extract value from data only as quickly as the underlying infrastructure can manage it. Data centers and the applications they support are becoming more and more complex day-by-day. Issues arising in an on-premise or public cloud-based data management storage solution can have an adverse effect on an organization and can cause loss of revenue as a result of downtime. Troubleshooting issues (e.g., deviations from best practices) and fixing them is often time consuming and exhausting and distracts users from other business objectives and customer service related tasks.
Systems and methods are described for automated remediation of deviations from best practices in the context of a data management storage system. According to one embodiment, after receiving a notification regarding a rule-evaluation trigger event, a determination is made regarding an existence of a deviation from a best practice by a data storage system by: (i) identifying a set of one or more rules associated with the rule-evaluation trigger event, in which the set of one or more rules define one or more conditions that are indicative of a root cause of the deviation; and (ii) evaluating the set of one or more rules with respect to one or more of historical data and a current state of the data storage system. Based on the set of one or more rules, a determination is made regarding whether a remediation associated with the deviation that addresses or mitigates the deviation is available.
Other features of embodiments of the present disclosure will be apparent from accompanying drawings and detailed description that follows.
The present disclosure is best understood from the following detailed description when read with the accompanying figures.
The drawings have not necessarily been drawn to scale. Similarly, some components and/or operations may be separated into different blocks or combined into single blocks for the purposes of discussion of some embodiments of the present technology. Moreover, while the technology is amenable to various modifications and alternate forms, specific embodiments have been shown by way of example in the drawings and are described in detail below. The intention, however, is not to limit the technology to the particular embodiments described or shown. Rather, the technology is intended to cover all modifications, equivalents, and alternatives.
Systems and methods are described for automated remediation of deviations from best practices in the context of a data management storage system. At present, some storage equipment and/or data management storage solution vendors monitor customer clusters using automated support (“ASUP”). ASUP is often used to proactively monitor the health of the storage system and automatically send messages to the vendor, internal support teams, or support partners. These messages can include telemetry data, configuration details, system status, performance metrics, system events, as well as other data that may be useful to proactively detect and avoid potential issues.
In some cases, tens or hundreds of thousands of deployed assets (e.g., storage controllers) of a particular data management storage solution vendor may deliver ASUP telemetry data to the vendor on a regular basis. As a result, the volume of data collected by the ASUP back-end system (e.g., ASUP records including events, configuration details, logs, performance data, counter information, etc.) contains a wealth of system information. Many of these messages may include benign event messages indicative of a healthy system while a few of the messages may be indicative of a problem. For example, relatively simple call-home messages can create a significant volume of data while mostly indicating the deployed assets are able to communicate with the ASUP back-end system. Moreover, the deployed assets will have different configurations (e.g., hardware and software configurations) running different applications. As such, sorting through the large volume of data reported to the ASUP back-end system to effectively, and efficiently, identify and interpret presents significant challenges.
Some storage equipment and/or data management storage solution vendors may allow administrative users of customers to log in (e.g., via cloud-based portals) to check for issues associated with their installation and then proceed with manual fixes based on the community wisdom. Unfortunately, each customer has to primarily rely on their own expertise and resources to identify and resolve problems with the storage solutions. While vendor support personnel are typically available to assist, they often have to manually parse through the ASUP data to identify actual operational problems and failures, particularly where a large number of benign event messages are produced.
Even where it is determined that one or more actions should be taken with respect to the storage system, traditional ASUP functionality does not automatically take responsive action. Instead, separate tools have been used to manually initiate change within the storage solutions where an issue has been identified. One historical reason for this is that security and other operational concerns usually mandate that changes to the storage system be tightly controlled. Thus, allowing the ASUP back-end to initiate manipulation of the storage system or components thereof, without being at the direction or under control of the front-end system, is generally not acceptable to the users or operators of the storage system. As a result, the support personnel who may have identified a problem generally needs to work with front-end management personnel in order to identify a solution and directly manipulate system changes (e.g., through storage administrator use of a management application).
One drawback with these traditional approaches is time. The time taken to identify and fix an issue may be quite long. The length can depend on a lot of factors including, but not limited to, the time between check-ins by the administrative user, knowledge of the administrative user, type of issue, severity of the problem, and the like. With respect to best practices, in particular, the volume and technical nature of documentation (e.g., best practice guides and/or technical reports provided by a vendor of the storage system) may be difficult to digest by customer personnel (e.g., an administrative user of the storage system). Additionally, the descriptions of particular best practices in their current form may not be easily translated into operation in terms of evaluation for compliance or non-compliance and/or performance of appropriate remediation(s) by customer personnel.
Various embodiments of the present technology allow for an intelligent data infrastructure that can proactively monitor system data from multiple deployed storage solutions, identify various insights by learning from system data, and provide auto-healing functionality. For example, in one embodiment, rules may be executed by a data management storage solution to identify deviations from best practices. When a deviation is identified, a corresponding remediation may be identified and potentially automatically implemented to bring the configuration or operation of the data management storage solution into compliance with the best practice at issue.
In some embodiments, the received ASUP telemetry data can be added to a multi-petabyte data lake and processed through one or more machine-learning (ML) classification models to perform predictive analytics and arrive at “community wisdom” derived from the vendor's installed user base. Various embodiments described herein seek to provide an insight-based approach to risk detection and remediation including more proactively addressing issues before they turn into more serious problems. For example, by continuously learning from the community wisdom and making it available for use by cognitive computing co-located with a customer's cluster, insights may be extracted from this data to deliver actionable intelligence.
The general idea behind some embodiments is to offer storage consumers insights (e.g., a set of words, phrases, visual cues, or other indicators providing a level of understanding or discernment), guidance, and actions into issues that are affecting their environment rather than an endless list of cryptic error events. Such insights, guidance, and actions (collectively referred to as actionable intelligence) can lead to higher availability, improved security, and simplified storage operations. When derived based at least in part from information (e.g., telemetry data, interactions with support staff, and the like) received from the vendor's consumer base, the actionable intelligence may be referred to as community wisdom. In various embodiments, described herein the actionable intelligence may be operationalized by locally triggering automated evaluation of a set of one or more rules to identify the existence of a particular risk to which a customer's distributed data management storage system (e.g., in the form of a cluster of nodes) is exposed. In response to identifying a particular risk to which the distributed data management storage system is exposed, associated insights, guidance, and actions may be presented via a system manager dashboard as part of an alert to an administrative user that will facilitate maintaining the health and resiliency of the customer's cluster.
By moving rich ML models and/or rule sets (which may individually or collectively be referred to as rules, a rule set, or rule sets) local to customer clusters, the provision of proactive and real-time health analysis, notifications to customers, and automated remediation (auto healing) is facilitated. As described further below, in one embodiment, various rule sets for identifying the existence of various risks to which the customer's cluster may be exposed and various remediation sets, including remediation actions or scripts for mitigating the various identified risks may be proactively delivered to the customer's cluster by an artificial intelligence for IT operations (AIOps) platform that derives the rule sets and remediation sets based on community wisdom collected from a vendor's consumer base. Some remediations and rules may be generally applicable to all of the products/services of a vendor, whereas other remediations and rules may be more narrowly applicable to only a subset of the products/services of the vendor. In some examples, only those rule and remediations derived from community wisdom that are deemed to be relevant or applicable (e.g., based on the cluster being of a same or similar class and/or type of data storage system as the community wisdom) may be delivered to the customer's cluster. Based on the rule sets received by a customer's cluster, monitoring may be performed to identify a risk to which the customer's cluster is exposed (via inferencing performed by a rule set provided in the form of a rich ML classification model and/or by analysis performed by a rule engine on a rule set provided in the form of conditional logic). After identifying a risk to which the customer's cluster is exposed, a corresponding remediation may be identified that mitigates or addresses the risk.
In some examples, a predefined or configurable set of event management system (EMS) events may be used to trigger a deep analysis (e.g., via a local ML classification model and/or via a rule engine, as the case may be) to identify the existence of a risk to the cluster or a node thereof. When such a risk is determined to exist, an alert may be raised and presented via a system manager dashboard associated with the cluster. Alternatively, or in addition to the triggering of rule evaluation by a rule engine responsive to certain EMS events, some rules may be run (or evaluated) by the rule engine on a periodic schedule. For example, a scheduler/job manager may execute rules on a schedule specified by the rules themselves. In this manner, active risks may be checked on a periodic basis (by re-running an associated rule) to determine if the risk condition still exists or has been resolved. Risks that are known to arise as a result of periodic changes may be good candidates for checking on a periodic schedule.
In one embodiment, auto-heal functionality may be enabled by monitoring a data management storage solution or a data storage system thereof (e.g., the Data ONTAP storage operating system available from NetApp, Inc. of San Jose, CA) for key events via a publisher/subscriber pattern (e.g., a Pub/Sub bus) and signaling an analytic engine when an issue is identified based on an event. Identified issues may be further analyzed using the rich community wisdom and such analysis may be mapped to known rules to facilitate determination of a root cause and a corresponding appropriate course of action. An administrative user of the data management storage solution may then be notified via an EMS of the issue (e.g., a risk, an error, or a failure) and potential corrective action (e.g., a remediation). Alerts may be provided in the form of an EMS stateful event (e.g., an EMS event that contains state information). The state information may include a corrective action identified for the issue at hand. In some embodiments, the state information may include sufficient information for external infrastructure or a cloud-based service (e.g., a third-party cloud-based workflow automation platform, such as ServiceNow or the like, or a cloud-based service of the vendor of the storage system) to remotely initiate performance of remediations. For example, the state information may include information regarding the API (e.g., exposed by the storage system or by the auto-healing service) to call as well as any information needed to make the call, for example, authentication information.
Depending upon the particular implementation, some issues may be automatically remediated, while others may be proactively brought to the attention of the administrative user and remediated upon receipt of authorization from the administrative user. Preferences relating to the desired type of remediation (e.g., automated vs. user activated) for various types of identified issues arising within the data management storage solution may be configured by the administrative user, learned from historical interactions (e.g., dismissal of similar issues or approving automated application of a remediation for similar issues) with the administrative user, and/or based on community wisdom. For example, the administrative user may select automated remediation for issues/risks known to arise as a result of periodic changes to the environment in which the data management storage solution operates and/or to the configuration of the data management storage solution. Auto-healing data management storage solution nodes and/or the cluster adds customer value by monitoring and fixing (or at least mitigating) issues before they become more serious problems, thereby freeing administrative users from researching and implementing remediations and instead allowing them to spend time on more strategic objectives.
While for purposes of explanation, various specific examples of events (e.g., Network Attached Storage (NAS) events), risks (e.g., deviation from a particular best practice), and corresponding remediations are described herein, it is to be appreciated the methodologies described herein are broadly applicable to other types of events (e.g., storage area network (SAN) events, security issues, performance issues, capacity issues, other best practices, and/or compliance issues). More broadly speaking, and as described further below, the methodologies described herein are applicable to any signaling event (e.g., a manually initiated check, an event management system (EMS) event, expiration of a timer, or the like) that can be associated with a rule that does the analysis, for example, to identify or existence or non-existence of a risk to the storage system; and when the existence of the risk associated with the rule is present, further determines an associated corrective action. For example, the described approach can be applied to misconfiguration issues, environmental issues, security issues, performance issues, capacity issues, deviations from best practices, and/or compliance issues. While, for simplicity, in the context of various examples a rule may be said to be caused to be evaluated after a given signaling event, it is to be appreciated certain rules may be grouped together in various combinations and all of such rules in the group or set may be evaluated (e.g., in series or in parallel). For example, at certain predefined time intervals or responsive to other events arising in the storage system, a given set of multiple rules (e.g., relating to capacity fullness of all volumes of the storage system, best practices relating to security objectives for the storage system confidentiality, integrity, and availability, or other related or unrelated predefined or configurable groupings of rules) may all be evaluated.
Various embodiments of the present technology provide for a wide range of technical effects, advantages, and/or improvements to computing systems and components. For example, various embodiments may include one or more of the following technical effects, advantages, and/or improvements: 1) use of non-routine and unconventional operations to facilitate operationalization of community wisdom in the form of rule sets and remediation sets derived therefrom that may be proactively distributed by an artificial intelligence for IT operations (AIOps) platform to a data management storage solution; 2) use of an auto-healing service architecture for coordination of execution of rules and remediations for a data management storage solution; 3) providing an insight-based approach to risk detection and remediation, including more proactively addressing issues before they turn into more serious problems; 4) cross-platform integration of system monitoring capabilities with machine learning and artificial intelligence to automatically monitor and heal (e.g., repair or reallocate) storage solutions in a timely and efficient manner; 5) use of non-routine and unconventional operations and system configurations to analyze the health of storage solutions to improve the speed of the diagnosis and resolution of customer issues; 6) provide an integrated monitoring platform that uses non-routine and unconventional techniques to both reactively and proactively detect, correct, and/or avoid potential issues with storage solutions; 7) use of a distributed architecture with local cognitive computing co-located with customer storage using non-routine and unconventional operations to analyze data and submit issues and solutions to global ASUP platform for additional analysis and integration; and 8) facilitating a more intelligent data infrastructure by continuously learning from community wisdom and making rules and remediations derived therefrom available for use by cognitive computing co-located with a customer's storage cluster to facilitate auto remediation (auto-healing) functionality.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of embodiments of the present disclosure. It will be apparent, however, to one skilled in the art that embodiments of the present disclosure may be practiced without some of these specific details. In other instances, well-known structures and devices are shown in block diagram form.
Brief definitions of terms used throughout this application are given below.
A “computer” or “computer system” may be one or more physical computers, virtual computers, or computing devices. As an example, a computer may be one or more server computers, cloud-based computers, cloud-based cluster of computers, virtual machine instances or virtual machine computing elements such as virtual processors, storage and memory, data centers, storage devices, desktop computers, laptop computers, mobile devices, or any other special-purpose computing devices. Any reference to “a computer” or “a computer system” herein may mean one or more computers, unless expressly stated otherwise.
The terms “connected” or “coupled” and related terms are used in an operational sense and are not necessarily limited to a direct connection or coupling. Thus, for example, two devices may be coupled directly, or via one or more intermediary media or devices. As another example, devices may be coupled in such a way that information can be passed there between, while not sharing any physical connection with one another. Based on the disclosure provided herein, one of ordinary skill in the art will appreciate a variety of ways in which connection or coupling exists in accordance with the aforementioned definition.
If the specification states a component or feature “may”, “can”, “could”, or “might” be included or have a characteristic, that particular component or feature is not required to be included or have the characteristic.
As used in the description herein, the meaning of “a,” “an,” and “the” includes plural reference unless the context clearly dictates otherwise. Also, as used in the description herein, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise.
The phrases “in an embodiment,” “according to one embodiment,” and the like generally mean the particular feature, structure, or characteristic following the phrase is included in at least one embodiment of the present disclosure and may be included in more than one embodiment of the present disclosure. Importantly, such phrases do not necessarily refer to the same embodiment.
As used herein, the term “storage operating system” generally refers to computer-executable code operable on a computer to perform a storage function that manages data access and may, in the case of a storage system (e.g., a node of a cluster representing a distributed storage system), implement data access semantics of a general purpose operating system. The storage operating system can also be implemented as a microkernel, an application program operating over a general-purpose operating system, such as UNIX or Windows NT, or as a general-purpose operating system with configurable functionality, which is configured for storage applications as described herein. A non-limiting example of a storage operating system that may implement one or more of the various file system, Redundant Array of Independent Disks (RAID), storage, auto-healing, rule evaluation, ML model training and inferencing, remediation, and other functionality described herein is the ONTAP data management software available from NetApp, Inc.).
As used herein a “cloud” or “cloud environment” broadly and generally refers to a platform through which cloud computing may be delivered via a public network (e.g., the Internet) and/or a private network. The National Institute of Standards and Technology (NIST) defines cloud computing as “a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.” P. Mell, T. Grance, The NIST Definition of Cloud Computing, National Institute of Standards and Technology, USA, 2011. The infrastructure of a cloud may be deployed in accordance with various deployment models, including private cloud, community cloud, public cloud, and hybrid cloud. In the private cloud deployment model, the cloud infrastructure is provisioned for exclusive use by a single organization comprising multiple consumers (e.g., business units), may be owned, managed, and operated by the organization, a third party, or some combination of them, and may exist on or off premises. In the community cloud deployment model, the cloud infrastructure is provisioned for exclusive use by a specific community of consumers from organizations that have shared concerns (e.g., mission, security requirements, policy, and compliance considerations), may be owned, managed, and operated by one or more of the organizations in the community, a third party, or some combination of them, and may exist on or off premises. In the public cloud deployment model, the cloud infrastructure is provisioned for open use by the general public, may be owned, managed, and operated by a cloud provider (e.g., a business, academic, or government organization, or some combination of them), and exists on the premises of the cloud provider. The cloud service provider may offer a cloud-based platform, infrastructure, application, or storage services as-a-service, in accordance with a number of service models, including Software-as-a-Service (SaaS), Platform-as-a-Service (PaaS), and/or Infrastructure-as-a-Service (IaaS). In the hybrid cloud deployment model, the cloud infrastructure is a composition of two or more distinct cloud infrastructures (private, community, or public) that remain unique entities, but are bound together by standardized or proprietary technology that enables data and application portability (e.g., cloud bursting for load balancing between clouds).
As used herein “AutoSupport” or “ASUP” generally refers to a telemetry mechanism that proactively monitors the health of a cluster of nodes (e.g., implemented in physical or virtual form) and/or individual nodes of a distributed computing system. A non-limiting example of a distributed computing system is a distributed data management storage solution (or a distributed storage system), for example, in the form of a cluster of nodes.
As used herein “community wisdom” generally refers to data received from and/or derived from a user base of one or more products/services of a vendor. A non-limiting example of such a product/service is a distributed storage system. Community wisdom may be collected to acquire a deep knowledge base to which predictive analytics and cognitive computing may be applied to derive insight-driven rules for identifying exposure to particular risks and insight-driven remediations for addressing or mitigating such risks. In the context of the enterprise-data-storage market, even a one to two percent market share represents a massive user base from which billions of data points may be gathered by a vendor on a daily basis from potentially hundreds of thousands of data management storage solutions. Insights may be extracted from this data by or on behalf of the vendor with cloud-based analytics that combine predictive analytics and proactive support to deliver actionable intelligence. Community wisdom may be said to be relevant to or applicable to a particular data storage system when such community wisdom was received from or derived from a same or similar class (e.g., entry-level, midrange, or high-end), and/or type (e.g., on-premise, cloud, or hybrid) of data storage system. Other classifications may include, but are not limited to workload type (e.g., high throughput, read only, etc.), features that are enabled (e.g., snapshot, replication, data reduction, Internet small computer system interface (iSCSI) protocol), applications running on the storage controllers, hardware (e.g., serial-attached SCSI (SAS), serial advanced technology attachment (SATA), non-volatile memory express (NVMe) disks, cache adapter installed, network adapters, and so on), system-defined performance service level (e.g., extreme performance (extremely high throughput at a very low latency), performance (high throughput at a low latency), value (high storage capacity and moderate latency), extreme for database logs (maximum throughput at the lowest latency), extreme for database shared data (very high throughput at the lowest latency), extreme for database data (high throughput at the lowest latency)).
As used herein, a “best practice” or “recommended practice” generally refers to a standard or a guideline that provides the best course of action in a given situation. In the realm of technology, a best practice may refer to a method, a technique, a configuration, or the like that is accepted as superior because it produces results that are better than those achieved by other means. In the context of various examples described herein, best practices for planning and optimizing a storage system deployment, for example, within different ecosystems or with different protocols there may be a variety of best practices for making use of certain features and capabilities of a storage system of a particular family, model, type, and/or class. For example, best practices may be related to how to optimally carry out a particular task within a cluster of nodes representing a distributed storage system, how to optimally configure the cluster or an individual node of the cluster when making use of particular functions/features of the storage operating system. Non-limiting examples of classes or groups of best practices, which may be granular with respect to a particular family, model, type and/or class of storage system, the current version of the storage operating system, and/or functions/features enabled within the storage operating system, may relate to one or more of the following:
Implementation and usage of data protection and disaster recovery mechanisms (e.g., based on data replication solutions, such as NetApp SnapMirror storage and data replication software available from NetApp, Inc.), which may be available in an asynchronous and/or a synchronous replication configuration,
As used herein, “on-box” generally refers to or describes one or more functions, processes, services, or features implemented local to or on a data management storage solution (e.g., a node of nodes of a physical or virtual storage system), whereas “off-box” generally refers to or describes one or more functions, processes, services, or features implemented remote from or external to the data management storage solution.
As used herein, a “risk” may identify an issue within a cluster of nodes and/or individual nodes of a distributed computing system (e.g., data management storage solution). A risk may be communicated to an auto-heal system as an alert (e.g., an EMS event that contains state information (an EMS stateful event)). In some embodiments, the state information contained within an EMS stateful event may include an associated corrective action (e.g., a remediation). In one embodiment, risk identification may be triggered responsive to a predefined or configurable set of EMS events, which may be referred to herein as key EMS events. Risk identification may additionally or alternatively be performed responsive to rules that are run on a periodic schedule, responsive to configuration changes made to the distributed computing system, or on demand (e.g., responsive to a request made by an administrative user of the cluster). A deviation from a best practice is a non-limiting example of a risk, for example, to be addressed and/or brought to the attention of support or administrative personnel.
As described herein a “remediation” generally represents one or more corrective actions that may be used to resolve an identified risk. In some embodiments, in order to facilitate auto-healing, remediations may be comprised of Python code. In other cases, remediations may be provided in the form of detailed directions (e.g., similar to the type of guidance and/or direction that might be received via level 1 (L1) or level 2 (L2) technical support) to allow an administrative user to perform remediations manually. Non-limiting examples of remediation actions include configuration recommendations for a data management storage solution or node thereof, command recommendations to be issued to a data management storage solution or node thereof, for example, via a command-line interface (CLI), a REST API, or a graphical user interface (GUI).
As described herein “rules” may be used to identify risks within a cluster of nodes and/or individual nodes of a distributed computing system (e.g., a data management storage solution). In some examples, the rules may be represented in the form of self-contained Python file(s) that contain code to identify a given issue (risk). For example, a rule may include one or more conditions or conditional expressions involving the current or historical state (e.g., configuration and/or event data) of the cluster or individual nodes that when true are indicative of the cluster or an individual node being exposed to the given risk. In some embodiments, rules may be hierarchically organized in parent-child relationships, for example, with zero or more child rules depending from a parent rule. A rule may contain or otherwise be associated with information as to whether it can be remediated. If so, the rule may also contain or be associated with steps for remediating the issue and/or explaining how the issue can be remediated. In one embodiment, rules can be executed based on a trigger or a schedule. In the context of trigger-based rules, a publisher/subscriber bus message, for example, identifying the occurrence of a key EMS event may represent the source of a trigger and may be associated with one or more rules to be executed. In the context of schedule-based rules, a scheduler or job manager may execute a given rule in accordance with a schedule associated with the given rule. In other examples rules or rule sets may be represented in a form of a machine-learning (ML) algorithm or model, for example, an ML classification model or a deep learning model, such as a Recurrent Neural Network (RNN).
As used herein, a “publisher/subscriber bus,” a “publisher-subscriber bus,” a “pub/sub bus,” a “pub-sub bus” and the like generally refer to a messaging queue system that facilitates communication among publishers and subscribers. Publishers generally represent systems, components, or applications that produce or generate events or data and subscribers generally represent systems, components, or applications that desire to be made aware of the availability of data produced by one or more publishers or the occurrence of certain events or data relating to one or more publishers. A pub/sub bus eliminates the need for subscribers to poll for data from publishers (e.g., via an application programming interface exposed by a publisher) and instead implements a subscription model. For example, subscribers may subscribe to the data or topic(s) of interest (e.g., the occurrence of a particular event or type of event within a data storage management solution) generated by or otherwise associated with one or more publishers via Application Programming Interfaces (APIs) (e.g., Representational State Transfer (REST) APIs) exposed by a storage operating system of a data storage management solution for use by authorized internal and/or external entities. Non-limiting examples of a pub/sub bus include NetApp ONTAP Pub/Sub. Non-limiting examples of message brokers that may be used to facilitate implementation of a pub/sub bus include Apache Qpid, ActiveMQ, and RabbitMQ.
As used herein, a “storage volume” or “volume” generally refers to a container in which applications, databases, and file systems store data. A volume is a logical component created for the host (e.g., a client) to access storage of the underlying storage primary tier associated with a storage system. A volume may be created from the capacity available in storage pod, a pool, or a volume group. A volume has a defined capacity. Although a volume might consist of more than one storage drive, a volume appears as one logical component to the host (e.g., a client). Non-limiting examples of a volume include a flexible volume and a flexgroup volume.
As used herein, a “flexible volume” generally refers to a type of storage volume that may traditionally be efficiently distributed across multiple storage devices. A flexible volume may be capable of being resized to meet changing business or application requirements. In some embodiments, a storage system may provide one or more aggregates and one or more storage volumes distributed across a plurality of nodes interconnected as a cluster. Each of the storage volumes may be configured to store data such as files and logical units. A flexible volume may be comprised within a storage aggregate (e.g., representing a set of storage devices (disks)) and includes at least one storage device. The storage aggregate may be abstracted over a RAID plex where each plex comprises a RAID group. Moreover, each RAID group may comprise a plurality of storage disks. As such, a flexible volume may comprise data storage spread over multiple storage disks or devices. A flexible volume may be loosely coupled to its containing aggregate (e.g., a file system aggregate, such as a WAFL aggregate). A flexible volume can share its containing aggregate with other flexible volumes. Thus, a single aggregate can be the shared source of all the storage used by all the flexible volumes contained by that aggregate. A non-limiting example of a flexible volume is a NetApp ONTAP Flex Vol volume.
As used herein, a “flexgroup volume” generally refers to a single namespace that is made up of multiple constituent/member volumes. A non-limiting example of a flexgroup volume is a NetApp ONTAP FlexGroup volume that can be managed by storage administrators, and which acts like a flexible volume. In the context of a flexgroup volume, “constituent volume” and “member volume” are interchangeable terms that refer to the underlying volumes (e.g., flexible volumes) that make up the flexgroup volume.
“AIOps” is an umbrella term for the use of big data analytics, ML, and/or other artificial intelligence (AI) technologies to automate the identification and resolution of common IT issues or risks. Separate rule sets 121 may be generated for different types and/or classes of data storage systems or based on features enabled within the data storage systems. Similarly, separate remediation sets 122 may be created for different types and/or classes of data storage systems or based on features enabled within the data storage systems. A non-limiting example of the AIOps platform 120 is the NetApp Active IQ Digital Advisor available from NetApp, Inc.
As shown in
As illustrated in
Some data may be sensitive and/or capable of identifying a customer. Some data may be excluded or scrubbed by default (or when requested by a customer) before being sent to AIOps. Examples of potentially sensitive data may include, but is not limited to: IP addresses, MAC addresses, URIs, DNS names, E-mail addresses, Port numbers, Node names, SVM names, Cluster names, Aggregate names, Volume names, Junction paths, Policy names, User IDs and the like.
Given the volume of data and different configurations of storage system, the next step is to intelligently sift “signals” out of the “noise” to identify significant events and patterns related to existence of potential risks, application performance, and/or availability issues to which a given data management storage solution 130 may be exposed (block 124). This can be done, in accordance with various embodiments, using various machine learning techniques or classifiers. Examples of potential ML classifiers that could be used include but are not limited to the following: Decision Tree, Random Forest, Naive Bayes Classifier, K-Nearest Neighbors, Support Vector Machines, Artificial Neural Networks, and the like. A non-limiting example of an ML classification model in the form of a neural network model is shown and described below with reference to
Once the significant events and patterns are identified, the next step is to diagnose root causes (e.g., using Lean Six Sigma, statistical analysis tools, hierarchical clustering, and/or data-mining solutions) and report them to technical support staff, IT, and/or DevOps for rapid response and/or development of appropriate remediations that may be deployed to relevant portions of the customer base (block 125). Alternatively, in in some cases, the AIOps platform 120 may automatically propose remediations without human intervention.
According to one embodiment, the AIOps platform 120 represents a big data platform that aggregates community wisdom (e.g., community wisdom 111a-b) received from or derived from multiple sources (e.g., customers'/users' interactions with technical support staff (e.g., technical support 115), support case histories, and events associated with operation of data management storage solutions of participating customers having a feedback/reporting feature enabled) (block 126).
Community wisdom may start in a form of support troubleshooting workflows, knowledge based articles, and pattern matching within a given customer environment and/or across multiple customer environments. At block 127, the community wisdom may then be broken into multiple segments. For example, in some embodiments, the community wisdom may include trigger event segment (e.g., the “signals” or portions thereof from above), analysis segment, and recovery segment. In some embodiments, the trigger events may represent significant events and patterns related to existence of potential risks, application performance, and/or availability issues to which a given data management storage solution 130 may be exposed. The existence or occurrence of such trigger events may be used as an indicator to start an analysis process. For example, as described further below, one or more key event management system (EMS) events may be used as trigger events to evaluate corresponding sets of one or more rules to confirm or refute the existence of potential risks, application performance, and/or availability issues to which the given data management storage solution 130 may be exposed. The analysis segment may be used to determine if an actual issue associated with the trigger event has occurred. As described further below, the analysis may include the evaluation by a rules engine of one or more rules, for example, written in Python. The recovery segment of the community wisdom, at least in some embodiments, may include the logic (or remediation) that actually corrects the issue identified. Like the rules, the remediations may also be written in Python. A given remediation may be associated with one or more rules.
The community wisdom and the telemetry data (e.g., ASUP telemetry data 131) from which it is derived may include, among other data:
Based on the community wisdom, the AIOps platform 120 may apply focused analytics and ML capabilities to, among other things:
In one embodiment, one or more of the remediations of remediation sets 122 may be generated based on support troubleshooting workflows developed by technical support staff to identify and address problems/issues/risks observed in numerous customer cases. For example, the support troubleshooting workflows may be turned into code modules that perform deep analysis and provide automated recovery. As described further below, a non-limiting example of a potential remediation to address a capacity issue (e.g., a risk of imminently filling the storage capacity of a storage container, for example, a LUN, a volume, and/or an aggregate) causes the storage capacity of the storage container at issue to be increased by a predetermined or configurable percentage.
The code modules may analyze, among other issues:
According to one embodiment, during operation of the data management storage solution 130, a single node called the “primary node,” which may be responsible for coordinating cluster-wide activities, may collect and report telemetry data (e.g., ASUP telemetry data 131) to the AIOps platform 120. The telemetry data may be collected by, among other mechanisms, performance-monitoring tools running on the data management storage solution 130, and service ticketing systems, for example, utilized by technical support staff.
When received from the data management storage solution 130, the AIOps platform 120 may store the telemetry data in an ASUP data lake 110 to allow the raw data to be transformed into structured data that is ready for SQL analytics, data science, and/or ML with low latency. For example, the telemetry data may be processed by one or more analytical models to create the community wisdom that may be stored within ASUP data lake 110. The collection and reporting of the telemetry data by a telemetry mechanism (not shown) may be performed periodically and/or responsive to trigger events. The telemetry mechanism may proactively monitor the health of a particular data storage system or cluster with which it is associated and automatically send information regarding configuration, status, performance, and/or system updates relating to the particular data storage system or cluster to the vendor. This information may then be used by technical support personnel and/or the AIOps platform 120 to speed the diagnosis and resolution of issues (e.g., step-by-step or automated remediations). For example, when predetermined or configurable events are observed within an individual node of a given data management storage solution or at the cluster level, when manually triggered by a customer, when manually triggered by the vendor, or on a periodic basis (e.g., daily, weekly, etc.), ASUP telemetry data 131 (e.g., in the form of an ASUP payload), including, among other things, information indicative of the class and type of the data management system(s) at issue, the configuration (e.g., features that are enabled/disabled) of the data management system(s) at issue, and the version of storage operating system software being run by the data management system may be generated and transmitted to the AIOps platform 120.
In addition to automatically reported telemetry data (e.g., ASUP telemetry data 131), data collected by technical support personnel (e.g., technical support 115) in connection with troubleshooting customer issues may be used to derived community wisdom. In one embodiment, customers of a vendor of the data management storage solution 130 may report potential issues they are experiencing with the data management storage solution 130 to technical support personnel via text, chat, email, phone, or other communication channels. Information collected by technical support 115, for example, regarding a given reported issue, including, among other data, the class and type of data management system, the configuration of the data management system, and the version of storage operating system software being run by the data management system may be provided in near real-time to the AIOps platform 120.
Depending upon the particular implementation, updates (e.g., update 119) may be provided to groups of clusters based on their similarity in terms of class and/or type of data storage systems. For example, a given update may include an updated rule set (e.g., including new and/or updated rules, for example, in the form of conditional logic or in the form of an ML model) and/or an updated remediation set (e.g., including new and/or updated remediations) for use by a particular class and/or a particular type of data storage system. Alternatively, an update may be unique to a particular cluster. According to one embodiment, updates may be performed in accordance with a predefined or configurable schedule (e.g., daily, weekly, monthly, etc.) and/or responsive to manual direction from the vendor. Given a typical feature release schedule for software of a data storage system might be on the order of once or twice per calendar year, the ability to deliver such updates out-of-cycle with the release schedule provides enormous benefit. For example, customers obtain the advantages and results of enhanced risk identification and/or remediation capabilities without having to wait for the next feature release. As will be appreciated, for dark sites (e.g., government or military sites having no Internet connectivity) that may employ one or more data storage systems utilizing features associated with various embodiments, updates to rule sets 121 and/or remediation sets 122 may be delivered via “sneaker net” (e.g., on a computer-readable medium) concurrently with or separate from updates or patches to the software of the data storage systems.
In the context of the present example, the ML classification model is shown as a network of nodes (or “neurons”) which are organized in layers (e.g., an input layer 152, one or more hidden layers 154, and an output layer 156). Based on the predictors (or inputs) provided to the input layer 152, forecasts (or outputs) are emitted by the output layer 156. Coefficients (not shown) associated with each of the predictors are generally referred to as weights. The forecasts are obtained by a combination (in this case, a non-linear combination) of the inputs. The weights may be selected using a learning algorithm that minimizes a cost function (e.g., mean absolute error, mean squared error, root mean squared error, etc.). The example ML classification model 150 depicted in
In general, ML classification algorithms may be used to predict a discrete outcome (y) using independent variables (x). ML has a variety of use-cases in different domains. Subscription-based media streaming platforms like Netflix and Spotify, for instance, use ML to recommend content to users based on their respective activity on the platform. In the context of various embodiments described herein, an ML classification model (e.g., ML classification model 150) may be trained remotely by the AIOPs platform, for example, based on community wisdom and applied locally by a particular data storage system, for example, by an auto-healing service to predict whether the particular data storage system is exposed to a particular issue or risk based on a state of the particular data storage system (e.g., one or more of events occurring within the particular data storage system, results of periodically scheduled checks, and historical data) as inputs (e.g., one of input1 to inputn) to the input layer 152. As described further below, responsive to identification of the particular issue or risk, the auto-healing service may further identify a corresponding remediation to be manually approved or automatically applied to the data storage system that has is known to address or mitigate a root cause of the particular issue or risk, for example, based on analysis of community wisdom performed by the AIOps platform.
While in the context of the present example, only one ML classification model is shown, it is to be appreciated multiple different ML classification models may be employed. According to one embodiment, a different ML classification model may be trained by the AIOps platform for respective target groups of data storage systems of the same or similar class and type of data storage system based on community wisdom derived from information (e.g., telemetry data, interactions with technical support staff, support case histories, and/or the like) collected from the vendor's consumer base that are of the same or similar class and type as the target group. For example, a first ML classification model may be trained by the AIOps platform for virtual storage systems deployed within a particular public cloud (e.g., Amazon Web Services (AWS)), a second ML classification model may be trained by the AIOps platform for virtual storage systems deployed within another public cloud (e.g., Google Cloud Platform (GCP)), and a third ML classification model may be trained by the AIOps platform for virtual storage systems deployed within yet another public cloud (e.g., Microsoft Azure). Similarly, separate ML classification models may be trained by the AIOps platform for more performant virtual storage systems versus less performant virtual storage systems or more performant physical storage systems versus less performant physical storage systems. Those skilled in the art will appreciate there are numerous other potential groupings/classifications/types of data storage systems, for example, based on features that are enabled on the data storage systems, applications running on the data storage systems, the performance service levels for which the data storage systems are configured, the type or nature of the storage media employed by the data storage systems, and the hardware configuration of the data storage systems.
Nodes 202 may service read requests, write requests, or both received from one or more clients (e.g., clients 205). In one or more embodiments, one of nodes 202 may serve as a backup node for the other should the former experience a failover event. Nodes 202 are supported by physical storage 208. In one or more embodiments, at least a portion of physical storage 208 is distributed across nodes 202, which may connect with physical storage 208 via respective controllers (not shown). The controllers may be implemented using hardware, software, firmware, or a combination thereof. In one or more embodiments, the controllers are implemented in an operating system within the nodes 202. The operating system may be, for example, a storage operating system (OS) that is hosted by the distributed storage system. Physical storage 208 may be comprised of any number of physical data storage devices. For example, without limitation, physical storage 208 may include disks or arrays of disks, solid state drives (SSDs), flash memory, one or more other forms of data storage, or a combination thereof associated with respective nodes. For example, a portion of physical storage 208 may be integrated with or coupled to one or more nodes 202.
In some embodiments, nodes 202 connect with or share a common portion of physical storage 208. In other embodiments, nodes 202 do not share storage. For example, one node may read from and write to a first portion of physical storage 208, while another node may read from and write to a second portion of physical storage 208.
Should one of the nodes 202 experience a failover event, a peer high-availability (HA) node of nodes 202 can take over data services (e.g., reads, writes, etc.) for the failed node. In one or more embodiments, this takeover may include taking over a portion of physical storage 208 originally assigned to the failed node or providing data services (e.g., reads, writes) from another portion of physical storage 208, which may include a mirror or copy of the data stored in the portion of physical storage 208 assigned to the failed node. In some cases, this takeover may last only until the failed node returns to being functional, online, or otherwise available.
The data center 330 may represent an enterprise data center (e.g., an on-premises customer data center) that is built, owned, and operated by a company or the data center 330 may be managed by a third party (or a managed service provider) on behalf of the company, which may lease the equipment and infrastructure. Alternatively, the data center 330 may represent a colocation data center in which a company rents space of a facility owned by others and located off the company premises. The data center 330 is shown including a distributed storage system (e.g., cluster 335). Those of ordinary skill in the art will appreciate additional information technology (IT) infrastructure would typically be part of the data center 330; however, discussion of such additional IT infrastructure is unnecessary to the understanding of the various embodiments described herein.
Turning now to the cluster 335 (which may be analogous to data management storage solution 130 and/or cluster 201), it includes multiple nodes 336a-n and data storage nodes 337a-n (which may be analogous to nodes 202 and which may be collectively referred to simply as nodes) and an Application Programming Interface (API) 338. In the context of the present example, the nodes are organized as a cluster and provide a distributed storage architecture to service storage requests issued by one or more clients (e.g., clients 305) of the cluster. The data served by the nodes may be distributed across multiple storage units embodied as persistent storage devices, including but not limited to hard disk drives, solid state drives, flash memory systems, or other storage devices. A non-limiting example of a node is described in further detail below with reference to
The API 338 may provide an interface through which the cluster 335 is configured and/or queried by external actors. Depending upon the particular implementation, the API 338 may represent a REST or RESTful API that uses Hypertext Transfer Protocol (HTTP) methods (e.g., GET, POST, PATCH, DELETE, and OPTIONS) to indicate its actions. Depending upon the particular embodiment, the API 338 may provide access to various telemetry data (e.g., performance, configuration and other system data) relating to the cluster 335 or components thereof. As those skilled in the art will appreciate various types of telemetry data may be made available via the API 337, including, but not limited to measures of latency, utilization, and/or performance at various levels (e.g., the cluster level, the node level, or the node component level). The telemetry data available via API 337 may be include ASUP telemetry data (ASUP telemetry data 131) or the ASUP telemetry data may be provided to an AIOps platform (e.g., AIOps platform 120) separately.
In this example, the virtual storage system 410a makes use of storage (e.g., hyperscale disks 425) provided by the hyperscaler, for example, in the form of solid-state drive (SSD) backed or hard-disk drive (HDD) backed disks. The cloud disks (which may also be referred to herein as cloud volumes, storage devices, or simply volumes or storage) may include persistent storage (e.g., disks) and/or ephemeral storage (e.g., disks), which may be analogous to physical storage 208.
The virtual storage system 410a (which may be analogous to a node of data management storage solution 130, one of nodes 202, and/or one of nodes 336a-n) may present storage over a network to clients 405 (which may be analogous to clients 205 and 305) using various protocols (e.g., small computer system interface (SCSI), Internet small computer system interface (ISCSI), fibre channel (FC), common Internet file system (CIFS), network file system (NFS), hypertext transfer protocol (HTTP), web-based distributed authoring and versioning (WebDAV), or a custom protocol. Clients 405 may request services of the virtual storage system 410 by issuing Input/Output requests 406 (e.g., file system protocol messages (in the form of packets) over the network). A representative client of clients 405 may comprise an application, such as a database application, executing on a computer that “connects” to the virtual storage system 410 over a computer network, such as a point-to-point link, a shared local area network (LAN), a wide area network (WAN), or a virtual private network (VPN) implemented over a public network, such as the Internet.
In the context of the present example, the virtual storage system 410a is shown including a number of layers, including a file system layer 411 and one or more intermediate storage layers (e.g., a RAID layer 413 and a storage layer 415). These layers may represent components of data management software or storage operating system (not shown) of the virtual storage system 410. The file system layer 411 generally defines the basic interfaces and data structures in support of file system operations (e.g., initialization, mounting, unmounting, creating files, creating directories, opening files, writing to files, and reading from files). A non-limiting example of the file system layer 411 is the Write Anywhere File Layout (WAFL) Copy-on-Write file system (which represents a component or layer of ONTAP software available from NetApp, Inc.).
The RAID layer 413 may be responsible for encapsulating data storage virtualization technology for combining multiple hyperscale disks 425 into RAID groups, for example, for purposes of data redundancy, performance improvement, or both. The storage layer 415 may include storage drivers for interacting with the various types of hyperscale disks 425 supported by the hyperscaler 420. Depending upon the particular implementation the file system layer 411 may persist data to the hyperscale disks 425 using one or both of the RAID layer 413 and the storage layer 415.
The various layers described herein, and the processing described below may be implemented in the form of executable instructions stored on a machine readable medium and executed by one or more processing resources (e.g., one or more of a microcontroller, a microprocessor, central processing unit core(s), an application-specific integrated circuit (ASIC), a field programmable gate array (FPGA), and the like) and/or in the form of other types of electronic circuitry. For example, the processing may be performed by one or more virtual or physical computer systems of various forms (e.g., servers, blades, network storage systems or appliances, and storage arrays, such as the computer system described with reference to
In some examples, the system manager may represent an HTML5-based graphical management interface that enables an administrative user to use a web browser to manage the distributed storage system and associated storage objects (e.g., disks, volumes, and storage tiers) and perform common management tasks related to storage systems. Using the system manager dashboard, the administrator may be provided with view at-a-glance information about, among other things, various types of alerts and notifications, the efficiency and capacity of storage tiers and volumes, the nodes that are available in a cluster, the status of the nodes in a high-availability (HA) pair, the most active applications and objects, and the performance metrics of a cluster or a node.
With the system manager, the administrator may be able to perform many common tasks, such as:
In the context of the present example, the system manager dashboard includes respective sections relating to health, capacity, performance, management actions, and network. As shown in the management actions section, a DNS lookup failure event has occurred as indicated by 510. An administrative user may view details associated with this event by selecting the “Details” button 511.
As explained further below, the DNS lookup failure event may represent a key EMS event that triggered analysis of one or more rules (e.g., of rule sets 121), for example, by an auto-healing service running on the distributed storage system. Selection of the “Details” button 511 may reveal the particular risk or issue underlying the failure that was identified as a result of analysis of the one or more rules triggered by the key EMS event. As noted above with reference to
As explained further below, the CIFS share offline event may represent a key EMS event that triggered analysis of one or more rules (e.g., of rule sets 121), for example, by an auto-healing service running on the distributed storage system. Selection of the “Details” button 711 may reveal the particular risk or issue underlying the event that was identified as a result of analysis of the one or more rules triggered by the key EMS event. As noted above with reference to
While for purposes of explanation, two specific examples of NAS events and corresponding remediations have been described above with reference to
In the context of the present example, the major components that make up the auto-healing service 900 include, a rule/remediation coordinator 940, a cluster-wide task table 912, an auto-healing REST API 910, an event management system (EMS) service 920, a publisher/subscriber (pub/sub)/EMS topic 930, a rules table 911, a pub/sub/auto-heal topic 950, a rules evaluator 960, and a task execution engine 970.
The rule/remediation coordinator 940 may be responsible for coordinating the execution of rules and remediations for the cluster (e.g., data management storage solution 130 of
In the context of the present example, the rule/remediation coordinator 940 includes an event digest 942 and a thread pool 943. The rule/remediation coordinator 940 may utilize the event digest 942 to communicate with a pub/sub bus. For example, the event digest 942 may include functions to subscribe to one or more pub/sub/EMS topics (e.g., pub/sub/EMS topic 930) and/or publish to one or more pub/sub/auto-heal topics (e.g., pub/sub/auto-heal topic 950). According to one embodiment, based on a set of rules derived from community wisdom and distributed to the cluster by an AIOps platform (e.g., AIOps platform 120), the rule/remediation coordinator 940 may create a mapping between key EMS events (e.g., a CIFS share offline event, a DNS server lookup failure, etc.) and associated rules, the evaluation of which may be able to identify underlying root causes (e.g., potential risks or issues to which the cluster may be exposed) and may subscribe to the EMS topics corresponding to the key EMS events. Thereafter, responsive to being notified regarding the occurrence of a particular EMS event to which the rule/remediation coordinator 940 has subscribed, the rule/remediation coordinator 940 may cause the rules evaluator 960 to evaluate a set of one or more rules to which the particular EMS event is mapped. For example, an EMS event indicative of a CIFS share offline event, may be associated with a set of rules that check for one or more of client authentication issues, directory service issues, and/or client connectivity issues. Similarly, an EMS event indicative of a DNS server lookup failure may be associated with a set of rules that check for one or more of external networking issues and/or client connectivity issues. Other EMS events may be associated with a set of rules that check for one or more of client authentication issues, capacity exhaustion issues, external networking issues, directory services issues, and/or connectivity issues as appropriate.
In the context of the present example, the thread pool 943 may represent a collection of polymorphic threads, which allows any of the individual threads to be shared across functions. Thread pools are a software design pattern for achieving concurrency of execution, for example, by maintaining multiple threads waiting for tasks to be allocated for concurrent execution. Creating and destroying threads and their associated resources can be an expensive process in terms of time. One benefit of making use of thread pools over creating a new thread for each task is that thread creation and destruction overhead is restricted to the initial creation of the pool, which may result in better performance and better system stability. Polymorphic means the ability to take different forms. In one embodiment, the thread pool 943 represents generic thread pool of a predetermined or configurable size that may be used by any of the rule/remediation coordinator 940, the rules evaluator 960, or the task execution engine 970. For example, on an as-needed basis as new tasks are allocated to the threads within the thread pool 943 they may inherit different attributes and behaviors, thereby making the threads available for use by any of a rules evaluator (e.g., rules evaluator 960), a task execution engine (e.g., task execution engine 970) or a rule/remediation coordinator (e.g., rule/remediation coordinator 940). In this manner, the availability and usage of threads may be optimized and wasting of resources may be avoided.
According to one embodiment, the rule/remediation coordinator 940 oversees the detection and scheduling of rules and remediations. The rule/remediation coordinator 940 may use a distributed Saga design pattern (e.g., Saga pattern 941) for managing failures and recovery, where each action has a compensating action for roll-back/roll-forward. For example, the distributed Saga design pattern may be used as a mechanism to manage data consistency across multiple services (e.g., microservices) in distributed transaction scenarios. A saga is a sequence of transactions that updates each service and publishes a message or event to trigger the next transaction step. If a step fails, the saga executes compensating transactions that counteract the preceding transactions.
While in the context of the present example, the auto-healing service 900 is described as running on a single node within the cluster, it is to be appreciated if a node running the auto-healing service 900 fails, another node of the cluster may be elected to run the auto-healing service 900.
The cluster-wide task table 912 may be responsible for logging the steps of a given rule execution and/or a given remediation execution. In the case of a failure of the rule/remediation coordinator 940, the cluster-wide task table 912 may be used to restart execution of running rules/remediations from the point at which they were interrupted by the failure.
The auto-healing REST API 910 provides an interface through which requests for remediation execution may be received from an administrative user of the cluster, for example, as a result of interactions of a user interface presented by a system manager dashboard, for example, selection of a “Fix It” button (e.g., “Fix It” button 611 or 811).
The EMS service 920 may represent an event system that includes monitoring and create, read, update, and delete (CRUD)-based alerting. The EMS service 920 may collect and log event data from different parts of the storage operating system kernel and provide event forwarding mechanisms to allow the events to be reported as EMS events. For example the EMS service 920 may be used to create and modify EMS messages (with stateful attributes).
In one embodiment, a pub/sub bus (including, for example, pub/sub/EMS topic 930 and pub/sub/auto-heal topic 950) is provided to facilitate the exchange of messages among components of the auto-healing service 900. In one embodiment, a topic may be specified by the source component when it publishes a message and subscribers may specify the topic(s) (e.g., pub/sub/EMS topic 930 and/or pub/sub/auto-heal topic 950) for which they want to receive publications.
The pub/sub/EMS topic 930 may be used to listen for (e.g., register to be notified regarding) key EMS messages (e.g., those to which the rule/remediation coordinator is subscribed) and used to trigger execution of rule(s) and/or remediations by the rules evaluator 960 and the task execution engine 970, respectively.
The rules table 911 may be used to store and retrieve information about the mapping between EMS events and associated rules (e.g., rules that are part of rule sets 121) to be executed as well as information regarding scheduled risk checks. For example, an EMS event indicative of a CIFS share offline event, may be associated with a set of rules that check for one or more of client authentication issues, directory service issues, and/or client connectivity issues. Similarly, an EMS event indicative of a DNS server lookup failure may be associated with a set of rules that check for one or more of external networking issues and/or client connectivity issues. Other EMS events may be associated with a set of rules that check for one or more of client authentication issues, capacity exhaustion issues, external networking issues, directory services issues, and/or connectivity issues as appropriate.
The pub/sub/auto-heal topic 950 may be used for communication between the rule/remediation coordinator 940 and the rules evaluator 960 and between the rule/remediation coordinator 940 and the task execution engine 970.
The rules evaluator 960 may be responsible for overseeing the execution of rules and the detection of risks. Depending on the needs of the particular deployment, the auto-healing service 900 may be scaled by running multiple instances of the rules evaluator 960 on other nodes of the cluster. The rules evaluator 960 may build the dependency of rules to triage and rules to run to remediate. The rules evaluator 960 may perform triaging using the triage rules and may dispatch remediation based input for remediation executions.
In the context of the present example, the rules evaluator 960 is shown including a logic controller 961, utilities 962, a collector module 964, an open rule platform (ORP) 965, an event digest module 963, and a thread pool 966, which may also represent a polymorphic thread pool like thread pool 943. The logic controller 961 may be responsible for taking care of the rules evaluator logic. For example, the rules evaluator 960 may be responsible for binding the rules to be run. The logic controller 961 may handle mapping rules to collectors and parsers (not shown) as well as executing the rules using the ORP 965. Additionally, the logic controller 961 may be responsible for getting the required sections collected using the collector 964. In one embodiment, the logic controller 961 may use the thread pool 966 to execute the rules evaluator logic. The logic controller 961 may also handle error and exception handling. The logic controller 961 may utilize the event digest 963 to communicate with the pub/sub bus. Depending on the form in which the rules are represented, the execution of the rules may involve, for example, inferencing by an ML model or execution of conditional logic (e.g., represented by Python code).
The utilities 962 may represent helper functions needed for the functioning of the rules evaluator 960. In one embodiment, the utilities 962 may be shared across the rules evaluator 960 and the task execution engine 970.
In one embodiment, the collector module 964 represents a wrapper class for running collection needs, including collecting information from various services within the storage cluster. For example, data may be retrieved from an SMF database (e.g., an SQL collector) by using the DOT SQL package to run collection from the SMF database. The collector module 964 may use the thread pool 966 for asynchronous functionality of the collector 964. The collector 964 may be generic, for example, by accepting instructions, executing the instructions, and returning values.
The ORP 965 may provide the rules that are executed along with the infrastructure to execute the rules. The ORP 965 may be updated with the latest rules on a periodic basis or on demand from the vendor. For example, an update (e.g., update 119) received by the data management storage from an AIOps service (e.g., AIOps 120) may include a new rule set containing updated rules and/or additional rules, in either case, in the form of conditional logic, code, or an ML model, to be used by the auto-healing service 900 to determine the existence of a risk to which the data management storage solution is exposed.
This event digest module 963 may be a generic module used to communicate. In one embodiment, the event digest module 963 is used to register and communicate to the pub/sub bus. For example, the event digest module 963 may include functions to subscribe or publish to auto-heal topics via pub/sub auto-heal topic 950.
The task execution engine 970 may be responsible for overseeing the execution of remediations and other tasks that may be distributed across the cluster. Depending on the needs of the particular deployment, the auto-healing service 900 may be scaled by running multiple instances of the task execution engine 970 on one or more other nodes of the cluster.
In the context of the present example, the task execution engine 970 is also shown including a logic controller 971, utilities, a collector module 974, an open rule platform (ORP) 975, an event digest module 973, and a thread pool 976, which may also represent a polymorphic thread pool like thread pool 943.
The logic controller 971 may handle the remediation logic. For example, the logic controller 971 may be responsible for mapping rules to collectors and parsers. The logic controller 971 may also take care of executing remediation actions (e.g., issuing storage commands, using an ML model to make predictions, and/or executing a remediation script), including getting the required inputs. The logic controller 971 may use the thread pool 976 to execute the remediation logic. Additionally, the logic controller 971 may take care of error and exception handling. The logic controller 971 may make use of the event digest module 973 to communicate back to the pub/sub bus.
As noted above, the utilities 972 may represent helper functions needed for the functioning of the rules evaluator 960 and/or the task execution module 970. In one embodiment, the utilities 972 may be shared between the rules evaluator 960 and the task execution module 970.
In one embodiment, the collector module 974 represents a wrapper class for running collection needs, including collecting information from various services within the storage cluster. For example, data may be retrieved from an SMF database (e.g., an SQL collector) by using the DOT SQL package to run collection from the SMF database. The collector module 974 may use the thread pool for asynchronous functionality of the collector 974. The collector 974 may be generic, for example, by accepting instructions, executing the instructions, and returning values.
The ORP 975 may provide the rules that are executed along with the infrastructure to execute the rules. The ORP 975 may be updated with the latest remediations on a periodic basis or on demand from the vendor. For example, an update (e.g., update 119) received by the data management storage solution from an AIOps service (e.g., AIOps 120) may include a new remediation set containing updated remediations and/or additional remediations to be used by the auto-healing service 900 to mitigate or address risks detected by the rules evaluator 960 automatically or responsive to receipt of manual approval by an administrative user.
This event digest module 973 may be a generic module used to communicate. In one embodiment, the event digest module 973 is used to register and communicate to the pub/sub bus. For example, the event digest module 973 may include functions to subscribe or publish to auto-heal topics via pub/sub auto-heal topic 950.
In the context of the present example, the thread pool 976 generally represents a collection of polymorphic threads, which allows it to be shared across functions.
Returning to the rule/remediation coordinator 940, it may be responsible for overseeing one or more of the following activities:
While in the context of the present example, the various components of the auto-healing service 900 are described as being implemented “on-box” (e.g., local to the data management storage solution), given the use of REST APIs, for example, it is to be appreciated in alternative embodiments one or more or all of the rule/remediation coordinator 940, rules evaluator 960, and task execution engine 970 may be implemented “off-box” (external to the data management storage solution). For example, a cloud service may provide an auto-healing service (e.g., auto-healing service 900) on behalf of an individual cluster (e.g., data management storage solution 130), by subscribing to topics of interest that are managed by a pub/sub bus (e.g., pub/sub bus 931) implemented by the cluster and instead of AIOps updates (e.g., update 119) being delivered to the cluster (e.g., as described with reference to
Thereafter, the pub/sub bus 931 notifies the rules/remediation coordinator 940, the rules evaluator 960, and task execution engine 970 when a message is posted to a topic (e.g., pub/sub EMS topic 930 and/or pub/sub/auto-heal topic 950) to which they have subscribed. For example, following completion of the subscription requests, upon occurrence of a key EMS event to which the rules/remediation coordinator 940 is subscribed, the pub/sub bus 960 is shown notifying the rules/remediation coordinator 940 regarding the occurrence of subscribed EMS event within the data storage system. Advantageously, in this manner, the need for such internal or external subscribers to poll for data from publishers may be eliminated and timely notifications may automatically be delivered to subscribers.
In the context of the present example, based on the notification regarding the subscribed EMS event received from the pub/sub bus 931, the rules/remediation coordinator 940 identifies a rule to which the EMS event is mapped (e.g., via rules table 911) and posts (or publishes) a rule evaluation request to the pub/sub bus 931 that is to be carried out by the rules evaluator 960. Responsive to receipt of the rule evaluation request from the rules/remediation coordinator 940 and after determining the existence of one or more subscribers (in this case, the rules evaluator 960) to the request, the pub/sub bus 931 issues a notification regarding the rule evaluation request to the rules evaluator 960.
Upon completion of the rule evaluation requested by the rules/remediation coordinator 940, the rules evaluator 960 posts (or publishes) a rule evaluation result (in this case, confirming the existence of a particular issue or risk to which the data storage system is exposed) to the pub/sub bus 931, which causes the pub/sub bus 931 to notify the rules/remediation coordinator 940.
Based on the notification regarding the rule evaluation result received from the rules evaluator 960 via the pub/sub bus 931, the rules/remediation coordinator identifies a remediation (e.g., of the remediation sets 122) corresponding to the particular issue or risk, for example, that has been determined by an AIOps platform (e.g., AIOps platform 120) to address or mitigate the particular issue or risk. The rules evaluator 960, then posts (or publishes) a remediation execution request (e.g., identifying the remediation ID of the identified remediation) to be carried out by the task execution engine 970.
Responsive to receipt of the remediation execution request from the rules/remediation coordinator 940 and after determining the existence of one or more subscribers (in this case, the task execution engine 970) to the request, the pub/sub bus 931 issues a notification regarding the remediation execution request to the task execution engine 970. Upon receipt of the notification from the pub/sub bus 931, the task execution engine 970 executes the requested remediation and posts (or publishes) the result of the remediation execution (e.g., completion, success, failure, etc.).
While in the context of this simplified example, only a single rule evaluation request and a single remediation execution request are shown, it is to be appreciated during operation of the auto-healing service 900 many rule evaluations and remediation executions may be performed depending on the number of occurrences of subscribed EMS events and/or the manner in which various rules are related or grouped.
In the context of the present example, a set of rules is shown organized hierarchically with a parent rule ID (rule ID 101) at the root and three child rule IDs (rule IDs 10001, 10002, and 10003). In this manner, a complex conditional expression may be broken down into a series of less complex conditional expressions in which those rules having dependencies on other rules need not be evaluated until their respective pre-conditions have been confirmed. Those skilled in the art will appreciate the rules and remediations may be organized in various other ways.
As those skilled in the art will appreciate, it may be preferable to perform event-based triggering when available as they may provide reduced overhead and complexity; however, some types of checks (e.g., best practices and performance checks) lend themselves well to scheduling. For example, if an administrative user wants to check whether a given cluster is complying with SAN best practices (e.g., as defined by the vendor), the administrator may schedule one or more rules associated with SAN best practices to run periodically (e.g., once a month). Similarly, the administrator may schedule one or more rules associated with security and/or performance checks to be performed on a periodic basis.
In one embodiment, a given rule may contain the ID(s) of the trigger events (e.g., the EMS event(s)) it is looking for. The trigger event ID information can be inferred by scanning all the active rules or by a catalog that is maintained. A coordinator (e.g., rule/remediation coordinator 940) may register with a pub/sub bus (e.g., pub/sub/EMS topic 930) for the event IDs of interest. In this manner, an auto-healing service (e.g., auto-healing service 900) may avoid listening to all events.
At block 1110, the existence of a risk to which the data storage system is exposed is determined. The risk might represent a misconfiguration of the data storage system, an environmental issue (e.g., a DNS change or network reconfiguration) that might impact the data storage system, a security issue relating to the data storage system, a performance issue relating to the data storage system, or a capacity issue relating to the data storage system. The exposure to a particular risk may be determined by evaluating one or more conditions associated with a set of one or more rules (e.g., of rule sets 121) that are indicative of a root cause of the risk. As noted above, the rules and corresponding remediations (e.g., of remediation sets 122) may have been derived based on community wisdom by an AIOps platform (e.g., AIOps platform 120) and delivered to the data storage system to facilitate automated identification of issues or risks to which the data storage system may be exposed as well as mitigation thereof via performance of the corresponding remediations.
The one or more rules may be associated with a trigger event (e.g., the occurrence of a key EMS event or a predetermined or configurable schedule). According to one embodiment, a rules evaluator (e.g., rules evaluator 960) may be directed (e.g., via a pub/sub pattern) to evaluate (execute) a set of one or more rules (e.g., organized hierarchically with a parent rule at the root and zero or more child rules) by a coordinator (e.g., rule/remediation coordinator 940). Non-limiting examples of pub/sub processing, coordinator processing, and rule execution are described further below with reference to
At block 1120, a remediation associated with the risk determined in block 1110 is identified that addresses or mitigates the risk. According to one embodiment, a given rule (e.g., a parent rule) may include information regarding or a reference to a remediation, for example, a remediation ID that may be used to look up the remediation action(s) or remediation script within a remediation table. Assuming the existence of an associated remediation, a task execution engine (e.g., task execution engine 970) may be directed (e.g., via a pub/sub pattern) to carry out (implement) a set of one or more remediation actions or a remediation script by the coordinator.
At block 1130, the set of one or more remediation actions are executed. For example, responsive to receipt of a remediation execution request via a pub/sub bus (e.g., pub/sub/auto-heal topic 950) the task execution engine may execute the set of one or more remediation actions to implement the remediation identified in block 1120. A non-limiting example of remediation execution is described further below with reference to
Examples of more specific details associated with automated remediation of a risk identified responsive to evaluation of rules associated with a trigger event and responsive to evaluation of rules on a periodic schedule are described further below with reference to
Depending on the particular implementation, the cloud-based service 1140 may be hosted within a private (e.g., a data center of the vendor of the storage clusters 1130a-n) or a public cloud (e.g., AWS, Microsoft Azure, Google Cloud Platform, or the like).
At block 1150, a rule set for a set of best practices for a data storage system is received. The rule set may be received as part of an update (e.g., update 119) distributed to the data storage system directly (as in the example of
At block 1160, after a rule evaluation has been triggered, for example, due to a rule-evaluation trigger event (e.g., occurrence of a key EMS event, a scheduled event associated with a particular rule, or an event representing an on-demand rule-evaluation has been requested, for example, by an administrative user of the data storage system), it is determined whether a risk exists to which the data storage system is exposed by evaluating one or more rules associated with the rule-evaluation trigger event. In this example, the risk represent a deviation from a best practice by the data storage system. As noted above, the one or more rules (e.g., of rule sets 121) and corresponding remediations (e.g., of remediation sets 122) may have been derived based on community wisdom by an AIOps platform (e.g., AIOps platform 120) and delivered to the data storage system to facilitate automated identification of issues or risks to which the data storage system may be exposed as well as mitigation thereof via performance of the corresponding remediations. As also noted above, depending on the form in which the rules are represented, the execution of the rules may involve, for example, inferencing by an ML model or execution of conditional logic (e.g., represented by Python code).
According to one embodiment, a rules evaluator (e.g., rules evaluator 960) may be directed (e.g., via a pub/sub pattern) to evaluate (execute) a set of one or more rules (e.g., organized hierarchically with a parent rule at the root and zero or more child rules) by a coordinator (e.g., rule/remediation coordinator 940). Non-limiting examples of pub/sub processing, coordinator processing, and rule execution are described further below with reference to
At block 1170, it is determined whether a remediation associated with the risk (i.e., the deviation from the best practice) determined in block 1150 exists that addresses or mitigates the risk. For example, the remediation may be identified based on a rule of one or more of the rule(s) whose conditions have been satisfied According to one embodiment, the rule may include information regarding or a reference to a remediation, for example, a remediation ID that may be used to look up the remediation action(s) or remediation script within a remediation table. Assuming the existence of an associated remediation, a task execution engine (e.g., task execution engine 970) may be directed (e.g., via a pub/sub pattern) to carry out (implement) a set of one or more remediation actions or a remediation script by the coordinator.
At block 1180, the set of one or more remediation actions are executed. For example, responsive to receipt of a remediation execution request via a pub/sub bus (e.g., pub/sub/auto-heal topic 950) the task execution engine may execute the set of one or more remediation actions to implement the remediation identified in block 1170. A non-limiting example of remediation execution is described further below with reference to
Examples of more specific details associated with automated remediation of a risk identified responsive to evaluation of rules associated with a trigger event and responsive to evaluation of rules on a periodic schedule are described further below with reference to
For purposes of illustration and without limiting the general applicability of the proposed auto-healing service to deviations from best practices, a brief description of a concrete use case relating to DNS best practices is now provided. In some implementations, a data storage system may be virtualized, for example, to support multitenancy, In such a case, there may be multiple SVMs (e.g., one for each tenant within the customer's organization). In one example, a rule may represent a best practice of each SVM being associated with a DNS meeting certain conditions indicative of having a particular operational status. The rule may be capable of running based on incoming telemetry data from the data storage system, data collected from the data storage system, or data provided through an EMS message. During evaluation of the rule, appropriate data may be gathered and various conditions may be evaluated relating to the current state of the data storage system as compared to the expected or desired state of the data storage system (represented by the best practice). For example, the rule may iterate through or otherwise evaluate all SVMs of the data storage system, each of which have their own network configurations, to check the DNS status (e.g., it is configured appropriately and is reachable) of each SVM. If the DNS status is not satisfactory for a given DNS of an SVM, it may be added to a list for subsequent remediation. Upon completion of the rule evaluation, a list of those of those DNSs recommended for remediation (e.g., addition of a new name to address a misconfiguration as a result of an SVM host name change) may be presented (e.g., via a system manager dashboard of the data storage system) to an administrative user of the data storage system for approval of the proposed remediation or the remediation may be automatically performed depending on the configuration of rule at issue.
While in the context of the present example, a best practice is called out as a specific type of risk to which a cluster may be exposed, it is to be appreciated best practices may generally be treated like other risks described herein. Therefore, the other rule evaluation (or ML inferencing) and remediation activities or tasks and infrastructure relating to the various other types of risks to which a data storage system might be exposed are generally applicable to the identification of deviations from best practices and remediation or mitigation thereof.
At decision block 1210, an event indicative of a type of message published to the pub/sub bus is determined. When no message has been published, processing loops back to decision block 1210.
Responsive to a subscription request, processing continues with block 1220 at which the requester is added as a subscriber to a topic specified by the subscription request. For example, a coordinator (e.g., rule/remediation coordinator 940) may subscribe to particular EMS events of which it would like to be notified by making a subscription request to the pub/sub bus for a corresponding EMS topic (e.g., pub/sub/EMS topic 930). Similarly, a rules evaluator (e.g., rules evaluator 960) and a task execution engine (e.g., task execution engine 970) may subscribe to rule execution requests and remediation execution requests, respectively, by making subscription requests to the pub/sub bus for corresponding auto-heal topics (e.g., pub/sub/auto-heal topic 950).
Responsive to a new EMS event, processing continues with block 1230 to notify the coordinator of the new EMS event. Responsive to the new EMS event notification, the coordinator may identify a set of one or more rules (e.g., of rule sets 121) to be evaluated based on a mapping between EMS events and corresponding rules (e.g., rules tables 911) and may cause the rules evaluator to perform the evaluation, for example, by publishing a rule execution request to the pub/sub bus.
Responsive to a rule execution request message published (e.g., to the pub/sub/auto-heal topic 950 of
Responsive to a rule evaluation result message published (e.g., to the pub/sub/auto-heal topic 950 of
Responsive to a remediation execution request message published (e.g., to the pub/sub/auto-heal topic 950 of
Responsive to a remediation complete message published (e.g., to the pub/sub/auto-heal topic 950 of
At block 1305, the coordinator may upon initialization subscribe to desired EMS events. For example, the coordinator may post a subscription request message specifying the pub/sub EMS topic so as to be automatically notified by the pub/sub bus of subsequent messages posted to this topic.
At decision block 1310, it is determined whether a new EMS event has been received. If so, processing continues with decision block 1315; otherwise processing loops back to decision block 1310.
At decision block 1315, one or more rule execution pre-checks may be performed. If all pre-checks pass, processing continues with block 1320; otherwise, processing loops back to decision block 1310 to await receipt of another EMS event. In one embodiment, the one or more rule pre-checks may include performing a check regarding whether a mapping exists for the event ID of the EMS event at issue to a corresponding rule ID of a rule to be executed. If no mapping is found, the pre-checks may be treated as having failed. Alternatively, or additionally the one or more rule pre-checks may include performing a check to determine whether an entry exists in a task table (e.g., the cluster-wide task table 912 of
At block 1320, the rule(s) to be run are extracted. For example, the coordinator may determine the rule ID to which the event ID of the EMS event at issue maps.
At block 1325, details (e.g., the rule ID and the event ID) may be logged in the task table and a rule execution request message (including the rule ID and optionally the node ID to which the rule execution is being delegated if the rule execution is not to be performed by the primary node) may be posted/published to a pub/sub topic (e.g., a pub/sub “evaluate” topic) to trigger execution of the rules associated with the rule ID by a rules evaluator (e.g., the rules evaluator 960 of
At decision block 1330, it is determined whether a rule evaluation result message (e.g., a reply) has been received (e.g., from the rules evaluator). If so, processing continues with block 1335; otherwise processing loops back to decision block 1330 to await the rule evaluation result. As noted above, in one embodiment, a timeout mechanism may be used when a request is sent to the rule evaluator. If a request times out, the coordinator may perform a roll-back or roll-forward as appropriate.
At block 1335, the appropriate next step is determined based on the rule evaluation result and a checkpoint is created in a cluster-wide log to facilitate failure recovery. With respect to determining the appropriate next step, when a risk has been identified as being associated with the EMS event at issue by the rule evaluator and a remediation has been returned as part of the rule evaluation result (e.g., as part of a stateful EMS event), then the event and the associated corrective action may be brought to the attention of an administrative user of the cluster by creating an EMS alert that is displayed via a user interface of a system manager dashboard (e.g., as described an illustrated with reference to
At decision block 1345, it is determined whether remediation execution is to be performed. In the context of the present example, while no indication is received regarding remediation execution, processing loops back to decision block 1345. Responsive to the administrative user dismissing the alert displayed via the system manager dashboard (e.g., by selecting the “Dismiss” button), resulting in invocation of an auto-healing REST API (e.g., the auto-healing REST API 910 of
At decision block 1350, one or more remediation pre-checks may be performed. If all pre-checks pass, processing continues with block 1355; otherwise, processing loops back to decision block 1310 to await receipt of another EMS event. In one embodiment, the one or more remediation pre-checks may include performing a check regarding whether a mapping exists for the rule ID to a corresponding remediation ID of a remediation to be executed. If no mapping is found, the pre-checks may be treated as having failed. Alternatively, or additionally the one or more remediation pre-checks may include performing a check to determine whether an entry exists in a task table (e.g., the cluster-wide task table 912 of
At block 1355, the remediation(s) to be executed are extracted. For example, the coordinator may determine (e.g., with reference to the rules tables) the remediation ID to which the rule ID of the EMS event at issue maps.
At block 1360, a checkpoint may be created within the cluster-wide log (e.g., including the rule ID, the event ID, and the remediation ID) and execution of the remediation(s) may be requested, for example, by posting a remediation execution request message (including the remediation ID, and optionally the node ID to which the remediation execution is being delegated if the remediation execution is not to be performed by the primary node) may be posted to a pub/sub topic (e.g., a pub/sub “remediate” topic) to trigger execution of the remediation actions associated with the remediation ID by a task execution engine (e.g., the task execution engine 970 of
At decision block 1365, it is determined whether a remediation status update has been received (e.g., from the task execution engine). If so, processing continues with block 1370; otherwise, processing loops back to decision block 1365 to await the remediation status update. As noted above, in one embodiment, a timeout mechanism may be used when a request is sent to the task execution engine. If a request times out, the coordinator may perform a roll-back or roll-forward as appropriate.
At block 1370, responsive to the remediation status update, the status of the remediation is updated within the cluster-wide task table and via an EMS service (e.g., the EMS service 920 of
At decision block 1375, it is determined whether a remediation reply has been received from the task execution engine that is indicative of completion of a given remediation execution. If so, processing continues with block 1380; otherwise, processing loops back to decision block 1375 to await the remediation reply. As noted above, in one embodiment, a timeout mechanism may be used when a request is sent to the task execution engine. If a request times out, the coordinator may perform a roll-back or roll-forward as appropriate.
At block 1380, responsive to the remediation reply update, the status of the remediation is updated (e.g., to a terminal state) within the cluster-wide task table and via the EMS service.
At decision block 1410, a determination is made regarding whether the rule ID contained within the rule execution request message exists. If so, processing continues with block 1430; otherwise, processing branches to block 1420 in which an error may be published. In one embodiment, the rule evaluator may consult a rules table (e.g., the rules tables 911 of
At block 1430, execution of a sequence of rules is initiated by finding child rules of the rule ID at issue. For example, rule execution logic (e.g., the logic controller 961 of
At decision block 1440, it is determined whether any specified rule conditions are satisfied for a given child rule. If all rule conditions are satisfied, processing continues with block 1460; otherwise, processing branches to block 1450 to skip the current child rule and move on to the next child rule after looping back to decision block 1440.
At block 1460, the associated remediation is identified, for example, with reference to a remediation ID associated with the rule ID at issue as indicated in the rules table.
At decision block 1470, it is determined whether the remediation was found. If so, processing continues with block 1490; otherwise, processing branches to block 1480 in which an error may be published. In one embodiment, the rule evaluator may consult the rules tables to make this determination and/or arrive at this determination as a result of a corresponding folder or file for the remediation at issue being missing or corrupted.
At block 1490, the remediation is prepared for publication, for example, by creating the remediation URL and associated parameters; and then, the “reply” (e.g., to the rule execution request message that initiated this process) may be posted/published, for example, in the form of a stateful EMS event via the EMS service to communicate to the coordinator the results of the rule evaluation.
At decision block 1510, a determination is made regarding whether the remediation ID contained within the remediation execution request message exists. If so, processing continues with decision block 1530; otherwise, processing branches to block 1520 in which an error may be published. In one embodiment, the task execution engine may consult rules tables (e.g., rules tables 911 of
At decision block 1530, a determination may be made regarding whether the issue still exists. If so, then processing continues with block 1550; otherwise, processing branches to block 1540 in which the status of the event/issue may be updated in a cluster-wide task table (e.g., the cluster-wide task table 912 of
At block 1550, the associated remediation is identified. For example, task execution logic (e.g., the logic controller 971 of
At decision block 1560, it is determined whether the remediation was found. If so, processing continues with block 1580; otherwise, processing branches to block 1570 in which an error may be published.
At block 1580, a remediation plan is created, a remediation script is executed, and the status may be updated in the cluster-wide task table as well as via the EMS service at each step of the script.
Network environment 1600, which may take the form of a clustered network environment, includes data storage apparatuses 1602a-n that are coupled over a cluster or cluster fabric 1604 that includes one or more communication network(s) and facilitates communication between data storage apparatuses 1602a-n (and one or more modules, components, etc. therein, such as, node computing devices 1606a-n (also referred to as node computing devices), for example), although any number of other elements or components can also be included in network environment 1600 in other examples. This technology provides a number of advantages including methods, non-transitory computer-readable media, and computing devices that implement the techniques described herein.
In this example, node computing devices 1606a-n may be representative of primary or local storage controllers or secondary or remote storage controllers that provide client devices 1608a-n (which may also be referred to as client nodes and which may be analogous to clients 205, 305, and 405) with access to data stored within data storage nodes 1610a-n (which may also be referred to as data storage devices) and cloud storage node(s) 1636 (which may also be referred to as cloud storage device(s) and which may be analogous to hyperscale disks 425). The node computing devices 1606a-nmay be implemented as hardware, software (e.g., a storage virtual machine), or combination thereof.
Data storage apparatuses 1602a-n and/or node computing devices 1606a-n of the examples described and illustrated herein are not limited to any particular geographic areas and can be clustered locally and/or remotely via a cloud network, or not clustered in other examples. Thus, in one example data storage apparatuses 1602a-n and/or node computing devices 1606a-n can be distributed over multiple storage systems located in multiple geographic locations (e.g., located on-premise, located within a cloud computing environment, etc.); while in another example a network can include data storage apparatuses 1602a-n and/or node computing devices 1606a-n residing in the same geographic location (e.g., in a single on-site rack).
In the illustrated example, one or more of client devices 1608a-n, which may be, for example, personal computers (PCs), computing devices used for storage (e.g., storage servers), or other computers or peripheral devices, are coupled to the respective data storage apparatuses 1602a-n by network connections 1612a-n. Network connections 1612a-nmay include a local area network (LAN) or wide area network (WAN) (i.e., a cloud network), for example, that utilize TCP/IP and/or one or more Network Attached Storage (NAS) protocols, such as a Common Internet Filesystem (CIFS) protocol or a Network Filesystem (NFS) protocol to exchange data packets, a Storage Area Network (SAN) protocol, such as Small Computer System Interface (SCSI) or Fiber Channel Protocol (FCP), an object protocol, such as simple storage service (S3), and/or non-volatile memory express (NVMe), for example.
Illustratively, client devices 1608a-n may be general-purpose computers running applications and may interact with data storage apparatuses 1602a-n using a client/server model for exchange of information. That is, client devices 1608a-n may request data from data storage apparatuses 1602a-n (e.g., data on one of the data storage nodes 1610a-n managed by a network storage controller configured to process I/O commands issued by client devices 1608a-n, and data storage apparatuses 1602a-n may return results of the request to client devices 1608a-n via the network connections 1612a-n.
The node computing devices 1606a-n of data storage apparatuses 1602a-n can include network or host nodes that are interconnected as a cluster to provide data storage and management services, such as to an enterprise having remote locations, cloud storage (e.g., a storage endpoint may be stored within cloud storage node(s) 1636), etc., for example. Such node computing devices 1606a-n can be attached to the cluster fabric 1604 at a connection point, redistribution point, or communication endpoint, for example. One or more of the node computing devices 1606a-n may be capable of sending, receiving, and/or forwarding information over a network communications channel, and could comprise any type of device that meets any or all of these criteria.
In an example, the node computing devices 1606a-n may be configured according to a disaster recovery configuration whereby a surviving node provides switchover access to the storage devices 1610a-n in the event a disaster occurs at a disaster storage site (e.g., the node computing device 1606a provides client device 1608n with switchover data access to data storage nodes 1610n in the event a disaster occurs at the second storage site). In other examples, the node computing device 1606n can be configured according to an archival configuration and/or the node computing devices 1606a-n can be configured based on another type of replication arrangement (e.g., to facilitate load sharing). Additionally, while two node computing devices are illustrated in
As illustrated in network environment 1600, node computing devices 1606a-n can include various functional components that coordinate to provide a distributed storage architecture. For example, the node computing devices 1606a-n can include network modules 1614a-n and disk modules 1616a-n. Network modules 1614a-n can be configured to allow the node computing devices 1606a-n (e.g., network storage controllers) to connect with client devices 1608a-n over the network connections 1612a-n, for example, allowing client devices 1608a-n to access data stored in network environment 1600.
Further, the network modules 1614a-n can provide connections with one or more other components through the cluster fabric 1604. For example, the network module 1614a of node computing device 1606a can access the data storage node 1610n by sending a request via the cluster fabric 1604 through the disk module 1616n of node computing device 1606n when the node computing device 1606n is available. Alternatively, when the node computing device 1606n fails, the network module 1614a of node computing device 1606a can access the data storage node 1610n directly via the cluster fabric 1604. The cluster fabric 1604 can include one or more local and/or wide area computing networks (e.g., cloud networks) embodied as Infiniband, Fibre Channel (FC), or Ethernet networks, for example, although other types of networks supporting other protocols can also be used.
Disk modules 1616a-n can be configured to connect data storage nodes 1610a-n, such as disks or arrays of disks, SSDs, flash memory, or some other form of data storage, to the node computing devices 1606a-n. Often, disk modules 1616a-n communicate with the data storage nodes 1610a-n according to a SAN protocol, such as SCSI or FCP, for example, although other protocols can also be used. Thus, as seen from an OS on node computing devices 1606a-n, the data storage nodes 1610a-n can appear as locally attached. In this manner, different node computing devices 1606a-n, etc. may access data blocks, files, or objects through the OS, rather than expressly requesting abstract files.
While network environment 1600 illustrates an equal number of network modules 1614a-n and disk modules 1616a-n, other examples may include a differing number of these modules. For example, there may be a plurality of network and disk modules interconnected in a cluster that do not have a one-to-one correspondence between the network and disk modules. That is, different node computing devices can have a different number of network and disk modules, and the same node computing device can have a different number of network modules than disk modules.
Further, one or more of client devices 1608a-n can be networked with the node computing devices 1606a-n in the cluster, over the network connections 1612a-n. As an example, respective client devices 1608a-n that are networked to a cluster may request services (e.g., exchanging of information in the form of data packets) of node computing devices 1606a-n in the cluster, and the node computing devices 1606a-n can return results of the requested services to client devices 1608a-n. In one example, client devices 1608a-n can exchange information with the network modules 1614a-n residing in the node computing devices 1606a-n (e.g., network hosts) in data storage apparatuses 1602a-n.
In one example, storage apparatuses 1602a-n host aggregates corresponding to physical local and remote data storage devices, such as local flash or disk storage in the data storage nodes 1610a-n, for example. One or more of the data storage nodes 1610a-n can include mass storage devices, such as disks of a disk array. The disks may comprise any type of mass storage devices, including but not limited to magnetic disk drives, flash memory, and any other similar media adapted to store information, including, for example, data and/or parity information.
The aggregates may include volumes 1618a-n in this example, although any number of volumes can be included in the aggregates. The volumes 1618a-n are virtual data stores or storage objects that define an arrangement of storage and one or more filesystems within network environment 1600. Volumes 1618a-n can span a portion of a disk or other storage device, a collection of disks, or portions of disks, for example, and typically define an overall logical arrangement of data storage. In one example volumes 1618a-n can include stored user data as one or more files, blocks, or objects that may reside in a hierarchical directory structure within the volumes 1618a-n.
Volumes 1618a-n are typically configured in formats that may be associated with particular storage systems, and respective volume formats typically comprise features that provide functionality to the volumes 1618a-n, such as providing the ability for volumes 1618a-n to form clusters, among other functionality. Optionally, one or more of the volumes 1618a-n can be in composite aggregates and can extend between one or more of the data storage nodes 1610a-n and one or more of the cloud storage node(s) 1636 to provide tiered storage, for example, and other arrangements can also be used in other examples.
In one example, to facilitate access to data stored on the disks or other structures of the data storage nodes 1610a-n, a filesystem (e.g., file system layer 411) may be implemented that logically organizes the information as a hierarchical structure of directories and files. In this example, respective files may be implemented as a set of disk blocks of a particular size that are configured to store information, whereas directories may be implemented as specially formatted files in which information about other files and directories are stored.
Data can be stored as files or objects within a physical volume and/or a virtual volume, which can be associated with respective volume identifiers. The physical volumes correspond to at least a portion of physical storage devices, such as the data storage nodes 1610a-n (e.g., a RAID system, such as RAID layer 413) whose address, addressable space, location, etc. does not change. Typically, the location of the physical volumes does not change in that the range of addresses used to access it generally remains constant.
Virtual volumes, in contrast, can be stored over an aggregate of disparate portions of different physical storage devices. Virtual volumes may be a collection of different available portions of different physical storage device locations, such as some available space from disks, for example. It will be appreciated that since the virtual volumes are not “tied” to any one particular storage device, virtual volumes can be said to include a layer of abstraction or virtualization, which allows it to be resized and/or flexible in some regards.
Further, virtual volumes can include one or more LUNs, directories, Qtrees, files, and/or other storage objects, for example. Among other things, these features, but more particularly the LUNs, allow the disparate memory locations within which data is stored to be identified, for example, and grouped as data storage unit. As such, the LUNs may be characterized as constituting a virtual disk or drive upon which data within the virtual volumes is stored within an aggregate. For example, LUNs are often referred to as virtual drives, such that they emulate a hard drive, while they actually comprise data blocks stored in various parts of a volume.
In one example, the data storage nodes 1610a-n can have one or more physical ports, wherein each physical port can be assigned a target address (e.g., SCSI target address). To represent respective volumes, a target address on the data storage nodes 1610a-n can be used to identify one or more of the LUNs. Thus, for example, when one of the node computing devices 1606a-n connects to a volume, a connection between the one of the node computing devices 1606a-n and one or more of the LUNs underlying the volume is created.
Respective target addresses can identify multiple of the LUNs, such that a target address can represent multiple volumes. The I/O interface, which can be implemented as circuitry and/or software in a storage adapter or as executable code residing in memory and executed by a processor, for example, can connect to volumes by using one or more addresses that identify the one or more of the LUNs.
The present embodiments can take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment containing both hardware and software elements. Accordingly, it is understood that any operation of the computing systems of the network environment 1600 and the distributed storage system may be implemented by a computing system using corresponding instructions stored on or in a non-transitory computer-readable medium accessible by a processing system. For the purposes of this description, a non-transitory computer-usable or computer-readable medium can be any apparatus that can store the program for use by or in connection with the instruction execution system, apparatus, or device. The medium may include non-volatile memory including magnetic storage, solid-state storage, optical storage, cache memory, and RAM.
At decision block 1710, it is determined whether a trigger event has been received. If so, processing continues with block 1720; otherwise, processing loops back to decision block 1710. According to one embodiment, the auto-healing service, for example, via a rule/remediation coordinator (e.g., rule/remediation coordinator 940) may have previously subscribed to a set of one or more key EMS events (trigger events) with a publisher/subscriber bus (e.g., Pub/Sub/EMS Topic 930) so as to be automatically notified upon the occurrence of any of the set of one or more key EMS events in the context of the distributed storage system.
At block 1720, a set of one or more rules are identified for evaluation. According to one embodiment, the auto-healing service, for example, via the rule/remediation coordinator, may maintain a mapping of the set of one or more key EMS events (trigger events) to respective sets of one or more rules to be evaluated responsive to occurrence of a given key EMS event (trigger event). As noted above, in the context of a capacity example, a non-limiting example of a key EMS event may be an EMS event indicative of a volume being X % (e.g., 80%) full. This EMS event may be mapped to an associated rule that causes a forecast to be performed and evaluated.
At block 1730, the set of one or more rules are evaluated with respect to one or more of historical data and a current state of the data storage system. For example, continuing with the capacity example, the associated rule may cause a forecast to be performed based on the current state of the data storage system (e.g., a particular volume is X % full) and historical data (e.g., usage of the particular volume over time) to determine when the particular volume will be at Y % (e.g., 100%) full. A risk may be identified when the forecasted fullness date is within N (e.g., 3) months.
At decision block 1740, it is determined whether a risk has been identified. If so, processing continues with block 1750; otherwise, processing loops back to decision block 1710. For example, continuing with the capacity example, if the forecasted fullness date is within N months from the current date, this condition may be indicative of a volume fullness risk.
At block 1750, the availability of a remediation associated with the risk is identified that addresses or mitigates the risk. For example, continuing with the capacity example, the remediation associated with the risk may be one that causes the volume size to be increased by M % (e.g., 20%).
At block 1760, an administrative user of the data storage system may be notified of the risk and the proposed automated remediation. According to one embodiment, the administrative user may be notified via a system manager dashboard (e.g., system manager dashboard 500 of
At decision block 1770, it is determined whether performance of the remediation is authorized. According to one embodiment, authorization of the remediation may be received via interaction of the administrative user with a dialog box (e.g., dialog box 600) that provides the administrative user with the option of dismissing the event (e.g., by selecting the “Dismiss” button 610) or allowing the auto-healing service to perform the proposed remediation (e.g., by selecting the “Fix It” button 611).
At block 1780, one or more remediation actions that implement the remediation are executed. For example, responsive to receipt of a remediation execution request via a pub/sub bus (e.g., pub/sub/auto-heal topic 950) the task execution engine may execute the set of one or more remediation actions to implement the remediation identified in block 1750.
While in the context of the present example, a set of one or more rules are identified for evaluation responsive to an EMS event indicative of a volume being X % full, it is to be appreciated the rules to be evaluated may be identified based on various other EMS events and/or responsive to a periodic schedule associated with the set of one or more rules.
While in the context of the present example, authorization to perform a remediation is described as being authorized by an administrative user via a dialog box presented to the administrative user via a system manager dashboard, it is to be appreciated in other examples, the administrative user may configure certain remediations for automated performance without requiring such authorization. For example, as noted above, preferences relating to the desired type of remediation (e.g., automated vs. user activated) for various types of identified issues arising within the distributed storage system may be configured by the administrative user, learned from historical interactions with the administrative user, and/or based on community wisdom. The administrative user may select automated remediation for issues/risks known to arise as a result of periodic changes to the environment in which the distributed storage system operates and/or to the configuration of the distributed storage system.
At decision block 1810, it is determined whether evaluation of a set of one or more rules is due for evaluation in accordance with an associated periodic schedule. If so, processing continues with block 1820; otherwise, processing loops back to decision block 1810. According to one embodiment, the auto-healing service, for example, via a scheduler/job manager may launch a task for a given set of one or more rules that periodically wakes up based on a schedule associated with the given set of one or more rules and performs an evaluation of the given set of one or more rules.
At block 1820, the set of one or more rules associated with the periodic schedule are evaluated with respect to one or more of historical data and a current state of the data storage system. For example, as noted above, in the context of a capacity example, a rule may be run on a daily, weekly, or monthly basis to evaluate whether any volumes utilized by the distributed storage system or a particular volume of the distributed storage system are/is X % (e.g., 80%) full. This rule may cause a forecast to be performed and evaluated based on the current state of the data storage system (e.g., the particular volume is X % full) and historical data (e.g., usage of the particular volume over time) to determine when the particular volume will be at Y % (e.g., 100%) full. A risk may be identified when the forecasted fullness date is within N (e.g., 3) months.
At decision block 1830, it is determined whether a risk has been identified. If so, processing continues with block 1840; otherwise, processing loops back to decision block 1810. For example, continuing with the capacity example, if the forecasted fullness date is within N months from the current date, this condition may be indicative of a volume fullness risk.
At block 1840, the availability of a remediation associated with the risk is identified that addresses or mitigates the risk. For example, continuing with the capacity example, the remediation associated with the risk may be one that causes the volume size to be increased by M % (e.g., 20%).
At block 1850, an administrative user of the data storage system may be notified of the risk and the proposed automated remediation. According to one embodiment, the administrative user may be notified via a system manager dashboard (e.g., system manager dashboard 500 of
At decision block 1860, it is determined whether performance of the remediation is authorized. According to one embodiment, authorization of the remediation may be received via interaction of the administrative user with a dialog box (e.g., dialog box 600) that provides the administrative user with the option of dismissing the event (e.g., by selecting the “Dismiss” button 610) or allowing the auto-healing service to perform the proposed remediation (e.g., by selecting the “Fix It” button 611).
At block 1870, one or more remediation actions that implement the remediation are executed. For example, responsive to receipt of a remediation execution request via a pub/sub bus (e.g., pub/sub/auto-heal topic 950) the task execution engine may execute the set of one or more remediation actions to implement the remediation identified in block 1840.
While in the context of the present example, a set of one or more rules are identified for evaluation responsive to a particular periodic schedule (e.g., of a day, a week, or a month) associated with the set of one or more rules, it is to be appreciated the rules to be evaluated may be identified based on other periodic schedules, an EMS event indicative of a volume being X % full, and/or various other EMS events associated with the set of one or more rules.
While in the context of the present example, authorization to perform a remediation is described as being authorized by an administrative user via a dialog box presented to the administrative user via a system manager dashboard, it is to be appreciated in other examples, the administrative user may configure certain remediations for automated performance without requiring such authorization. For example, as noted above, preferences relating to the desired type of remediation (e.g., automated vs. user activated) for various types of identified issues arising within the distributed storage system may be configured by the administrative user, learned from historical interactions with the administrative user, and/or based on community wisdom. The administrative user may select automated remediation for issues/risks known to arise as a result of periodic changes to the environment in which the distributed storage system operates and/or to the configuration of the distributed storage system.
While in the context of the examples described with reference to
Various components of the present embodiments described herein may include hardware, software, or a combination thereof. Accordingly, it is to be understood operation of a distributed storage management system (e.g., data management storage solution 130, cluster 201, cluster 335, or a cluster of one or more of virtual storage systems 410a-c) or one or more of components thereof may be implemented using a computing system via corresponding instructions stored on or in a non-transitory computer-readable medium accessible by a processing system. For the purposes of this description, a tangible computer-usable or computer-readable medium can be any apparatus that can store the program for use by or in connection with the instruction execution system, apparatus, or device. The medium may include non-volatile memory including magnetic storage, solid-state storage, optical storage, cache memory, and RAM.
The various systems and subsystems (e.g., file system layer 411, RAID layer 413, and storage layer 415), and/or nodes 102 (when represented in virtual form) of the distributed storage system described herein, and the processing described herein may be implemented in the form of executable instructions stored on a machine readable medium and executed by one or more processing resources (e.g., one or more of and/or a combination one or more of a microcontroller, a microprocessor, central processing unit core(s), an application-specific integrated circuit (ASIC), a field programmable gate array (FPGA), and the like) and/or in the form of other types of electronic circuitry. For example, the processing may be performed by one or more virtual or physical computer systems (e.g., servers, network storage systems or appliances, blades, etc.) of various forms, such as the computer system described with reference to
Embodiments of the present disclosure include various steps, which have been described above. The steps may be performed by hardware components or may be embodied in machine-executable instructions, which may be used to cause one or more processing resources (e.g., one or more general-purpose or special-purpose processors) programmed with the instructions to perform the steps. Alternatively, depending upon the particular implementation, various steps may be performed by a combination of hardware, software, firmware and/or by human operators.
Embodiments of the present disclosure may be provided as a computer program product, which may include a non-transitory machine-readable storage medium embodying thereon instructions, which may be used to program a computer (or other electronic devices) to perform a process. The machine-readable medium may include, but is not limited to, fixed (hard) drives, magnetic tape, floppy diskettes, optical disks, compact disc read-only memories (CD-ROMs), and magneto-optical disks, semiconductor memories, such as ROMs, PROMs, random access memories (RAMs), programmable read-only memories (PROMs), erasable PROMs (EPROMs), electrically erasable PROMs (EEPROMs), flash memory, magnetic or optical cards, or other type of media/machine-readable medium suitable for storing electronic instructions (e.g., computer programming code, such as software or firmware).
Various methods described herein may be practiced by combining one or more non-transitory machine-readable storage media containing the code according to embodiments of the present disclosure with appropriate special purpose or standard computer hardware to execute the code contained therein. An apparatus for practicing various embodiments of the present disclosure may involve one or more computers (e.g., physical and/or virtual servers) (or one or more processors within a single computer) and storage systems containing or having network access to computer program(s) coded in accordance with various methods described herein, and the method steps associated with embodiments of the present disclosure may be accomplished by modules, routines, subroutines, or subparts of a computer program product.
Computer system 1900 also includes a main memory 1906, such as a random-access memory (RAM) or other dynamic storage device, coupled to bus 1902 for storing information and instructions to be executed by processor(s) 1904. Main memory 1906 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor(s) 1904. Such instructions, when stored in non-transitory storage media accessible to processor(s) 1904, render computer system 1900 into a special-purpose machine that is customized to perform the operations specified in the instructions.
Computer system 1900 further includes a read only memory (ROM) 1908 or other static storage device coupled to bus 1902 for storing static information and instructions for processor(s) 1904. A storage device 1910, e.g., a magnetic disk, optical disk or flash disk (made of flash memory chips), is provided and coupled to bus 1902 for storing information and instructions.
Computer system 1900 may be coupled via bus 1902 to a display 1912, e.g., a cathode ray tube (CRT), Liquid Crystal Display (LCD), Organic Light-Emitting Diode Display (OLED), Digital Light Processing Display (DLP) or the like, for displaying information to a computer user. An input device 1914, including alphanumeric and other keys, is coupled to bus 1902 for communicating information and command selections to processor(s) 1904. Another type of user input device is cursor control 1916, such as a mouse, a trackball, a trackpad, or cursor direction keys for communicating direction information and command selections to processor(s) 1904 and for controlling cursor movement on display 1912. This input device typically has two degrees of freedom in two axes, a first axis (e.g., x) and a second axis (e.g., y), that allows the device to specify positions in a plane.
Removable storage media 1940 can be any kind of external storage media, including, but not limited to, hard-drives, floppy drives, IOMEGA® Zip Drives, Compact Disc-Read Only Memory (CD-ROM), Compact Disc-Re-Writable (CD-RW), Digital Video Disk-Read Only Memory (DVD-ROM), USB flash drives and the like.
Computer system 1900 may implement the techniques described herein using customized hard-wired logic, one or more ASICs or FPGAs, firmware or program logic which in combination with the computer system causes or programs computer system 1900 to be a special-purpose machine. According to one embodiment, the techniques herein are performed by computer system 1900 in response to processor(s) 1904 executing one or more sequences of one or more instructions contained in main memory 1906. Such instructions may be read into main memory 1906 from another storage medium, such as storage device 1910. Execution of the sequences of instructions contained in main memory 1906 causes processor(s) 1904 to perform the process steps described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions.
The term “storage media” as used herein refers to any non-transitory media that store data or instructions that cause a machine to operation in a specific fashion. Such storage media may comprise non-volatile media or volatile media. Non-volatile media includes, for example, optical, magnetic or flash disks, such as storage device 1910. Volatile media includes dynamic memory, such as main memory 1906. Common forms of storage media include, for example, a flexible disk, a hard disk, a solid-state drive, a magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, NVRAM, any other memory chip or cartridge.
Storage media is distinct from but may be used in conjunction with transmission media. Transmission media participates in transferring information between storage media. For example, transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise bus 1902. Transmission media can also take the form of acoustic or light waves, such as those generated during radio-wave and infra-red data communications.
Various forms of media may be involved in carrying one or more sequences of one or more instructions to processor(s) 1904 for execution. For example, the instructions may initially be carried on a magnetic disk or solid-state drive of a remote computer. The remote computer can load the instructions into its dynamic memory and send the instructions over a telephone line using a modem. A modem local to computer system 1900 can receive the data on the telephone line and use an infra-red transmitter to convert the data to an infra-red signal. An infra-red detector can receive the data carried in the infra-red signal and appropriate circuitry can place the data on bus 1902. Bus 1902 carries the data to main memory 1906, from which processor(s) 1904 retrieve and execute the instructions. The instructions received by main memory 1906 may optionally be stored on storage device 1910 either before or after execution by processor(s) 1904.
Computer system 1900 also includes a communication interface 1918 coupled to bus 1902. Communication interface 1918 provides a two-way data communication coupling to a network link 1920 that is connected to a local network 1922. For example, communication interface 1918 may be an integrated services digital network (ISDN) card, cable modem, satellite modem, or a modem to provide a data communication connection to a corresponding type of telephone line. As another example, communication interface 1918 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN. Wireless links may also be implemented. In any such implementation, communication interface 1918 sends and receives electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.
Network link 1920 typically provides data communication through one or more networks to other data devices. For example, network link 1920 may provide a connection through local network 1922 to a host computer 1924 or to data equipment operated by an Internet Service Provider (ISP) 1926. ISP 1926 in turn provides data communication services through the worldwide packet data communication network now commonly referred to as the “Internet” 1928. Local network 1922 and Internet 1928 both use electrical, electromagnetic or optical signals that carry digital data streams. The signals through the various networks and the signals on network link 1920 and through communication interface 1918, which carry the digital data to and from computer system 1900, are example forms of transmission media.
Computer system 1900 can send messages and receive data, including program code, through the network(s), network link 1920 and communication interface 1918. In the Internet example, a server 1930 might transmit a requested code for an application program through Internet 1928, ISP 1926, local network 1922 and communication interface 1918. The received code may be executed by processor(s) 1904 as it is received, or stored in storage device 1910, or other non-volatile storage for later execution.
All examples and illustrative references are non-limiting and should not be used to limit any claims presented herein to specific implementations and examples described herein and their equivalents. For simplicity, reference numbers may be repeated between various examples. This repetition is for clarity only and does not dictate a relationship between the respective examples. Finally, in view of this disclosure, particular features described in relation to one aspect or example may be applied to other disclosed aspects or examples of the disclosure, even though not specifically shown in the drawings or described in the text.
The foregoing outlines features of several examples so that those skilled in the art may better understand the aspects of the present disclosure. Those skilled in the art should appreciate that they may readily use the present disclosure as a basis for designing or modifying other processes and structures for carrying out the same purposes and/or achieving the same advantages of the examples introduced herein. Those skilled in the art should also realize that such equivalent constructions do not depart from the spirit and scope of the present disclosure, and that they may make various changes, substitutions, and alterations herein without departing from the spirit and scope of the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
202241043049 | Jul 2022 | IN | national |
This application is a continuation-in-part of U.S. patent application Ser. No. 18/392,807, filed Dec. 21, 2023, which is a continuation-in-part of U.S. patent application Ser. No. 18/301,091, filed on Apr. 14, 2023, which claims the benefit of Indian Provisional Application No. 20/224,1043049, filed on Jul. 27, 2022. All of the aforementioned applications are hereby incorporated by reference in their entirety for all purposes.
Number | Date | Country | |
---|---|---|---|
Parent | 18392807 | Dec 2023 | US |
Child | 18646119 | US | |
Parent | 18301091 | Apr 2023 | US |
Child | 18392807 | US |