An enterprise, such as a business, may implement processes using a cloud computing environment. For example, a system design might couple multiple components together (e.g., a load balancer, database, and application server) for execution in a cloud computing environment to support Human Resource (“HR”) processing, Purchase Order (“PO”) activities, financial monitoring, etc. In today's cloud world, these systems include many different types of components and subcomponents (e.g., hardware, software, and network elements) that each may have different reliability and cost considerations. Moreover, a cloud service provider might have a contractual Service Level Agreement (“SLA”) with customers having certain system reliability goals. For example, a cloud service provider might offer high reliability or availability to certain customers (e.g., a service provider could offer 99.x % reliability to some customers). It is possible to improve component reliability by replicating components in a parallel fashion. Such an approach, however, may increase the costs associated with a system design. Currently, there is no automated model that accurately helps a service provider understand the components and subcomponents that may require replication to derive an optimal architecture design. Instead, more often, the provider manually selects elements to replicate adding additional, and perhaps unnecessary, cost.
Currently, SLA promises are to customer “one size fits all.” For example, in some cases a service provider may be forced to offer a static SLA to all customers because it doesn't have the flexibility to add/remove replications dynamically in view of reliability impact. All customers might not require substantially high reliability (e.g., 99.99%), and it would provide more flexibility to service providers if they could offer different packages to customers based on criticality. For example, the provider might want to charge more to a customer who prefers 99.9% reliability as compared to a customer who is comfortable with 99.5% reliability. This type of flexible approach might provide benefits for both the cloud customer (e.g., a desired SLA at a reasonable price) and the service provider (e.g., cost savings)—but may not be practical without a way of automatically and accurately evaluating potential system designs.
Systems are desired that facilitate an accurate and efficient algorithmic approach to high availability, cost efficient system design, maintenance, and predictions in a cloud computing environment.
According to some embodiments, methods and systems may include a cloud computing design evaluation platform that receives a master variant for a cloud computing design, including a sequential sequence of a set of components. The evaluation platform may then determine a maximum number of parallel levels for the master variant and automatically create a plurality of potential variants of the master variant by expanding the master variant with parallel components in accordance with the maximum number of parallel levels. The evaluation platform determines reliability information (e.g., based on MTBF) and cost information (e.g., a TCO) for each of the set of components. An overall reliability score and overall cost score for each of the automatically created potential variants is automatically calculated and an evaluation result of the calculation is indicated. The final result may represent the most optimally designed architecture of components and subcomponents that satisfies the needs of both the service provider and the consumer. Some embodiments may also provide continuous monitoring of design performance and/or predict future design performance based on historical data.
Some embodiments comprise: means for receiving, by a computer processor of a cloud computing design evaluation platform, a master variant for a cloud computing design, including a sequential sequence of a set of components; means for determining a maximum number of parallel levels for the master variant; means for automatically creating a plurality of potential variants of the master variant by expanding the master variant with parallel components in accordance with the maximum number of parallel levels; determining reliability information for each of the set of components; determining cost information for each of the set of components; automatically calculating an overall reliability score and overall cost score for each of the automatically created potential variants; and indicating an evaluation result of said calculation.
Some technical advantages of some embodiments disclosed herein are improved systems and methods providing accurate and efficient algorithmic approach to high availability, cost efficient system design, maintenance, and predictions in a cloud computing environment.
Briefly, some embodiments facilitate an accurate and efficient algorithmic approach to high availability, cost efficient system design, maintenance, and predictions in a cloud computing environment. In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of embodiments. However, it will be understood by those of ordinary skill in the art that the embodiments may be practiced without these specific details. In other instances, well-known methods, procedures, components, and circuits have not been described in detail so as not to obscure the embodiments.
One or more specific embodiments of the present invention will be described below. In an effort to provide a concise description of these embodiments, all features of an actual implementation may not be described in the specification. It should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developer's specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another. Moreover, it should be appreciated that such a development effort might be complex and time consuming, but would nevertheless be a routine undertaking of design, fabrication, and manufacture for those of ordinary skill having the benefit of this disclosure.
The elements of the system 100 may store data into and/or retrieve data from various data stores (e.g., a component database or a historical reliability data store), which may be locally stored or reside remote from the cloud computing design evaluation platform 150. Although a single cloud computing design evaluation platform 150 is shown in
An operator (e.g., a service provider administrator) may access the system 100 via a remote device (e.g., a Personal Computer (“PC”), tablet, or smartphone) to view data about and/or manage operational data in accordance with any of the embodiments described herein. In some cases, an interactive graphical user interface display may let an operator or administrator define and/or adjust certain parameters (e.g., to define component relationships and enter information about a SLA) and/or provide or receive automatically generated recommendations, results, and/or alerts from the system 100. For example, the cloud computing evaluation platform may output an evaluation result (e.g., recommending a particular component arrangement) via an operator or administrator display.
At S210, a computer processor of a cloud computing design evaluation platform may receive a master variant for a cloud computing design. The master variant may, for example, include a sequential sequence of a set of components. By way of examples only, a component might be associated with a load balancer, a dispatcher, a database, an application server, a file system, a router, memory, Network Address Translation (“NAT”), a messaging queue, etc. At S220, the system may determine a maximum number of parallel levels for the master variant (e.g., each component might be allowed to be replicated in parallel up to three times).
A plurality of potential variants of the master variant may then be automatically created at S230 by expanding the master variant with parallel components in accordance with the maximum number of parallel levels. The potential variants of the master variant might be created, for example by expanding the master variant with parallel identical components (e.g., two identical database components might be provided in parallel to improve reliability). According to some embodiments, at least one potential variant of the master variant is created by expanding the master variant with a parallel alternate component.
At S240, the system may determine reliability information for each of the set of components. The reliability information might be associated with, for example, a Mean Time Between Failure (“MTBF”) for each component (e.g., in days). Similarly, the system may determine cost information for each of the set of components at S250. The cost information might be associated with, for example, a Total Cost of Ownership (“TCO”) for each component. The TCO might be associated with a monetary value, an amount of computing resources, an amount of memory, Input Output (“IO”) considerations, etc. An overall reliability score and overall cost score are automatically calculated at S260 for each of the automatically created potential variants. At S270, the system may indicate an evaluation result of this calculation (e.g., to recommend one particular variant as the optimal configuration).
According to some embodiments, the cloud computing design evaluation platform is further to determine a SLA associated with the cloud computing design and the evaluation result comprises a selection of one of the automatically created potential variants based on the SLA, the overall reliability scores, and the overall cost scores. In other cases, a TCO goal might be input and used to generate the variant that meets that goal while providing the highest level of reliability.
According to some embodiments, the cloud computing design evaluation platform continuously monitors the cloud computing design in real time based on design performance. Moreover, the cloud computing design evaluation platform may, in some embodiments, use a machine learning model to predict future cloud computing design performance based on historical cloud computing design performance. In this case, the cloud computing design evaluation platform could automatically generate a recommended design based on the predicted future cloud computing design performance.
In this way, embodiments may help maintain a desirable balance between reliability and TCO for complex cloud computing environment architectures. For example, the algorithms described herein may let a provider design an optimal system architecture by suggesting the parallel replications required for various components (which in-turn can help fulfill a contractual reliability SLA agreement while keeping the TCO to a minimum value). Embodiments may also maintain a system architecture by continuously monitoring the component reliabilities (and generate an alert if the system reliability drops or is about to drop below a contractual SLA). For example, the system may automatically identify and propose a new system architecture which meets the reliability criteria while keeping the TCO down. Some embodiments may also predict the best future system architecture with the help of a time series based machine learning model. The model may, for example, analyze historical component reliabilities, observe the trends, and suggest the best design accordingly.
Embodiments described herein may calculate the reliability of various system designs. For example,
Large systems may consist of various components connected in series and parallel modes. The resultant reliability of the system may be measured by calculating the relevant series and parallel reliabilities. In series-parallel structure, the system consists of subsystems in series where, for each subsystem, multiple components are used in parallel. In the first design 400 of
A provider may replicate components parallelly to improve reliability. Hence, a single component failure will not cause the entire system to fail. The reliability of component A in the design 400 is improved in
Since component A consists of parallel connections, the Parallel Reliability (A) is calculated first:
where Ra, Ra1, and Ra2 are the respective reliabilities of components A, A1, and A2.
A provider could improve the system 500 reliability by adding parallel components, such as parallel components for the dispatcher 520 and database 530.
Note that in some embodiments, it is not necessary to replicate the exact same component in a parallel connection. For example, instead of using the second database 632 as a parallel connection, it is possible to use a file system as a back-up component as shown in
The reliability of components may be measured separately and provided as an input to the algorithm. Note that, by default, the same component might be considered for parallel replication using the same reliability score. If the back-up component is different or the reliability score is different, the system may need to provide the additional failure information separately. Similarly, the TCO may also be calculated per component and be supplied as an input to the algorithm. Note that, by default, the same component might be considered for parallel replication having the same TCO value. If the back-up component is different or the TCO value is different, the system may need to provide the additional cost information separately.
A reliability SLA agreed to with a customer may also be provided as an input to the algorithm. This may represent a minimum reliability that is expected from the whole system. The sequential order of components can also be supplied as a {master variant} input to the system. A maximum number of allowed parallel levels may represent an optional input parameter that indicates the maximum allowed replications (parallel connections) for components. This parameter may reduce false positives while determining the best scenario variant (by preventing endless parallelization of components).
At S810, the system may automatically identify and generate the possible variants for a reliability design. This step may, for example, identify all possible variants of a given {master variant} (the one in which the required components get arranged sequentially). The algorithm starts with master variant and expands the nodes one-by-one. At each step the variant expands one node, becoming a new variant. This process is repeated until no further expansions are possible (that is, all of the components get expanded until the maximum allowed parallel level is reached).
Referring again to
The system may then perform continuous monitoring of reliability and TCO for a design. Referring again to
In addition to continuously monitoring performance in substantially real time, some embodiments may use machine learning to evaluate future design performance. For example,
Initially, the historical reliability scores for the components are stored for future use. At S1310, the system reads the data from the storage and passes it to a machine learning model for processing. The system may also receive relevant inputs, such as component TCO, a contractual SLA, a master variant design, a maximum number of parallel levels, etc. At S1320, the time-series machine learning model (or models) are applied to predict future reliability changes for design components. For example, the system may read the component historical reliability scores 1310 and apply the time-series based machine learning algorithm (e.g., autoregressive, exponential smoothing, the prophet library, etc.) to forecast potential reliability changes for components. For example,
Step 1 starts with {Master Variant} which is inserted to a queue at Step 2. At Step 3, the best variant is defined to be {Master Variant} and Steps 4 and 5 expand the node in a parallel fashion to reach {Next Variant}. Variants are compared in terms of SLA and cost to select the best variant. Step 6 marks the root as “Expanded” and places it back into the queue. Steps 7 through 9 continue until no further expansions are possible. At Step 10, the best design is selected based on the best variant, reliability is set to the reliability of the best variant, and the TCO is set to the TCO of the best variant.
Some embodiments may provide user interfaces to facilitate execution of an algorithm for a reliable, cost-optimal system design. For example,
Other user interfaces may help develop a system solution by connecting components (in serial parallel fashion). For example,
The user interface may help generate all possible design variants for a set of given inputs.
Note that the embodiments described herein may be implemented using any number of different hardware configurations. For example,
The processor 2110 also communicates with a storage device 2130. The storage device 2130 can be implemented as a single database, or the different components of the storage device 2130 can be distributed using multiple databases (that is, different deployment data storage options are possible). The storage device 2130 may comprise any appropriate data storage device, including combinations of magnetic storage devices (e.g., a hard disk drive), optical storage devices, mobile telephones, and/or semiconductor memory devices. The storage device 2130 stores a program 2112 and/or design evaluation engine 2114 for controlling the processor 2110. The processor 2110 performs instructions of the programs 2112, 2114, and thereby operates in accordance with any of the embodiments described herein. For example, the processor 2110 may receive a master variant for a cloud computing design, including a sequential sequence of a set of components. The processor 2110 may then determine a maximum number of parallel levels for the master variant and automatically create a plurality of potential variants of the master variant by expanding the master variant with parallel components in accordance with the maximum number of parallel levels. The processor 2110 determines reliability information (e.g., MTBF) and cost information (e.g., TCO) for each of the set of components. An overall reliability score and overall cost score for each of the automatically created potential variants is automatically calculated by the processor 2110, and an evaluation result of the calculation is indicated.
The programs 2112, 2114 may be stored in a compressed, uncompiled and/or encrypted format. The programs 2112, 2114 may furthermore include other program elements, such as an operating system, clipboard application, a database management system, and/or device drivers used by the processor 2110 to interface with peripheral devices.
As used herein, data may be “received” by or “transmitted” to, for example: (i) the platform 2100 from another device; or (ii) a software application or module within the platform 2100 from another software application, module, or any other source.
In some embodiments (such as the one shown in
Referring to
The design identifier 2202 might be a unique alphanumeric label or link that is associated with a cloud-based system design being evaluation, monitored, predicted, etc. The master variant 2204 may comprise a series of components defining how the design operates. The inputs 2206 may include a maximum number of parallel levels (L), a SLA, reliability information, TCO information, etc. The optimal solution 2208 may comprise a design that has been selected, from a set of potential variants to the master variant 2204, as the best design in view of the inputs 2206.
Thus, embodiments may help identify the components that require high availability (along with the number of instances) in order to maintain a SLA while keeping the TCO as low as possible. Embodiments may provide a scientific model to a service provider to help them understand which of the components require replication (and how many instances). Some embodiments continuously monitor the designs and generate alerts if the total reliability drops below the SLA. If so, the system may regenerate the design variants and suggest a new best variant to meet the SLA. Some embodiments apply a machine learning model to forecast reliability changes and predict the future best scenario based on the predicted changes. Moreover, embodiments may let a service provider offer different packages to different customers based on SLA requirements (making the contractual SLAs dynamic in nature).
The following illustrates various additional embodiments of the invention. These do not constitute a definition of all possible embodiments, and those skilled in the art will understand that the present invention is applicable to many other embodiments. Further, although the following embodiments are briefly described for clarity, those skilled in the art will understand how to make any changes, if necessary, to the above-described apparatus and methods to accommodate these and other embodiments and applications with modifications and alterations limited only by the spirit and scope of the appended claims.
Although specific hardware and data configurations have been described herein, note that any number of other configurations may be provided in accordance with some embodiments of the present invention (e.g., some of the data associated with the databases described herein may be combined or stored in external systems). Moreover, although some embodiments are focused on particular types of enterprise and system components and designs, any of the embodiments described herein could be applied to other types of components and designs. Moreover, the displays shown herein are provided only as examples, and any other type of user interface could be implemented.