Containerized applications are applications that run in isolated runtime environments called containers. Containers encapsulate an application with all its dependencies, including system libraries, binaries, and configuration files. This all-in-one packaging makes a containerized application portable by enabling it to behave consistently across different hosts allowing developers to write once and run almost anywhere. Containers, however, do not include their own operating systems (OS). Different containerized applications running on a host system, instead, share the existing OS provided by that system. Without any need to bundle an extra OS along with the application, containers are extremely lightweight and can launch very fast. To scale an application, more instances of a container can be added almost instantaneously.
Chaos engineering is a method of testing distributed production software that deliberately introduces failure and faulty scenarios to the production software to verify its resilience in the face of disruptions, random or otherwise. These disruptions can cause applications to respond unpredictably and break under pressure.
In one general aspect, the present invention is directed to computer-implemented systems and methods for chaos testing a target application. The target application can be, for example, a containerized application or running on a virtual machine. The chaos testing can test a production or non-production (e.g., offline) version of the target application. Performing the testing on a non-production version protects any online, production version of the target application from being affected by the chaos testing. During the chaos-testing for a non-production version of the target application, the non-production version can (i) generate responses to a simulated traffic stream (e.g., HTTP request) for the target application while simultaneously (ii) being subject to one or more chaos conditions that can be specified by a user, e.g., a person or team running the chaos experiment. These and other benefits realizable from embodiments of the present invention will be apparent from the description that follows.
Various embodiments of the present invention are described herein by way of example in connection with the following figures.
Various embodiments of the present invention are directed to systems and methods for performing chaos testing, such as for a software application, particularly a containerized application or an application running on a virtual machine (VM). At the outset, as background and in connection with
The memory devices 114A-B may be volatile or non-volatile memory devices, such as RAM, ROM, EEPROM, or any other device capable of storing data. The memory devices 114A may be persistent storage devices such as hard drive disks (“HDD”), solid-state drives (“SSD”), and/or persistent memory (e.g., Non-Volatile Dual In-line Memory Module (“NVDIMM”)). I/O device(s) 116 refers to devices capable of providing an interface between one or more processor pins and an external device, the operation of which is based on the processor inputting and/or outputting binary data. CPU(s) 112 may be interconnected using a variety of techniques, ranging from a point-to-point processor interconnect, to a system area network, such as an Ethernet-based network. Local connections within physical hosts 110, including the connections between processor(s) 112 and memory devices 114A-B and between processor(s) 112 and I/O device 116 may be provided by one or more local buses of suitable architecture, for example, peripheral component interconnect (PCI).
The physical host 110 may run one or more isolated guests, for example, a VM 122, which may in turn host additional virtual environments (e.g., VMs and/or containers). In an example, a container (e.g., storage container 160, service containers 150A-B) may be an isolated guest using any form of operating system level virtualization, for example, Red Hat® OpenShift®, Docker® containers, chroot, Linux®-VServer, FreeBSD® Jails, HP-UX® Containers (SRP), VMware ThinApp®, etc. Storage container 160 and/or service containers 150A-B may run directly on a host operating system (e.g., host OS 118) or run within another layer of virtualization, for example, in a virtual machine (e.g., VM 122). In an example, containers that perform a unified function may be grouped together in a container cluster that may be deployed together, e.g., in a Kubernetes® pod. A pod is a group of one or more containers, with shared storage and network resources, and a specification of how to run the containers. A pod's contents can be co-located and co-schedule, and run in a shared context.
The cluster 100 may run one or more VMs (e.g., VMs 122), by executing a software layer (e.g., hypervisor 120) above the hardware and below the VM 122. The hypervisor 120 may be a component of respective host operating system 118 executed on physical host 110, for example, implemented as a kernel based virtual machine function of host operating system 118. In another example, the hypervisor 120 may be provided by an application running on host operating system 118. The hypervisor 120 may also run directly on physical host 110 without an operating system beneath hypervisor 120. Hypervisor 120 may virtualize the physical layer, including processors, memory, and I/O devices, and present this virtualization to VM 122 as devices, including virtual central processing unit (“VCPU”) 190, virtual memory devices (“VMD”) 192, virtual input/output (“VI/O”) device 194, and/or guest memory 195. In an example, another virtual guest (e.g., a VM or container) may execute directly on host OSs 118 without an intervening layer of virtualization.
The VM 122 may be a virtual machine and may execute a guest operating system 196, which may utilize the underlying VCPU 190A, VMD 192A, and VI/O 194A. Processor virtualization may be implemented by the hypervisor 120 scheduling time slots on physical CPUs 112 such that from the guest operating system's perspective those time slots are scheduled on a virtual processor 190. The VM 122 may run on any type of dependent, independent, compatible, and/or incompatible applications on the underlying hardware and host operating system 118. The hypervisor 120 may manage memory for the host operating system 118 as well as memory allocated to the VM 122 and guest operating system 196 such as guest memory 195 provided to guest OS 196. In an example, storage container 160 and/or service containers 150A, 150B are similarly implemented.
In addition to distributed storage provided by storage container 160, a storage controller may additionally manage storage in dedicated storage nodes (e.g., NAS, SAN, etc.). In an example, a storage controller may deploy storage in large logical units with preconfigured performance characteristics (e.g., storage nodes 170). In an example, access to a given storage node (e.g., storage node 170) may be controlled on an account and/or tenant level. In an example, a service container (e.g., service containers 150A-B) may require persistent storage for application data, and may request persistent storage with a persistent storage claim to an orchestrator of the cluster 100. In the example, a storage controller may allocate storage to service containers 150A-B through a storage node (e.g., storage nodes 170) in the form of a persistent storage volume. In an example, a persistent storage volume for service containers 150A-B may be allocated a portion of the storage capacity and throughput capacity of a given storage node (e.g., storage nodes 170). In various examples, the storage container 160 and/or service containers 150A-B may deploy compute resources (e.g., storage, cache, etc.) that are part of a compute service that is distributed across multiple clusters (not shown in
A container engine 12 is a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings. The container engine 12 enables the host OS 118 to act as a container host. The container engine 12 accepts user commands to build, start, and manage containers through client tools (including CLI-based or graphical tools), and it provides an API that enables external programs to make similar requests. The container engine 12 can comprise a container runtime, which is responsible for creating the standardized platform on which applications can run, for running containers, and for handling the container's storage needs on the local system.
Docker is a set of platform-as-a-service products that use OS-level virtualization to deliver software in containers. OpenShift from Red Hat is a Docker-based, layered system that abstracts the creation of Linux-based container images. Cluster management and orchestration of containers on multiple hosts is handled by Kubernetes.
Turning now to the novel chaos testing aspects of the present invention,
The enterprise computer system 20 can include, or be implemented as part of, one or more clusters 100, such as shown in
In this example, a static repository copy of the code 22 for the target application 32 to be chaos-tested may be stored in a source code repository 24, such as a Git-based repository such as Bitbucket. Various embodiments of the present invention rely on Apache JMeter as the load-testing tool for the target application, and JMeter typically requires a Java Management Extensions (JMX) script. Accordingly, the repository 24 can store a JMX script 28 for the target application according to various embodiments. JMeter can be run by running jmeter.bat for Windows or JMeter for Unix. The JMX script can be created using, for example, a Postman-to-JMX converter, BlazeMeter, or BadBoy.
The illustrated enterprise computer system 20 also comprises a container platform 30. The container platform 30 can manage containerized applications and, in various embodiments, an OpenShift container platform, from Red Hat Software, can be used. The container platform comprises, according to various embodiments, the non-production copy of the target application 32, the JMX script 34 for the target application (generated from the target application code repository 22), a “Perf Ops” software module 36, a fault injection module 38, and a chaos-testing module.
Importantly, the target application 32 can be tested based on, simultaneously, (i) simulated traffic flow (e.g., transactions per second) for the target application 32 that is generated with the perf ops module 36 and using the JMX script 34 and (ii) chaos event setting for chaos events or conditions that are injected from the chaos testing module into the target application 32. The chaos events or conditions can be user-defined via the fault injection module 38, as described further below.
The JMX script 34 simulates a non-chaotic, traffic condition for the target application 32 for the testing, e.g., a steady state traffic condition. For example, traffic data for the production version of the target application can be captured, such as via a traffic monitoring application or system, so that typical traffic patterns can be learned, and the simulated traffic for the non-production copy of the target application 32 used for the chaos testing can replicate, or sample, a known or typical, or even outlier, traffic scenario for the production version of the target application to generate the simulated traffic flow for the non-production version of the target application 32. The simulated traffic condition can include or specify, for example, a number of transactions per second for the testing, where the transactions can be, for example, HTTP requests to the target application 32. The simulated traffic might also simulate, for example, a number of users for the target application, over the duration of the chaos testing, that is typical for the production version of the target application. The simulated traffic can be similar to the historical traffic patterns that it simulates, such as within an upper and lower bound (e.g., +/−5%) of the typical peak transactions and users. A user performing the chaos testing may select the simulated traffic condition for the target application 32 for the testing via the perf op module 36. That is, the perf ops module 36 may provide a user interface (e.g., a browser based user interface) through which the user can, for example, select a simulated traffic condition from a pre-established menu of possible simulated traffic scenarios, or the user can design or specify, via the user interface of the perf ops module 36, a custom simulated traffic scenario for the testing. The perf ops module 36 can transmit the parameters for the user selection for the simulated traffic condition to the JMX script 34, and the JMX script then generates the simulated traffic for the target application 32 according to the user's specification for the testing. That way, the response of the non-production target application 32 to the chaos events for the simulated traffic scenario (e.g., number of users interacting with target application 32, number of HTTP requests to the target application 32, etc.) can be monitored, and changes to the production version of the target application 32 to better address such chaos events under similar traffic conditions can be made.
In various embodiments, the chaos-testing module 40 can use LitmusChaos, which is a cloud-native, open source chaos-engineering framework for Kubernetes environments. It can be installed in an OpenShift containerized environment. As such, in various embodiments, the chaos-testing module 40 can receive YAML declarations for the chaos conditions from the fault injection module 38, to be injected into the target application 32. YAML is a human-readable data-serialization language often used for writing configuration files, such as, in this case, configuration files for the chaos-testing module. The structure of a YAML file can be, for example, a map or a list, and it can follow a hierarchy depending on the indentation, and how key values are defined. In that connection, the fault injection module 38 may be a software program that allows a user, e.g., the person or the team of persons conducting the chaos engineering test, to, via a user interface (e.g., a browser-based user interface) provided by the fault injection module 38, select the target application 32 for the chaos testing and to set the parameters for the chaos testing.
In various embodiments, the fault injection module user interface can use a name-space approach. The user interface can have different name spaces, like folders, each with selection options for different types of chaos tests. The options allow, for example, the user to select the target application 32 for the testing and to select chaos parameters for the testing. The chaos parameters can vary by name-space, which can vary by the type of test. Some exemplary parameters that can be specified via the fault injection module 38 for the chaos testing include:
Below is an example of pseudo code for the chaos-testing module 40 for a CPU stress test.
Below is an example of pseudo code for the chaos-testing module 40 for a Pod Kill test.
Once the user finalizes the user selections, the fault injection module 38, for example, packages the user selections for the chaos testing into a YAML file for the chaos-testing module 40. The chaos testing module 40 reads the parameters from the received YAML file and initiates the chaos experiment for the target application 32 based on the read, user-specified chaos parameters. In particular, based on the parameters in the YAML file, the chaos-testing module can orchestrate the chaos injection into the target application 32.
During the testing, the Perf Ops module 36 can monitor and track the responses from the target application 32 to requests in the simulated traffic flow and display, for the user, codes for the responses. For example, if the target application 32 successfully responded to a request in the simulated traffic, an HTTP 200 OK status code be assigned to the request. Other status codes, e.g., HTTP status codes, could be assigned as needed based on the target application's response, such as 401 (unauthorized request), 404 (not found), etc. In various embodiments, a GrafanaLabs dashboard can be used for the Perf Ops module 36.
Steps 60 and 62 may be performed in any sequence. When the chaos testing is initiated, the target application 32 is run (or executed) by the, for example, the container platform 30, such that, at step 64, the perf ops module 36 can monitor (and display on a dashboard) the performance of the target application 32 from, simultaneously, the simulated traffic conditions and the chaos conditions. As described above, the performance monitoring can include capturing and displaying HTTP status codes generated by the target application 32 in response to the simulated HTTP requests to it during the testing and under the simultaneous burden of the chaos conditions.
In some embodiments, the target application 32 tested in the above-described manner is a containerized application, such as service containers 150A-B in
In other embodiments, the system can be used to chaos test an application running on a virtual machine (VM), such as VM 122 in
The perf ops module 36, fault injection module 38 and chaos testing 40 can be software modules stored in the memory devices 114A-B and executed by the host CPU 112, using any suitable computer language, such as, for example, SAS, Java, C, C++, or Perl using, for example, conventional or object-oriented techniques. The software code may be stored as a series of instructions or commands in the computer memory devices 114A-B. To that end, below is pseudo code for the perf ops module 36 to perform the load performance testing with a JMX script.
As mentioned previously, the inventive chaos testing system could also be used for a production version of the target application, as shown in the exemplary embodiment depicted in
As before with a non-production version of the target application, in the production version testing shown in
In one general aspect, the present invention, therefore, is directed to computer systems and methods for performing a chaos experiment for a target application. The computer system can comprise one or more processors, and computer memory in communication with the one or more processors. The computer memory stores instructions that when executed by the one or more processors, causes the one or more processors to: (i) generate, for the chaos experiment, a simulated traffic stream for a non-production version of the target application; (iii) provide chaos event settings for one or more chaos conditions to the non-production version of the target application; (iv) execute the non-production version of the target application during the chaos experiment, such that the non-production version of the target application, during the chaos experiment, (a) generates responses to the simulated traffic stream while simultaneously (b) being subject to the one or more chaos conditions of the chaos event settings; and (iv) monitor the responses generated by the non-production version of the target application during the chaos testing.
A computer-implemented method according to embodiments of the present invention can comprise the steps of: (i) generating, for the chaos experiment, with a computer system that comprises one or more processors, a simulated traffic stream for a non-production version of the target application; (ii) providing, by the computer system, chaos event settings for one or more chaos conditions to the non-production version of the target application; (iii) executing, by the computer system, the non-production version of the target application during the chaos experiment, such that the non-production version of the target application, during the chaos experiment, (a) generates responses to the simulated traffic stream while simultaneously (b) being subject to the one or more chaos conditions of the chaos event settings; and (iv) monitoring, by the computer system, the responses generated by the non-production version of the target application during the chaos testing.
According to various implementations, the computer memory further stores instructions that when executed by the one or more processors, causes the one or more processors to: generate a declarative YAML file defining chaos condition parameters for the one of more chaos conditions for the chaos experiment for the target application, where the chaos condition parameters for the one or more chaos conditions are based on a user input for the chaos experiment; and provide the chaos event setting to the non-production version of the target application based on the chaos condition parameters file from the declarative YAML Also, the computer memory can further store instructions that when executed by the one or more processors, causes the one or more processors to generate the simulated traffic stream from a JMX script for the target application. Still further, the simulated traffic stream can comprises HTTP requests and the responses generated by the non-production version of the target application comprise HTTP status codes. The simulated traffic stream can simulate a historical traffic stream for a production version of the target application.
In various implementations, the chaos condition can comprise one or more of the following: CPU stress for a container for the target application; network loss for the container for the target application;
In various implementations, the target application comprises a containerized application or an application running on a virtual machine.
The examples presented herein are intended to illustrate potential and specific implementations of the present invention. It can be appreciated that the examples are intended primarily for purposes of illustration of the invention for those skilled in the art. No particular aspect or aspects of the examples are necessarily intended to limit the scope of the present invention. Further, it is to be understood that the figures and descriptions of the present invention have been simplified to illustrate elements that are relevant for a clear understanding of the present invention, while eliminating, for purposes of clarity, other elements. While various embodiments have been described herein, it should be apparent that various modifications, alterations, and adaptations to those embodiments may occur to persons skilled in the art with attainment of at least some of the advantages. The disclosed embodiments are therefore intended to include all such modifications, alterations, and adaptations without departing from the scope of the embodiments as set forth herein.