Data centers are repositories for computing facilities used to store and manage large amounts of data. In large data centers, the computing facilities may include computing systems having large and complex hardware and software architectures. These complexities have led to the development of various automated management tools to help system administrators manage the performance of the computing system.
Large scale, distributed computing systems such as those often found in data centers are implemented as loosely coupled components, each serving a distinct functional purpose. Such a design promotes encapsulation and modularity allowing each component to evolve independently of each other to serve business needs. The services themselves vary in terms of the role they play as well as the frequency with which they are invoked. As an example, infrastructure level components such as authentication, authorization, logging, alerting etc. are invoked much more frequently than higher level components that manage logical and physical infrastructure.
The present disclosure is best understood from the following detailed description when read with the accompanying Figures. It is emphasized that, in accordance with the standard practice in the industry, various features are not drawn to scale. In fact, the dimensions of the various features may be arbitrarily increased or reduced for clarity of discussion.
While examples described herein are susceptible to various modifications and alternative forms, the drawings illustrate specific examples herein described in detail by way of example. It should be understood, however, that the description herein of specific examples is not intended to be limiting to the particular forms disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the examples described herein and the appended claims.
Illustrative examples of the subject matter claimed below will now be disclosed. In the interest of clarity, not all features of an actual implementation are described in this specification. It will be appreciated that in the development of any such actual implementation, numerous implementation-specific decisions may be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which will vary from one implementation to another. Moreover, it will be appreciated that such a development effort, even if complex and time-consuming, would be a routine undertaking for those of ordinary skill in the art having the benefit of this disclosure.
Large computing systems, such as those under discussion, evolve over time. Older equipment and features may be replaced. New equipment and features may be added. As newer features are added over time, the problem of finding and fixing holistic system level issues that are architecturally inefficient becomes that much harder, snowballing into ever larger technical debt. Another class of issues comes from inefficient database access patterns. As developers continue to rely on generic object relational model (“ORM”) technologies to isolate them from database access details, one often finds that inefficient queries and tables can proliferate which in turn causes significant resource crunch at scale. These issues are difficult to identify and isolate in small scale localized testing.
Experience suggests that it is often difficult to predict upfront which intercomponent interactions are susceptible to misuse in terms of architectural adherence (i.e., architectural anti-patterns) and contribute to issues at scale that are not readily apparent in localized developer testing. This can be particularly problematic in the development of new equipment and/or features and before their release. Many of these issues manifest as system-wide issues that are only identified in large scale testing which takes substantial effort to set up and may be too late depending on the phase in the release cycle when the issues are identified.
In addition, as teams and features evolve over time in a product that is being developed at a rapid release cadence, it is very difficult to train developers adequately to ensure they follow all recommended architectural principles. This issue leads to a “whack-a-mole” effect in terms of finding and fixing a class of issues in one component in one release only to find the same issue showing up in a different component in a different release (an artifact of “cut-and-paste driven” development). Occasionally, components unintentionally making calls at a very high frequency for essentially static data may cause an internal “denial of service” style of invocation that can bring the entire application stack down resulting in downtime for the customer.
Some conventional practice identifies performance variances for the same set of operations under different conditions and highlighting them as potential areas for developers to investigate. Such approaches fall short of identifying potential contributors such as the inefficiencies the solution presented herein provides. Other conventional practice has focused on drawing the entire call graph for a given set of operations but does not go far enough into highlighting performance bottlenecks. Extensions of this approach involve injecting faults into the call graph paths to introduce chaos into the system to simulate environmental degradation but does not highlight areas such as inadequate event storming implementations.
The present disclosure provides a technique by which service side logs such as Apache (HTTP) logs, Request Identifier (“ID”) logs, database logs and system level performance data are used to identify and correlate patterns of inefficiencies that are known to result in increased system load which result in performance and/or availability issues. This data driven approach provides metrics that can be used to highlight issues as well as implement tests that can be used to enforce criteria for code submission, thus facilitating a “shift-left” paradigm towards architectural compliance. The technique helps ensure that architecture compliance will be monitored regularly without any additional overhead and non-compliance will be found early to avoid expensive fixes and trade-offs later on in the release cycle.
More particularly, the technique uses data available in service side logs (such as Apache (HTTP) logs and database logs) to find and highlight inefficient patterns of invocation that are known to cause performance issues. The data driven approach provides easily measurable and understandable metrics that can then be validated once appropriate fixes are made in the components. This approach allows writing gating tests that can exercised as exit criteria for code check-ins which in turn will drive up the quality of the product stack and foster a shift left mindset to test architectural compliance.
One example of such a compliance and enforcement issue alluded to above is for static data such as management appliance network address or software version. As components are initialized, they are expected to retrieve static data such as the management appliance IP address, software version the component is running and cache the software version for future use instead of repeatedly invoking lower level components that provide Representational State Transfer (REST) Application Program Interfaces (APIs) for such data. Any changes to such data are communicated over the eventing infrastructure, thus removing the need for polling. However, components routinely violate this principle resulting in unnecessary polling. By matching the Request Identifiers logged in the Apache logs with component identifiers (“IDs”) logged in the product request logs and correlating them with observed system and process load at the time, one can identify components that are violating the architectural principles around frequency of REST invocations for such interactions.
Another set of examples is environment based: detection of architectural anti-patterns such as failure to implement event storming/burst suppression techniques or net filter rules. By correlating similar/same high frequency incoming events from devices (e.g. port flapping, disk drive notifications) with Apache logs that indicate a high rate of interaction with foundational components such as the alerting service, the presently disclosed technique is able to find and flag issues where architectural guidelines around event suppression techniques have not been implemented adequately. In a similar vein, this solution can also detect patterns where an external agent/plugin sends a very high volume of requests in a short period of time and flag architectural compliance issues around lack of Internet Protocol (“IP”) address based net filter rules to prevent external denial of service style of attacks.
Turning now to the drawings,
The computing system 100 is a private network housed entirely within the data center. The ports (not shown) of the physical devices are therefore locked down to help prevent unauthorized intrusion of external origination. However, other examples may include portions of, or communicate over, over a public network. One public network that might be included in whole or in part is the Internet and, in particular, the World Wide Web portion of the Internet. In such examples, the ports would need to be unlocked to permit interfacing with the public network.
In this particular example, the computing system 100 is a network operating on a client-server model of communication. The computing system 100 employs the HyperText Transfer Protocol (“HTTP”) and the Transmission Control Protocol/Internet Protocol (“TCP/IP”) to define the architecture thereof. The computing system 100 also employs a distributed web application architecture.
In a distributed web application architecture, components are frequently implemented as distinct services, each serving a separate function. This architectural separation ensures encapsulation and loose coupling between services. In this model, interactions between software components are via 1) Representational State Transfer (“REST”) Application Program Interface (“API”) calls or 2) messages on a message bus (not shown). Note, however, that other examples are not limited to the use of REST and may include some other distributed model protocol using protocols similar to REST over HTTP protocols for inter-component interactions.
The services themselves vary in terms of functionality provided and how often they are invoked. As an example, infrastructure foundation services that provide authorization, authentication, logging, etc. are invoked more often than some others that interact with devices.
The computing system 100 includes a number of interacting entities such as a plurality of users 105, a plurality of servers 110 hosting the application software components, an administrative console 115 interfacing with an internal network 120, and a suite of management resources 122. Those in the art having the benefit of this disclosure will appreciate that the computing system 100 may include any number of physical devices other than those shown. The users 105, servers 110, and administrative console 115 may interface other physical devices not shown such as, for instance, routers, switches, etc. of various kinds. These and other physical devices not shown are omitted for the sake of clarity and so as not to obscure that which is claimed below.
Each component of the computing system 100 (e.g., the servers 110, administrative console 115, physical devices 120-127, etc.) is assigned a Uniform Resource Identifier (“URI”) that uniquely identifies that component within the computing system 100. The URI may be a Uniform Resource Locator (“URL”) or a Uniform Resource Name (“URN”). In the illustrated example, the URIs are URNs, but URLs may be used in other examples.
The computing system 100 may include an architecture that is more complicated than is presented in
The computing system 100 may also be virtualized in some examples. In these examples, physical resources of the physical devices may be allocated to software-implemented “virtual machines” that may be organized into software-implemented “virtual networks”. A virtual machine may include physical resources of multiple physical devices and so may be conceptualized as residing on multiple physical devices. A single physical device may therefore host portions of multiple virtual machines. Similarly, a number of virtual networks may be hosted on the single computing system 100. However, other examples of the computing system 100 may not be virtualized. Thus, details associated with such virtualization are omitted for the sake of clarity and so as not to obscure that which is claimed below.
The users 105 may be, for instance, people operating on the physical devices 125-127. The physical devices 125-127 in various examples may be infrastructure components such as servers, storage, networking components or desktop computers, laptop computers, or mobile devices such as smart phones or tablets, none of which are otherwise shown. The users 105 may also, or alternatively, be software components 130-132 residing on the physical devices 125-127. More typically, the users 105 will be software components 130-132 residing on the physical devices 125-127. The software components 130-132 may be, for instance, without limitation, web browsers or services or third party products.
The servers 110 provide various resources for the computing system 100. Among the servers 110 are a web server 140, a web server 141, and a database server 142. Note that not all resources of the computing system 100 will be servers or even physical devices. They may also include, for example, software components residing on physical devices. For instance, a service 150 resides on the webserver 140 and a database 155 resides on the database server 142 and these also may be resources. A variety of logs containing information about the performance of the computing system 100 reside on the web server 141. These logs include, but are not limited to, server logs 160, database logs 161, Request Identifier logs 162, and system data logs 163.
The server logs 160, database logs 161, Request Identifier logs 162, and system data logs 163 are also known as “service side logs” 158. These logs are representative of the many kinds of service side logs that may be found in various computing system implementations. Their listing is therefore illustrative rather than comprehensive or exhaustive. Furthermore, there are many implementations of server logs 160, database logs 161, Request Identifier logs 162, and system data logs 163 that may be used depending upon the implementation of the computing system 100.
Each of the service side logs 158 includes request patterns 164 for the various resources of the computing system 100 of the type corresponding to the type of the log. For instance, the database logs 162 contain data reflecting requests for access to the contents of the individual databases such as the database 155. These requests can be characterized into patterns in which the database 155 is used. Similarly, the Request Identifier logs 162 contain data reflecting requests for the service 150. These requests also can be characterized into patterns in which the service 150 is used. Thus, each of the service side logs 158 contains data constituting request patterns 164 for the resources of the computing system 100.
More particularly, the Request Identifier logs 162 include uniquely generated identifiers (e.g., e3bed508-3fd4-4862-17fc-8a35a2683bce) of requesters, time stamps (e.g., 2018-12-14 16:21 UTC) for when the requests were made, and source and target component names (e.g., server-resource, power-actuator). Request identifier logs 162 as described here permit identification of who made the request, who received the request and when it was made. Note it does not identify the target component's REST Uniform Resource Indicator (“URI”)—those are in the server logs 160. The system performance data logs 163 are logs that capture system and process level resource utilization data—e.g., how much central processing unit (“CPU”), memory, disk input/output (“10”) is being used. The database logs 161 capture the following: the component running the database query (i.e., requesting access to the database), the time stamp when the query was run, the database query, and the duration of query. The server logs 160 contain the time stamp, the target REST URI, status code (success/failure), response time, and size of response for a request. From this information, a compliance analyzer 165 can determine actual usage patterns indicating the performance of the computing system 100 from an architectural standpoint.
The management resources 122, hosted on a server 124, may be used to manage the resources of the computing system 100. One common example of resource management, for instance, is what is known as “load balancing”. Load balancing attempts to balance the amount of processing across multiple processing resources so that no one processing resource is overburdened. Another form of management controls resources for service, repair, and/or replacement. So, for example, a particular database server 142 might need to be upgraded. The management resources 122 can be used to do this in a controlled fashion. There are many other types of resource management that will become apparent to those in the art having the benefit of this disclosure.
Some of the management resources 122 may be automated in the sense that they take action when triggered by predetermined conditions without the need for human intervention. Some may be manual in the sense that they take action only when directed by a human user. Still others may be automated with manual overrides. The management resources 122 will typically include a plurality of tools combining all of these approaches.
Each of the software components in the computing system 100 will include an Application Program Interface (“API”) that is not separately shown. The software components not only include the software components 130-132, but also the software running the servers 110, the service side logs 158, the compliance analyzer 165, a database of reports 170, a dashboard 180, and a plurality expected resource usage models 180. In this context, an API is a set of a set of protocols, routines, functions and/or commands that facilitate interaction between distinct systems. APIs act on behalf of the software component with which they are associated and, so, in discussion are subsumed in the concept of the software component. When a source component requests resource data from a target component that will service the request, the source and target components communicate through their respective APIs.
In this particular example, the administrative console 115 hosts a compliance analyzer 165, a plurality of reports 170, a dashboard 175, and expected resource usage models 180. Selected portions of the hardware and software architecture of the administrative console 115 are shown in
The processing resource 205 may be a processor, a processing chipset, or a group of processors depending upon the implementation of the administrative console 115. The memory 210 may include some combination of read-only memory (“ROM”) and random-access memory (“RAM”) implemented using, for instance, magnetic or optical memory resources such as magnetic disks and optical disks. Portions of the memory 210 may be removable. The communication system 220 may be any suitable implementation known to the art. In this example, the administrative console 115 is a stand-alone computing apparatus. Accordingly, the processing resource 205, the memory 210 and user interface 215 are all local to the administrative console 115. The communication system 220 is therefore a bus system and may be implemented using any suitable bus protocol.
The memory 210 is encoded with an operating system 225, user interface software 230, the reports 170, and the compliance analyzer 165. The user interface software (“UIS”) 230, in conjunction with a display 235, implements the user interface 215. The user interface 215 includes the dashboard 175 displayed on a display 235. The user interface 215 may also include other peripheral I/O devices such as a keypad or keyboard 245 and a mouse 250.
Also residing in the memory 210 of the administrative console 115 is a plurality of expected resource usage models 180. The design of the computing system 100 includes analysis of the expected usage of the resources. These resources may include, for example, the servers 100 and their residents such as the service 150, the database 155, and the service side logs 158. The expected usage ordinarily results from expected requests for use from other parts of the computing system 100—the services 130-132, for instance. The expected usage can be formulated into expected resource usage models that can then later be used to examine actual resource usage in a manner described below.
Expected resource usage models 180 may more particularly be generated based on the design, as opposed to the operation, of the computing system 100 or, alternatively, from selected periods of operation. For REST APIs, the expected resource usage models 180 may be derived by correlating data in the server logs 160 and request identifier logs 162 over a period of time, for instance, an hour or a day. So, if there are 10,000 requests for a target URI that is essentially static (i.e., only one request would be expected), there is an issue. Expected usage models are maintained in the compliance analyzer 165 and contained in a table that looks like representative Table 1 below. Table 1 includes both database and REST API expected usage models for illustrative purposes. The first row contains a target URI whose response is invariant, i.e., never expected to change once set. The second and third rows are examples of data that change at the expected frequency.
Referring collectively to
Furthermore, the compliance analyzer 165 need not be hosted on the same computing apparatus as is the dashboard 175. The compliance analyzer 165 may be hosted on some other resource of the computing system 100, such as the web server 140. In some examples, the compliance analyzer 165 itself may be distributed across multiple computing resources of the computing system 100. Still further, the compliance analyzer 165 is, in this example, web service but may be implemented in other types of software components-such as an application or a daemon—in other examples. The compliance analyzer 165 is furthermore implemented as a standalone web service that would be integrated into an automatic testing infrastructure. Further, as noted above, the functionality may be distributed across the computing system 100 instead of being collected in a single piece of software.
Still referring to both
Referring collectively to
The compliance analyzer 165 may access the service side logs 158 directly. Alternatively, the content of the service side logs 158 may be dumped and their content accessed through the dumps (not shown). This approach may be useful in examples where the compliance analyzer 165 is housed in another location separate and apart from the rest of the computing system 100. In such examples, the dumps may be electronically transmitted to the compliance analyzer 165 over a public network.
The method 300 continues by evaluating (at 320) the performance for architectural compliance based on the accessed data by comparing request patterns against expected resource usage patterns in the architectural design to identify departures from the expected resource usage patterns. One form of architectural compliance is that actual resource usage match the expected resource usage. The expected resource usage is captured in the expected resource usage models 180. The request patterns are reflected in the data captured by the service side logs 158 indexed by the resource. The compliance analyzer 165 can therefore access the service side logs 158 to retrieve the request patterns 164 for usage of a particular resource, compare the request patterns 164 with the expected resource usage models 180. The comparison will reveal departures in the request patterns 164 from the expected resource usage models 180 that may then be identified as such. The evaluation (at 320) is a comparison between actual and expected usage and can include either ranges of values or fixed values depending on what the target component's REST URI is or whether resource consumption is DB query usage as is described below
The method 300 continues by publishing (at 330) the evaluation results with respect to the identified departures, the evaluation results including details of components, target resource uniform resource identifiers, frequency of usage, and infrastructure resource consumption. In the example illustrated in
Some examples will then take evaluation results and identify the root causes of the identified departures. Once the root causes are identified in these examples, the identified departures may be mitigated in a manner improving architectural compliance. Table 2 shows some sample data gathered from scale tests using the method herein including the identified root causes where none of the current approaches to testing architectural compliance exist currently.
The compliance analyzer 165 calls on the server logs 160, Request Identifier logs 162, system data logs 163, and database logs 161 to evaluate the performance for architectural compliance based on the accessed data by comparing request patterns against expected resource usage models in the architectural design and identify departures from the expected resource usage models 180. The compliance analyzer 165, in this example, performs a method 440. Those in the art having the benefit of this disclosure will appreciate that, in practice, the evaluation will typically be more extensive and more involved. Consequently, the method 440 is instructive rather than comprehensive so as to further an understanding of the claimed subject matter.
The method 440 examines the activity between the source component 410 and the target component 420 as reflected in the server logs 160, Request Identifier logs 162, system data logs 163, and database logs 161. The compliance analyzer 165 evaluates whether, for instance, polling for static data (at 450), the polling rate (at 460), or volume of long running database queries (at 470) depart from the expected resource usage models 180. If any of these evaluations (at 450, 460, 470) detects a departure, the departure is published by communication through the dashboard 175 and inclusion in a report 170.
Referring collectively to
The nature of the mitigation action will depend on the root cause. Frequently, mitigation will involve the addition or reallocation of infrastructure resources to alleviate the demands on existing resources that constitute departures from the expected resource usage models. However, the technique herein is not so limited and other mitigation actions be employed where appropriate to the identified root cause. For instance, various load balancing techniques can be used to manage processing resources or resources may be replaced or upgraded.
The source component 910 may be any component of the computing system 900 requesting infrastructure resource consumption and the target component 920 may be any infrastructure resource that may be consumed responsive to a request. For example, in
The service side logs 105′ include Apache logs 940. The presence of the Apache logs 940 in this discussion implies that the target component 920 is a web server running on Apache Web Server Software. Apache Web Server software is an open-source web server software available from Apache Software Foundation and is commonly used in the art for hosting web pages and sites. Apache Web Server Software can support multiple programming languages, server-side scripting, authentication, and virtualization.
The Apache logs 940 record events that occur on Apache web servers-including the target component 920 in this particular example. One example of an Apache log is an access log 942 that includes information about requests coming into, for instance, the target component 920. This information may include, for instance, what pages are requested, the success status of requests, and how long it took the target component 920 to respond to (or “service”) the request. Another example of an Apache log is an error log 944. The error log 944 may contain information about errors that the web server (e.g., the target component 920) encountered when processing requests, such as when files are missing or are otherwise inaccessible. Some aspects of the Apache logs 940 may be modified for any particular implementation. For instance, the Apache Server Software permits modification of what events may be logged.
The Apache logs 940 are disposed on a web server, although not necessarily the web server(s) for which they are logging. Their location within the computing system 900 will depend on a number of well known factors. One such factor is the type of operating system used by the computing system 900. The precise location of the Apache logs 940 within the computing system 900 is not material to the practice of that which is claimed below.
The Request Identifier logs 162′ include information associated with requests made in the computing system 900. This includes the identification of both the source and the target of requests. They therefore include information about both the source component 910 and the target component 920 when the source component 910 requests a service of the target component 920.
Still referring to
The evaluation may be a comparison of either ranges or fixed values. For REST invocations, for instance, a range is applicable when the target component 920 has an expected rate of change. If the invocation (polling) is far higher than the rate of change, then that becomes an architectural anti-pattern, or departure from expected resource usage models. An example would be if the target REST URI's value is only expected to change once every 6 hours but the caller is polling every 5 seconds. For fixed values (e.g. management station IP address), the target REST URI does not change once set, but if the source component 910 invokes that URI every 10 minutes, that is also an anti-pattern, or departure from expected resource usage models.
In
Turning now to
The source component 1010 may be any component of the computing system 1000 requesting access to the database 1020. For example, in
The compliance analyzer 165″ is then again used to find and highlight architectural compliance issues. The compliance analyzer 165″ takes input from request the database logs 161″ to validate if the request patterns found the database logs 162″ meet the expected resource usage models 180″. If any patterns are suspicious or correlate with observed high system load, details of components, target database queries and frequency is logged in an architecture compliance dashboard from which reports can be generated for trend analysis and defect logging. The expected resource usage models 180″, combined with the URI metadata 1024 and Request Identifier logs 162″ are analyzed by the compliance analyzer 165″ to find and flag issues that fall outside of expected usage patterns. Reports 170″ can be generated to provide historical trends over time.
For database queries, the evaluation is a comparison that typically operates on a range of values. For instance, it may depend on what operations are underway at the time that may query the database 1020. The evaluation measures the proportionality of the database access requests relative to those operations. For instance, the evaluation may measure whether the database query volumes are commensurate with the user-level operation underway.
Thus, in the examples illustrated herein, components are implemented as distinct RESTtful services, each serving a separate function. In this model, components interact with each other using REST over HTTP(s) which promotes loose coupling and modularity. There are no contractual obligations or enforcements around the frequency, mode or nature of interactions between components.
However, as the components scale up in features and numbers and evolve over time, it is difficult to predict upfront which types of inter component interactions are inefficient and cause performance issues which inhibit scale and future evolvability of the application. The solution presented herein provides a set of tools and techniques that can be used to test and validate the architecture for architectural compliance issues (“anti-patterns”) as new features are added over time and to ensure there are no regressions in existing components as they evolve. This approach disclosed herein and claimed below is extensible and allows for addition of new usage patterns as they are found.
The detailed description provided above is set in the context of the development phase for a new product or feature into a data center computing system. However, the technique disclosed herein is not limited to use in the development phase. The technique may be modified or adapted for use in, example, a day-to-day operational environment. The manner in which such modifications or adaptations may be made will become apparent to those skilled in the art having the benefit of this disclosure.
This concludes the detailed description. The particular examples disclosed above are illustrative only, as examples described herein may be modified and practiced in different but equivalent manners apparent to those skilled in the art having the benefit of the teachings herein. Furthermore, no limitations are intended to the details of construction or design herein shown, other than as described in the claims below. It is therefore evident that the particular examples disclosed above may be altered or modified and all such variations are considered within the scope and spirit of the appended claims. Accordingly, the protection sought herein is as set forth in the claims below.