Application monitoring systems can operate to monitor applications that provide a service over the Internet. Typically, the administrator of the operating application provides specific information about the application to administrators of the monitoring system. The specific information indicates exactly what portions of the application to monitor. The specific information is static, in that it cannot be changed, and the monitoring system has no intelligence as to why it is monitoring a specific portion of a service. What is needed is an improved system for monitoring applications.
The present technology, roughly described, monitors an application and automatically models, correlates, and presents insights. The monitoring is performed without requiring administrators to manually identify what portions of the application should be monitored. The modeling and correlating are performed using a knowledge graph and automated modeling system that identifies system entities, builds the knowledge graph, and reports the most crucial insights, determined automatically, using a dashboard that automatically reports on the most relevant system data and status.
The present system is flexible in that it can be deployed in several different environments having different operating parameters and nomenclature. A system graph is created from the nodes and metrics of each environment application that make up a client system. The system graph, and the properties of entities within the graph, can be displayed through an interface to a user. Assertion rules are generated, both by default and after monitoring an application, and used to determine the status and health of a system. If assertion rules experience a failure, data regarding the failure is automatically reported. The system architecture may be reported through a dashboard that automatically provides insights regarding the system components and areas of concern.
In some instances, a method may automatically generate and apply assertions. The method can begin with receiving a first set of time series metrics with labels from one or more agents monitoring a client system in one or more computing environments. The method can continue with automatically applying a set of rules to the time series metrics. Next, a knowledge graph generated from the time series metrics can be automatically updated. One or more assertions can then be automatically generated based on time series metrics and the result of applying the rules, wherein the results of applying the rules used to update the knowledge graph. The assertions can automatically be reported though a user interface.
In some instances, a system for automatically generating and applying assertions can include a memory and a processor. One or more modules stored in the memory can be executed by the processor to receive a first set of time series metrics with labels from one or more agents monitoring a client system in one or more computing environments, automatically apply a set of rules to the time series metrics, automatically update a knowledge graph generated from the time series metrics, automatically generate one or more assertions based on time series metrics and the result of applying the rules, the results of applying the rules used to update the knowledge graph, and automatically report the assertions though a user interface.
The present system monitors an application and automatically models, correlates, and presents insights. The monitoring is performed without requiring administrators to manually identify what portions of the application should be monitored. The modeling and correlating are performed using a knowledge graph and automated modeling system that identifies system entities, builds the knowledge graph, and reports the most crucial insights, determined automatically, using a dashboard that automatically reports on the most relevant system data and status.
The present system is flexible in that it can be deployed in several different environments having different operating parameters and nomenclature. A system graph is created from the nodes and metrics of each environment application that make up a client system. The system graph, and the properties of entities within the graph, can be displayed through an interface to a user. Assertation rules are generated, both by default and after monitoring an application, and used to determine the status and health of a system. If assertion rules experience a failure, data regarding the failure is automatically reported. The system architecture may be reported through a dashboard that automatically provides insights regarding the system components and areas of concern.
Client application 118 may be implemented as one or more applications on one or more machines that implement a system to be monitored. The system may exist in one or more environments, for example environments 110, 120, and/or 130.
Agent 116 may be installed in one or more client applications within environment 110 to automatically monitor the client application, detect metrics and events associated with client application 118, and communicate with the system application 152 executing remotely on server 150. Agent 116 may detect new data (i.e., knowledge) about client application 118, aggregate the data, and store and transmit the aggregated data to server 150. Client application 118 may automatically perform the detection, aggregation, storage, and transmission based on one or more files, such as a rule configuration file. Agent 116 may be installed with an initial rule configuration file and may subsequently receive updated rule configuration files as the system automatically learns about the application being monitored.
Environment 120 may include a third-party cloud platform service 122 and a system monitoring and alert service 124, as well as client application 128. Agent 126 may execute on client application 128. The system monitoring alert service 124, client application 128, and agent 126 may be similar to those of environment 110. In particular, agent 126 may monitor the third-party cloud platform service, application 128, and system monitoring and alert service, and report to application 152 on system server 150. The third-party cloud platform service may provide environment 120, including one or more servers, memory, nodes, and other aspects of a “cloud.”
Environment 130 may include client application 138 and agent 136, similar to environments 110 and 120. In particular, agent 136 may monitor the cloud components and client application 138, and report to application 152 on server 150. Environment 130 may also include a push gateway 132 and BB exporter 134 that communicate with agent 136. The push gateway and BB exporter may be used to process batch jobs or other specified functionality.
Network 140 may include one or more private networks, public networks, local area networks, wide-area networks, an intranet, the Internet, wireless networks, wired networks, cellular networks, plain old telephone service networks, and other network suitable for communicating data. Network 140 may provide an infrastructure that allows agents 116, 126, and 136 to communicate with application 152.
Server 150 may include one or more servers that communicate with agents 116, 126 and 136 over network 140. Application 152 can be stored on and executed by a single server 150 or distributed over on one or more servers. In some instances, application 152 may execute on one or more servers 150 in an environment provided by a cloud computing provider. Application 152 may receive data from agent s 116, 126, and 136, process the data, and model, correlate, and present insights for the data based at least in part on assertion rules and a knowledge graph. Application 152 is described in more detail with respect to
Timeseries database 210 may reside within application 200 or be implemented as a separate application. In some instances, timeseries database 210 may be implemented on a machine other than server 150. Timeseries database may receive timeseries metric data from agents 116-136 and store the time series data. Timeseries database 210 may also perform searches or queries against the data, insert new data, and retrieve data as requested by other modules or other components.
Rules manager 215 may update a rules configuration file that is maintained on server 150 and transmitted to one or more of agents 116-126, and 136. The rules manager may maintain an up-to-date rules configuration file for a particular type of environment, provide the updated rules configuration file with agent modules being installed in a particular environment, and update rule configuration files for a particular agent based on data and metrics that the agent is providing to application 152. In some instances, rules manager 215 may periodically query timeseries database 210 for new data or knowledge received by agent 116 as part of monitoring a particular client application. When rules manager 215 detects new data, the rule configuration file is updated to reflect the new data.
Alert manager 220 managers alerts for application 152. In some instances, if an assertion rule failure occurs, alert manager 220 may generate failure information for the particular node or entity associated with the failure. The failure may be indicated in the call graph, as well as in a dashboard provided by UI manager 250. In some instances, the alert manager generates a failure that is depicted as red or yellow ring, based on the severity of the failure, around the node or entity for which the failure is detected. Alert manager 220 can also create alerts for displaying on a dashboard provided by UI manager 250 and communications with an administrator.
Assertion detection engine 225 can define assertion rules and evaluate the rules against timeseries data within the database 210. The assertion detection engine 225 applies rules to metric data, or a particular system, and identifies portions of the data that fail the rules. The failings are then recorded in the graph as attachments to entities. The assertion role definitions may include saturation of a resources, anomalies, changes to code whether amendments, failures and faults, and KPIs such as error ratio or error budget.
Assertion rules can be generated in several ways. In some instances, rules are generated automatically based on metrics. For instance, the assertion engine 225 may determine a particular rate of a request over a time period, and generate rules based on a baseline observed during that time period. For example, the assertion engine may observe that three errors that occur in two minutes, and use that as a baseline. As time goes on while monitoring the system, the baselines may be updated over larger periods of time, and additional baselines may be determined (e.g., short term and long term baselines). Some of the rules determined automatically include connections over time, output bytes, input bites, latency total, and error totals.
Some assertions may be determined automatically based on assertion rules with failures. For example, if assertion detection 225 determines that a particular pod in a Kubemetes system executes a function with a very long response time that amounts to an anomaly, an assertion rule may be automatically generated for the particular pod in the particular system and for the particular metric. The assertion rule may be automatically generated by the rules manager, for example in response to receiving an alert regarding the pod response time anomaly from the alert manager.
When rules are triggered, a call is placed to the assertion engine by the rules manager. The assertion engine can then process the rules, identify the assertion rules that experience any failures, and update the entity/knowledge graph accordingly to reflect the failures. The knowledge graph can be updated, for example, to indicate that one or more components of a node have experienced a particular failure during a particular period of time for a particular client system.
Model builder 230 may build and maintain a model, in the form of a knowledge graph, of the system being monitored by one or more agents. The model built by model builder 230 may indicate system nodes, pods, services, relationships between nodes, node and pod properties, system properties, and other data. Model builder 230 may consistently update the model based on data received from timeseries database 210, including the status of each component with respect to application of one or more assertion rules for each component. For example, model builder 230 can scan, periodically or based on some other event, time-series metrics and their labels to discover new entities, relationships, and update existing entities along with their properties and status. A searchable knowledge index may be generated from the knowledge graph generated by the module builder, and enable queries on the scanned data and for generating and viewing snapshots of the entities, relationships, and their status in the present and arbitrary time windows at different points in the time. In some embodiments, schema .yml files can be used to describe entities and relationships for the model builder.
An example of model schema example snippets, for purpose of illustration, are below:
type: CALLS
Knowledge graph 225 knowledge graph) may be built based on the model generated by model builder 230. In particular, the cloud knowledge graph can specify node types, relationships, and properties for nodes in a system being monitored by agents 116-136. The cloud knowledge graph is constructed automatically based on data written to the time series database and the model built by model builder 220.
A knowledge index 240 may be generated as a searchable index of the cloud knowledge graph. The knowledge index is automatically built from the graph, and creates new expressions dynamically from templates in response to a new domain or error detection. Searchable entities within the knowledge index include pods, service, nodes, service instance, kafka topic, Kubernetes entity, Kubernetes service, namespace, node group, and other aspects of a system being monitored and the associated knowledge graph. The cloud knowledge index includes relationships and nodes associated with search terms. When a search is requested by a user of the system, the cloud knowledge index is used to determine the entities for which data should be provided in response to the search.
Knowledge bot 235 may detect new data in timeseries database 210. The new knowledge, such as new metrics, event data, or other timeseries data, may be provided to rules manager 215, model builder 220, and other modules. In some instances, knowledge bot scrapes cloud providers for the most up-to-date data for static components, and connects the data to scraped data in order to build insights from the connected data. In some instances, knowledge bot 235 may be implemented within timeseries database 210. In some instances, knowledge bot 235 may be implemented as its own module or as part of another module.
GUI manager 240 may manage a graphical user interface provided to a user. The GUI may reflect the cloud knowledge graph, provide assertions and current status, timelines, lists of nodes within a system, and may include system nodes, node relationships, node properties, and other data, as well as one or more dashboards for data requested by a user. Examples of interfaces provided by GUI manager 240 are discussed with respect to
Metric label and event data can be captured, aggregated, and transmitted to a remote application time series data base at step 315. The metric label and event data can be retrieved periodically at a client machine based on the rule configuration file. Retrieving metric, label, and event data may include an agent accessing rules and retrieving the data from a client application or environment by the agent based on the received rules. In some instances, the agent may automatically transform or rewrite the existing metric label data into a specified nomenclature which allows the metrics to be aggregated and reported more easily. The data may be aggregated and cached locally by the agent until it is transmitted to application 152 to be stored in a timeseries database. The caching and time at which the data is transmitted is set forth with the data configuration file.
The timeseries database receives and stores the timeseries metric data sent by the remote agent at step 320. Labels are retrieved from the timeseries metric data at the application by the server at step three 225 and a label data is stored at step 330. Unknown metric data may be mapped to known labels at step 335 and new identities may be identified at step 340. A knowledge graph is dynamically and automatically created and updated at step 345. A search index based on the knowledge graph is then automatically built in updated at step 350.
More details for installing an agent, collecting data, transmitting data by an agent to a remote application, and building a knowledge graph is discussed with respect to U.S. patent application Ser. No. ______, titled “XX,” filed on Apr. _, 2021, the disclosure of which is incorporated herein by reference.
Assertion rules that have failed based on the timeseries metric data are identified at step 415. For example, if a particular memory allocation has been saturated, this would result in a failure of the particular assertion rules. This failure of the resource saturation would be identified at step 415.
A rules manager calls an alert manager with assertion rule failure information at step 420. For each rule failure, alert data is created by the alerts manager at step 420. The alert may include an update or additional data to include in a knowledge graph, graphics to include in a dashboard, a notification to transmit to an administrator, or some other implementation of an alert. The alert manager generates alerts for a knowledge graph and places calls to an assertion manager at step 425. The assertion manager attaches a structure regarding the failure to the detected alert and updates to the knowledge graph at step 430. Next, insights are automatically generated based on particular events at step 435. The insights may include failures and important status information for portions of the system that fail one or more assertion model rules for saturation, an anomaly, amendments, failures and faults, and error ratio, error budget, and other kpi elements.
An amend assertion role is applied to metric timeseries data at step 520. An amend assertion role can be applied to amendments or changes to code, such as updated code, replacement code, or other changes to code. A failure and fault assertion rule may be applied to metric timeseries data at step 525. The failure and faults may relate to failures and faults that are triggered during code execution.
Error ratio and error budget assertion rules may be applied to metric timeseries data at step 530. Error ratio and error budget are examples of key performance indicators that may be tracked for a particular system. Assertion rules may be generated for other key performance indicators “KPIs” as well.
A selection of an entity displayed in a graph may be received at step 625. Additional detail made them be provided for the selected entity at step 630. Additional detail may include other nodes, pods, or other components which comprise the selected entity or have relationships with the selected entity. In some instances, additional detail may also include properties or other data associated with a selected entity.
A query may be received for a specific portion of a graph at step 635. In some instances, an administrator may only wish to view a particular node, particular type of note, or some other subset of the set of nodes within a system. In response to receiving the query, the system may provide the query graph portion as well as additional details, such as properties, in a dashboard at step 640.
The list 715 includes information for multiple entities, including an indication that each entity is a service, the service name, and a graphical icon indicating the status.
Each icon representing an entity or service provides an inner icon surrounded by status indicators. The inner icon may be implemented as a circle or some other icon or graphical component. The status indicators may be provided as one or more rings, wherein each ring represents one or more entities or subcomponents and their status. When a subcomponent is associated with one or more failures, the status indicator for that subcomponent may visually indicate the failure, for example by having a color or red. When a subcomponent is associated with a near failure, the status indicator for that subcomponent may be yellow. When a subcomponent is operating as expected with no failures, the status indicator for that subcomponent may be gray, green, or some other color. In some instances, icons for a particular entity having multiple subcomponents may have multiple rings.
Within graph portion 730, nodes 735, 740, and 745 are all represented amongst other nodes. Each node includes a center icon and one or more status indicator rings. Each node also includes at least one relationship connector 750 between that node and other nodes. For example, node 740 includes at least one yellow status indicator ring and node 745 includes at least one red status indicator ring.
The metric window for the selected entity includes parameter data that is selected by the user. The parameter data 840 indicates user selections for workload “auth”, job “auth”, request type “all”, and error type “all.” The metrics provided for the selected entity may be displayed based on several types of parameters, such as those shown in parameter bar a 40, as well as filters. Different parameters and filters may be used to this modify the display of metrics for the selected entity.
The selected entity, as illustrated by entity name 820, includes displayed metrics of requests per minute window 825, average latency window 830, errors per minute window 835, CPU percentage window sure window 845, memory percentage 850, network received window 855, and network transmitted window 860. For each displayed metric 825-860, the status of the metric with respect to an assertion rule is indicated by the color of the data within the particular window. For example, errors per minutes CPU percentage, and memory percentage are green, indicating the values of those metrics are good. The color for the request per minute metric and average latency metric are yellow, indicating they are close to violating an assertion rule. The networks received metric and network transmitted metric are both colored red, indicating the time series data for these metrics violates the assertion rule.
The components shown in
Mass storage device 1530, which may be implemented with a magnetic disk drive, an optical disk drive, a flash drive, or other device, is a non-volatile storage device for storing data and instructions for use by processor unit 1510. Mass storage device 1530 can store the system software for implementing embodiments of the present invention for purposes of loading that software into main memory 1520.
Portable storage device 1540 operates in conjunction with a portable non-volatile storage medium, such as a floppy disk, compact disk or Digital video disc, USB drive, memory card or stick, or other portable or removable memory, to input and output data and code to and from the computer system 1500 of
Input devices 1560 provide a portion of a user interface. Input devices 1560 may include an alpha-numeric keypad, such as a keyboard, for inputting alpha-numeric and other information, a pointing device such as a mouse, a trackball, stylus, cursor direction keys, microphone, touch-screen, accelerometer, and other input devices. Additionally, the system 1500 as shown in
Display system 1570 may include a liquid crystal display (LCD) or other suitable display device. Display system 1570 receives textual and graphical information and processes the information for output to the display device. Display system 1570 may also receive input as a touch-screen.
Peripherals 1580 may include any type of computer support device to add additional functionality to the computer system. For example, peripheral device(s) 1580 may include a modem or a router, printer, and other device.
The system of 1500 may also include, in some implementations, antennas, radio transmitters and radio receivers 1590. The antennas and radios may be implemented in devices such as smart phones, tablets, and other devices that may communicate wirelessly. The one or more antennas may operate at one or more radio frequencies suitable to send and receive data over cellular networks, Wi-Fi networks, commercial device networks such as a Bluetooth device, and other radio frequency networks. The devices may include one or more radio transmitters and receivers for processing signals sent and received using the antennas.
The components contained in the computer system 1500 of
The foregoing detailed description of the technology herein has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the technology to the precise form disclosed. Many modifications and variations are possible in light of the above teaching. The described embodiments were chosen to best explain the principles of the technology and its practical application to thereby enable others skilled in the art to best utilize the technology in various embodiments and with various modifications as are suited to the particular use contemplated. It is intended that the scope of the technology be defined by the claims appended hereto.
The present application is a continuation-in-part of patent application Ser. No. 17/339,985, filed on Jun. 5, 2021, titled “AUTOMATICALLY GENERATING AN APPLICATION KNOWLEDGE GRAPH,” which claims the priority benefit of U.S. provisional patent application 63/144,982, filed on Feb. 3, 2021, titled “AUTOMATICALLY GENERATING AN APPLICATION KNOWLEDGE GRAPH,” the disclosures of which are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
63144982 | Feb 2021 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 17339985 | Jun 2021 | US |
Child | 17339988 | US |