1. Technical Field
The disclosed embodiments relate generally to Service-Oriented Architecture system management and, more specifically, to a Service Performance Manager software platform with conditional Service Level Agreement and issue mitigation and autoprotection features.
2. Background
Service oriented architecture (SOA) is rapidly being adopted and deployed by many different organizations in all industries and sizes. With the focus and attention squarely on implementing SOAs, organizations have generally paid little attention to monitoring and managing their SOAs to ensure that service levels are maintained and efficiencies increased.
The disclosed Service Performance Manager (SPM) is an enterprise software platform that monitors and proactively manages the health and performance of both individual and grouped services based on Service Level Agreements (SLAs). The SPM provides enhanced visibility of running services, allows for automatic deployment of extra service instances in order to meet load spikes, and helps ensure that SLAs are not violated during the unexpected spikes. The SPM also allows for rules to monitor service performance, service availability, and service usage. The SPM provides IT and operations managers better visibility and control over their IT and business services. The SPM predicts and solves potential customer-related issues before customers are aware of them, enabling an organization to meet quality of services objectives. Unlike other software platforms, the disclosed SPM automatically optimizes resources, services and SLAs with finer granularity and precision, while remaining steadfastly vendor neutral, allowing the SPM to manage many different applications and Service-Oriented Architecture (SOA) platforms substantially simultaneously. The disclosed SPM allows a user to monitor and manage the performance of individual or grouped services, and provides visibility in service monitoring from both a technical and a business perspective.
Embodiments are illustrated by way of example in the accompanying figures, in which like reference numbers indicate similar parts, and in which:
Service Performance Management is the ability to monitor and measure the observable behavior of individual or grouped services, and to implement changes (reactively or proactively) to their behavior based on a defined set of rules. Observable behavior may include system performance, availability, usage, faults, and payload.
The disclosed Service Performance Management system is a software platform that maintains and automatically manages the health and performance of the observable behavior of individual or grouped services, while additionally managing business payload. In an embodiment, the SPM maintains and manages the health and performance of the observable behavior of IT services. In another embodiment, the SPM maintains and manages the health and performance of the observable behavior of business services. The SPM may be used to design, plan and monitor services based on business needs. The SPM may also be used to balance service levels against the costs. In addition, the SPM may be used to achieve and enforce measurable levels of service and reduce likelihood of unpredictable demands. The SPM may dramatically improve relationships between service providers and customers. Disclosed embodiments of the SPM include properties that feature obligation-bound service level agreements (SLAs), and patterns for recognizing component misbehavior.
The SPM may use policy management techniques to distribute listeners and associated policies and also to gather performance information. With the combination of complex event processing, rules, policies, and Java Management Extensions (JMX) control interfaces, the SPM allows a user to create substantially any reaction scenario to service level exceptions or anomalies.
The SPM allows users to monitor deployed service artifacts through use of a distributed monitoring and instrumentation framework. In an embodiment, the user may monitor deployed service artifacts through use of a dashboard to track metrics from a service perspective, independent of the deployment infrastructure.
In an embodiment, the SPM may be added to an existing SOA infrastructure. The SPM may be added to a variety of technologies and architectures.
The SPM may provide autonomic capability to SOA fabric including using SLAs in conjunction with monitoring, providing proactive and reactive alerting on threshold violations or impending violations, and providing assurance (both self-healing and self-optimizing) where possible in both usage and performance.
In an embodiment, the SPM provides for wizard based creation of SLAs and rules.
The SPM not only provides users with substantially instant visibility into their running services, but also allows them to set up automatic deployment of extra service instances in order to meet load spikes. This may ensure that service level agreements are not violated during the unexpected peaks, and may allow users to set up rules to monitor service metrics including, but not limited to, system performance, availability, and usage. If an incident or violation occurs, it may be handled through an alert on the user interface or dashboard or through email. In an embodiment, a business process management (BPM) or customer relationship management (CRM) workflow may be initiated.
The SPM not only helps monitor services, but may also assist in managing those services. The SPM allows the user to monitor the key performance indicators in a business process, analyze the performance, check the behavioral pattern, and take corrective actions in proactive and predictive ways to manage and run the business successfully. Based on past performance, the user may predict future performance, identify bottle necks, and take corrective actions for better performance. In certain scenarios, the user may be proactive and setup rules to trigger actions if certain conditions are met or if certain rules are violated, thus providing a level of assurance to the user.
Rule libraries may be created using the SPM, in which simple or complex rules may be defined on some service metrics. These rules may internally trigger one or more types of actions if the conditions defined in the rules are met. The action library may store exemplary actions such as sending an alert, invoking a script or a service, or logging an event. Some rules may be run on recurring schedules such as, for example, Everyday at 2 PM, on all week days, or on peak hours. Standard schedules may be defined in the schedule library, which may be used to trigger actions at a specified time based on a corresponding rule.
In an embodiment, the SPM provides low cost of administration through centralized management and self-managing protocols, ensuring better compliance and SOA governance. In another embodiment, more efficient operations management and quality control are achieved. The SPM may allow for easier measurement and determination of SLAs. In an embodiment, the addition of SPM for end to end enterprise infrastructure monitoring and managing provides the ability to predict and respond to a myriad of business services and events.
In a typical business scenario, there are service providers and service consumers. Irrespective of the user's role as a service provider or service consumer, the SPM may be used for monitoring and managing the business services.
In an embodiment, if the guaranteed availability of an external service, such as the credit check service presented in
In an embodiment, a service provider may wish to ensure a guaranteed response to all requests within a time specified in an SLA. For example, if a consumer overloads the system by sending too many requests which are abnormally large in quantity or have faulty payloads, a service provider may choose to take corrective actions to keep the system load under control. Such corrective actions may include blocking further requests so that the entire system does not become impaired, or alerting other parties. To repair the faulty or overloaded services, the system administrator may choose to throw more grid resources (assign additional computing resources), reallocate existing resources, or select which requests to process.
In the discovering services step 310, the SPM may check for all the services running in a single or multiple environments. These services may be individual or grouped services such as service assemblies or service units. The SPM may also check for service dependencies, composite services, and service references. The SPM may also check for SLAs defined on each service and party and thresholds defined for each service.
Once the services and the consumer and provider parties for those services are identified, the next step is to measure observables 320, or measure the metrics values. Some of the measurable metrics may include service metrics, infrastructure metrics, and business metrics from payload. Service metrics may include throughput, latency, request size, faults, and availability signals; infrastructure metrics may include capacity, memory, and information about the central processing unit (CPU); and business metrics from payload may include user identity or role, source, and transaction value. In an embodiment, business metrics may be extracted directly from the content or envelope of a request. For example, the user identity, the origin of the request, or a transaction amount in Dollars or Euros may be used to associate a value by which to priority a request. Metrics may also be gathered about the physical deployment architecture that can be gathered through JMX instruments.
After the metrics and their values are gathered over a time period, the data may be analyzed and the future data requirement may be predicted in the analyze and predict behavior step 330. The data may be analyzed by computation and aggregation. Certain behavioral patterns may be identified which may help to predict the future data requirement. A statistical and time-based analysis may be performed in which the average, minimum, and maximum values are calculated in addition to the values for the moving time frame window, and the values for the last hour, day, week, or month. An infrastructure aggregate calculation may be performed in which the metrics value by node and metric value by container are calculated. A functional aggregate calculation may be performed in which the metrics value by service assembly and metrics value by service unit are calculated. A business aggregate calculation may be performed in which the metrics value by client and metrics value by amount are calculated. Finally, a customer-based aggregate analysis may be performed in which metric values by customer role (e.g. gold, silver, and platinum) are derived and aggregated.
The next step is the analyzing and predicting behavior step 330. Any of these metrics may be displayed in a web-based dashboard which may contain some pre-defined views. In an embodiment, these metrics may provide real-time values by fetching data every minute and updating the values of the metrics. Various views may be configured to monitor performance at various levels such as environment, machine, node, service assembly, and service units. The dashboard may be personalized as necessary for a particular business's need to get real-time updates including, but not limited to, service availability, service usage, service faults, business payload. To monitor services using the disclosed SPM, rule packages and rules may be defined, and target objects may be selected to apply the rules. In an embodiment, these objects are called referenced target objects. Additionally, conditions may be set on the default metrics available for the selected target objects, schedules may be created to run the rule at the scheduled time, and actions may be defined and associated with rules for managing the service performance. The actions may be a default action or a custom action. When a rule is enabled, the system may start monitoring all the referenced target objects for the specified set of conditions defined in the rule. When the metrics value reaches the threshold condition, the rule is triggered, which in turn initiates an action to manage the performance within the specified limits. In an embodiment, based on the SLA between service consumers and providers, a set of rules may be defined. These rules are able to be monitored as well as customized, which helps both the consumer and provider to track the service execution and adhere to service level business agreements.
Threshold conditions may be defined on metric values and rules may be set based on the metrics. When these threshold levels are reached or conditions defined in a rule are met, one or more alerts or actions may be triggered 350. In an embodiment, if there are any violations in the SLA, alerts are sent. Alerts may be displayed in the dashboard as visual indicators. At times, these alerts may internally trigger certain actions including, but not limited to, running a script, logging an event, or sending a mail notification.
In an embodiment, in addition to alerts, certain corrective actions may also be set to execute in a rule. When the conditions in a rule are met, these corrective actions may automatically be executed, which may help business continuity. Some of the corrective actions may include automatic resource allocation, starting a node, or incident management.
In an embodiment, a high-level overview of the major steps involved in implementing an SPM in a business includes identifying technical requirements, configuring the system and monitoring the performance, and managing the system.
The architecture of the disclosed SPM may contain groups including, but not limited to, a user interface plugged into an administrator of a service oriented architecture service platform, back-end web services integrated into a service oriented architecture service platform, and system services such as rules service and action service deployed into a service oriented architecture service platform foundation. In an embodiment, the disclosed SPM may be integrated into a TIBCO ActiveMatrix® service platform.
The SPM system services typically run on an isolated SPM system environment 750 on one or multiple specially provisioned nodes and hardware. In an embodiment, all services specific to the SPM are hosted on a node named “spmnode” in a separate “spmenv” environment. In an embodiment, the “spmenv” environment is kept separate, and not used for any business services. The SPM system services may include, among other services, a rule service, an action manager service, a standard action service, and an alert server. A rules service may collect and aggregate basic and custom metrics, may translate and deploy SPM rules, and may send rule triggers or clear messages to an action manager service. An action manager service may handle rule actions, for example sending an alert, invoking a service, or making a log, on either a rule triggers or clear messages, and on an assurance 790 like blocking further requests or provisioning new computing resources. The action manager service may generate messages using templates for alerts. A standard action service may deploy services on additional existing nodes, deploy service on additional nodes by provisioning a new node, invoke scripts on a machine, generate Simple Network Management Protocol (SNMP) asynchronous notification messages or “traps,” and provide support for integration software for service oriented architecture service platforms engine control methods. Actions are distributed back to the nodes for execution through a Management Bus 740. An alert server allows a user to specify email format (e.g., text or HTML) and email delivery method (e.g., digest mode).
In an embodiment, integration software for SOA service platforms includes TIBCO BusinessWorks™.
A user interface (UI) of the SPM is plugged into an administrator of an SOA service platform administrator. The user interface includes a perspective to build and configure rules as well as a perspective for viewing and managing dashboards including, but not limited to, a monitoring dashboard 710 and a SLA dashboard 720. Additionally, the UI may support monitoring custom metrics, including defining a custom metric to monitor and manage performance of any service. Real time updates of the performance measurements and alerts are distributed to the dashboard through a real time messaging bus, or dashboard bus, 730.
A command line interface (CLI) (not shown) supports substantially all actions performed from the UI. The CLI may also support defining alerting templates and using them for email notifications. The web services to support the SPM UI and CLI may be plugged into a service oriented architecture service platform server via a standard http protocol as well as a real time asynchronous communication bus 730. These web services fetch the data and then display the data to the user.
In an embodiment, a machine agent runs on all management daemons where remote script execution and enhanced machine metrics extraction are desired.
In an embodiment, by building and configuring various rules, a user may monitor and manage the system performance using the SPM. A rule defines conditions for monitoring target objects. A rule may also specify an action to be taken on the selected target objects when the specified condition is met.
In an embodiment, rules are the basic building blocks of the SPM. There are two types of rules, simple rules and complex rules.
Providing basic rule information 910 may include providing information such as name and description. In an embodiment, providing basic rule information 910 may also include specifying the schedule for running the rule from a pre-defined schedule in a schedule library. In another embodiment, providing basic rule information 910 may also include setting priority for rules.
Choosing a target object 920 may include choosing either a single target object 922 or a group of target objects 924. A group of target objects 924 may be formed of objects that are of the same type or have a shared criteria. Target objects 922, 924 may be machines, nodes, service assemblies, service instances, or operations. In an embodiment, the target objects are selected from an infrastructure or deployment views of the TIBCO ActiveMatrix® environment or domain. In an embodiment, the TIBCO BusinessWorks™ Service Probe is installed and BusinessWorks™ services and processes may be selected as target objects.
Depending on the target object selected, the relevant metrics are made available for creating a condition 930. A condition may be simple 932 or complex 934. In an embodiment, a complex rule 934 may include adding up to three conditions using logical AND operators. Conditions may be validated at run-time and, when the specified criteria are fulfilled, an action may be triggered.
Setting actions 940 includes setting the actions to be taken when any condition defined in a rule is satisfied. Single 942 or multiple 944 actions may be taken for any given condition. An action may be set to, for example, send alerts, invoke a script, or log events.
A rule may be a standalone rule, or is may be part of an objective which belongs to a rule package.
An objective is a collection of rules intended to achieve a definite goal. The objective can impose common metadata, schedules, and actions on the rules contained within it. In an embodiment, a set of objectives packaged to achieve business goals is called a rule package. Rule packages may be organized by the business roles, which are based on the level of service the rule package represents.
A referenced target object is a target object that is referenced by one or more rules. The conditions defined in the rule are validated against the selected target objects. If a condition is violated, the rule is triggered to send an alert. If the rule is associated with an action, the action takes corrective measures and tries to bring the performance within the specified condition.
A schedule defines a recurring time period during which a rule, objective, or rule package is run. In an embodiment, the schedule set for a rule applies only if the rule is a stand-alone rule, and not belonging to a rule package or objective. In an embodiment, if the rule is in an objective, the objective schedule takes precedence or, in other words, by default, when a rule is added to an objective, the schedule is not copied. A rule package contains the default schedule for all objectives in the rule package, which is used when an objective has no schedule of its own; however, an objective is not required to have a schedule.
A schedule may contain “include” and “exclude” time periods that control when associated rules should or should not be run. For example, a schedule called “Peak Hours” could include the hours from 9 PM to midnight daily for all months of the year, but exclude the hours from 3 AM to 6 AM for January. In an embodiment, multiple include and exclude time periods for a single schedule are defined.
In an embodiment, the SPM supports global schedules, owned by super users, and schedules owned by individuals.
A super user is a user with the privilege of creating and managing global schedules, including the out-of-the-box schedules. Global schedules are available to all users. A super user may also delete and edit schedules created by individual users, and duplicate a user-owned schedule and save it as a global schedule.
An individual user can see and duplicate all schedules in the library, edit the schedules owned by the individual user, see and use global schedules or their own schedules in the schedule drop-down list in the rule builder, and create rule package builder dialogs. An individual user may replace an owned schedule with another owned schedule or with a global schedule. A schedule can be replaced either universally (replace the old schedule with the new schedule everywhere it is used) or individually (navigate through all the locations it is used, and replace it with another schedule). An individual user can also delete owned schedules that are not used anywhere.
In an embodiment, the SPM includes an action library. The action library contains a list of web services. These services may automatically perform service management tasks and save administrator time. The scope of what a service can do depends on how the web service is written. A service is configured to apply to a specific endpoint or a target service for a specific target object type.
When creating a rule using a rule builder, a user may choose to invoke a script with the rule is triggered and conditions in the rule are met. In an embodiment, a super user may create a script that is designed to add a new node if demand on a single node exceeds a maximum amount. The script may also contain an undo method to remove the extra node when demand drops again. The undo method corresponds to a cancel condition state defined in the rule. An individual user may choose to use this script when creating a rule.
In an embodiment, the SPM provides some global services owned by super users. In this embodiment, the SPM supports only services owned by super users. Individual users may only see a list of services in the rule builder and choose which services apply to the rule. The services name and owner may be displayed in the rule builder. A super user may add services that are global available to all users. A super user may also delete services and replace a service with another service or no services at all. If an in-use service is replaced, a notification may be automatically sent to rule owners. A super user may also prevent or allow services from displaying in a choose service panel in the rule builder.
SLAs specify a service level that a service provider will guarantee. For example, an SLA may guarantee a maximum response time. In some cases, an SLA may only be fulfilled if the service consumer adheres to specific conditions or obligations. For example, a loan processing service may be able to guarantee a 5-second response time, but only if the loan request rate does not exceed one per second.
This invention extends SLA specifications to include the notion of an obligation on the part of the service consumer. Thus an SLA is only required to be met if the service consumers meet the specified obligation or constraint. A consumer obligation is a measurable characteristic that cannot be controlled by the service provider, but can be monitored and acted upon if breached.
The source of a service provider's conditions may be internal (e.g., a limitation of the provider's physical capacity), or it may be a byproduct of the service provider's secondary role as a service consumer. In this latter case, a service provider that requires the use of another service provider to complete its task may propagate the secondary provider's obligations back to the initial service consumer. Consumer obligations may include, for example, request rate, request size, request form compliance, request content compliance (erroneous payload generating a large amount of faults), and response profile (valid payload generating abnormal backend load). Service provider obligations may include, for example, response time, throughput, error rate, and availability over a period of time.
Obligations differ fundamentally from ordinary SLA characteristics (such as guaranteed response time or availability) as they are not generally controlled by the service provider. Obligations can be used effectively in a number of scenarios. These scenarios include providing advanced warning that a service consumer is misbehaving; making a decision not to mitigate and provide additional provider computing resources if consumer obligation is not met; providing insight to SLA violations, and indicating remedial steps that identify and isolate the violation's source; and mitigating monetary impact when an SLA is violated due to unfulfilled consumer obligations.
The SPM provides methods to mitigate the effect of misbehavior by any component of the system. Any components in the system, whether it is a consumer or provider of a service, may misbehave due to a hardware or software failure. The SPM can detect such situations by combining a number of factors originated from the consumer, provider, and infrastructure. Identifiable situations are consumer-bound, provider-bound, or infrastructure-bound.
Consumer bound situations include abnormal request size or throughput, erroneous payload generating a large amount of faults, and erroneous payload generating abnormal backend load. Provider bound situations include overloaded backend CPU, provider software failure, and deadlocks. Infrastructure bound situations include machine failure and network failure.
The SPM assesses the source of the problem by collecting metrics, detecting threshold violations, and identifying the source of the issue (machine, client, user app, service, etc.). The SPM has the ability to mitigate the effect of a misbehaving application through isolating the source of the issue through a blocking or throttling policy or removing the source of the issue if authorization permits.
The service-oriented architecture in
Once possible violations have been identified at Block 1320 and/or Block 1330, the service-oriented architecture includes an “Evaluate Mitigation Step” at Block 1340. Depending upon the violations identified, any of several steps can be taken as illustrated in Blocks 1350, 1360, and 1370. As illustrated in these respective action blocks, the violations can be addressed by: adding more resources, assigning different resources, or otherwise re-provisioning resources (as indicated in Block 1350); alerting the consumer to the problem, such that the consumer could resubmit the job, reconfigure the job, assign the job to another provider, or take some other action (as indicated in Block 1360); or throttle or shut down a consumer (or specifically an agent process/daemon) operating on a consumer computer (Block 1370).
Modules of the SPM system operate on the processor in the SPM computer 1410. Rules and measurements may be stored on databases 1420, 1430 and may be accessed by the SPM computer 1410 and implemented or used by the modules of the SPM system. The SPM computer 1410 sends and receives information through a network 1450 to and from one or more SOA application computers 1460. SPM probes 1465 are located on the SOA application computers 1460 and can monitor data pertinent to SOA platforms. In an embodiment, the probes are directly embedded inside the container infrastructure. Information gathered by the probes 1465 may be distributed to the SPM computer 1410 running the SPM system services, through a network 1450. Measurements and rules may be stored in databases 1420, 1430 and may be accessed by the SPM computer 1410. Results and metrics may be sent through a network 1440 to a display computer 1470. In an embodiment, a display computer 1470 may execute a dashboard, which may include displaying results and metrics on a dashboard console 1480. In an embodiment, the SPM computer 1410 may write update the display computer and the dashboard console, through the network 1440.
The SPM computer 1410 receives measurements 1490 through the network 1450 from system probes 1465 and sends assurances 1495 through the network to the SOA application computers 1460.
While various embodiments in accordance with the principles disclosed herein have been described above, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of the invention(s) should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with any claims and their equivalents issuing from this disclosure. Furthermore, the above advantages and features are provided in described embodiments, but shall not limit the application of such issued claims to processes and structures accomplishing any or all of the above advantages.
Additionally, the section headings herein are provided for consistency with the suggestions under 37 CFR 1.77 or otherwise to provide organizational cues. These headings shall not limit or characterize the invention(s) set out in any claims that may issue from this disclosure. Specifically and by way of example, although the headings refer to a “Field of the Invention,” the claims should not be limited by the language chosen under this heading to describe the so-called field. Further, a description of a technology in the “Background of the Invention” is not to be construed as an admission that certain technology is prior art to any invention(s) in this disclosure. Neither is the “Brief Summary of the Invention” to be considered as a characterization of the invention(s) set forth in issued claims. Furthermore, any reference in this disclosure to “invention” in the singular should not be used to argue that there is only a single point of novelty in this disclosure. Multiple inventions may be set forth according to the limitations of the multiple claims issuing from this disclosure, and such claims accordingly define the invention(s), and their equivalents, that are protected thereby. In all instances, the scope of such claims shall be considered on their own merits in light of this disclosure, but should not be constrained by the headings set forth herein.
This patent application relates and claims priority to provisional patent application 61/048,932, entitled “Service Performance Manager with Obligation-Bound Service Level Agreements (SLA) and Patterns for Mitigation and Autoprotection,” filed Apr. 29, 2008, which is herein incorporated by reference for all purposes.
Number | Date | Country | |
---|---|---|---|
61048932 | Apr 2008 | US |