The present invention relates to computer servers, particularly to detect anomalous activity in computer servers by dynamically identifying potential data streams that can be embedded, at data generation itself, with specific data for facilitating anomaly detection and/or accurate decision making.
Operating systems (e.g., z/OS) provide controls to share finite hardware resources amongst client services. A workload consists of 1 or more jobs performing computing for similar client services. When multiple workloads are executing in parallel on the same operating system, a component (e.g., Workload Manager (WLM) on z/OS) provides controls to define attributes for each workload, such as an important level and a goal (e.g., response time). At regular intervals (e.g., every 10 s), this component assesses the results of each workload and may change the scheduler priority attribute of each workload, so most important workloads achieve their goals. Work represents the aggregate computing performed across all workloads.
For images serving multiple (e.g., double digits) workloads, transient performance problem diagnosis requires identifying problematic workload(s), defining the root cause, and recommending corrective action. A performance analyst uses visual analytics to graphically visualize activity in the form of metrics (e.g., central processing unit (CPU) execution time, CPU efficiency, CPU delay, serialization contention, etc.) against time for all work to define normal and anomalous activity. Detailed visual analytics against each workload can be overwhelming to an analyst and require significant computing resources.
A computer-implemented method for detecting anomalies in computing systems includes measuring activity metrics associated with accessing resources of the system by several users. Further, condensed diagnostic data is generated by grouping the users into buckets based on bucket and user attributes, and aggregating the activity metrics across all users in each bucket. Bucket contents are recorded during system's use, during which, analytic embedded data is generated for anomaly detection. The generating includes, for each bucket, capturing the activity metrics for an exceptional user in each bucket without aggregation at a next time interval.
A system includes a memory and one or more processing units that perform a method for detecting anomalies in computing systems includes measuring activity metrics associated with accessing resources of the system by several users. Further, condensed diagnostic data is generated by grouping the users into buckets based on bucket and user attributes, and aggregating the activity metrics across all users in each bucket. Bucket contents are recorded during system's use, during which, analytic embedded data is generated for anomaly detection. The generating includes, for each bucket, capturing the activity metrics for an exceptional user in each bucket without aggregation at a next time interval.
A computer program product includes a memory device with computer-executable instructions therein, the instructions when executed by a processing unit perform a method for detecting anomalies in computing systems includes measuring activity metrics associated with accessing resources of the system by several users. Further, condensed diagnostic data is generated by grouping the users into buckets based on bucket and user attributes, and aggregating the activity metrics across all users in each bucket. Bucket contents are recorded during system's use, during which, analytic embedded data is generated for anomaly detection. The generating includes, for each bucket, capturing the activity metrics for an exceptional user in each bucket without aggregation at a next time interval.
The specifics of the exclusive rights described herein are particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other features and advantages of the embodiments of the invention are apparent from the following detailed description taken in conjunction with the accompanying drawings.
The diagrams depicted herein are illustrative. There can be many variations to the diagrams or the operations described therein without departing from the spirit of the invention. For instance, the actions can be performed in a differing order, or actions can be added, deleted, or modified. Also, the term “coupled” and variations thereof describe having a communications path between two elements and do not imply a direct connection between the elements with no intervening elements/connections between them. All of these variations are considered a part of the specification.
In the accompanying figures and following detailed description of the disclosed embodiments, the various elements illustrated in the figures are provided with two or three-digit reference numbers. With minor exceptions, the leftmost digit(s) of each reference number corresponds to the figure in which its element is first illustrated.
Embodiments of the invention described herein address technical challenges in computing technology, particularly in fields where multiple data sources (producers) generate respective streams of data, and the data has to be analyzed to identify anomalies. Alternatively, or in addition, embodiments of the present invention also facilitate accurate decision making. At times decision making requires correlated and concentrated high-value data to deal with the extreme number of variables (hundreds, thousands, etc.) that are to be considered. Bulk streams of low-value detail data increase the computing costs required to generate, transmit, and analyze such data. One or more embodiments of the present invention use embedded analytics to include data concentrating sensors in the mainline path operations to share data analysis responsibilities between the data producers and the analytics engine. The additional costs incurred by not producing and consuming ideal purpose-built data, significantly impede the ability to cost effectively perform timely and accurate decision making.
For example, computer servers, such as large-scale server environments, face such technical challenges because of the complexity of software interactions, when processing arbitrarily large numbers of resources, when analyzing orders of magnitude of generated “log data,” and several other scenarios. In any such scenario, patterns of operation (legitimately) change over time and can be periodic or chaotic. Detecting anomalies in the data generated can require “domain knowledge” to “understand” the true anomaly. Further, detecting the anomalies can be operationally expensive (processor-intensive, memory-intensive, impacts to service level agreements (SLA)), particularly to monitor “everything,” i.e., each data element that is generated and consumed. Presently, existing solutions use improved algorithms (e.g., sampling, filtering, etc.) and artificial intelligence, for example, to reduce the data being analyzed or to reduce the patterns to be analyzed. However, such technical solutions are limited by the quality of data.
It should be noted that while embodiments of the present invention are described using the context of computer servers and operations associated with such computer servers, other embodiments of the present invention can be applied in other technical fields with the growing volume of raw data. For example, the proliferation of machine data is being accelerated by the expanding use of internet-of-things (IoT), with some reports indicating that there will be more than 41 billion connected IoT devices, generating an estimated 79.4 zettabytes (ZB) of data in the year 2025. Such IoT devices can be found not only in household consumer settings but also in industrial settings, such as factories, warehouses, supply-chain routes, etc. Further, advances and proliferation of communication networks have increased the use of streaming media, vehicle-to-vehicle communications, e-commerce, and several other use cases where large amounts of electronic/digital data are being generated and consumed. It is understood that the above are illustrative uses and that embodiments of the present invention are not limited to only such uses but rather can be applicable in several other scenarios.
With the increasing trend in the use of digital data, cyberattacks have also increased in frequency. Accordingly, security analytics is critical for the success of uses of digital data, such as those mentioned herein. As organizations become more data-driven, they have scaled their analytics capabilities using automation. Artificial intelligence is being used to automate processes from recommendations and bidding to pattern detection and anomaly detection. Generally, the presently available techniques for anomaly detection rely on analyzing unidimensional time series. Such techniques are limited because the data that is generated, especially with the proliferation of computer servers and communication devices, is multi-dimensional. For instance, in microservices-based architectures (which routinely comprise thousands of microservices), analyzing data of individual microservices would, most likely, mask key insights.
Technical challenges described herein are addressed by one or more embodiments of the present invention by facilitating embedding analytics into the data generation over time-series intervals. Accordingly, data generation is improved to facilitate the detection of anomalies, which can be at the data generation and/or the data consumption. In some embodiments of the present invention, the improved data generation builds upon operating system level awareness, and exploiting standardized data collection points, grouping arbitrarily long list of resources (e.g., files) into resource groups, capturing exploitation patterns of consumers acting on any group over the finite time interval, determining resource groups being “offended” by consumers, and identifying offending consumer(s) for those resource group(s) over next time interval to capture the specific set of resources.
Further, embodiments of the present invention facilitate using anomaly detection to determine relationships that, in turn, facilitate reducing time to detection and remediation. Additionally, embodiments of the present invention group anomalies associated with separate producers/consumers so that a single alert/notification can be provided collectively for that entire group instead of multiple alerts—for example, one for each anomaly in a group.
Embodiments of the present invention can accordingly reduce the need to monitor everything, for example, each and every metric generated, each and every data stream, etc. Rather, embodiments of the present invention facilitate automatically determining what exceptional activity to focus on based on aggregated, summarized data. Further, embodiments of the present invention can detect the anomalies dynamically, without any up-front policy definitions. Embodiments of the present invention facilitate using minimum processor and memory resources during anomaly detection. Further, embodiments of the present invention ensure that the resources are used for monitoring exceptional behavior and trends captured and comparing them over time, rather than monitoring “uninteresting” behavior, which is ignored (automatically filtered).
Additionally, embodiments of the present invention facilitate correlating individual consumers directly to individual resources being acted upon. Accordingly, embodiments of the present invention are based on using exceptionalism-enriched data streams.
Various embodiments of the invention are described herein with reference to the related drawings. Alternative embodiments of the invention can be devised without departing from the scope of this invention. Various connections and positional relationships (e.g., over, below, adjacent, etc.) are set forth between elements in the following description and in the drawings. These connections and/or positional relationships, unless specified otherwise, can be direct or indirect, and the present invention is not intended to be limiting in this respect. Accordingly, a coupling of entities can refer to either a direct or an indirect coupling, and a positional relationship between entities can be a direct or indirect positional relationship. Moreover, the various tasks and process steps described herein can be incorporated into a more comprehensive procedure or process having additional steps or functionality not described in detail herein.
The following definitions and abbreviations are to be used for the interpretation of the claims and the specification. As used herein, the terms “comprises,” “comprises,” “includes,” “including,” “has,” “having,” “contains,” or “containing,” or any other variation thereof, are intended to cover a non-exclusive inclusion. For example, a composition, a mixture, process, method, article, or apparatus that comprises a list of elements is not necessarily limited to only those elements but can include other elements not expressly listed or inherent to such composition, mixture, process, method, article, or apparatus.
Additionally, the term “exemplary” is used herein to mean “serving as an example, instance or illustration.” Any embodiment or design described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments or designs. The terms “at least one” and “one or more” may be understood to include any integer number greater than or equal to one, i.e., one, two, three, four, etc. The term “a plurality” may be understood to include any integer number greater than or equal to two, i.e., two, three, four, five, etc. The term “connection” may include both an indirect “connection” and a direct “connection.”
The terms “about,” “substantially,” “approximately,” and variations thereof are intended to include the degree of error associated with the measurement of the particular quantity based upon the equipment available at the time of filing the application. For example, “about” can include a range of ±8% or 5%, or 2% of a given value.
For the sake of brevity, conventional techniques related to making and using aspects of the invention may or may not be described in detail herein. In particular, various aspects of computing systems and specific computer programs to implement the various technical features described herein are well known. Accordingly, in the interest of brevity, many conventional implementation details are only mentioned briefly herein or are omitted entirely without providing the well-known system and/or process details.
Assessing the performance of workloads on computing systems can be an important part of the testing and day-to-day operation of workloads. Traditionally such assessments are accomplished through workload performance instrumentation that includes a workload and component summary collected at a long interval (e.g., 15 minutes). Performance analysts begin by assessing the overall workload and component summary. When the overall results are unexpected, a performance problem occurs during most of the long interval (e.g., the problem occurred for 10 out of 15 minutes), and the analyst knows which components require further investigation. When the overall results look good, there can be transient performance problems occurring for a small part of the interval (e.g., 3 minutes) that go unnoticed because they are lost in averages across the interval. For example, 90% CPU utilization for a long interval (e.g., 15 minutes) can be achieved through the workload consistently running at 90% CPU utilization or the workload having periods at 70% CPU utilization and other periods at 100% CPU utilization. Using existing techniques, performance analysts cannot see the difference. Gathering the workload and component summary has high compute costs at the interval end, so collecting the data at a shorter interval (e.g., 1 minute) can incur unacceptable compute costs and, in some situations, distort the underlying performance.
A computer server (“server”) makes finite hardware resources available to multiple applications. The server consists of many stack layers (e.g., middleware, operating system, hypervisor, and hardware). Every stack layer contains components (single to double digits) that manage resources (single digits to thousands) that are virtualized to the applications and consequently to the users of those applications. The workload consists of stack layers, components, and user requests. The workload context consists of component activities and resources and their interdependencies and interactions. As the arrival pattern changes, the workload context changes.
A workload performance problem typically describes a problem symptom like a slow/erratic response time or high resource contention. The overall workload and component summary are investigated, and the problem is sent to the component that is most likely the problem source for further diagnosis. A component expert generally begins with first failure data capture that includes a multi-minute (e.g., 15 minutes) component summary of activity (e.g., requests, response times) to identify normal and anomalous results. If no anomalous results are found, the component is not obviously involved, and the problem is sent to a different component expert. When an individual component discovers anomalous results or all components have no anomalous results, in summary, component details (e.g., all component activity records) must be investigated. Each component has its own controls to capture component details due to the high CPU overheads associated with collecting component details. Collecting component details requires recreating the problem. If the component details across all suspected components do not contain information about the anomalous results, new traces and diagnostics must be pursued. With the necessary component details, an expert will be able to define the problem or route the problem to another expert to investigate further. Recreating the problem to collect new data, transform data, analyze data, engage new experts, collect additional data, and correlate data across components increases the time required to define the workload context and ultimately define the underlying problem.
With existing technologies, an advanced performance analyst can apply machine learning to build a model using detailed training data. Machine learning training requires significant compute and memory resources to transform data, identify and consider important data, and ignore the noise. With a model in place, test data can be scored to detect and correlate anomalies. An advanced performance analyst then defines a problem that fits the anomalies from machine learning. A problem definition enables a performance analyst to take action against a workload component or resource to address the problem.
With existing technologies, workload components cannot produce high-frequency, summary data for an acceptable CPU cost with current support and procedures. Using existing techniques, workload components can collect summary data for long intervals (e.g., 15 minutes) at an acceptable compute CPU cost. Summary data cannot be collected at a short interval (e.g., less than 1 minute) because of the unacceptable increase in CPU cost and can distort the problem. With existing techniques, workload component details can be collected for specific problems but incur unacceptable CPU costs when regularly collected.
The present invention provides an orthogonal approach to generating synchronized, standardized, and summarized data for immediate analysis. This smarter data can be collected at a human-consumable high-frequency (e.g., greater than one second) for an undetectable CPU cost. A lightweight analytics engine can transform this smarter data into component activity and resource micro-trends and correlate micro-trends to reveal workload component activity and resource interdependencies and interactions with cause and victim peers. The whole process, from the smarter data generation to the analysis, focuses on summarizing data and thereby reducing noise, which enables an analyst to quickly transform data into insights.
Embodiments of the present invention facilitate diagnosing workload performance problems by collecting activity (e.g., CPU execution time) at a human-consumable high-frequency (e.g., greater than one second), establishing the activity normal baseline (e.g., mean), identifying baseline deviations (e.g., deviating 10% above or below the baseline), and temporally correlating baseline deviations. A micro-trend is a short-duration (e.g., one or more high-frequency intervals) deviation from the baseline. Further, every micro-trend contains a peak for every baseline deviation period above the baseline or a valley for every baseline period below the baseline. Micro-trend peak and valley correlations are used to identify the cause and victim peers amongst component activities and resources across the stack.
One or more embodiments of the present invention address technical challenges and facilitate an analyst to quickly investigate component data to identify normal and anomalous activity and determine the workload context. Accordingly, one or more embodiments of the present invention facilitate decreasing the time required to determine the involved components, their interdependencies, their interactions, and how they are being affected by the underlying performance problem. One or more embodiments of the present invention are rooted in computing technology, particularly diagnosing workload performance problems in computer servers. Further, one or more embodiments of the present invention improve existing solutions to the technical challenge in computing technology by significantly reducing the time required to identify normal and anomalous activity and determine the workload context.
Embodiments of the present invention facilitate diagnosing workload performance problems by using time-synchronized cross-stack micro-trend data generation.
Performance problems do not occur in a vacuum. Their ripple effects permeate through the workload. One or more embodiments of the present invention use such component ripple effects to detect clues to define the underlying problem. Component ripple effects can have short or long durations with impacts ranging from none, to subtle, to significant. Detecting such component ripples requires high-frequency, synchronized, standardized, and summarized data generation. Accordingly, micro-trends make subtle component ripple effects for transient durations detectable and hence can be used for diagnosing previously undetectable workload performance problems.
One or more embodiments of the present invention facilitate generating micro-trends with a substantial reduction in CPU costs. Using one or more embodiments of the present invention, because of low overhead, a server can aggregate always-on cross-stack high-frequency activity metrics that capture the arrival pattern effects on the workload context. An analytics engine transforms activity metrics into micro-trends. Correlating micro-trends cast a wide net to catch ripple effects across the entire workload and ensure performance first failure data capture is available whenever a performance problem is reported.
Processors 104 may include one or more processors, including processors with multiple cores, multiple nodes, and/or processors that implement multi-threading. In some embodiments, processors 104 may include simultaneous multi-threaded processor cores. Processors 104 may maintain performance metrics 120 that may include various types of data that indicate or can be used to indicate various performance aspects of processors 104. Performance metrics 120 may include counters for various events that take place on the processors or on individual processor cores on a processor. For example, a processor may have architected registers that maintain counts of instructions, floating-point operations, integer operations, on-processor cache hits, misses, pipeline stalls, bus delays, etc. Additionally, time may be a performance metric. Registers or other data locations or functions that maintain a time value may be used as a performance metric 120 in some embodiments.
Memory 110 may be system memory (e.g., one or more of cache, SRAM, DRAM, zero capacitor RAM, Twin Transistor RAM, eDRAM, EDO RAM, DDR RAM, EEPROM, NRAM, RRAM, SONOS, PRAM, etc.). A memory controller for memory 110 may maintain performance metrics 126 that may include various types of data that indicate or can be used to derive indicators of memory performance. For example, memory performance metrics 126 may include a counter for the number of memory accesses, type of accesses (e.g., read or write access), cache hits, cache misses, etc.
Power subsystem 106 provides and regulates power to the various components of computer system 102. Power subsystem 106 may maintain performance metrics 122 that comprise voltage levels for various rails of one or more power supplies in power subsystem 106.
Storage subsystem 108, when present, provides persistent storage for computer system 102. Such storage can include hard disks, optical storage devices, magnetic storage devices, solid-state drives, or any suitable combination of the foregoing. Storage subsystem 108 may maintain performance metrics 124 that may include counts of read or write accesses, or timing information related to reads, writes, and seeks.
Communication subsystem 112, when present, provides network communication functions for computer system 102. Communication subsystem 112 may maintain performance metrics 128 that may include counts of packets transmitted and received and other data regarding network communications. In some embodiments, communication subsystem 112 may include a network interface (e.g., an ATM interface, an Ethernet interface, a Frame Relay interface, SONET interface, wireless interface, etc.)
The computer system 102 contains operating systems (142) that can be configured to process workloads 140. A workload 140 is a set of tasks interacting to complete requests from users 150. An operating system 142 maintains performance metrics 114 for each user about its communication activity (e.g., data size) and resource use (e.g., time using network adapter to send/receive packets from rom communication subsystem 112). In some embodiments, a performance manager 116 facilitates tracking performance metrics (e.g., read and write accesses from memory subsystem 126) and updating workload and user metrics. Different workloads may have different characteristics. For example, OLTP (On-Line Transaction Processing) workloads typically involve many data entry or retrieval requests that involve many short database interactions. Data mining workloads, on the other hand, have few interactions with users but more complicated and lengthy database interactions. Different types of workloads 140 may have different impacts on the activities and resources of computer system 102.
In one or more embodiments of the present invention, a lightweight method includes an instruction sequence to aggregate the metrics described above that is used during the mainline operation of the workload. In one or more examples, the lightweight method is always running during mainline processing to aggregate metrics about computer system resource 102 use and the workload activity 140.
The performance manager 116 calculates metric deltas from the components of the computer system 102, including the workload 140 at periodic synchronized intervals. The periodic synchronized interval is at a human-consumable high-frequency that is greater than one second. The metrics for each component are generated in a continuous and always-on manner as described herein. In one or more embodiments of the present invention, an administrator can switch off the data generation via the performance manager 116. Data generation is based on a synchronized interval across the whole computer system 102. Once different component metrics are using different intervals, correlations are much less viable. Consequently, the metric deltas are computed at the synchronized human-consumable high-frequency interval (e.g., greater than one second) across all components.
The metric library 130 represents the collection of metrics 120, 122, 124, 126, 128 that the performance manager 116 produced across all aspects of the computer system 102. The metric library 130 may be part of computer system 102, or it may be maintained on a separate system that is available to the computer system 102.
In some embodiments, the metrics aggregated and captured are customized for particular hardware implementation and/or for the particular type of workload 140. For example, for a particular workload 140, the metrics that are aggregated and captured only include hardware metrics 120 for the used family of processors and memory subsystems.
The performance manager 116 further transforms the captured metrics into concise summaries using multiple levels of aggregation. Every aggregation level removes one or more details and further refines the data. The last aggregation level yields the context-rich and concise data required for micro-trends that can be used by an expert to define previously unseen workload performance problems.
Next, every component continuously aggregates activity metrics (e.g., number of requests, response times) on a per activity basis for each user 150 (202). This is the first level of aggregation. For example, the performance manager 116 aggregates CPU activity metrics (e.g., CPU requests [dispatches], CPU delay time, and CPU use time) from hardware metrics (120) and, in some embodiments of the present invention, the operating system metrics (142). The operating system 142 or performance manager 116 aggregates the results locally for every user 150.
Consider a computer system that has 30 or more users 150. The CPU activity metrics for every user 150 can overwhelm a human expert that is investigating such activity. Moreover, in typical scenarios, the number of users is even larger (in hundreds, if not thousands). A context-rich and concise activity summary for buckets of users 150 with similar activity can facilitate the human expert to analyze the data and diagnose the workload problem more efficiently.
Then for every human-consumable high-frequency interval (e.g., greater than one second), the performance manager (116) places each user based on its attributes into the single bucket with matching attributes (205) and, for each user, increments the count of users and aggregate the user's activity metrics into the bucket the user belongs to (206). During this second level of aggregation, the most significant user name and its activity metrics are included in the bucket (208). In this embodiment, there is a single most significant user name and activity metrics, but other embodiments may include multiple significant users (e.g., low single digits). In this embodiment, the performance manager 116 performs the actions required for block 206 and 208. As shown in block 210, the performance manager 116 then records bucket contents for analysis. In some embodiments, the performance manager 116 may output visual analytics to an operatively connected output device.
Grouping users into a small number of buckets and aggregating user activity into buckets enables performance analysts to quickly detect user activity changes across all users 150 in the bucket. Furthermore, with the most significant user and its corresponding activity in each bucket, performance analysts can quantify how much of the bucket activity changes were attributable to the most significant user 150. Performance analysts can use a most significant user to determine whether one or multiple users are driving the majority of the bucket activity changes. When multiple users are driving bucket activity changes, performance analysts know other users are causing smaller impacts.
Further, in the same interval, low-level activities are generalized into higher-level user constructs. For example, low-level activity metrics that are associated with a specific user (150) are generalized by aggregating them into a bucket (206). When there is no specific user (e.g., for operating system overhead), the activity metrics are associated with the operating system (142), which may be treated as a special user (150) in its own bucket or like a regular user (150) and aggregated into an existing bucket (206). In either case, the data aggregation is performed continuously.
Over multiple intervals, bucket activity metrics exhibit normal and anomalous periods. Bucket activity metrics enable establishing normal baseline periods and baseline deviation periods as anomalous for a group of similar users. A micro-trend is a short-duration (e.g., one or more high-frequency intervals) deviation period from the baseline. Every micro-trend above the baseline has a peak (e.g., a maximum value) and every micro-trend below the baseline has a valley (e.g., a minimum value). When activity metric peaks or valleys occur in buckets across multiple components and users, those activities are correlated between cause and victim peers. Micro-trend correlations can reveal cross-stack interdependencies and relationships between buckets, most significant users, and activities because the same synchronized cross-stack interval is used to accumulate activity metrics across all components in the stack
For any component across the hardware or software stack, micro-trend data generation delivers a cross-stack summary of vital statistics that identify the affected buckets, users, and activities of an ailing workload.
According to embodiments of the present invention, a performance analyst can much more quickly identify which workload component(s) and which user(s) are cause and victim peers in a transient performance problem.
One or more embodiments of the present invention measure per user activity metrics for one or more activities independently from other activities. Aggregating user activity metrics into buckets improves the efficiency with which an analyst can diagnose workload performance problems. For example, if a component provides multiple services, the above technique can be applied to track only the relevant metrics for a particular service (e.g., the number of times the service, like allocate memory, was called) for each user. As a second example, consider CPU use. The computer system 102 can have a lot of CPUs, but which CPU a user operation actually ran on really does not matter; what matters is the CPU time used. So above techniques facilitate tracking the amount of CPU time used for each user. It is understood that CPU time (or processor usage, or processor time) is just one metric that can apply micro-trends. In a similar manner, and in conjunction, in one or more embodiments of the present invention, other metrics such as the number of requests, response time, accesses, and others for a particular computing resource can apply micro-trends.
Further, the method includes determining a normal baseline for each bucket metric at block 408. For example, 15 consecutive minutes comprising high-frequency intervals are analyzed to determine a normal baseline for each bucket metric (e.g., mean). Because the buckets are user attribute (e.g., priority and size) based, the bucket baseline represents the baseline of all users in each bucket. Then for every bucket metric, an analyst can identify baseline deviation periods (e.g., one or more consecutive intervals deviating by at least a standardized threshold such as 10% above or below the normal baseline) called micro-trends as shown in block 410. In a bucket, a single user or multiple users behaving differently than the others can cause a micro-trend for the bucket. For every micro-trend (baseline deviation period), the analyst locates a single point peak or valley in block 412 and correlates peaks and valleys across micro-trends in block 414. Peak and valley micro-trend correlation locates other micro-trends experiencing peaks and valleys at the same time. For each micro-trend peak and valley, an analyst can identify workload interdependencies and interactions with cause and victim users being impacted at the time of the problem in block 416.
With micro-trends, a performance analyst can identify a set of users, workloads, and activities across the stack that are impacted during baseline deviation periods. With the impacted set of users, workloads, and activities, a performance analyst can focus on a deeper analysis of the impacted areas and ignore the unimpacted areas. Micro-trends improve the productivity of performance analysts greatly.
In one or more examples, the performance manager 116 may act based on micro-trends, such as allocating computer resources from the system 102 in a different manner to avoid anomalies for a single user or bucket of users. For example, subsequent similar workload requests from that user may receive additional computer resources, such as memory, processor time, and the like. For example, consider that the performance manager 116 generates data at a 5-second periodic interval. In some embodiments of the invention, the performance manager 116 detects each local bucket's exceptional user for a 5 second interval, and then acts upon that user to ‘spotlight’ that user for a subsequent 5 second interval.
The performance manager 116 may act using a micro-trend feedback loop to access the action taken. Micro-trends are detected at a higher level by performing an analysis against multiple 5-second periods. The micro-trend feedback loop occurs at the ‘broad view’ higher level analysis (i.e. longer interval), after a 5-second point has been determined to be an anomaly. At this point additional ‘spotlight’ actions may be taken beyond those described herein. As noted elsewhere herein, there can be at least two forms of exceptionalism: 1) worst offender for a bucket within a 5-second point; 2) anomalous highest peak 5-second point (micro-trend) across multiple 5-second points.
In other examples, when resource use for a single user or a bucket of users has micro-trends deviating from the baseline, the performance manager 116 can request the system 102 to allocate the resources in a different manner, particularly for users 150 identified to cause the anomaly in performance.
Accordingly, human-consumable high-frequency (e.g., greater than one second) data generation of micro-trends that include context-rich and concise activity metrics (e.g., requests, response times) over multiple intervals exhibit patterns, which in turn can be used to identify workload performance problem(s) and particularly, as described above, specific user attributes, specific workloads, or specific activities and resources impacting and/or contributing to a performance problem. Micro-trends are baseline deviation periods. For each micro-trend, activity metric peaks and valleys focus performance analysts on which components, activities, and resources are significant factors in the ailing workload.
In accordance with
Furthermore, exceptional consumer activity entries are condensed and summarized into buckets as totals, averages, and worst-offending consumer(s) with corresponding activity metrics. These design points reduce noise and ensure concise and context-rich data, which lowers the CPU, memory, and storage costs.
Further, in conjunction, the method includes identifying and correlating micro-trends to map a consumed resource to consumer(s) at block 512 using techniques described herein (
Accordingly, one or more embodiments of the present invention are rooted in computing technology, particularly defining a workload performance problem in a computing system where a consumed resource to consumer combination is a significant contributor to the problem. One or more embodiments of the present invention further improve existing solutions in this regard by improving performance and by reducing CPU cost (CPU usage), amount of data instrumented, stored, and further analyzed. In turn, the workload performance problem can be diagnosed faster compared to existing solutions.
One or more embodiments of the present invention provide such advantages through micro-trend correlation that maps consumed resource peaks to worst offending consumer activity peaks to reveal which resources are being heavily used and which consumers are driving the usage. The worst offending consumer can be a bucket (e.g., a collection of consumers) or the single worst offending consumer in the bucket. Now, a performance analyst has first failure data capture that can detect transient differences in consumed resource use and worst offending consumers between baseline and baseline deviation periods. In this manner, the performance analyst receives the right data to discover consumed resources to consumer relationships and at significantly lower costs to CPU, memory, and disk.
With every component in the system 102 recording the results as noted above, any component across the hardware or software stack can generate context-rich and concise data and use micro-trends to facilitate finding the consumed resource to consumer relationships across the stack.
Accordingly, one or more embodiments of the present invention facilitate time-synchronized, high-frequency, cross-stack data generation required to create micro-trends. Micro-trends facilitate an analyst to quickly investigate component data to identify normal and anomalous activity and determine the workload context, in turn significantly decreasing the time required to define a performance problem.
Smarter data generation facilitates detecting ripple effects in component performance by facilitating the determination of the component baseline and uncovering baseline deviations called micro-trends. Micro-trends reveal never before seen component ripple effects. Micro-trends emerge from generating context-rich, low overhead, and concise component activity records on a human-consumable, high-frequency, synchronized interval (e.g., greater than one second). Smarter data generation yields key component vital signs that enable establishing the component's normal baseline and identifying baseline deviation periods called micro-trends (e.g., one or more sequential high-frequency intervals deviating 10% above or below the baseline). Every micro-trend contains a peak or valley representing the interval deviating most from the baseline. Micro-trend peak and valley correlations reveal cause-and-effect ripples across components and resources. Micro-trends make subtle component ripple effects for transient durations (e.g., seconds) detectable.
Further, low overheads in accumulating and collecting the metrics used for micro-trend data generation facilitate generating synchronized always-on cross-stack micro-trends that capture the arrival pattern effects on the workload context. Always-on micro-trends cast a wide net to catch ripple effects across the entire workload. They ensure performance first failure data capture is available whenever a performance problem is detected.
Micro-trends lower the expertise needed to detect and diagnose performance impacts. With micro-trends, performance teams can detect cause-and-effect relationships between workload components. Micro-trends improve triage and define areas of focus by exonerating unaffected components and resources, implicating the affected components and resources, and engaging the right experts.
Further, system availability improves with micro-trends. Micro-trends provide insights into problem areas before the problem causes outages. Experts can recommend configuration and/or tuning changes so that the system operation can be stabilized and the workload performance problem mitigated. An analyst can use micro-trends to assess whether an implemented configuration and/or tuning change had the intended effect without unintended consequences.
Further, micro-trends further improve solution quality because they provide a continuous feedback loop. For example, development teams can use micro-trends to make better design decisions and receive timely feedback by measuring the impacts within and across components. Development teams can foster performance improving conditions and avoid performance degrading conditions. Further yet, test teams can use micro-trends to validate that an intended scenario was driven and measure the desired results were achieved. Micro-trends also improve automation. As described herein, systems can automatically tune or configure a computer server, or an operating system, based on micro-trends. Further yet, in one or more examples, the system or an analyst can use micro-trends to assess whether a configuration change was a step in the right direction to commit or a step in the wrong direction to undo.
Further, one or more embodiments of the present invention facilitate generating smarter data input to reduce the cost and improve the speed of machine learning. Machine learning builds a model that represents input training data. Building a model requires cleansing and evaluating the training data to consider the relevant data and ignore the noise. Then, the resulting model scores input test data that has a mixture of normal and anomalous data. Comparing the model results with the expected test data results produces a model accuracy percent. With micro-trend data generation changes, higher frequency machine-consumable, fine-grained micro-trends can reduce machine learning training and scoring costs while maintaining model accuracy. One or more embodiments of the present invention, accordingly, provide a practical application for generating micro-trend diagnostic data that can be used to build a machine learning model which can score traditional mainline data or other micro-trend diagnostic data.
The machine-consumable micro-trend data generation for machine learning builds on top of human-consumable micro-trend data generation. Both generate synchronized, structured, context-rich data at an acceptable CPU cost. Human-consumable micro-trend data generation has to avoid overwhelming or tiring the analyst, but that is not a concern for machine-consumable micro-trend data generation. As a result, machine-consumable micro-trend data generation collects additional buckets via new/additional bucket attributes (e.g., new z/OS job sizes of extra-large and extra-small) that distributes the workload across more buckets and yields fewer users/consumers in each bucket. Furthermore, with machine-consumable micro-trend data generation, each bucket includes its non-exceptional users/consumers in the summary activity and captures its exceptional activity, such as the top n most significant users/consumers. Also, machine-consumable micro-trend data generation occurs more frequently than human-consumable micro-trend data generation. Machine learning requires higher frequency and fine-grained micro-trend data generation to build a representative model while maintaining model accuracy.
The cost-effectiveness and speed of machine learning training improve with machine-consumable micro-trend data generation. Machine-consumable micro-trend data generation produces synchronized, structured, context-rich data that contains both summary and exceptional activity. Machine-consumable micro-trend data generation reduces and refines the data to keep important summaries and exceptional content and removes noise. This content enables machine learning training to choose from only the most valuable data. Machine learning training using machine-consumable micro-trend data input has significantly fewer data to evaluate, which results in fewer model iterations to differentiate important data from noise. As a result, machine-consumable micro-trends deliver lower data generation and model training costs while maintaining model accuracy.
Machine learning scoring also benefits from machine-consumable micro-trend data generation. Machine-consumable micro-trend data generation enables a new form of scoring that can be done regularly during the higher frequency machine-consumable interval. Micro-trend summary context enables scoring to better assess whether test data is normal or anomalous based on the summary and exceptional activity. Furthermore, all machine learning scoring benefits from micro-trend data generation correlations between workload component interactions and consumer to consumed resource cause and victim peers.
Smarter data generation can significantly improve machine learning training. By reconfiguring human-consumable micro-trend data generation into machine-consumable micro-trend data generation, machine learning training can improve model building cost and speed while maintaining model accuracy. Generating machine-consumable micro-trends requires a large number of fine-grained buckets, the top n most significant users/consumers, and more frequent data generation (e.g., less than one second).
According to one or more embodiments, a computer-implemented method for diagnosing workload performance problems in computer servers includes measuring activity metrics and aggregating lower-level activity metrics into higher-level user constructs for each user. The method further includes generating condensed diagnostic data for identifying workload performance problems on a synchronized, regular interval. Generating diagnostic data includes grouping users into buckets based on the bucket and user attributes, aggregating user activity metrics across all users in each bucket, including one or more most significant user(s) and corresponding user activity metrics for each activity in each bucket, and recording bucket contents. The method includes generating high-level, condensed diagnostic data at a human-consumable analysis interval and analyzing recorded bucket contents to facilitate determining a baseline and baseline deviation periods, identifying a peak or valley for every baseline deviation, and correlating peaks and valleys temporally to identify cause and victim interdependencies and relationships between buckets, most significant users, and activities. This method also includes generating high-level, condensed diagnostic data at the machine-consumable interval to train a machine learning model with lower data generation and model training costs while maintaining model accuracy. The resulting model can be used to score new condensed diagnostic data or traditional mainline data. In one or more examples, the method further includes analyzing bucket contents at an analysis interval to identify buckets and users synchronously deviating from normal.
According to one or more embodiments, a computer program product includes a memory device with computer-executable instructions therein, the instructions, when executed by a processing unit, perform a method of diagnosing workload performance problems in computer servers. The method includes measuring activity metrics and aggregating lower-level activity metrics into higher-level user constructs for each user. The method further includes generating condensed diagnostic data for identifying workload performance problems on a synchronized, regular interval. Generating diagnostic data includes grouping users into buckets based on the bucket and user attributes, aggregating user activity metrics across all users in each bucket, including one or more most significant user(s) and corresponding user activity metrics for each activity in each bucket, and recording bucket contents. The method includes generating high-level, condensed diagnostic data at a human-consumable analysis interval. Further, the method includes analyzing recorded bucket contents to facilitate determining baseline and baseline deviation periods, identifying a peak or valley for every baseline deviation, and correlating the peaks and valleys temporally to identify cause and victim interdependencies and relationships between buckets; the most significant users, and activities. This method also includes generating high-level, condensed diagnostic data at the machine-consumable interval to train a machine learning model with lower data generation and model training costs while maintaining model accuracy. The resulting model can be used to score new condensed diagnostic data or traditional mainline data. In one or more examples, the method further includes analyzing bucket contents at an analysis interval to identify buckets and users synchronously deviating from normal.
According to one or more embodiments, a system includes a memory and a processor coupled to the memory; the processor performs a method of diagnosing workload performance problems in the system. The method includes measuring activity metrics and aggregating lower-level activity metrics into higher-level user constructs for each user. The method further includes generating condensed diagnostic data for identifying workload performance problems on a synchronized, regular interval. Generating diagnostic data includes grouping users into buckets based on the bucket and user attributes, aggregating user activity metrics across all users in each bucket, including one or more most significant user(s) and corresponding user activity metrics for each activity in each bucket, and recording bucket contents. The method includes generating high-level, condensed diagnostic data at a human-consumable analysis interval and analyzing recorded bucket contents to determine a baseline and baseline deviation periods, identify a peak or valley for every baseline deviation and correlate peaks and valleys temporally to identify cause and victim interdependencies and relationships between buckets, most significant users, and activities. This method also includes generating high-level, condensed diagnostic data at the machine-consumable interval to train a machine learning model with lower data generation and model training costs while maintaining model accuracy. The resulting model can be used to score new condensed diagnostic data or traditional mainline data. In one or more examples, the method further includes analyzing bucket contents at an analysis interval to identify buckets and users synchronously deviating from normal.
In one or more embodiments, diagnostic data can be generated in a human-consumable form for human analysis or in a machine-consumable form for machine analysis through machine learning.
Embodiments of the present invention use the aggregated data to further improve anomaly detection. For example, the data collection/aggregation can implicitly group the data, for example, by data set name, source, consumer, etc., alternatively, or in addition, the data may be aggregated using data set activity and access methods (extending “cube” of priority/size/cp-type).
Embodiments of the present invention can further perform anomaly detection based on data that has been aggregated over multiple activities per group. For example, data set “access patterns” are aggregated by group (e.g., #bytes read, #bytes written, capture jobs with the most activity, etc.). Such aggregation enables each group created to represent a logical “view” of data set activity. Embodiments of the present invention facilitate reducing the volumes of instrumentation data by aggregating at the group level.
Further, yet, in one or more embodiments of the present invention, analytics are embedded in the data generation itself. Exceptional activity in group, for each activity, is then hyper-correlated to the data sets being accessed by one or more consumers. Embodiments of the present invention can provide inline data correlation. Additionally, the embedding of the analytics during data generation itself optimizes the anomaly detection process by eliminating the need to track each resource (every address space, every data set operation, every consumer, etc.). It should be noted that, without such embedded analytics, the number of factors that the anomaly detection demands to be monitored is a product of the number of each resource (e.g., X producers*Y consumers). With embodiments of the present invention, only the exceptional job details can be tracked.
In some embodiments of the present invention, a historical reference database (metric library 130) is used to capture enriched behavioral signatures. The signatures can be analyzed using heuristic, algorithmic, and statistical modeling to determine anomalous activity.
Anomaly detection in data access patterns includes identifying a resource 750, such as a digital asset (e.g., file, folder, data in a database, financial information, login credential, electronic medical record, images/video, etc.) and an offender (person/people and/or machine(s) being used) accessing the resource in an atypical/unusual manner. For optimal anomaly detection, both, the resource and the offender have to be identified. Existing techniques for anomaly detection are based on analyzing all permutations of users 745 and resources 750 (X users*Y resources). Hence, the existing techniques do not scale (i.e., are not cost effective) to environments such as mainframes (e.g., Z/OS® based systems), where the number of users and the number of resources 750 are both high (in millions). Technical solutions described herein address such technical challenges by embedding analytics into the data generation.
For example, the method 700 facilitates summarizing data into substantive ‘micro-averages’ for collections of resources, at a frequent standardized periodicity. While calculating these activity averages (e.g., read/write operations) at every period (e.g., 5 seconds, 10 seconds etc.), the groups are enriched by identifying the single worst offender for each key activity, within each collection, and in every period. Further, the embedded analytics data generation is then extended, such that an offender identified during a period TO are spotlighted in a next period (e.g., period T1), to only generate data for the identified offender (e.g., track the files accessed by identified offender). Accordingly, offending users 745 and the resources 750 (e.g., digital assets) that are offended are identified every two periods in time (T0 and T1 in above example). Above technique can be referred to as ‘local’ subsequent data generation actions based upon an earlier period's data.
Such high frequency summary data is generated for several consecutive periods (e.g., T0-Tn, where n is an integer). In some aspects, the high frequency summary is captured continuously, and a predetermined number of most recent time periods are subsequently consumed/analyzed by a near-time Inspector, and most recent offender values are compared to historical norms, to determine if they are anomalous over time. Once an Activity is identified as anomalous, this ‘broad view’ anomaly indicator is fed back into subsequent data generation, to enable even more data to be generated for this anomalous, exceptional offender (particular user only). Accordingly, subsequently the analytics necessary and used to accurately identify anomalous behavior by the Inspector are significantly reduced because of the data embedded for particular users at generation itself.
The method 700 is now described with file intrusion detection in a mainframe environment (e.g., Z/OS®) as an example, although the hyper-correlate-based data embedding and generating techniques described herein can be used to a wide range of computer-based and other applications.
At block 702, the files are grouped into resource groups (e.g., 64 resource groups). For example, a resource group can be based on a disk on which the files are physically stored. Alternatively, or in addition, the resource group can be based on file permissions (e.g., read-access, write-access, etc.). Any other parameters associated with the files can be used for grouping the files into resource groups. In some cases, the number of resource groups is predetermined. In other applications, other resources 750 (instead of files) are grouped. For example, in an autonomous vehicle environment, vehicles, and/or sensors from which metrics are being captured are grouped.
At block 704, the access to the files is monitored, and metrics associated with the access are captured. When a consumer (in this case, an address space) accesses a resource (in this case, acts on a file), metrics for the resource are generated and stored. In some embodiments of the present invention, the metrics are stored by hashing on resource name (i.e., data set name/filename). It is understood that other types of classification can be used instead of hashing in other embodiments of the present invention. In case of other applications, the metrics associated with access of the other resources 750 are captured.
Consider that at time T, Fred 752A accesses a data set, say data set1, from the data set 750. Also, at time T, Joe 752B accesses a data set (same or different from Fred 752A). Each access is hashed into a respective group corresponding to the data set that is accessed, i.e., metrics associated with the access are routed to be recorded into the respective group.
Split hash value into resource group index and semi-unique identifier. The semi-unique identifier precision is based on the number of bytes allocated for the semi-unique identifier. For instance, a 1-byte identifier taken as part of the hash value allows a 1-1 correspondence to a 256-bit bitmap (as a 1-byte field can capture 256 states), and a high-quality hash will uniformly distribute across those 256 bits. This bitmap can then be used to estimate how many data sets are accessed within a bucket by a job or a user within a predetermined frequency, as the value of the identifier can be mapped to a bit that is logically ORed with the rest of the bits in the bitmap. At the end of the predetermined frequency interval, the bitmap can be queried to see the distribution of data sets accessed within the interval.
At block 706, the group index is used to record “activities” performed on the resources (data sets 750) by the consumers (users 752) at the group level. For example, in the example scenario of
At block 708, a relative distribution is computed for each group to represent how that resource group is being used by the consumer (e.g., consumer writing to a particular file in the group or to all files). The relative distribution can be stored as a bitmap in one or more embodiments of the present invention.
The operations depicted in blocks 702 to 708 are continuously performed at a predetermined frequency, for example, every 10 seconds interval, 20 seconds interval, etc.
At block 710, the metrics captured during each data collection interval are used to update a view 760. The aggregation can be performed at a different frequency than the time interval at which the grouping is performed (702 to 708). For example, an interval may be of 5 seconds, and the aggregation of the data captured during each interval may be performed every 20 seconds. It is understood that the interval and aggregation period can be of different durations from the above examples.
The view 760 aggregates several metrics across the various users 752 and the groups 754. For example, the view 760 can aggregate metrics such as total bytes read, total bytes written, maximum reads (in a group), maximum writes (in a group), and exceptional candidates in the file intrusion scenario. Techniques described herein can be used to compute such aggregation and identify exceptional (i.e., offending) candidates. (see
At block 712, specific data sources are identified to capture and embed analytics. For example, for those users 752 that are identified as potentially offending one or more resource groups, the system can note exceptionalism and track individual data set access in the group on the next time interval (for as long as exceptionalism is noted). In the ongoing example scenario, where the time interval was T to T+5, the next time interval was T+5 to T+10. Tracking individual exceptionalism includes embedding analytics at the source of the data in one or more embodiments of the present invention. Alternatively, or in addition, for a resource 750 (e.g., digital asset) that is identified as being potentially offended, the system can note exceptionalism and track data associated with that resource 750 only. For example, access information of a file, sensor information of a vehicle, or of a component, etc.
For example, in the example of
In this manner, only a specific set of metrics can be accumulated and analyzed for detecting anomalies rather than monitoring each activity being performed. Additionally, certain metrics can be additionally captured only for the identified potential anomaly. It should be noted that the specific set of metrics captured for the exceptionalism identified (offender/offended) (at block 712) are in addition to the continuous capture, grouping, and aggregation of metrics performed (blocks 702-to-708).
The specific metrics captured for detecting anomalies (block 712) are analyzed to determine if an anomalous behavior exists at block 714. Several techniques can be used to detect an anomaly in the captured data.
In some embodiments of the present invention, an entropy calculation can be used to detect the anomaly in the data. In an entropy-based approach, “chunks,” “windows,” or portions of input data are analyzed.
In the entropy-based approach, the number of unique historical distributions required, P_hist, is empirically small, e.g., ten over 100,000 samples. Performing the anomaly detection in this manner has an overhead, which is linear in the number of variables. The entropy-based anomaly detection can be implemented using machine learning with unsupervised learning online and continuously. Such an approach can be used for a stream of values for a single variable. Embodiments of the present invention can improve the entropy-based approach in several ways. For example, using the methods herein, anomalies can be detected across multiple variables, and that too in linear time. The anomalies can be detected in both individual variables and correlations between variables. Embodiments of the present invention further facilitate dynamic range for each variable. Further, a unified, seamless approach for missing variables can be used. In some embodiments of the present invention, spectral frequency analysis can also be performed based on entropy-based anomaly detection.
While an entropy-based anomaly detection is described, it is understood that in one or more embodiments of the present invention, any other anomaly detection algorithm can be used to analyze the data that is specifically generated for identified data sources. For example, Mahalanobis distances after dimension reduction, autoencoders, spectral density calculation, determining a deviation from the mean, or any other such anomaly detection algorithms can be used, which are not described in detail here.
In one or more embodiments of the present invention, at block 716, if the specific aggregated data is no longer indicative of the symptoms for categorizing the user 752 and/or the resource 750 as an outlier (offending), the specific metrics for the identified potential offending user (e.g., Joe 752B) is no longer monitored. Alternatively, or in addition, in one or more embodiments of the present invention, at block 718, if the specific aggregated data continues to be indicative that the user 752 and/or the resource 750 is an outlier (offending) for more than a predetermined duration (e.g., number of time intervals), the user 752B is identified as an offender. In this case, notifications are sent to specific personnel to identify a potential breach and/or anomalous activity so that an alert response can be performed. In some cases, the operating system may be shut down and/or reconfigured to a more protected mode to prevent further anomalous activity. Alternatively, or in addition, the offending user (e.g., Joe 752B) may be prohibited from accessing any content by the operating system in one or more embodiments of the present invention.
While data set access is used as an example to describe anomaly detection, it is understood that one or more embodiments of the present invention can be used for other types of anomaly detection. For example, embodiments of the present invention can facilitate building/generating behavioral signatures from data over a long duration (e.g., days). Such behavioral signatures can be dynamically detected based on the reoccurrence of signatures. Embodiments of the present invention can also be used to build a historical reference database for proactive analysis. Further, one or more embodiments of the present invention can facilitate creating new algorithms to recognize exceptional intrusion detection activity in near-real-time. In some embodiments of the present invention, continuous learning can be performed from the client environment over time.
In one or more embodiments, detecting anomalies in computing systems includes measuring activity metrics associated with access of the resources (e.g., digital assets), the resources being accessed by several users. The lower-level activity metrics are aggregated into higher-level user constructs for each resource. Further, condensed diagnostic data is generated. The condensed diagnostic data can be generated on a synchronized, regular interval. Alternatively the condensed diagnostic data is generated using a dynamic time interval. Generating the condensed diagnostic data includes grouping the resources into buckets based on bucket and resource attributes. The activity metrics are aggregated across all resources in each bucket, and from the aggregated data, one or more most significant (i.e., exceptional) resources and corresponding activity metrics are identified for each activity in each bucket.
For example, In addition to generating condensed diagnostic data, buckets are used to identify exceptional users for each bucket. ‘Spotlights’ are put on the exceptional users for these buckets, by acting upon them to capture their non-aggregated resource usage at the next time interval. The continuous aggregation of activity metrics for the next time interval, defines the exceptional user spotlights for the subsequent time interval, and so on. For example, in the context of the file access, a particular file experiencing an anomalous access can be ‘spotlighted.’ Alternatively, a user that accesses files anomalously can be ‘spotlighted.’ Once, identified in this manner, the extra activity metrics for the spotlighted item can be captured, which can include file-pointers identifying location within the file that was accessed, time-of-day at which the file was accessed, IP address or other identification related information associated with the file access, etc.
Embodiments of the present invention facilitate identifying file access patterns as exception data intrusion candidates based on the file access significantly deviating from typical file accesses, as described herein. On-platform (i.e., data does not leave the operating system). In one or more embodiments of the present invention, when an offender is detected (718), an alert is generated. For example, when the number of times a particular user is detected exhibiting a certain anomalous behavior (e.g., anomalous file access) above a threshold (predetermined or learned), an alert can be generated. In some embodiments of the present invention, the offending user is prevented from further access to the system until s/he is reauthorized for such access.
One or more embodiments of the present invention facilitate using a computing device to detect anomalous activity in computer server environments having multiple customers. For example, a computer-implemented method can include determining by the computing device one or more highest accessed resources accessed by each customer in the computer server environment. Further, the method includes tracking by the computing device one or more most frequent activities in the highest accessed resources of the computer server environment. Further, the method includes determining by the computing device one or more offending customers of the plurality of customers based upon the highest accessed resources and the one or more most frequent activities in the highest accessed resources. Further, the method includes detecting all anomalous activity associated with one or more offending customers accessing any resource associated with the computer server environment over a period of time.
One or more embodiments of the present invention improve anomaly detection by curating the quality of the input data that is used to detect anomalies. Further, in addition to filtering data that is being generated to improve the input data that is analyzed for detecting anomalies, embodiments of the present invention facilitate embellishing the data being generated for specific data sources that are identified as potential offenders. Such embellishing can include certain embedding analytics in the data generated and captured for the potential offenders.
It should be noted that the technical solutions described herein are not limited to detecting anomalies in computer server, mainframes, and anomalous file access patterns. The technical solutions herein are applicable in any technical area in which data streams are generated and analyzed. For example, autonomous vehicle decision making uses concentrated high-value data to cost effectively analyze a large number of variables (hundreds, thousands) to determine an action to be taken by the autonomous vehicle. For example, whether the vehicle can change a lane, turn left, accelerate, etc.
The one or more data streams can be received by the processors 802 from other vehicles, computer servers (e.g., weather service, traffic service, navigation service, etc.). All of these sources of data stream can be the data generators 820. The data streams can be received via the communication module 804. In some examples, the data streams can also include measurements captured by the sensors 808. The sensors 808 can include lidars, radars, pressure sensors, etc. The data streams can include information that has to be analyzed by the processors 802 to determine one or more actions to be performed by the vehicle 800. The processors 802 send one or more commands to the actuators 808 to perform the actions based on the decision making process.
During such decision making, the processors 802 may have to analyze a large number of data streams, some (or most) of which may include low-value detail data resulting in an increased processor costs to both generate and analyze that data. Another critical cost is the bandwidth necessary for the vehicles 800 to communicate both among each other and any centralized agent, such as a central controller server (not shown). Using embodiments of the technical solutions described herein, hyper-correlate techniques are used to generate both “micro-averages” and detailed data only for locally detected “exceptional” conditions by the processors 802 of the vehicle 800. Accordingly, significantly reduced compute resources can be used for the decision making analytics. The data concentration also significantly reduces the communication costs/delays associated with sharing the data. Accordingly, embodiments of the present invention facilitate building accurate sensor technology into mainline path operations being performed, to effectively share the data analysis responsibilities between the data producers, and the analytics engines (e.g., machine learning models). Based on the analysis, in one or more examples, the central agent, or a peer (e.g., another vehicle) can request additional information be generated as part of a feedback provided to the data generator. The costs saved by not producing and consuming data using technical solutions herein facilitate to cost effectively perform timely decision making, which is critical to the success of delivering autonomous vehicle decision making, and other such critical applications.
Several other IoT based and cloud-computing based applications can use the technical solutions described herein, for example, factory automation, warehouse automation, air traffic control, etc.
Embodiments of the present invention provide improvements to computing technology, particularly in the areas of anomalous behavior detection. Further, one or more embodiments of the present invention provide a practical application for detecting anomalies and identifying an offender.
One or more embodiments of the present invention provide such improvements and practical applications by facilitating the dynamic generation of exceptionalism-enriched data streams for instrumentation and forensic analysis. Instead of capturing per-event operating system call information and trying to filter the captured information to reduce noise and meet NRT (Near Real-Time) detection requirements, one or more embodiments of the present invention build intelligence into the data generation phase within the operating system itself. Such improved data generation facilitates collecting only relevant key activities and user/resource/process information that relates to exceptional resource usage within a compute environment. Accordingly, one or more embodiments of the present invention facilitate intelligent data generation in the operating system. The dynamic data (instrumentation) generation automatically learns what resources to track in a system based on exceptional access to those resources. Accordingly, embodiments of the present invention can be used for host data intrusion detection systems but also for optimization of information-technology (IT) systems (e.g., lock contention or high file I/O causing low performance, etc.).
One or more embodiments of the present invention use a mapping (e.g., hashing) scheme coupled with outlier analysis. Embodiments of the present invention enable precise problem determination (e.g., intrusion, low performance, correlation across distributed environments) and perform such determination with low overhead.
Turning now to
As shown in
The computer system 900 comprises an input/output (I/O) adapter 906 and a communications adapter 907 coupled to the system bus 902. The I/O adapter 906 may be a small computer system interface (SCSI) adapter that communicates with a hard disk 908 and/or any other similar component. The I/O adapter 906 and the hard disk 908 are collectively referred to herein as a mass storage 910.
Software 911 for execution on the computer system 900 may be stored in the mass storage 910. The mass storage 910 is an example of a tangible storage medium readable by the processors 901, where the software 911 is stored as instructions for execution by the processors 901 to cause the computer system 900 to operate, such as is described hereinbelow with respect to the various Figures. Examples of computer program product and the execution of such instruction is discussed herein in more detail. The communications adapter 907 interconnects the system bus 902 with a network 912, which may be an outside network, enabling the computer system 900 to communicate with other such systems. In one embodiment, a portion of the system memory 903 and the mass storage 910 collectively store an operating system, which may be any appropriate operating system, such as the z/OS or AIX operating system from IBM Corporation, to coordinate the functions of the various components shown in
Additional input/output devices are shown as connected to the system bus 902 via a display adapter 915 and an interface adapter 916 and. In one embodiment, the adapters 906, 907, 915, and 916 may be connected to one or more I/O buses that are connected to the system bus 902 via an intermediate bus bridge (not shown). A display 919 (e.g., a screen or a display monitor) is connected to the system bus 902 by a display adapter 915, which may include a graphics controller to improve the performance of graphics intensive applications and a video controller. A keyboard 921, a mouse 922, a speaker 923, etc. can be interconnected to the system bus 902 via the interface adapter 916, which may include, for example, a Super I/O chip integrating multiple device adapters into a single integrated circuit. Suitable I/O buses for connecting peripheral devices such as hard disk controllers, network adapters, and graphics adapters typically include common protocols, such as the Peripheral Component Interconnect (PCI). Thus, as configured in
In some embodiments, the communications adapter 907 can transmit data using any suitable interface or protocol, such as the internet small computer system interface, among others. The network 912 may be a cellular network, a radio network, a wide area network (WAN), a local area network (LAN), or the Internet, among others. An external computing device may connect to the computer system 900 through the network 912. In some examples, an external computing device may be an external webserver or a cloud computing node.
It is to be understood that the block diagram of
It is to be understood that although this disclosure includes a detailed description on cloud computing, implementation of the teachings recited herein are not limited to a cloud computing environment. Rather, embodiments of the present invention are capable of being implemented in conjunction with any other type of computing environment now known or later developed.
Cloud computing is a model of service delivery for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, network bandwidth, servers, processing, memory, storage, applications, virtual machines, and services) that can be rapidly provisioned and released with minimal management effort or interaction with a provider of the service. This cloud model may include at least five characteristics, at least three service models, and at least four deployment models.
Characteristics are as follows:
On-demand self-service: a cloud consumer can unilaterally provision computing capabilities, such as server time and network storage, as needed automatically without requiring human interaction with the service's provider.
Broad network access: capabilities are available over a network and accessed through standard mechanisms that promote use by heterogeneous thin or thick client platforms (e.g., mobile phones, laptops, and PDAs).
Resource pooling: the provider's computing resources are pooled to serve multiple consumers using a multi-tenant model, with different physical and virtual resources dynamically assigned and reassigned according to demand. There is a sense of location independence in that the consumer generally has no control or knowledge over the exact location of the provided resources but may be able to specify location at a higher level of abstraction (e.g., country, state, or datacenter).
Rapid elasticity: capabilities can be rapidly and elastically provisioned, in some cases automatically, to quickly scale out and rapidly released to quickly scale in. To the consumer, the capabilities available for provisioning often appear to be unlimited and can be purchased in any quantity at any time.
Measured service: cloud systems automatically control and optimize resource use by leveraging a metering capability at some level of abstraction appropriate to the type of service (e.g., storage, processing, bandwidth, and active user accounts). Resource usage can be monitored, controlled, and reported, providing transparency for both the provider and consumer of the utilized service.
Service Models are as follows:
Software as a Service (SaaS): the capability provided to the consumer is to use the provider's applications running on a cloud infrastructure. The applications are accessible from various client devices through a thin client interface such as a web browser (e.g., web-based e-mail). The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, storage, or even individual application capabilities, with the possible exception of limited user-specific application configuration settings.
Platform as a Service (PaaS): the capability provided to the consumer is to deploy onto the cloud infrastructure consumer-created or acquired applications created using programming languages and tools supported by the provider. The consumer does not manage or control the underlying cloud infrastructure including networks, servers, operating systems, or storage, but has control over the deployed applications and possibly application hosting environment configurations.
Infrastructure as a Service (IaaS): the capability provided to the consumer is to provision processing, storage, networks, and other fundamental computing resources where the consumer is able to deploy and run arbitrary software, which can include operating systems and applications. The consumer does not manage or control the underlying cloud infrastructure but has control over operating systems, storage, deployed applications, and possibly limited control of select networking components (e.g., host firewalls).
Deployment Models are as follows:
Private cloud: the cloud infrastructure is operated solely for an organization. It may be managed by the organization or a third party and may exist on-premises or off-premises.
Community cloud: the cloud infrastructure is shared by several organizations and supports a specific community that has shared concerns (e.g., mission, security requirements, policy, and compliance considerations). It may be managed by the organizations or a third party and may exist on-premises or off-premises.
Public cloud: the cloud infrastructure is made available to the general public or a large industry group and is owned by an organization selling cloud services.
Hybrid cloud: the cloud infrastructure is a composition of two or more clouds (private, community, or public) that remain unique entities but are bound together by standardized or proprietary technology that enables data and application portability (e.g., cloud bursting for load-balancing between clouds).
A cloud computing environment is service oriented with a focus on statelessness, low coupling, modularity, and semantic interoperability. At the heart of cloud computing is an infrastructure that includes a network of interconnected nodes.
Referring now to
Referring now to
Hardware and software layer 60 includes hardware and software components. Examples of hardware components include: mainframes 61; RISC (Reduced Instruction Set Computer) architecture based servers 62; servers 63; blade servers 64; storage devices 65; and networks and networking components 66. In some embodiments, software components include network application server software 67 and database software 68.
Virtualization layer 70 provides an abstraction layer from which the following examples of virtual entities may be provided: virtual servers 71; virtual storage 72; virtual networks 73, including virtual private networks; virtual applications and operating systems 74; and virtual clients 75.
In one example, management layer 80 may provide the functions described below. Resource provisioning 81 provides dynamic procurement of computing resources and other resources that are utilized to perform tasks within the cloud computing environment. Metering and Pricing 82 provide cost tracking as resources are utilized within the cloud computing environment, and billing or invoicing for consumption of these resources. In one example, these resources may include application software licenses. Security provides identity verification for cloud consumers and tasks, as well as protection for data and other resources. User portal 83 provides access to the cloud computing environment for consumers and system administrators. Service level management 84 provides cloud computing resource allocation and management such that required service levels are met. Service Level Agreement (SLA) planning and fulfillment 85 provide pre-arrangement for, and procurement of, cloud computing resources for which a future requirement is anticipated in accordance with an SLA.
Workloads layer 90 provides examples of functionality for which the cloud computing environment may be utilized. Examples of workloads and functions which may be provided from this layer include: mapping and navigation 91; software development and lifecycle management 92; virtual classroom education delivery 93; data analytics processing 94; transaction processing 95; and data generation 96.
The present invention can be a system, a method, and/or a computer program product at any possible technical detail level of integration. The computer program product can include a computer-readable storage medium (or media) having computer-readable program instructions thereon for causing a processor to carry out aspects of the present invention.
The computer-readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer-readable storage medium can be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer-readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer-readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
Computer-readable program instructions described herein can be downloaded to respective computing/processing devices from a computer-readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network can comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium within the respective computing/processing device.
Computer-readable program instructions for carrying out operations of the present invention can be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, configuration data for integrated circuitry, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++, or the like, and procedural programming languages, such as the “C” programming language or similar programming languages. The computer-readable program instructions can execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer can be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection can be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) can execute the computer-readable program instructions by utilizing state information of the computer-readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer-readable program instructions can be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions can also be stored in a computer-readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
The computer-readable program instructions can also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams can represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the blocks can occur out of the order noted in the Figures. For example, two blocks shown in succession can, in fact, be executed substantially concurrently, or the blocks can sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
The descriptions of the various embodiments of the present invention have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.