With the increasing transition of software applications from on-premises to cloud based solutions, telemetry data is collected more than ever and Application Performance Management (APM) is becoming an increasingly important part of software applications' success. With the increase of APM, there is also a growing need for log analysis, especially the analysis of failures. The ability to efficiently mine failure logs may speed up and improve the analysis process, leading to improvement in the overall software quality and reduced costs.
The subject matter claimed herein is not limited to embodiments that solve any disadvantages or that operate only in environments such as those described above. Rather, this background is only provided to illustrate one exemplary technology area where some embodiments described herein may be practiced.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
Embodiments disclosed herein are related to systems, methods, and computer readable medium for determining patterns of related attributes in failure data that are indicative of an underlying cause of a computing operation failure. In one embodiment, a system includes a processor and a system memory. The system instantiates in the system memory an aggregation module that groups accessed data, which may be failure data, into subsets. The data is associated with attributes, which may be categorical attributes, that describe information related to the accessed data. The subsets include data having matching combinations of the attributes. The system also instantiates in the system memory an expand module that iteratively removes, for the subsets, attributes of the combination of attributes associated with each subset to increase the amount of data included in the subsets. The system also instantiates in the system memory a score module that scores each subset, after iteratively removing attributes, to determine patterns related to the combination of attributes. In another embodiment, recorded data is received. The data is associated with attributes that describe information corresponding to the data. The data and the associated attributes are organized into a table having rows corresponding to the received data and columns corresponding to the one or more attributes. The table is reorganized into subsets of data based on a count representing an amount of the data having matching combinations of the attributes. For each of the subsets, attributes of the combination of attributes associated with each subset are iteratively removed to increase the count representing the amount of data included in each subset. After iteratively removing the attributes, each subset is scored to determine one or patterns related to the combination of attributes most.
In an additional embodiment, accessed data is grouped into one or more subsets. The data is associated with one or more attributes that describe information related to the data. The one or more subsets have matching combinations of the attributes. For each of the subsets, attributes of the combination of attributes associated with each subset are iteratively removed to thereby increase the amount of data included in the subset. After iteratively removing the attributes, each subset is scored to determine one or more patterns related to the combination of attributes.
Additional features and advantages will be set forth in the description, which follows, and in part will be obvious from the description, or may be learned by the practice of the teachings herein. Features and advantages of the invention may be realized and obtained by means of the instruments and combinations particularly pointed out in the appended claims. Features of the present invention will become more fully apparent from the following description and appended claims, or may be learned by the practice of the invention as set forth hereinafter.
In order to describe the manner in which the above-recited and other advantages and features can be obtained, a more particular description of various embodiments will be rendered by reference to the appended drawings. Understanding that these drawings depict only sample embodiments and are not therefore to be considered to be limiting of the scope of the invention, the embodiments will be described and explained with additional specificity and detail through the use of the accompanying drawings in which:
With the increasing transition of software applications from on-premises to cloud based solutions, telemetry data is collected more than ever and Application Performance Management (APM) is becoming an increasingly important part of software applications' success. With the increase of APM, there is also a growing need for log analysis, especially the analysis of failures in computing operations. The ability to efficiently mine failure logs may speed up and improve the analysis process, leading to improvement in the overall software quality and reduced costs.
Accordingly, failure analysis has become a common APM task performed in order to improve application stability and quality. Failures in computing operations may include, but are not limited to, exceptions thrown during code execution, application crash, failed server requests or similar events. Failure events often contain various attributes indicating various properties such as geographical data, application version, error codes, operating systems, device types, and the like.
In many embodiments, there are two types of common failure sets. One is two class but highly imbalanced sets: the full set of failures is a small subset of a larger set containing both success and failure records. For example, for http: requests the full set of records may contain mostly successful requests (where the http: response is 200 or any other value <400), and only a small set of failures (where the http: response is 500 or any value >=400). A second set is pure one class sets: containing failure records only, wherein there are no non-failure records.
For the two class problem, conventional solutions typically use supervised learning methods, i.e. building a classifier to identify the failures out of the non-failures. General classification algorithms (e.g. decision trees, etc.) work well on a balanced set, but perform poorly (if at all) on imbalanced ones. Also these methods in general cannot operate on sets that are too small relative to the number of attributes due to over-fitting problems, which might be the common case for failures sets.
For the one class problem, conventional solutions will use clustering (unsupervised learning) methods. In general, clustering methods suffer from the following problems: (1) prior requirements such as definition of a distance function between any two records, which is hard to define and mostly irrelevant for categorical attributes, (2) clustering methods partition (excluding fuzzy ones) the set, i.e. any record belongs to a single cluster, which is problematic in the context of failure analysis where a specific failure can belong to few different clusters or no cluster at all, and (3) the representation of the found clusters is not intuitive or simple to filter.
Aspects of the disclosed embodiments relate to the creation and use of computing systems that find and implement high quality patterns to further investigate the full set of failures. The patterns (alias, segments or clusters) may be subsets of the full set of failures sharing many common categorical attributes. That is, the patterns are related to combinations of categorical attributes that are common across the subsets of the failures.
The disclosed embodiments provide a balance between an informative (i.e., contains many attributes) but not representative (i.e., too small) subset versus a representative, but not informative subset (i.e., too generic, containing a single attribute). The disclosed embodiments find the patterns and then rank them in order to expose to a user a small list of the top ranked patterns that may then be used for further exploration of the cause of the failure and/or that may hint about the root-cause of the failures.
There are various technical effects and benefits that can be achieved by implementing aspects of the disclosed embodiments. By way of example, the use of the patterns in the disclosed embodiments significantly reduces the amount of failure data that need to be explored to determine the cause of the failure, thus reducing the computer resources needed for APM processes. In addition, the technical effects related to the disclosed embodiments can also include improved user convenience and efficiency gains through a reduction in the time it takes for the user to discover the cause of the failures.
Some introductory discussion of a computing system will be described with respect to
Computing systems are now increasingly taking a wide variety of forms. Computing systems may, for example, be handheld devices, appliances, laptop computers, desktop computers, mainframes, distributed computing systems, datacenters, or even devices that have not conventionally been considered a computing system, such as wearables (e.g., glasses). In this description and in the claims, the term “computing system” is defined broadly as including any device or system (or combination thereof) that includes at least one physical and tangible processor, and a physical and tangible memory capable of having thereon computer-executable instructions that may be executed by a processor to thereby provision the computing system for a special purpose. The memory may take any form and may depend on the nature and form of the computing system. A computing system may be distributed over a network environment and may include multiple constituent computing systems.
As illustrated in
As used herein, the term “executable module” or “executable component” is the name for a structure that is well understood to one of ordinary skill in the art in the field of computing as being a structure that can be software, hardware, or a combination thereof. For instance, when implemented in software, one of ordinary skill in the art would understand that the structure of an executable component may include software objects, routines, methods that may be executed on the computing system, whether such an executable component exists in the heap of a computing system, or whether the executable component exists on computer-readable storage media.
In such a case, one of ordinary skill in the art will recognize that the structure of the executable component exists on a computer-readable medium such that, when interpreted by one or more processors of a computing system (e.g., by a processor thread), the computing system is caused to perform a function. Such structure may be computer-readable directly by the processors (as is the case if the executable component were binary). Alternatively, the structure may be structured to be interpretable and/or compiled (whether in a single stage or in multiple stages) so as to generate such binary that is directly interpretable by the processors. Such an understanding of example structures of an executable component is well within the understanding of one of ordinary skill in the art of computing when using the term “executable component”.
The term “executable component” is also well understood by one of ordinary skill as including structures that are implemented exclusively or near-exclusively in hardware, such as within a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), or any other specialized circuit. Accordingly, the term “executable component” is a term for a structure that is well understood by those of ordinary skill in the art of computing, whether implemented in software, hardware, or a combination. In this description, the terms “component”, “service”, “engine”, “module”, “controller”, “validator”, “runner”, “deployer” or the like, may also be used. As used in this description and in the case, these terms (regardless of whether the term is modified with one or more modifiers) are also intended to be synonymous with the term “executable component” or be specific types of such an “executable component”, and thus also have a structure that is well understood by those of ordinary skill in the art of computing.
In the description that follows, embodiments are described with reference to acts that are performed by one or more computing systems. If such acts are implemented in software, one or more processors of the associated computing system direct the operation of the computing system in response to having executed computer-executable instructions. For example, such computer-executable instructions may be embodied on one or more computer-readable media that form a computer program product. An example of such an operation involves the manipulation of data. The computer-executable instructions (and the manipulated data) may be stored in the memory 104 of the computing system 100.
The computer-executable instructions may be used to implement and/or instantiate all of the disclosed functionality. The computer-executable instructions are also to implement and/or instantiate all of the interfaces disclosed herein, including the analysis view windows and graphics.
Computing system 100 may also contain communication channels 108 that allow the computing system 100 to communicate with other message processors over, for example, network 110.
Embodiments described herein may comprise or utilize special-purpose or general-purpose computer system components that include computer hardware, such as, for example, one or more processors and system memory. The system memory may be included within the overall memory 104. The system memory may also be referred to as “main memory,” and includes memory locations that are addressable by the at least one processing unit 102 over a memory bus in which case the address location is asserted on the memory bus itself. System memory has been traditionally volatile, but the principles described herein also apply in circumstances in which the system memory is partially, or even fully, non-volatile.
Embodiments within the scope of this disclosure also include physical and other computer-readable media for carrying or storing computer-executable instructions and/or data structures. Such computer-readable media can be any available media that can be accessed by a general-purpose or special-purpose computer system. Computer-readable media that store computer-executable instructions and/or data structures are computer storage media. Computer-readable media that carry computer-executable instructions and/or data structures are transmission media. Thus, by way of example, and not limitation, embodiments of the invention can comprise at least two distinctly different kinds of computer-readable media: computer storage media and transmission media.
Computer storage media are physical hardware storage devices that store computer-executable instructions and/or data structures. Physical hardware storage devices include computer hardware, such as RAM, ROM, EEPROM, solid state drives (“SSDs”), flash memory, phase-change memory (“PCM”), optical disk storage, magnetic disk storage or other magnetic storage devices, or any other hardware storage device(s) which can be used to store program code in the form of computer-executable instructions or data structures, which can be accessed and executed by a general-purpose or special-purpose computer system to implement the disclosed functionality of the invention.
Transmission media can include a network and/or data links which can be used to carry program code in the form of computer-executable instructions or data structures, and which can be accessed by a general-purpose or special-purpose computer system. A “network” is defined as one or more data links that enable the transport of electronic data between computer systems and/or modules and/or other electronic devices. When information is transferred or provided over a network or another communications connection (either hardwired, wireless, or a combination of hardwired or wireless) to a computer system, the computer system may view the connection as transmission media. Combinations of the above should also be included within the scope of computer-readable media.
Program code in the form of computer-executable instructions or data structures can be transferred automatically from transmission media to computer storage media (or vice versa). For example, computer-executable instructions or data structures received over a network or data link can be buffered in RAM within a network interface module (e.g., a “NIC”), and then eventually transferred to computer system RAM and/or to less volatile computer storage media at a computer system. Thus, it should be understood that computer storage media can be included in computer system components that also (or even primarily) utilize transmission media.
Computer-executable instructions comprise, for example, instructions and data which, when executed at one or more processors, cause a general-purpose computer system, special-purpose computer system, or special-purpose processing device to perform a certain function or group of functions. Computer-executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, or even source code.
Those skilled in the art will appreciate that the principles described herein may be practiced in network computing environments with many types of computer system configurations, including, personal computers, desktop computers, laptop computers, message processors, hand-held devices, multi-processor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, tablets, pagers, routers, switches, and the like.
The invention may also be practiced in distributed system environments where local and remote computer systems, which are linked (either by hardwired data links, wireless data links, or by a combination of hardwired and wireless data links) through a network, both perform tasks. As such, in a distributed system environment, a computer system may include a plurality of constituent computer systems. In a distributed system environment, program modules may be located in both local and remote memory storage devices.
Those skilled in the art will also appreciate that the invention may be practiced in a cloud computing environment. Cloud computing environments may be distributed, although this is not required. When distributed, cloud computing environments may be distributed internationally within an organization and/or have components possessed across multiple organizations. In this description and the following claims, “cloud computing” is defined as a model for enabling on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services). The definition of “cloud computing” is not limited to any of the other numerous advantages that can be obtained from such a model when properly deployed.
Alternatively, or in addition, the functionality described herein can be performed, at least in part, by one or more hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include: Field-programmable Gate Arrays (FPGAs), Program-specific Integrated Circuits (ASICs), Program-specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), etc.
When the referenced acts of the disclosed methods are implemented in software, the one or more processors 102 of the computing system 100 perform the acts and direct the operation of the computing system 100 in response to having executed the stored computer-executable instructions defined by the software. Various input and output devices such as display 112 can be used by the computing system to receive user input and to display output in accordance with the computer-executable instructions.
While not all computing systems require a user interface, in some embodiments, the computing system 100 may include a user interface for use in interfacing with a user. The user interface may include output mechanisms as well as input mechanisms. The principles described herein are not limited to the precise output mechanisms or input mechanisms as such will depend on the nature of the device. However, output mechanisms might include, for instance, speakers, displays 112, tactile output, holograms, virtual reality, and so forth. Examples of input mechanisms might include, for instance, microphones, touchscreens, holograms, cameras, keyboards, mouse of other pointer input, sensors of any type, and so forth. In accordance with the principles describe herein, alerts (whether visual, audible and/or tactile) may be presented via the output mechanism.
Attention is now given to
As illustrated in
Accordingly, the embodiments and the claims disclosed herein are not limited by the type of the data 215 and their associated attributes 215a, 215b, and 215c. In one specific embodiment, the data 215 may be failure data. Since this embodiment will be described herein in most detail, the data 215 will also hereinafter be called “failure data 215” for ease of explanation. However, it will be appreciated that the description of data mining using patterns for the failure data 215 may also apply to any other type of data 215. Accordingly, one of skill in the art will appreciate after reading this specification that data mining using patterns as described herein may be performed on any type of data 215.
The embodiment of using failure data 215 that is to be subjected to data mining using the patterns to help analyze failures will now be explained in more detail to follow. The failure data 215 may correspond to a table or the like that includes multiple failure records and their associated attributes 215a, 215b, and any number of additional attributes as shown by ellipses 215c. The failure data may include exceptions thrown during code execution, application crashes, failed server requests or similar events. The failure data may also include data latencies. It will be appreciated that the embodiments disclosed herein are not limited by the type of the failure data 215.
The attributes may include information about the failure data such as geographical data, application version data, error codes, operating system and version data, device type, or the like. It will be appreciated that there may be any number of different types of attributes and that the embodiments disclosed herein are not limited by the types of attributes that are associated with the failure data 215. In some embodiments, the attributes 215a, 215b, and 215c may be categorical attributes or other types of attributes such as numerical or discrete attributes. In some embodiments, an attribute may be considered categorical if a small subset of its values covers a large portion of the failure data 215. The failure data 215 may be received from any reasonable source as circumstances warrant, for example from a database that is designed to store the failure data 215.
Turning now to
As also shown in
Accordingly, the failure record 310a is associated with Redmond (attribute 320 City), Windows 7 (attribute 330 OS Version), Explorer (attribute 340 Browser), 2015-05-02 11:22:01 (attribute 350 Time Stamp), and fjhda67akj (attribute 350 Anon ID). The other failure data records 310 are associated with attributes in a similar manner.
In
Returning to
The preprocessing module may also include a data aggregation module 226. In operation, the data aggregation module 226 aggregates or groups the failure data 215 into one or more subsets of the failure data 227a, 227b, or any number of additional subsets as illustrated by the ellipses 227c (hereinafter referred to as subsets 227) based on matching or shared combinations of the categorical attributes 215a, 215b, and 215c. That is, the data aggregation module 226 aggregates the failure data by grouping all the failure data records that are related by having matching combinations of specific attributes into the subsets 227. For example, in one embodiment, the data aggregation module may aggregate or group all of the failure data records 310 that include the same combination of attributes into the same subset, such as subset 227a. A count of the number or amount of the failure data records 310 in each subset 227 may also be provided as illustrated at 370.
In other words, since the failure data records 310 are often characterized by highly dense regions in the data space, meaning that many rows are exact duplicates over the set of relevant columns, it may be useful to compute the aggregate table of duplicate row counts. Once this is computed, the complexity of the failure data 215 is reduced from linear in the total number of rows, to linear in the number of distinct rows, which can often be several orders of magnitude smaller. Thus, when a row of the failure data table matches the pattern under consideration, a “count” column that may be incremented may be added to the data table (see
Returning to
Returning to
Shown below is an example of pseudocode that may be implemented by the seed expand module 240 to expand the seeds in the manner described.
Although not shown, the seed expand module 240 may also perform the iterative step of removing the attribute 320 City and the iterative step of removing the attribute OS Version 330. Further, the seed expand module 240 may perform iterative steps where all but one attribute is removed to increase the count or even where all the attributes are removed. In this way, different patterns from the seed containing all attributes, then all attributes minus 1, all attributes minus 2 . . . up to a pattern containing a single attribute may be generated from each seed.
Returning to
In one embodiment, the score module 250 determines the score 255 for a given pattern in view of the total number of attributes. For example, if the total number of attributes is 10 and a given pattern of combinations of attributes for an expanded subset, such as a subset 385 of
In some embodiments, the score module 250 multiplies the informative score by the size score and uses the product as the score 255. This score reflects the trade-off between the informative score and the size score. In other words, it shows how informative a pattern is given its coverage of the failure data. It will be appreciated that the score module 250 may use other scoring procedures and so the embodiments disclosed herein are not limited by how the score is determined.
Shown below is an example of pseudocode that may be implemented by the score module 250 to determine the score 255 in the manner described.
In some embodiments, the computing system 200 may include a post-processing module 260. In operation, the post-processing module 260 receives the scored results 216 from the score module 260 and then may be configured to filter out patterns covering highly overlapped subsets of the results. In some embodiments, the filtering may be done either by a symmetrical similarity measure (e.g., Jaccard Index) or by asymmetrical subset filtering, such that no pattern is pure subset of another. In other embodiments, other types of filtering may be used. Accordingly, the embodiments disclosed herein are not limited by the type of filtering that may be performed.
The computing system 200 may further include an output module 270. In operation, the output module may receive the results 216 from the post-filtering module or, in those embodiments that do not include a post-filtering module 260, from the score module 250. The output module may provide the results 216 to an end user, who may then use the results to further investigate the highly scored patterns as these patterns are more likely to provide information about the root cause of the failure.
The following discussion now refers to a number of methods and method acts that may be performed. Although the method acts may be discussed in a certain order or illustrated in a flow chart as occurring in a particular order, no particular ordering is required unless specifically stated, or required because an act is dependent on another act being completed prior to the act being performed.
The method 400 includes grouping accessed data into one or more subsets (act 410) (act 410). The data may be associated with one or more attributes that describe information corresponding to the data records. The one or more subsets may include data that have matching combinations of the one or more attributes.
For example, as previously discussed, in one non-limiting embodiment the data aggregation module 226 may group or organize failure data into the subsets 227 based on the combinations of the attributes 215a, 215b, and 215c shared by failure data 215. As previously discussed,
The method 400 includes, for each of the one or more subsets, iteratively removing one or more of the attributes of the combination of attributes associated with the subset to thereby increase the amount of data included in each subset (act 420). For example, as previously described, in the non-limiting embodiment the seed expand module 240 may iteratively remove one or more of the attributes 215a, 215b, and 215c associated with one of the subsets 227 to increase the amount of failure data 215 included in the subsets 227. As previously discussed,
The method 400 includes, after iteratively removing the attributes, scoring each subset to determine one or patterns related to the combination of attributes (act 450). For example, as previously discussed, in the non-limiting embodiment the score module 250 scores the subsets 227 to determine the pattern of the combination of attributes 215a, 215b, and 215c. In the non-limiting embodiment, this pattern in the pattern that is likely to be a cause of one or more failures of the computing operation. As previously discussed,
The method 500 includes receiving data that is associated with one or more attributes that describe information corresponding to the data (act 510). For example, as previously discussed, in one non-limiting embodiment the input module 210 may receive the failure data 215, which may correspond to computing operations failures such as exceptions thrown during code execution, application crash, failed server requests or similar events. The failure data may also include data latencies. The failure data 215 may include multiple failure data records 310 and their corresponding attributes 215a, 215b, and 215c or 310-340. The attributes may include geographical data, application version data, error codes, operating system and version data, device type, or the like.
The method 500 includes organizing the data and the associated one or more attributes into a table (act 520). The table may have rows corresponding to the data and columns corresponding to the one or more attributes. For example, in the non-limiting embodiment the failure data 215 may be organized by the data aggregation module 230 into the failure record table 300 that has rows corresponding to the failure data records 310 and columns corresponding to attributes 320, 330, and 340 as shown in
The method 500 includes reorganizing the table into one or more subsets of data based on a count representing an amount of the data having matching combinations of the one or more attributes (act 530). For example, in the non-limiting embodiment the data aggregation module 230 may reorganize the failure record table 300 into the subsets 380 as shown in
The method 500 includes for each of the one or more subsets, iteratively removing one or more of the attributes of the combination of attributes associated with each subset to thereby increase the count representing the amount of the data included in each subset (act 540). For example, as previously described, in the non-limiting embodiment the seed expand module 240 may iteratively remove one or more of the attributes 320, 330, and 340 associated with one of the subsets 380 to increase the count 371a-375a representing the amount of failure data records 310 included in the subsets 385 after the iterative process. For instance,
The method 500 includes after iteratively removing the attributes, scoring each subset to determine one or more patterns related to the combination of attributes (act 550). For example, as previously discussed, in the non-limiting embodiment the score module 250 scores the subsets 380 with a score 391-397 that are used to determine the pattern of the combination of attributes 320, 330, and 340. In the non-limiting embodiment this is the pattern that is likely to be a cause of the one or more failures of the computing operation.
For the processes and methods disclosed herein, the operations performed in the processes and methods may be implemented in differing order. Furthermore, the outlined operations are only provided as examples, and some of the operations may be optional, combined into fewer steps and operations, supplemented with further operations, or expanded into additional operations without detracting from the essence of the disclosed embodiments.
The present invention may be embodied in other specific forms without departing from its spirit or essential characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. The scope of the invention is, therefore, indicated by the appended claims rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.
This application is a non-provisional of and claims priority from and the benefit of U.S. Provisional Patent Application Ser. No. 62/294,596 filed on Feb. 12, 2016 and entitled “DATA MINING USING DISCRETE ATTRIBUTES,” which application is hereby expressly incorporated herein in its entirety.
Number | Date | Country | |
---|---|---|---|
62294596 | Feb 2016 | US |