In a computing environment, it is not uncommon for a program to have an error, which may result in a “crash” or “hang.” Such programs may include word processing programs, office management programs or almost any type of program. Following the crash or hang, a dialog box may invite the user to send an “error report” to a software corporation.
Error reports, which may be considered “telemetry data,” include information from the memory of the computer, prior to the crash. Such information is useful to software developers trying to determine a cause of the failure. In some cases, tens of millions of error reports may arrive daily.
Due to the volume of error reports which may be received by a software company, it may be difficult to process the incoming information, and particularly, to derive useful insight as to the cause of error reports. This difficulty is magnified because error reports are not grouped logically and in a manner which suggests a cause of the underlying error.
Thus, advancements in error report processing would be welcome, particularly advancements able to more efficiently process very large numbers of error reports. Additionally, advancements in error report processing that are able to better analyze error reports, and particularly to indicate software problems in less common error reports, would result in more rapid error detection.
Techniques for error report processing are described herein. In one example, large numbers of error reports, organized according to “buckets,” are received due to program crashes. The error reports can be re-bucketed into meta-buckets, which can be based on a similarity of call stacks associated with each error report. The meta-buckets can be used to provide output to programmers analyzing the software errors.
In a further example, error reports received by a developer due to program crashes may be organized into a plurality of “buckets” based in part on a name and a version of the application associated with a crash. The error reports may also include a call stack of the computer on which the crash occurred. The call stacks of the error reports may be used to “re-bucket” the error reports into meta-buckets. Organization of error reports in meta-buckets may provide a deeper insight to programmers working to resolve software errors. The re-bucketing may cluster together error reports based in part on a similarity of their call stacks. The similarity of two call stacks may be based on a number of factors, including a model described herein. Further, call stack similarity may be based in part on functions or subroutines on two call stacks, a distance of those functions or subroutines from the crash point, and an offset distance between the common functions or subroutines.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. The term “techniques,” for instance, may refer to device(s), system(s), method(s) and/or computer-readable instructions as permitted by the context above and throughout the document.
The detailed description is described with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The same numbers are used throughout the drawings to reference like features and components. Moreover, the figures are intended to illustrate general concepts, and not to indicate required and/or necessary elements.
Techniques for error report processing are described herein. In one example, large numbers of error reports, organized according to “buckets,” are received due to program crashes. The error reports can be re-bucketed into meta-buckets, which can be based on a similarity of call stacks associated with each error report. The meta-buckets can be used to provide output to programmers analyzing the software errors.
In a further example, error reports received by a developer due to program crashes may be organized into a plurality of “buckets” based in part on a name and a version of the application associated with a crash. The error reports may also include a call stack of the computer on which the crash occurred. The call stacks of the error reports may be used to “re-bucket” the error reports into meta-buckets. Organization of error reports in meta-buckets may provide a deeper insight to programmers working to resolve software errors. The re-bucketing may cluster together error reports based in part on a similarity of their call stacks. The similarity of two call stacks may be based on a number of factors, including a model described herein. Call stack similarity may be based in part on functions or subroutines common to the two call stacks, a distance of those functions or subroutines from a crash point, an offset distance between the common functions or subroutines and/or other factors.
The techniques discussed herein improve error report processing by increasing a likelihood that related errors are clustered together in meta-buckets. Additionally, the techniques discussed herein provide a stack similarly model that effectively clusters error reports by providing a measure of call stack similarity. Additionally, the techniques discussed herein provide for model training, wherein parameters used by the model are adjusted to allow better measurement of a similarity between call stacks associated with two error reports.
The discussion herein includes several sections. Each section is intended to be non-limiting. More particularly, this entire description is intended to illustrate components which may be utilized in error report processing, but not components which are necessarily required. The discussion begins with a section entitled “Example Techniques in Error Reporting and Call Stack Similarity,” which describes error reports and techniques for measuring similarity between call stacks associated with different error reports. Next, a section entitled “Example Techniques for Objective Call Stack Similarity Measurement” illustrates and describes techniques that can be used to objectively measure similarity between call stacks. Next, a section entitled “Example Error Report Processing System” illustrates and describes techniques that can be used to process error reports, measure similarity between call stacks and to provide output to software developers and programmers. A fourth section, entitled “Example Flow Diagrams” illustrates and describes techniques that may be used in error reporting and in measurement of call stack similarity. Finally, the discussion ends with a brief conclusion.
This brief introduction, including section titles and corresponding summaries, is provided for the reader's convenience and is not intended to limit the scope of the claims or any section of this disclosure.
Example Techniques in Error Reporting and Call Stack Similarity
A graph 210 illustrates an example distribution 212 of meta-buckets (e.g., meta-bucket 214). Each meta-bucket may be formed by clustering two or more error reports obtained from buckets 206 of the graph 202. The clustering may be based on a similarity of the error reports. In particular, the similarity may be based on a similarity of a first call stack associated with a first error report to a second call stack associated with a second error report. Thus, if the call stacks are sufficiently similar, then the associated error reports may be combined into a same meta-bucket. Thus, each meta-bucket may include error reports that are based on similarity of call stacks in the error reports, and generally, based on similarity of an underlying error. In contrast, the buckets (e.g., bucket 206) may include error reports related to a same application and/or version thereof, but which may concern different errors.
Examination of the call stacks 302, 304 indicates that some of the procedures are the same and some are different. For example, the first three procedures are the same in both call stacks 302, 304. However, the fourth procedure in call stack 304 is in the fifth position in call stack 302. The last procedure in call stack 302 is not seen in call stack 304. Accordingly, there are both similarities and differences in the two call stacks. The below discussion will be directed in part to a determination of whether the error reports associated with two such call stacks, having both similarities and differences, should be combined into a same meta-bucket.
Alternatively, the alignment offset can be determined by calculation. The offset can be determined by measuring a distance, in each call stack, from each of two matched functions to the procedure that was operating when the crash occurred (e.g., the crash at procedures 406, 408), and subtracting those distances, to obtain the offset. The number of immune functions between a matched function and the procedure that was operating when the crash occurred may be removed when measuring the distance to the crash. The difference, obtained by subtraction of these distances is the offset.
The details of offset calculation may be understood with reference to
The call stack normalizer 810 is a length of the call stack 402. Such a factor can be used as one objective indicator of call stack similarity. A longer call stack normalizer tends to indicate that similarities in the call stacks are less likely to be coincidental. That is, a longer string of similarities is a stronger indication of similarity than a shorter string. In the example shown, the length of call stack 402 is 54 function calls.
Example Techniques for Objective Call Stack Similarity Measurement
Two arbitrary call stacks (e.g., call stacks 402, 404 of
According to Equation (1), the similarity of any two error reports may be evaluated. Such an evaluation can be used, for example, to determine the appropriateness of including two error reports within a same meta-bucket or cluster. In particular, the evaluation may be made by examination of a similarity of the call stacks of the two error reports. Further, the similarity between call stacks C1 and C2 may be denoted by sim(C1, C2). If sim(C1, C2) is less than a value ε, i.e., a threshold value, then the call stacks are sufficiently similar to warrant combination of the error reports into a meta-bucket or cluster (such as is indicated by
Referring to Equation (1), c is a coefficient for distance to crash point, o is a coefficient for alignment offset, S1 and S2 are any sub-sequences of C1 and C2, and 1 is a minimum of a length of S1 and S2, and L1 and L2 are lengths of S1 and S2, respectively. The range for c and o may be bounded by (0, 2), i.e., real numbers between zero and two. The terms c and o may initially be set by estimate, within this range. The range for ε may be bounded by (0, 1), and it may be similarly set by estimate to a real number within this range.
Equation (1) can be trained to provide more accurate clustering and/or meta-bucket creation by refinement of values for c, o and ε. In one example, once the software error has been discovered (such as by software developers' work) some error reports and their associated call stacks will have been correctly grouped by Equation (1). Such correct grouping means that the grouped error reports were in fact related to the same error. Other error reports may have been incorrectly grouped. Such incorrect grouping means that the grouped error reports were not related to the same error. Once the software error is known, correct error report clustering will also be known. Using this information, the values for c, o and ε assigned to Equation (1) can be adjusted to result in greater accuracy in clustering error reports. The revised values for c, o and ε can then be used to cluster error reports going forward. The values for c, o and ε can be periodically revised, as desired.
Initially, Equation (1) is used with starting values of c, o and ε, selected from within the approved ranges noted above. Error reports are clustered, such as into meta-buckets or in a hierarchical manner, when call stacks associated with the error reports are sufficiently similar, i.e., when sim(C1, C2) of Equation (1) is less than ε. After the error is resolved by software developers, it can be seen which error reports were correctly clustered as resulting from a same error. Equation (1) may then be trained to result in greater performance. In particular, the terms c, o and ε may be adjusted so that Equation (1) better recognizes call stack similarity and better combines error reports. That is, when the nature of the software error is known, the correct grouping of error reports is known. Accordingly, the terms c, o and ε may be adjusted so that Equation (1) indicates more call stack similarity in instances where the error reports should be combined and less call stack similarity when the error reports should not be combined. Such values of c, o and ε can be used going forward.
The learning process by which terms c, o and ε are adjusted to promote better operation of Equation (1) may involve use of Equation (2).
Referring to Equation (2), the terms p and r may be considered to be a “precision” term and a “recall” term, respectively. The precision and recall terms may be obtained by examination of error reports that were grouped, after the error was discovered. The precision term is set as a quotient p=(number of pairs of call stacks calculated by Equation (1) to be similar that are also found to be actually similar after the error was discovered)/(number of call stacks calculated by Equation (1) to be similar). The recall term is set to be the quotient r=(number of pairs of call stacks calculated by Equation (1) to be similar that are also found to be actually similar after the error was discovered)/(number of call stacks found to be similar).
Using the values for p and r found after the software error is resolved, the value of F1 in Equation (2) may be found. By adjusting the values for c, o and ε, Equation (1) will perform differently, resulting in new values for p, r, and F1. By selecting values for c, o and ε which maximize or increase F1, Equation (1) will more correctly group error reports by call stack similarity. Accordingly, the revised values for c, o and ε can be used going forward.
Any of several approaches may be employed to obtain values of c, o and ε that maximize or increase the function F1. A first approach to obtain values for of c, o and ε uses brute-force. In this approach, values of c, o and ε are sampled at intervals within their eligible ranges. Using a selected granularity (e.g., 0.1) all possible combinations of values of c, o and ε may be applied to Equation (1). The best combination of values of c, o and ε may be selected by comparing the F1 value obtained from each combination of the sampled values of c, o and ε. For example, we may sample the value of c in interval [0, 2] with step 0.1, i.e., we enumerate the value of c as 0, 0.1, 0.2, . . . , 2. Values of o and ε may be similarly obtained, and c, o and ε combined in Equation (1) in all or many combinations. A second approach to optimize Equation (2) involves use of a gradient descent algorithm. For both approaches, the best and/or acceptable matches between the two call stacks may be obtained first by using a standard string searching and/or matching algorithm like the Knuth-Morris-Pratt algorithm.
Example Error Report Processing System
An error report data procedure 1008 receives data, typically in the form of error reports. The error report data procedure 1008 is representative of data procedures generally, which receive data from systems (e.g., someone's computer anywhere on the Internet) after a software error or system crash. Such error reports may arrive from systems widely dispersed over the Internet or from local systems on an intranet. The error report data procedure 1008 may negotiate with a system having an error, to result in information transfer from the system. In some examples, such information may arrive at the error report data procedure 1008 in large quantities.
A call stack clustering procedure 1010 is configured for clustering error reports based on call stack similarity. Such clusters of error reports help to concentrate information related to a single software error (such as into one meta-bucket) and help to remove information related to unrelated software errors (which may be organized on other, appropriate meta-buckets). Error reports are received by the call stack clustering procedure 1010 from the error report data procedure 1008. Call stacks of each error report are examined, and those having sufficient similarity are grouped or clustered together, such as in a meta-bucket or other data structure. Accordingly, a plurality of meta-buckets may be created, thereby grouping related error reports.
The call stack clustering procedure 1010 may utilize the methods discussed with respect to
An online engine 1016 retrieves and manages error report data, such as meta-buckets created by the call stack clustering procedure 1010. This managed data is transmitted to the web service 1018, which provides a user interface of the backend system 1006, and provides web clients 1004 with data.
The directed graph 1100 includes a number of nodes or representations 1102 of functions, procedures and/or subroutines that were present on a call stack of one or more error reports. A plurality of weighted vectors 1104 describe a plurality of different paths through the plurality of functions 1102 to a crash point 1106. Because the representations of the plurality of functions are organized in rows, a distance of each function from the crash point is shown. The distance to the crash point for any function can be measured in terms of a number of functions called after the function and before the crash point. The directed graph 1100 may be helpful to programmers looking for a reason for a software error or crash. In the example of
Example Flow Diagrams
Each process described herein is illustrated as a collection of blocks or operations in a logical flow graph, which represent a sequence of operations that can be implemented in hardware, software, or a combination thereof. In the context of software, the operations represent computer-executable instructions stored on one or more computer-readable storage media 1202 that, when executed by one or more processors 1204, perform the recited operations. Such storage media 1202, processors 1204 and computer-readable instructions can be located within an error report processing system (e.g., system 1000 of
At operation 1208, the error reports are re-bucketed. The re-bucketing may be based on call stack similarity, indicating that a similarity of call stacks of two error reports is great enough, or that a difference between them is small enough, to warrant clustering or re-bucketing of the error reports. Referring to the example of
At operation 1210, error reports may optionally be combined in a hierarchical manner, based on similarities of the error reports. In the example of
At operation 1212, a directed graph is created, including representations of functions and weighted vectors leading to a crash point (a function within which the crash occurred). In the example of
At operation 1214, a report is generated, indicating aspects of error report processing. In one example, the error report includes the directed graph obtained at operation 1212. Alternatively, report information may be presented in an alternative format.
At operation 1304, functions which are not immune functions are matched. For example, the matching may involve identifying a same function in each of the two call stacks, wherein each call stack is associated with a different error report. A further example of matching non-immune functions is seen in
At operation 1306, a distance is measured from matched functions to a crash point. An example of this operation is seen at
At operation 1310, a length normalizer is measured. In the example of
At operation 1312, a number of values may be weighted, to determine similarity of the call stacks. In one example the following factors can be weighted: the distance of matched functions to a crash site; the offset distance between the various matched functions; and the length normalizer.
At operation 1314, values are used (e.g., the weighted values of operation 1312) to determine if error reports (e.g., two error reports) should be clustered, based on a similarity of their respective call stacks. In an alternative example wherein Equation (1) is utilized, the similarity may be compared to a threshold value, such as ε.
At operation 1404, call stack similarity based in part on distances of common functions from a crash point is weighted. As seen in
At operation 1406, call stack similarity based in part on an offset distance, measured between common functions, is weighted. As seen in
At operation 1504, the software error which caused the error reports is resolved, such as by efforts of engineers and programmers. Because the cause of the error reports is known, the clustering performed at operation 1502 can be evaluated for correctness. This allows values for precision and recall to be established.
At operation 1506, Equation (2) can be maximized or increased by adjustment of the values for c, o and ε, and reevaluation of Equation (1). In some instances, the initial values for c, o and ε will be adequate. Typically, new values for c, o and ε will be obtained.
At operation 1508, the new values for c, o and ε may be substituted into Equation (1), thereby improving its performance in future use.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as exemplary forms of implementing the claims.
Number | Name | Date | Kind |
---|---|---|---|
5928369 | Keyser et al. | Jul 1999 | A |
6266788 | Othmer et al. | Jul 2001 | B1 |
7614043 | Ognev et al. | Nov 2009 | B2 |
7890814 | Zhang et al. | Feb 2011 | B2 |
8032866 | Golender et al. | Oct 2011 | B1 |
20030005414 | Elliott et al. | Jan 2003 | A1 |
20040078689 | Knuutila et al. | Apr 2004 | A1 |
20050120273 | Hudson et al. | Jun 2005 | A1 |
20050289404 | Maguire | Dec 2005 | A1 |
20060150163 | Chandane | Jul 2006 | A1 |
20070283338 | Gupta et al. | Dec 2007 | A1 |
20080126325 | Pugh et al. | May 2008 | A1 |
20090006883 | Zhang et al. | Jan 2009 | A1 |
20100064179 | Champlin et al. | Mar 2010 | A1 |
Entry |
---|
Bartz, et al., “Finding Similar Failures Using Callstack Similarity”, retrieved on Aug. 25, 2010 at <<http://www.usenix.org/event/sysml08/tech/full—papers/bartz/bartz.pdf>>, USENIX Association, Workshop on Tackling Computer Systems Problems with Machine Learning Techniques (SysML), Dec. 11 2008, pp. 1-6. |
Brodie, et al., “Quickly Finding Known Software Problems via Automated Symptom Matching”, retrieved on Aug. 25, 2010 at <<http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1498056>>, IEEE Computer Society, Proceedings of International Conference on Autonomic Computing (ICAC), 2005, pp. 1-10. |
Modani, et al., “Automatically Identifying Known Software Problems”, retrieved on Aug. 25, 2010 at <<http://ccs.njit.edu/inst/source/06SMDS06.pdf>>, IEEE, Proceedings of ICDE Workshop on Self-Managing Database Systems (SMDB), Istanbul Turkey, 2007, pp. 433-441. |
Podgurski, et al., “Automated Support for Classifying Software Failure Reports”, retrieved on Aug. 25, 2010 at <<http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1201224>>, IEEE Computer Society, Proceedings of International Conference on Software Engineering (ICSE), Portland, Oregon, 2003, pp. 465-475. |
Number | Date | Country | |
---|---|---|---|
20120137182 A1 | May 2012 | US |