Software development includes a process of debugging, in which errors in source code are identified and removed. Modern software systems have increasingly large and complicated source code, which results in an increasing number of bugs that are to be identified and removed. To facilitate debugging, a crash reporting system is deployed to automatically gather crash reports from testing, delivery, and end users (customers) that are generated in response to crashing of the software. In general, a software crash can be described as a condition, in which the software stops functioning properly.
To help developers reduce debugging efforts, it is important to automatically organize duplicate crash reports into groups, each group (cluster) representing multiple, duplicate crash reports. Typically, duplicate crash report detection includes extracting a stack trace from each crash report, determining similarities between pairs of stack traces, and grouping pairs of stack traces into the same group, if the similarity exceeds a similarity threshold. However, computing similarity between stack traces is a difficult task. For example, in typical cases, duplicate crash reports can include stack traces that only have the some overlap in functions. In more difficult cases, duplicate crash reports can include stack traces that have little overlap in functions. That is, it can occur that crash reports that are relatively dissimilar, as a whole, are actually duplicates.
In some traditional approaches, duplicate detection can utilize the position of frame, trace alignment, and/or edit distance to compute the similarity between stack traces. However, this limits throughput and increases consumption of technical resources (e.g., processors, memory) for computation, particularly for large-scale crash bucketing tasks. In order to improve throughput, an example strategy can include speeding up the similarity measurement of stack traces by, for example, aligning the stack traces. However, stack trace alignment itself consumes time and resources, which limit throughput.
Implementations of the present disclosure are directed to a duplicate crash report detection system to identify duplicate crash reports from stack traces. More particularly, implementations of the present disclosure are directed to a duplicate crash report detection system that includes a deep learning (DL) pipeline to identify duplicate crash reports based on feature vectors determined from stack traces.
In some implementations, actions include receiving a set of crash reports, each crash report provided as a computer-readable file, determining a set of trace vectors by processing a set of stack traces through a first DL model, each trace vector in the set of trace vectors being a multi-dimensional vector representation of a stack trace of a respective crash report provided from the set of stack traces, generating a set of feature vectors by processing the set of trace vectors through a second DL model, each feature vector being a multi-dimensional vector representation of a stack trace of a respective crash report, and clustering each crash report in the set of crash reports into a group of a set of groups based on comparing feature vectors of respective crash reports, each group representative of a root cause resulting in respective crashes of the software system represented in one or more crash reports. Other implementations of this aspect include corresponding systems, apparatus, and computer programs, configured to perform the actions of the methods, encoded on computer storage devices.
These and other implementations can each optionally include one or more of the following features: actions further include pre-processing each crash report in the set of crash report to extract a respective stack trace that is included in the set of stack traces; processing the set of stack traces through the first DL model include, for each stack trace and for each frame in a set of frames of the stack trace, segmenting a frame into a set of sub-frames, determining a sub-frame representation for each sub-frame in the set of sub-frames, and combining the sub-frame representations to provide a trace vector for the frame; generating the set of feature vectors by processing the set of trace vectors through the second DL model includes providing each trace vector as input to the second DL model and receiving a respective feature vector as output of the second DL model; the second DL model is trained using a circle loss to minimize distances between anchor samples and positive samples and maximize distances between the anchor samples and negative samples, and a softmax loss based on predictions of a large-margin softmax layer; wherein the second DL model includes bidirectional long short-term memory (Bi-LSTM) layers, two fully-connected layers (linear), and a rectified linear unit (ReLU) layer; and at least one group is used to debug the software system with respect to a respective root cause.
The present disclosure also provides a computer-readable storage medium coupled to one or more processors and having instructions stored thereon which, when executed by the one or more processors, cause the one or more processors to perform operations in accordance with implementations of the methods provided herein.
The present disclosure further provides a system for implementing the methods provided herein. The system includes one or more processors, and a computer-readable storage medium coupled to the one or more processors having instructions stored thereon which, when executed by the one or more processors, cause the one or more processors to perform operations in accordance with implementations of the methods provided herein.
It is appreciated that methods in accordance with the present disclosure can include any combination of the aspects and features described herein. That is, methods in accordance with the present disclosure are not limited to the combinations of aspects and features specifically described herein, but also include any combination of the aspects and features provided.
The details of one or more implementations of the present disclosure are set forth in the accompanying drawings and the description below. Other features and advantages of the present disclosure will be apparent from the description and drawings, and from the claims.
Like reference symbols in the various drawings indicate like elements.
Implementations of the present disclosure are directed to a duplicate crash report detection system to identify duplicate crash reports from stack traces. More particularly, implementations of the present disclosure are directed to a duplicate crash report detection system that includes a deep learning (DL) pipeline to identify duplicate crash reports based on feature vectors determined from stack traces. Implementations can include actions of receiving a set of crash reports, each crash report provided as a computer-readable file, determining a set of trace vectors by processing a set of stack traces through a first DL model, each trace vector in the set of trace vectors being a multi-dimensional vector representation of a stack trace of a respective crash report provided from the set of stack traces, generating a set of feature vectors by processing the set of trace vectors through a second DL model, each feature vector being a multi-dimensional vector representation of a stack trace of a respective crash report, and clustering each crash report in the set of crash reports into a group of a set of groups based on comparing feature vectors of respective crash reports, each group representative of a root cause resulting in respective crashes of the software system represented in one or more crash reports.
As used herein, duplicate crash reports indicate crash reports that are generated in response to the same root cause. That is, crash reports that are duplicates do not need to be the same crash report in and of itself. For example, a first device can encounter a first crash that results from a root cause and, in response, a first crash report is generated. A second device can encounter a second crash that results from the root cause and, in response, a second crash report is generated. The first crash report and the second crash report, while being different crash reports, can be identified as duplicates, because they were generated in response to the root cause (i.e., the same root cause). As another example, the first device can encounter a third crash that results from the root cause and, in response, a third crash report is generated. The first crash report and the third crash report, while being different crash reports, can be identified as duplicates, because they were generated in response to the root cause (i.e., the same root cause).
To provide further context for implementations of the present disclosure, and as introduced above, software development includes a process of debugging, in which errors in source code are identified and removed. Modern software systems have increasingly large and complicated source code, which results in an increasing number of bugs that are to be identified and removed. To facilitate debugging, a crash reporting system is deployed to automatically gather crash reports from testing, delivery, and end users (customers) that are generated in response to crashing of the software. In general, a software crash can be described as a condition, in which the software stops functioning properly.
Typically, crash reports contain information on the environment (e.g., device, operating system), system status, stack trace, and execution. In some examples, a stack trace can be described as a control flow of the software. In a crash report, a stack trace represents a control flow leading up to a crash. A control flow can include an order of functions that were executed leading up to a crash. A stack trace alone can provide sufficient information for developers to know where to look for the root cause of a crash. However, manually finding the root cause of a crash can be difficult. For example, root cause analysis can require deep knowledge and understanding of the source code. Moreover, as the number of crashes increases, manual analysis of the crashes becomes unpractical.
To help developers reduce debugging efforts, it is important to automatically organize duplicate crash reports into groups, each group (cluster) representing multiple, duplicate crash reports. A duplicate crash report can be generally described as a crash report that results from the same root cause as another crash report (i.e., multiple crash reports are generated for the same root cause). This task is referred to as duplicate crash report detection, crash report bucketing, or crash report deduplication. By grouping crash reports based on duplicates, developers can more quickly and efficiently address and resolve issues in software.
Typically, duplicate crash report detection includes extracting a stack trace from each crash report, determining similarities between pairs of stack traces, and grouping pairs of stack traces into the same group, if the similarity exceeds a similarity threshold. In some examples, a group can include multiple crash reports (e.g., duplicate crash reports). In some examples, a group can include a single crash report (e.g., the crash report is not associated with a duplicate). However, computing similarity between stack traces is a difficult task. For example, in typical cases, duplicate crash reports can include stack traces that only have the some overlap in functions. In more difficult cases, duplicate crash reports can include stack traces that have little overlap in functions. That is, it can occur that crash reports that are relatively dissimilar, as a whole, are actually duplicates in that they result from the same root cause.
In some traditional approaches, duplicate detection can utilize the position of frame, trace alignment, and/or edit distance to compute the similarity between stack traces. For example, some approaches compute every frame pair edit distance as a frame pair similarity and aggregate the frame pairs similarities based on weights (e.g., determined by text frequency and inverse document frequency (TF-IDF)). However, this limits throughput and increases consumption of technical resources (e.g., processors, memory) for computation, particularly for large-scale crash bucketing tasks. In order to improve throughput, an example strategy can include speeding up the similarity measurement of stack traces by, for example, aligning the stack traces. However, stack trace alignment itself consumes time and resources, which limit throughput.
In view of the above context, implementations of the present disclosure provide a duplicate crash report detection system that includes a DL pipeline to identify duplicate crash reports based on feature vectors determined from stack traces. In some implementations, the duplicate crash report detection system of the present disclosure includes mapping stack traces of crash reports to respective feature vectors and grouping the crash reports into buckets (also referred to herein as groups and/or clusters) based on the feature vectors. In some examples, the duplicate crash report detection system includes a frame tokenization DL model, referred to herein as frame2vec, that extracts frame representations in stack traces based on frame segmentation. In some examples, the duplicate crash report detection system includes a deep metric model (a DL model) that maps sequential stack trace representations into feature vectors that can be used to determine similarity between pairs of crash reports. In some examples, the duplicate crash report detection system includes a clustering algorithm that is used to group crash reports (based on respective feature vectors) into buckets. As described in further detail herein, the duplicate crash report detection system of the present disclosure provides time- and resource-efficient detection of duplicate crash reports even in instances of large-scale crash report bucketing.
In some examples, the client device 102 can communicate with the server system 104 over the network 106. In some examples, the client device 102 includes any appropriate type of computing device such as a desktop computer, a laptop computer, a handheld computer, a tablet computer, a personal digital assistant (PDA), a cellular telephone, a network appliance, a camera, a smart phone, an enhanced general packet radio service (EGPRS) mobile phone, a media player, a navigation device, an email device, a game console, or an appropriate combination of any two or more of these devices or other data processing devices. In some implementations, the network 106 can include a large computer network, such as a local area network (LAN), a wide area network (WAN), the Internet, a cellular network, a telephone network (e.g., PSTN) or an appropriate combination thereof connecting any number of communication devices, mobile computing devices, fixed computing devices and server systems.
In some implementations, the server system 104 includes at least one server and at least one data store. In the example of
In some implementations, and as noted above, the server system 104 can host a duplicate crash report detection system of the present disclosure. In some examples, the server system 104 receives a set of crash reports that is processed using the duplicate crash report detection system to time- and resource-efficiently identify any duplicate crash reports in the set of crash reports. In some examples, crash reports can be received from one or more sources. Example sources can include, without limitation, users of software, for which the crash reports are generated, and a developer of the software.
Referring again to
In accordance with implementations of the present disclosure, the trace vector module 204 processes each stack trace in the set of stack traces to determine a sequential trace representation provided as a trace vector for a respective stack trace. In some examples, the trace vector module 204 executes a DL model (e.g., frame2vec, referenced herein) to extract a frame representation for each frame based on aggregating sub-frame representations, and provides a respective trace vector by aggregating the frame representations.
In further detail, in providing a trace vector, each frame is divided into sub-frames and a sub-frame representation (multi-dimensional vector) is provided for each sub-frame. The sub-frame representations are combined to provide a trace vector as a frame representation for the respective frame. This tokenization technique enables a reduction in a number of tokens that are stored in a token dictionary, thereby reducing memory. Further, this tokenization technique enhances a quality of the trace vectors because the DL model (frame2vec) preserves the semantic similarity of frames.
By way of non-limiting example, a first frame com.company.Class1. method1 and a second frame com.company.Class1.method2 represent two different functions in a stack trace. In this example, sub-frames of the first frame include com, company, Class1, and method1, and sub-frames of the second frame include com, company, Class1, and method2. In this example, the first frame and the second frame have the same prefix (i.e., com.company.Class1). In view of this, their respective frame representations should be similar (but not same).
In some examples, the following formulation can be used to denote the DL model (frame2vec):
v
i=Frame2Vec(si)
where vi is the ith frame representation. As indicated above, the sequential stack trace frames STi={s1, s2, . . . , sn}i can be denoted by the trace vectors Vi={v1, . . . , vn}i. In some examples, to ensure that the trace vectors are suitable for stack traces, skip-gram negative sampling is used to optimize the DL model (frame2vec).
Referring again to
F
bilstm
=biLSTM(Vi)
In some examples, two fully-connected layers and ReLU activation (of the ReLU layer 404) are stacked on the Bi-LSTM layer 402 for mapping trace vectors (V1, . . . , Vm) into feature vectors (F1, . . . , Fm). This can be formulated as:
F
i=Linear(ReLU(Linear(Fbilstm(Vi))))
In accordance with implementations of the present disclosure, the deep metric model 400 is trained based on predicted GIDs for the anchor, positive, and negative samples, respectively (e.g., GIDa,pred, GIDp,pred, GIDn,pred). More particularly, during training, a total loss () is determined based on a circle loss (
circle) and a softmax loss (
l_softmax), and parameters of the deep metric model 400 are adjusted between training iterations (e.g., using back-propagation). In some examples, iterations of training are executed until the total loss is minimized. The following example relationship can be provided:
=
circle+
l_softmax
In further detail, the circle loss is determined based on an anchor feature vector (Fa), a positive feature vector (Fp), and a negative feature vector (Fn) output by the deep metric model 400 for the respective trace vectors during training. In some examples, the circle loss is determined based on a first distance between the anchor feature vector and the positive feature vector, and a second distance between the anchor feature vector and the negative feature vector. In some examples, the first distance and the second distance can each be determined as a cosine distance. During training, the first distance is minimized, while the second distance is maximized. The circle loss can be represented as:
circle(α, cos(Fa,Fp/n))
where cos indicates the cosine distance, Fa/p/n denotes the feature vectors of anchor/positive/negative samples, and α is a margin that is enforced between positive and negative pairs.
In some examples, the softmax loss can be represented as:
l_softmax(Ĝa/p/n,Ga/p/n)
where Ĝ is the predicted GID and G indicates the actual GID.
During inference, the (trained) deep metric model is used to provide a feature vector for each crash report in the set of crash reports. For example, each trace vector in the set of trace vectors V1, . . . , Vm correspond to a respective crash report in the set of crash reports. Each trace vector is provided as input to the deep metric model (e.g., executed by the feature vector module 206 of
In accordance with implementations of the present disclosure, dis-/similarity between feature vectors can represent dis-/similarity between. That is the feature vectors of duplicate crash reports are similar to one another and feature vectors of non-duplicate crash reports are dissimilar to one another. Referring again to
A set of crash reports is received (502). For example, and as described herein, the duplicate crash report detection system 200 of
A set of trace vectors is generated (506). For example, and as described herein, each stack trace in the set of stack traces is processed through a DL model, frame2vec, by the trace vector module 204. In some examples, each trace vector in the set of trace vectors is a multi-dimensional vector representation of a stack trace of a respective crash report provided from the set of stack traces. In some examples, processing the set of stack traces through the DL model includes, for each stack trace and for each frame in a set of frames of the stack trace, segmenting a frame into a set of sub-frames, determining a sub-frame representation for each sub-frame in the set of sub-frames, and combining the sub-frame representations to provide a trace vector for the frame.
A set of feature vectors is generated (508). For example, and as described herein, each trace vector in the set of trace vectors is processed through a DL model, deep metric model, by the feature vector module 206. In some examples, the second DL model is trained using a circle loss to minimize distances between anchor samples and positive samples and maximize distances between the anchor samples and negative samples, and a softmax loss based on predictions of a large-margin softmax layer. In some examples, the second DL model includes bidirectional long short-term memory (Bi-LSTM) layers, two fully-connected layers (linear), and a ReLU.
Crash reports of the set of crash reports are clustered into a set of groups (510). For example, and as described herein, the clustering module 208 groups feature vectors, and thus their respective crash reports into buckets, each bucket representing one or more crashes resulting from a (same) root cause. In some implementations, for each unique pair of feature vectors in the set of feature vectors, the clustering algorithm determines a similarity score. In some examples, the similarity score is calculated as a cosine distance between feature vectors in a pair of feature vectors. In some examples, if the similarity score exceeds a threshold similarity score, the crash reports corresponding to the feature vectors are included in the same bucket. In some examples, if the similarity score does not exceed the threshold similarity score, the crash reports corresponding to the feature vectors are not included in the same bucket.
In accordance with implementations of the present disclosure, at least one group is used to debug the software system with respect to a respective root cause. For example, a developer can use one or more crash reports of a group to determine the root cause of the crashes represented by the group and can modify source code of the software system to resolve the root cause. In some examples, modification of the source code can be pushed to one or more devices in an updated to the software system.
As described herein, implementations of the present disclosure achieve one or more technical advantages and provides technical improvements of existing technological systems for detecting duplicate crash reports. Advantages and improvements achieved by implementations of the present disclosure are highlighted in a validation experiment that compares the duplicate crash report detection system of the present disclosure to other systems.
In further detail, a set of traditional duplicate crash report detection systems were identified for comparison to that of the present disclosure and the same hierarchical clustering algorithm was used to group crashes into buckets based on the respective similarity detection approach. For the validation experiment, a crash report data set was defined an included approximately 11,000 crash reports generated from internal testing of a software system. The crash report data set was split based on time to define a training set, a validation set, and a testing set. In the validation experiment, a split ratio of 20:1:5 was used. Table 1 shows the breakdown of splitting the crash report data set into the training set, the validation set, and the test set.
To evaluate clustering performance, metrics of Purity, InversePurity, and F-measure were used, which are commonly used for evaluating clustering performance. Before computing these metrics, the Precision and Recall for each bucket were determined based on the following example relationships:
where Gi indicates ith actual group and Cj means jth cluster. The Purity, InversePurity, and F-measure were determined based on the following example relationships:
where N is the total number of testing samples.
For training a first sub-set of the existing technological systems, the training set was used to count the required information (e.g., using TF-IDF) and the hyperopt library was used to find the optimal parameters (which include the threshold for clustering) based on the validation set. For a second sub-set of the existing technological systems, an Adam optimizer was used with a learning rate of 1e−4, and other parameters are set as default configurations.
For training the duplicate crash report detection system of the present disclosure, the frame2vec model is trained first and the deep metric model is optimized using the (trained) frame2vec model. The detailed configurations of frame2vec and deep metric model are summarized in Table 2, where lr denotes the learning rate.
Each of the duplicate crash report detection systems was executed using a workstation with Intel® Xeon® Platinum 8260 CPU @ 2.40 GHz, 156-GB RAM, and the operating system was SUSE Linux Enterprise Server 15 SP1. In addition, to avoid the influence of noise (e.g., background applications), every compared system was executed on a rebooted machine.
To assess precision, experiments were executed five times for each system and the average Purity, InversePurity, and F-measure were determined as the final results. Experiment results are summarized in Table 3.
From observing the Purity values based on crash report data set, four systems achieved 94-95%, including the duplicate crash report detection system of the present disclosure, the peak value being 94.34%. From observing the InversePurity scores, the duplicate crash report detection system of the present disclosure had the best performance with few being in 82-84%. The duplicate crash report detection system of the present disclosure achieves a high InversePurity value because similar frames will have similar frame representations. For F-measure score, the duplicate crash report detection system of the present disclosure had the best performance. The experimental results demonstrate that the duplicate crash report detection system of the present disclosure can achieve better precision performance than the others, balancing Purity and InversePurity.
With regard to speed, as is known, a high throughput crash bucketing system is important for large-scale crash bucketing. Intuitively, the feature-based similarity measurement could speed up the crash bucketing task. To validate this, the average clustering time all systems was determined for the five experiments. The performance results are summarized in Table 4.
The results of Table 4 show that the duplicate crash report detection system of the present disclosure is the second fastest, grouping 2,910 crash reports in 3.2 minutes. While one system is faster, at 0.35 minutes, that system is one of the worst performing systems with respect to precision (see Table 3). Consequently, the experimental results illustrate that the duplicate crash report detection system of the present disclosure can support large-scale duplication identification with improved precision over traditional systems.
Referring now to
The memory 620 stores information within the system 600. In some implementations, the memory 620 is a computer-readable medium. In some implementations, the memory 620 is a volatile memory unit. In some implementations, the memory 620 is a non-volatile memory unit. The storage device 630 is capable of providing mass storage for the system 600. In some implementations, the storage device 630 is a computer-readable medium. In some implementations, the storage device 630 may be a floppy disk device, a hard disk device, an optical disk device, or a tape device. The input/output device 640 provides input/output operations for the system 600. In some implementations, the input/output device 640 includes a keyboard and/or pointing device. In some implementations, the input/output device 640 includes a display unit for displaying graphical user interfaces.
The features described can be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or in combinations of them. The apparatus can be implemented in a computer program product tangibly embodied in an information carrier (e.g., in a machine-readable storage device, for execution by a programmable processor), and method steps can be performed by a programmable processor executing a program of instructions to perform functions of the described implementations by operating on input data and generating output. The described features can be implemented advantageously in one or more computer programs that are executable on a programmable system including at least one programmable processor coupled to receive data and instructions from, and to transmit data and instructions to, a data storage system, at least one input device, and at least one output device. A computer program is a set of instructions that can be used, directly or indirectly, in a computer to perform a certain activity or bring about a certain result. A computer program can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
Suitable processors for the execution of a program of instructions include, by way of example, both general and special purpose microprocessors, and the sole processor or one of multiple processors of any kind of computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. Elements of a computer can include a processor for executing instructions and one or more memories for storing instructions and data. Generally, a computer can also include, or be operatively coupled to communicate with, one or more mass storage devices for storing data files; such devices include magnetic disks, such as internal hard disks and removable disks; magneto-optical disks; and optical disks. Storage devices suitable for tangibly embodying computer program instructions and data include all forms of non-volatile memory, including by way of example semiconductor memory devices, such as EPROM, EEPROM, and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, ASICs (application-specific integrated circuits).
To provide for interaction with a user, the features can be implemented on a computer having a display device such as a CRT (cathode ray tube) or LCD (liquid crystal display) monitor for displaying information to the user and a keyboard and a pointing device such as a mouse or a trackball by which the user can provide input to the computer.
The features can be implemented in a computer system that includes a back-end component, such as a data server, or that includes a middleware component, such as an application server or an Internet server, or that includes a front-end component, such as a client computer having a graphical user interface or an Internet browser, or any combination of them. The components of the system can be connected by any form or medium of digital data communication such as a communication network. Examples of communication networks include, for example, a LAN, a WAN, and the computers and networks forming the Internet.
The computer system can include clients and servers. A client and server are generally remote from each other and typically interact through a network, such as the described one. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
In addition, the logic flows depicted in the figures do not require the particular order shown, or sequential order, to achieve desirable results. In addition, other steps may be provided, or steps may be eliminated, from the described flows, and other components may be added to, or removed from, the described systems. Accordingly, other implementations are within the scope of the following claims.
A number of implementations of the present disclosure have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the present disclosure. Accordingly, other implementations are within the scope of the following claims.