Testing and debugging database system applications is often challenging and time consuming. A database tester (or DB tester for short) has to detect a problem, determine why it happened, set up an environment to reproduce it, and then create a fix to resolve the problem. In many cases, problems appear in very complex scenarios, and thus the reproduction of a problem may be complex and difficult to understand. This makes the task of finding the root cause of the problem very difficult. As a consequence, a very time-consuming task for DB testers is finding a min-repro, a minimum configuration that reproduces a software problem. Finding a min-repro involves weeding out irrelevant inputs and finding a simpler, or the simplest, way to reproduce a problem. Currently, a great deal of searching for a min-repro is carried out manually, which is both slow and error-prone.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
The min-repro finding technique described herein is designed to ease and speed-up the task of finding a min-repro, a minimum configuration that reproduces a problem in a database-related product. Specifically, in one embodiment, the technique simplifies the repro (the original configuration that caused the problem) using transformations in order to find one or more min-repros. One embodiment provides a high-level script language to automate some sub-tasks and to guide the search for simpler configurations that reproduce the problem. Yet another embodiment of the min-repro finding technique provides record-and-replay functionality, and provides an intuitive representation of results and the search space. These tools can save hours of time for both customers and testers to isolate the problem and can result in faster fixes and large cost savings to organizations.
In the following description of embodiments of the disclosure, reference is made to the accompanying drawings which form a part hereof, and in which are shown, by way of illustration, specific embodiments in which the technique may be practiced. It is understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the disclosure.
The specific features, aspects, and advantages of the disclosure will become better understood with regard to the following description, appended claims, and accompanying drawings where:
In the following description of the min-repro finding technique, reference is made to the accompanying drawings, which form a part thereof and which show by way of illustration examples by which the min-repro finding technique described herein may be practiced. It is to be understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the claimed subject matter.
The following sections provide an introduction to database debugging, an illustration of the min-repro environment, usage scenarios for employing the min-repro finding technique, a description of various features of the technique, and an exemplary process for employing the technique. An exemplary architecture and an exemplary User Interface are also provided.
Database software is complex along many dimensions, as it is comprised of a large number of features and execution components. An implicit assumption is that underlying database management system (DBMS) services are well tested, reliable and correct.
To ensure bug-free data management services, testing and debugging are two processes that are used hand in hand together. Testing can demonstrate the presence of a “bug,” and debugging is used to identify what caused it and how to fix it. Too often, the starting point for the debugging process is a very large setup configuration with many irrelevant inputs and variables. This is a consequence of either automatic randomized test generators, or real-world application scenarios. Of course, the shorter and more concise is the setup to reproduce a problem, the more likely it is that a tester will understand the root cause of the problem and effectively fix it. Conceptually, testers try to obtain a min-repro, i.e., a simpler or the “simplest possible” version of the input variables that still reproduce the original problem. Further removing or simplifying any input in a min-repro would make the problem not reproduce any longer.
Currently, there seems to be a missing link between testing (where a problem is found) and debugging (where the bugs are fixed). The present min-repro finding technique fills this gap in the database context with a system designed to weed out irrelevant inputs and simplify relevant inputs in a repro (a configuration that reproduces a problem) to ultimately find a min-repro (a simpler, or the simplest, configuration that reproduces the problem).
A min-repro environment can be described as follows.
Min Repro Given a configuration C composed of a set of inputs {i1, i2 . . . in} (e.g., queries, indexes, etc.), a set of database execution components ε and a problem specification P, find the minimum set of inputs or input configuration C′ that reproduces the problem P and removing or simplifying any input in the configuration C′ cannot reproduce the problem P any longer.
In one embodiment, the technique focuses on two database-specific input types, namely Data Manipulation Language (DML) statements (e.g., SQL queries) and physical structures (e.g., indexes). For example, a generic structure of a SQL query can be as follows:
For simplicity of presentation, an index physical structure is considered in the following discussion, however other physical structures can be handled similarly by the min-repro finding technique. Indexes consist of a sequence of key columns optionally followed by a sequence of suffix (include) columns and can be described using SQL as follows:
The present min-repro finding technique can be used for various purposes, such as DBMS testing and debugging, benchmarking and privacy-preserving technical assistance.
In any testing or debugging domain, when it comes to problem repeatability it is desirable to reduce a problem to the smallest and least complicated number of steps that can produce the bad result. Once the problem is reproducible and the fix is created, the min-repro configuration may become a part of an automated test suite for future testing and verification that the problem is not recurring.
Min-repros can also be used for software benchmarking. For example, min-repros can also be used to isolate the root cause of performance difference between successive releases of a database engine, or even to crisply contrast the performance/capabilities of different engines.
Often, corporations or enterprises encounter issues in their environments and need assistance from their database vendor. This naturally raises a number of legal and technical issues that must be addressed to preserve private and business-sensitive information through the control of the information flow amongst different entities. One embodiment of the min-repro finding technique can serve as a technical solution for preserving privacy in DBMS technical assistance. In order to not reveal business-sensitive information, an enterprise can create a smaller and simpler, and information-preserving configuration for the vendor to reproduce the same problem.
Below several definitions used in the rest of the specification are introduced. These include a definition of a test function that determines whether a problem occurs or not, the definition of a problem-reproducing configuration and the definition of a minimum problem reproducing configuration.
Definition 1. (Test Function) The test function: F(C, ε, P)→{T,F} determines for an input configuration C and environment ε whether problem P occurs or not.
Definition 2. (Min Reproducing Configuration) A problem-reproducing configuration C is minimal if F(C, ε, P) holds and ∀, C′ ⊂C, F(C′, ε, P) does not hold.
The definition above can be similarly applied to simplification of the internal structure/content of the inputs.
1.5.1 Feature Set
The min-repro finding technique provides many useful tools for testing and debugging database problems. These will be discussed in the paragraphs below.
1.5.1.1 Specification of an Initial Repro and a Problem:
The initial (large) repro is the initial configuration that creates the problem. In one embodiment of the min-repro finding technique, the problem is specified using a user-defined test function (UDTF). The UDTF allows users to specify the repro information and has the following three main parts: (1) a set of inputs (e.g., a complex query workload and a set of indexes), (2) a set of execution components (e.g., successive releases of a database engine), and (3) a set of rules describing the problem (e.g., the new engine performs worse by more than 10% compared to the old one). For example, in one embodiment, users can specify a UDTF using XML language or using a declarative language like SQL as illustrated below.
1.5.1.2 Sessions.
In one embodiment of the technique, each UDTF can be executed in a separate session with a unique identity or identifier. The session information can be saved and associated with the UDTF. The identifier, then, can be used to reload the session as and when needed. The sessions make the comparison of different runs (for the same UDTF) possible.
1.5.1.3 Modifications.
In order to find the minimum configuration that reproduces a problem, modifications to the initial configuration (repro) can be carried out via transformations. In one embodiment of the technique, there are two types of transformations, namely inter-transformations that are applicable to a set of inputs and intra-transformations that applicable to the “internal” content of an input. Inter-transformations are applied to whole inputs, e.g., removing a query from the input workload. Intra-transformations are more fine=grained and input-specific, e.g., query or index intra-transformations. A more detailed explanation of inter-transformations and intra-transformations follows.
1.5.1.3.1 Inter-Transformations.
Inter-transformations are used to find a simpler configuration that will reproduce the problem sought to be reproduced. The inter-transformations supported by one embodiment of the technique are illustrated in Table 1. They include removing inputs, making inputs immutable and partitioning inputs. Details of the inter-transformations follow.
(a) Removal: Any input i in a configuration C can be removed to obtain a new configuration C′=C−{i}.
(b) Immutability: Inputs i in a configuration C can be made immutable (to transformations). This may be useful, when no more simplification of certain inputs is desired.
(c) Partitioning: Inputs i in a configuration C can be partitioned into a set of input groups to obtain a set C* of new configurations C*={{C1},{C2} . . . } where ∀ CiεC*, Ci⊂C and ∪Ci=C. C* is the set of partitions of Ci which consists of C1, C2, etc. The restriction U Ci=C says that the union of all Ci is the same as the original configuration C.
1.5.1.3.2 Intra-Transformations:
Intra-transformations depend on the input type, e.g., query intra-transformations and index intra-transformations. The intra-transformations supported by one embodiment of the technique are illustrated in Table 2. They include query intra-transformations that are macros (e.g., SELECT simplification, FROM simplification, WHERE removal, WHERE simplification, GROUP BY simplification, GROUP BY removal, ORDER BY simplification, ORDER BY removal. Sub-query simplification and Sub-query removal) and custom transformations that are based on a SQL parse tree.
More specifically, in addition to transformations defined as macros, users can perform arbitrary intra-transformations on queries using a SQL parse tree. In one embodiment the technique employs a general SQL parser to parse an SQL statement into the parse tree. Then a visual representation of the parse tree of the current SQL query is exposed to the user, which contains detailed information about the SQL statement such as its type (SELECT, INSERT, UPDATE, DELETE or CREATE, etc.), which tables and fields are used in the statement, and different parts of the SQL statement are also available such as a WHERE clause, GROUP BY clause, HAVING clause, and so on. The user can select any node in the hierarchical parse tree and select a transformation (e.g., edit, remove, simplify) to be applied to the node and its children.
1.5.3 Search Strategy
Once the configurations are defined, a search can be made for one or more min-repros that create the problem that is sought to be reproduced. In one embodiment, the main steps of a min-repro search are as follows:
Users can specify the search strategy, by manipulating the following logical steps: (1) how to simplify (e.g., how to partition input set into subsets and how to simplify each input), (2) what to test (e.g., which “simpler” subset to test), (3) what to keep (if multiple simpler configurations reproduce the problem, which configuration should the search continue with), (4) when and where to backtrack (if the problem can no longer be reproduced after a simplification, which earlier state to backtrack to).
The strategies for simplifying, partitioning, testing and handling of multiple min-repros are described in Tables 3-6. Table 3 illustrates strategies for simplifying a repro (a configuration that reproduces a problem). These involve partitioning the initial configuration first and then testing different subsets of the initial configuration; or simplifying individual inputs first and then partitioning the inputs.
Table 4 illustrates strategies for partitioning a repro configuration that reproduces a problem. These include partitioning into n subsets, partitioning randomly, partitioning by input similarity, and partitioning by a rank function.
(a) Partition by n. In one embodiment of the technique, when partitioning by n, the technique breaks the current input configuration into n subsets. If there are different input types, each type is partitioned into n subsets.
(b) Partition random by n. In one embodiment of the technique, when partitioning randomly by n, the technique partitions current input configuration into n random subsets. One alternative is random partitioning, in which groups of inputs are formed by randomly selecting which input goes into which partition. The advantage in random partitioning is that it is generally less work to construct test partitions.
(c) Partition by similarity. Isolating problem-reproducing code changes can greatly profit from syntactic knowledge. All changes belonging to one class or one method can be combined, thereby reducing the amount of unresolved tests that occur during the minimization process. This is where a “similarity” function (per input type) becomes useful. This partitioning approach is similar in spirit to Equivalence Partitioning, Category Partition, and Domain Testing which are based on the model that the input space of the test object may be divided into subsets based on the assumption that all points in the same subset result in a similar behavior from the test object. This is called partition testing. Typically, in partition testing, the tester identifies test suites by selecting one or a few cases from each subset. The goal is to minimize the number of tests to run, yet to have a sufficient coverage. In one embodiment of the min-repro finding technique partitioning by similarity is used.
(d) Partition by rank. In one embodiment of the technique, inputs are characterized with respect to a certain rank function (e.g., input size), and then subsets are formed based on the rank of the inputs (e.g., all subsets must have a size ≦θ, where θ is a size threshold).
Testing strategies determine which subset(s) should be tested. This is where domain-specific combination strategies can become useful. Search can benefit from choosing “interesting” (for the current problem specification) inputs combinations. For example, one embodiment of the technique employs a “choose random” and “choose custom” testing strategy. While the latter chooses subsets based on a custom heuristic, the former randomly selects subsets of inputs for testing.
Table 6 depicts strategies to deal with multiple independent subsets that each reproduce the problem. First-repro stops testing subsets after finding one that fails, where smallest-repro tests all subsets before deciding which one, among those that failed, to choose
1.5.6 Test Script Language
In one embodiment, the technique allows a user to create scripts to expedite finding the min-repros (the minimum configurations that reproduce the problem sought to be duplicated). A high-level script language allows users to create custom scripts that are re-usable. In one embodiment, the technique script language, called TLDB (short for Test Language for Databases), uses XML as its primary syntax and is similar in spirit to the XML Expression Language (XEXPR) language. TLDB has several extensions (functions and keywords) specific to the problem domain. Using TLDB, test scripts can be created, and similar to transformations, can be applied to either a set of inputs or a particular input. Scripts encapsulate a general logic that can be then employed in the search for a min-repro in different scenarios. Existing algorithms (e.g., delta debugging) can be implemented in TLDB.
1.5.7 Simplify by Example Patterns
An intuitive way of debugging is when a user has tried a number of steps over time for similar problems and they have reproduced the wanted results. In one embodiment, the technique features “simplify-by-example,” which records user actions, generalizes them into a pattern, which is then available for replay, in either a manual min repro search or as a part of a script.
1.5.8 Visualization of Search Space and Test Results
Simply knowing which repro is reproducing a problem is one thing, but presenting it in an intuitive and understandable manner (especially in complex scenarios) is another. A singularly bad feature of many current debugging systems is the lack of attention that has been paid to the aspects of the debugging interface. In one embodiment, the min-repro finding technique employs a User Interface (UI) that provides a simple visualization of the search space and search results that can help DB testers in understanding what might have caused a given problem. An exemplary UI 500 employed in one embodiment of the technique is shown in
As shown in
A problem search module 608 employs a configuration simplifier (block 610) that simplifies the original configuration into simplified configurations (block 618), such as, for example by using the transformations and simplification methods previously discussed. A user can provide input to the configuration simplifier 610 in order to facilitate creating simplified configurations 618 to be used in finding one or more min-repros. A user can also use a UI (block 612) to input a script (block 614) specifying what actions to take when determining whether a simplified input configuration (block 618) is a min-repro or not. Alternately, the UI (block 612) can be used to record and playback a set of user actions (block 616).
The problem search module 608, using a simplified configuration (block 618) and the database management system (block 606) tests to see if the simplified input configuration reproduces the problem (block 626). If so, the simplified configuration is stored (block 620). Otherwise the simplified configuration is discarded. The most simple configuration (block 622) or one of the stored simplified configurations (block 620), can then be used to recreate the problem (block 624), such as, for example, in order to determine the cause of the problem and fix it.
An exemplary process 700 employing the min-repro finding technique is shown in
Various alternate embodiments of the technique are possible. The following paragraphs describe alternate embodiments of the min-repro finding technique.
1.8.1 Machine Learning-based Enhancements.
In one embodiment, the technique “learns” from earlier test results to guide its future strategy for min repro finding. For example, the technique can learn which transformation activities are most useful to a given situation and provide guidance to the user. As an example, if in previous debugging sessions it was found that simplifying the WHERE clause of the queries was useful, while removing indexes was not, this knowledge can be presented to the user to help make better decisions.
1.8.2 Extending Min Repro Support to Data.
Besides considering only queries and indexes as input, one embodiment of the technique considers data (stored in a database). It considers data as another input type and devises a set of “Data-based” simplification transformations (e.g., row-pruning, column pruning) on data tables that can find a min repro given data input—i.e., to find a minimum data (set) that can reproduce a certain problem.
1.8.3 Correlation of Transformations.
One embodiment of the technique considers the correlation between the inputs when choosing simplifying transformations, e.g., when simplifying an index, it considers which queries it will affect and the simplification transformations that have been performed on those inputs. One embodiment of the min-repro finding technique ranks different transformations based on their impact on the rest of the input configuration.
The min-repro finding technique is designed to operate in a computing environment. The following description is intended to provide a brief, general description of a suitable computing environment in which the min-repro finding technique can be implemented. The technique is operational with numerous general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable include, but are not limited to, personal computers, server computers, hand-held or laptop devices (for example, media players, notebook computers, cellular phones, personal data assistants, voice recorders), multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
Device 800 also can contain communications connection(s) 812 that allow the device to communicate with other devices and networks. Communications connection(s) 812 is an example of communication media. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal, thereby changing the configuration or state of the receiving device of the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. The term computer readable media as used herein includes both storage media and communication media.
Device 800 may have various input device(s) 814 such as a keyboard, mouse, pen, camera, touch input device, and so on. Output device(s) 816 include devices such as a display, speakers, a printer, and so on may also be included. All of these devices are well known in the art and need not be discussed at length here.
The min-repro finding technique may be described in the general context of computer-executable instructions, such as program modules, being executed by a computing device. Generally, program modules include routines, programs, objects, components, data structures, and so on, that perform particular tasks or implement particular abstract data types. The min-repro finding technique may be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
It should also be noted that any or all of the aforementioned alternate embodiments described herein may be used in any combination desired to form additional hybrid embodiments. Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. The specific features and acts described above are disclosed as example forms of implementing the claims.
Number | Name | Date | Kind |
---|---|---|---|
5561763 | Eto et al. | Oct 1996 | A |
6138112 | Slutz | Oct 2000 | A |
6324683 | Fuh et al. | Nov 2001 | B1 |
20030200537 | Barsness et al. | Oct 2003 | A1 |
Entry |
---|
Andreas Zeller and Ralf Hildebrandt, “Simplifying and Isolating Failure-Inducing Input” IEEE Transactions on Software Engineering, vol. 28 No. 2, Feb. 2002 (http://www.cs.washington.edu/homes/mernst/teaching/6.883/readings/zeller-tse2002.pdf). |
Thomas J. Ostrand and March J. Balcer, “The Category-Partition Method for Specifying and Generating Functional Tests” Communications of the ACM 1988 (http://staff.unak.is/andy/MScTestingMaintenance/Lectures/CategoryPartitionMethod1988.pdf). |
Thomas J. Ostrand and March J. Balcer, “The Category-Partition Method for Specifying and Generating Functional Tests” Communications of the ACM 1988 (http://staff.unak.is/andy/MScTestingMaintenance/Lectures/CategoryPartitionMethod1988.pdf; hereinafter “Ostrand”). |
Thomas J. Ostand and March J. Balcer, “The Category-Partition Method for Specifying and Generating Functional Tests” Communications of the ACM 1988 (http://staff.unak.is/andy/MScTestingMaintenance/Lectures/CategoryPartitionMethod1988.pdf. |
Andreas Zeller and Ralf Hildebrandt, “Simplifying and Isolating Failure-Inducing Input” IEEE Transactions on Software Engineering, vol. 28 No. 2, Feb. 2002 (http://www.cs.washington.edu/homes/mernst/teaching/6.883/readings/zeller-tse2002.pdf; hereinafter “Zeller”). |
Zeller Andreas, “Automated Debugging: Are We Close?”, retrieved at<<http://ieeexplore.ieee.org/ielx5/2/20799/00963440.pdf?arnumber=963440>>, pp. 6. |
Whalley David B. , “Automatic Isolation of Compiler Errors”, retrieved at<<http://www.cs.fsu.edu/˜whalley/papers/acmtoplas94.ps>>, pp. 1-14. |
Zeller et al. , “Finding Failure Causes through Automated Testing”, retrieved at<<http://www.infosun.fim.uni-passau.de/st/papers/aadebug2000/aashort.pdf>>, pp. 6. |
Burger Martin , “Locating Failure-Inducing Code Changes in an Industrial Environment”, retrieved at<<http://www.st.cs.uni-saarland.de/publications/files/burger-thesis-2005.pdf>>, Dec. 14, 2005, pp. 118. |
Su et al. , “HDD: Hierarchical Delta Debugging”, retrieved at<<http://www.cs.ucdavis.edu/˜su/publications/icse06-hdd.pdf>>,May 20-28, 2006, Shanghai, China, pp. 10. |
Lu et al. , “Triage: Diagnosing Production Run Failures at the Users Site”, retrieved at<<http://opera.cs.uiuc.edu/paper/sosp63-tucek.pdf>>, Oct. 14-17, 2007, Stevenson, Washington, USA, pp. 14. |
Gupta et al. , “Dynamic Slicing Long Running Programs through Execution Fast Forwarding”, retrieved at<<http://www.cs.purdue.edu/homes/xyzhang/Comp/fse06.pdf>>, Nov. 5-11, 2006, Portland, Oregon, USA, pp. 11. |
Zeller Andreas, “Isolating Cause Effect Chains from Computer Programs”, retrieved at<<http://www.st.cs.uni-saarland.de/papers/fse2002/p201-zeller.pdf>>, Nov. 18-22, 2002, Charleston, SC, USA, pp. 10. |
Kollar et al. , “Database Tuning Advisor for Microsoft SQL Server 2005”, retrieved at<<http://www.vldb.org/conf/2004/IND4P3.PDF>>, Proceedings of the 30th VLDB Conference, Toronto, Canada, 2004, pp. 1110-1121. |
Bruno et al. , “Constrained Physical Design Tuning”, retrieved at<<http://research.microsoft.com/pubs/74116/paper75-revised.pdf>>, Aug. 24-30, 2008, Auckland, New Zealand, pp. 12. |
Bruno et al. , “Configuration-Parametric Query Optimization for Physical Design Tuning”, retrieved at<<http://www.cs.purdue.edu/homes/mehme/papers/sigmod08cpqo.pdf>>, Jun. 9-12, 2008, Vancouver, BC, Canada , pp. 12. |
Nicol Gavin Thomas, “XEXPR—A Scripting Language for XML”, retrieved at<<http://www.w3.org/TR/xexpr/>>, Nov. 21, 2000, pp. 1-28. |
Balcer,et al. , “The Category-Partition Method for Specifying and Generating Fuctional Tests”, retrieved at<<http://portal.acm.org/ft—gateway.cfm?id=62964&type=pdf&coll=GUIDE&dl=GUIDE&CFID=21726511&CFTOKEN=51588138>>, Jun. 1988, pp. 676-686. |
Zeller Andreas, “Simplifying and Isolating Failure-Inducing Input”, retrieved at<<http://www.cs.purdue.edu/homes/jv/510s05/papers/puff.pdf, IEEE Transactions on Software Engineering, vol. 28, No. 2, Feb. 2002, pp. 1-17. |
Number | Date | Country | |
---|---|---|---|
20100241766 A1 | Sep 2010 | US |