Claims
- 1. A method for generating a statistical summary of a database, the database having a plurality of base relations, the method comprising steps of:
forming a sample-tuple set for at least one selected base relation of the plurality of base relations of the database, each sample-tuple set containing at least one sample tuple from a corresponding base relation; and forming a join synopsis set for each selected base relation, each join synopsis set containing a join synopsis for each sample tuple in a sample-tuple set, a join synopsis of a sample tuple being based on a join of the sample tuple and at least one descendent relation of the sample tuple, all join synopsis sets forming a statistical summary of the database.
- 2. The method according to claim 1, wherein the join synopsis of the sample tuple is based on a foreign key join of the sample tuple and at least one descendent relation of the sample tuple.
- 3. The method according to claim 1, wherein the join synopsis of the sample tuple is based on a join of the sample tuple and all descendent relations of the sample tuple.
- 4. The method according to claim 1, further comprising a step of storing all join synopses sets.
- 5. The method according to claim 4, further comprising a step of allocating storage space among each respective synopsis set based on assumed characteristics of a query workload.
- 6. The method according to claim 5, wherein the step of allocating storage space among the respective join synopsis sets includes a step of dividing an allotted storage space equally between join synopsis sets.
- 7. The method according to claim 5, wherein the step of allocating storage space among the respective join synopsis sets includes a step of dividing an allocated storage space between the join synopsis sets in proportion to a cube root of a join synopsis tuple size of each respective join synopsis set.
- 8. The method according to claim 5, wherein the step of allocating storage space among the respective join synopsis sets includes a step of dividing an allocated storage space between the join synopsis sets in proportion to a join synopsis tuple size of each respective join synopsis set.
- 9. The method according to claim 4, further comprising steps of
eliminating redundant columns of each join synopsis; and storing only selected columns of each join synopsis.
- 10. The method according to claim 1, further comprising steps of
renormalizing tuples of each join synopsis into constituent relations of the join synopses; and removing duplicative tuples.
- 11. The method according to claim 1, wherein the steps of claim 1 are performed when the database is initialized.
- 12. The method according to claim 1, wherein a number of tuples allocated to each join synopsis is the same.
- 13. The method according to claim 1, further comprising steps of:
determining a fraction of queries in a query set for which each relation of the plurality of base relations of the database is one of a source relation in a join and a sole relation in a query without joins; and dividing an allocated storage space among join synopsis sets for minimizing an average relative error over the queries based on a high-level characterization of the query set.
- 14. The method according to claim 1, further comprising a step of generating an approximate answer to a query based on the join synopses.
- 15. The method according to claim 14, wherein the query is an aggregate query.
- 16. The method according to claim 14, further comprising steps of:
receiving a user query for an approximate answer; and reformulating the user query to be the query based on the join synopses.
- 17. The method according to claim 14, further comprising a step of generating a confidence bound for the approximate answer.
- 18. The method according to claim 17, wherein the confidence bound is based on one of a Hoeffding bound, a Chebychev (conservative) bound, a Chebychev (estimated σ) bound and a Central Limit Theorem bound.
- 19. The method according to claim 17, wherein the step of generating a confidence bound includes steps of:
partitioning the join synopses into a predetermined number of subsets; and generating an estimator for each subset.
- 20. The method according to claim 19, wherein the predetermined number of subsets is based on a desired confidence level.
- 21. The method according to claim 19, further comprising steps of:
determining an average of the estimators for the subsets; and generating the confidence bound based on the average of the estimators.
- 22. The method according to claim 19, further comprising steps of:
determining a median of the estimators for the subsets; and generating the confidence bound based on the median of the estimators.
- 23. The method according to claim 19, wherein the subsets are each a same size.
- 24. The method according to claim 19, further comprising a step of reporting a result based on the estimator for each subset.
- 25. The method according to claim 24, wherein the result is a summary of the estimator for each subset.
- 26. The method according to claim 1, further comprising steps of:
adding a new tuple to the sample-tuple set for a selected base relation with a probability p when the new tuple is inserted into the selected base relation, probability p being related to a ratio of a number of tuples in the sample-tuple set to a number of tuples in the selected base relation; and forming a join synopsis corresponding to the new tuple when the new tuple is added to the sample-tuple set for the selected base relation, the join synopsis for the new tuple being based on a join of the new tuple and at least one descendent relation of the new tuple.
- 27. The method according to claim 26, further comprising steps of:
uniformly selecting at random a tuple of the sample-tuple set for the selected base relation when a target size for the sample-tuple set is exceeded; and removing the selected tuple from the sample-tuple set.
- 28. The method according to claim 1, further comprising steps of:
removing a selected tuple from the sample-tuple set for a selected base relation when the selected tuple is removed from the selected base relation and is contained in the sample-tuple set for the selected base relation; and removing the join synopsis corresponding to the removed tuple.
- 29. The method according to claim 28, further comprising steps of
repopulating the sample-tuple set from which the selected tuple was removed by rescanning the selected base relation; and forming a join synopsis for each tuple selected by rescanning the selected base relation.
- 30. A method for generating an approximate answer to a query in a database environment, the method comprising steps of:
receiving a query relating to a database; and generating an approximate answer to the query, the approximate answer being based on at least one join synopsis formed from the database.
- 31. The method according to claim 30, wherein the received query is a user query,
the method further comprising a step of reformulating the user query to be a query based on the join synopses.
- 32. The method according to claim 30, further comprising a step of generating a confidence bound for the approximate answer.
- 33. A program storage device, comprising:
a storage area; and information stored in the storage area, the information being readable by a machine, and tangibly embodying a program of instructions executable by the machine for performing method steps for generating a statistical summary of a database, the database having a plurality of base relations, the method comprising steps of
forming a sample-tuple set for at least one selected base relation of the plurality of base relations of the database, each sample-tuple set containing at least one sample tuple from a corresponding base relation; and forming a join synopsis set for each selected base relation, each join synopsis set containing a join synopsis for each sample tuple in a sample-tuple set, a join synopsis of a sample tuple being based on a join of the sample tuple and at least one descendent relation of the sample tuple, all join synopsis sets forming a statistical summary of the database.
- 34. The program storage device according to claim 33, wherein the join synopsis of the sample tuple is based on a foreign key join of the sample tuple and at least one descendent relation of the sample tuple.
- 35. The program storage device according to claim 33, wherein the join synopsis of the sample tuple is based on a join of the sample tuple and all descendent relations of the sample tuple.
- 36. The program storage device according to claim 33, further comprising a step of storing all join synopses sets.
- 37. The program storage device according to claim 36, further comprising a step of allocating storage space among each respective synopsis set based on assumed characteristics of a query workload.
- 38. The program storage device according to claim 36, further comprising steps of:
eliminating redundant columns of each join synopsis; and storing only selected columns of each join synopsis.
- 39. The program storage device according to claim 33, further comprising steps of:
renormalizing tuples of each join synopsis into constituent relations of the join synopses; and removing duplicative tuples.
- 40. The program storage device according to claim 33, wherein the steps of claim 33 are performed when the database is initialized.
- 41. The program storage device according to claim 33, further comprising steps of:
determining a fraction of queries in a query set for which each relation of the plurality of base relations of the database is one of a source relation in a join and a sole relation in a query without joins; and dividing an allotted storage space among join synopsis sets for minimizing an average relative error over the queries based on a high-level characterization of the query set.
- 42. The program storage device according to claim 33, further comprising a step of generating an approximate answer to a query based on the join synopses.
- 43. The program storage device according to claim 42, wherein the query is an aggregate query.
- 44. The program storage device according to claim 42, further comprising steps of:
receiving a user query for an approximate answer; and reformulating the user query to be the query based on the join synopses.
- 45. The program storage device according to claim 42, further comprising a step of generating a confidence bound for the approximate answer.
- 46. The program storage device according to claim 45, wherein the step of generating a confidence bound includes steps of:
partitioning the join synopses into a predetermined number of subsets; and generating an estimator for each subset.
- 47. The program storage device according to claim 46, further comprising a step of reporting a result based on the estimator for each subset.
- 48. The program storage device according to claim 33, further comprising steps of:
adding a new tuple to the sample-tuple set for a selected base relation with a probability p when the new tuple is inserted into the selected base relation, probability p being related to a ratio of a number of tuples in the sample-tuple set to a number of tuples in the selected base relation; and forming a join synopsis corresponding to the new tuple when the new tuple is added to the sample-tuple set for the selected base relation, the join synopsis for the new tuple being based on a join of the new tuple and at least one descendent relation of the new tuple.
- 49. The program storage device according to claim 48, further comprising steps of:
uniformly selecting at random a tuple of the sample-tuple set for the selected base relation when a target size for the sample-tuple set is exceeded; and removing the selected tuple from the sample-tuple set.
- 50. The program storage device according to claim 33, further comprising steps of:
removing a selected tuple from the sample-tuple set for a selected base relation when the selected tuple is removed from the selected base relation and is contained in the sample-tuple set for the selected base relation; and removing the join synopsis corresponding to the removed tuple.
- 51. A program storage device, comprising:
a storage area; and information stored in the storage area, the information being readable by a machine, and tangibly embodying a program of instructions executable by the machine for performing method steps for generating an approximate answer to a query in a database environment, the method comprising steps of:
receiving a query relating to a database; and generating an approximate answer to the query, the approximate answer being based on at least one join synopsis formed from the database.
- 52. The program storage device according to claim 51, wherein the received query is a user query,
the method further comprising a step of reformulating the user query to be a query based on the join synopses.
- 53. The program storage device according to claim 51, further comprising a step of generating a confidence bound for the approximate answer.
CROSS-REFERENCE TO RELATED APPLICATION
[0001] The present application is a continuation-in-part application of application Ser. No. 09/081,660, entitled System and Techniques For Fast Approximate Query Answering, filed May 20, 1998, and invented by S. Acharya et al., and claims priority to provisional patent application 60/125,244, filed Mar. 19, 1999, both of which are incorporated by reference herein.
Provisional Applications (1)
|
Number |
Date |
Country |
|
60125224 |
Mar 1999 |
US |
Divisions (1)
|
Number |
Date |
Country |
Parent |
09480261 |
Jan 2000 |
US |
Child |
10216295 |
Aug 2002 |
US |
Continuation in Parts (1)
|
Number |
Date |
Country |
Parent |
09081660 |
May 1998 |
US |
Child |
09480261 |
Jan 2000 |
US |