This invention relates generally to information storage and retrieval. More particularly, this invention relates to a technique for random database sampling with repeatable results.
A database is an organized collection of digital data. A database management system enforces data quality in a database, which is measured in terms of accuracy, availability, usability and resilience.
A data warehouse is a database used for reporting and analysis. The data stored in the data warehouse is uploaded from an operational database. A data warehouse maintains data history. A data warehouse may integrate data from multiple source systems. Consequently, a data warehouse may generate unruly volumes of data. This makes it difficult to identify characteristics of the data since data characterization requires processing of such a large volume of data. To address this issue, one may randomly sample data. Random sampling provides a characterization of data in a large data store, but it does not provide repeatable results since each sampling iteration accesses different data.
In view of the foregoing, it would be desirable to provide a technique for random database sampling with repeatable results.
A method of sampling data in a database includes designating permanent read locations in a database. The database is populated with randomly loaded data. The permanent read locations in the database are accessed to form sampled repeatable results attributable to the permanent read locations and the randomly loaded data.
A non-transitory computer readable storage medium includes executable instructions to sort data segments by identifiers to form pairs of identifiers and corresponding data segments. An attribute is ascribed to each data segment to form pairs of identifiers and corresponding attributes. The pairs of identifiers and corresponding attributes are randomly loaded into a database.
The invention is more fully appreciated in connection with the following detailed description taken in conjunction with the accompanying drawings, in which:
Like reference numerals refer to corresponding parts throughout the several views of the drawings.
The advantages of the invention are more fully appreciated with reference to the system 300 of
System 300 is a communication network that includes devices to capture and record header, flow, and content information of network data packets. In particular, the system 300 includes a set of servers 302 that communicate with a set of users or clients 304 through a network switch 306. A capture appliance 308 is connected to network switch 306. The capture appliance 308 captures and stores all traffic through network switch 306. The capture appliance 308 loads all of the data in a bulk repository 310. The capture appliance 308 operates with indexing databases 312 to form identifiers for data segments. This results in pairs of identifiers and corresponding data segments. One or more attributes are ascribed to each data segment. This results in pairs of identifiers and corresponding attributes. In one embodiment, the sampling module loads pairs of identifiers and corresponding attributes into the indexing database 312 in a random manner.
Since system 300 stores all network traffic, the resultant bulk repository and indexing databases 312 are very large. Therefore, the sampling techniques of the invention are successfully deployed in such a context. However, other contexts, such as data warehouses may also successfully deploy the disclosed techniques.
Returning to
The use of attributes operates to condense the amount of information that is stored in a database. The bulk repository 110 stores the entire data packet. The attributes characterize the stored data. The attributes facilitate the search for information in the bulk repository. That is, a search for a selected attribute results in a match between the selected attribute and corresponding attributes in the database, thereby forming matched corresponding attributes. The matched corresponding attributes have corresponding identifiers.
The next operation of
In the prior art, a row number corresponds with an identifier number since the database is loaded sequentially. However, in accordance with the invention, the identifiers and their corresponding attributes are loaded randomly. In this example, the sequence of values in the identifier column 602 is 8, N, 2, 6, 3, 1, etc.
Permanent read locations may now be designated. For example, permanent read locations may be a first sub-set of the rows in the database. Consider a simple example of a table with 100 rows. If one desired a 10% sample, then the first 10 rows would be designated as the permanent read locations. If one desired a 50% sample, then the first 50 rows would be designated as the permanent read locations. Different random sample sets may also be obtained by varying the starting location. Again, consider the simple example of a table with 100 rows split into 10 different non-overlapping data sample sets, each representing a 10% sample of all data. Each sample set is repeatable and distinct from the other sets. Thus, different result sets can be obtained in a repeatable fashion by varying either or both of the starting segment and starting offset within the starting segment. The foregoing examples specify rows in a database, but any location in the database may be used in accordance with embodiments of the invention.
The remaining operations of
An embodiment of the present invention relates to a computer storage product with a computer readable storage medium having computer code thereon for performing various computer-implemented operations. The media and computer code may be those specially designed and constructed for the purposes of the present invention, or they may be of the kind well known and available to those having skill in the computer software arts. Examples of computer-readable media include, but are not limited to: magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROMs, DVDs and holographic devices; magneto-optical media; and hardware devices that are specially configured to store and execute program code, such as application-specific integrated circuits (“ASICs”), programmable logic devices (“PLDs”) and ROM and RAM devices. Examples of computer code include machine code, such as produced by a compiler, and files containing higher-level code that are executed by a computer using an interpreter. For example, an embodiment of the invention may be implemented using JAVA®, C++, or other object-oriented programming language and development tools. Another embodiment of the invention may be implemented in hardwired circuitry in place of, or in combination with, machine-executable software instructions.
The foregoing description, for purposes of explanation, used specific nomenclature to provide a thorough understanding of the invention. However, it will be apparent to one skilled in the art that specific details are not required in order to practice the invention. Thus, the foregoing descriptions of specific embodiments of the invention are presented for purposes of illustration and description. They are not intended to be exhaustive or to limit the invention to the precise forms disclosed; obviously, many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain the principles of the invention and its practical applications, they thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated. It is intended that the following claims and their equivalents define the scope of the invention.