The present invention relates to random sampling within distributed processing systems with very large data sets, and more particularly, to an improved system and method for reservoir sampling of distributed data, including distributed data streams.
Random sampling has been widely used in database applications. A random sample can be used, for instance, to do sophisticated analytics on a small portion of data, which, otherwise, would be prohibitively expensive to apply on terabytes or petabytes of data. In this era of Big Data, data becomes virtually unlimited and should be processed as unbounded streams. Data has also became more and more distributed as evident by recent processing models such as MapReduce.
A random sample is a subset of data that is statistically representative of an entire data set. When the data is centralized and its size is known prior to sampling, it is fairly straightforward to obtain a random sample. However, many applications deal with data that is both distributed and never-ending. One example is distributed data stream applications, such as sensor networks. Random sampling for this kind of application becomes more difficult due to two main reasons. First, the size of the data is unknown; hence, it is not possible to predetermine sampling probability before sampling starts. Second, data is distributed by nature and accordingly, it is not feasible to redistribute or duplicate the data to a central processing unit to do sampling. These two challenges combined bring the question of how to obtain a random sample of distributed data efficiently while guaranteeing the sample uniformity. Described below is a novel technique that addresses this problem. The devised technique is applicable to traditional distributed database systems, distributed data streams, and modern processing models such as MapReduce. This solution is easily implemented within a Teradata Unified Data Architecture™ (UDA), illustrated in
The system may be better understood with reference to the following drawings and description. The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention. Moreover, in the figures, like referenced numerals designate corresponding parts throughout the different views.
The data sampling techniques described herein can be used to sample table data and data streams within a Teradata Unified Data Architecture™ (UDA) system 100, illustrated in
The Teradata Database System 110 is a massively parallel processing (MPP) relations database management system including one or more processing nodes that manage the storage and retrieval of data in data storage facilities. Each of the processing nodes may host one or more physical or virtual processing modules, referred to as access module processors (AMPS). Each of the processing nodes manages a portion of a database that is stored in a corresponding data storage facility. Each data-storage facility includes one or more disk drives or other storage medium. The system stores data in one or more tables in the data-storage facilities wherein table rows may be stored across multiple data storage facilities to ensure that the system workload is distributed evenly across the processing nodes 115. Additional description of a Teradata Database System is provided in U.S. patent application Ser. No. 14/983,804, titled “METHOD AND SYSTEM FOR PREVENTING REUSE OF CYLINDER ID INDEXES IN A COMPUTER SYSTEM WITH MISSING STORAGE DRIVES” by Gary Lee Boggs, filed on Dec. 30. 2015, which is incorporated by reference herein.
The Teradata Aster Database 120 is also based upon a Massively Parallel Processing (MPP) architecture, where tasks are run simultaneously across multiple nodes for more efficient processing. The Teradata Aster Database includes multiple analytic engines, such as SQL, MapReduce, and Graph, designed to provide optimal processing of the analytic tasks across massive volumes of structured, non-structured data, and multi-structured data, referred to as Big Data, not easily processed using traditional database and software techniques. Additional description of a Teradata Aster Database System is provided in U.S. patent application Ser. No. 15/045,022, titled “COLLABORATIVE PLANNING FOR ACCELERATING ANALYTIC QUERIES” by Derrick Poo-Ray Kondo et al., filed on Feb. 16, 2016, which is incorporated by reference herein.
The Teradata UDA system illustrated in
The Teradata UDA System 100 may incorporate or involve other data engines including cloud and hybrid-cloud systems.
Data sources 140 shown in
A very well-known technique for sampling over data streams is reservoir sampling. A reservoir sample always holds a uniform random sample of data collected thus far. This technique has been used in many database applications, such as approximate query processing, query optimization, and spatial data management.
Additional description of reservoir sampling is provided in the paper titled “Random sampling with a reservoir” by Jeffrey S. Vitar presented in ACM Transactions on Mathematical Software, Vol. 11, No. 1, March 1985, Pages 35-57.
Described herein is a novel reservoir-based sampling technique that leverages the conventional reservoir sampling algorithm for distributed data. A typical application for the devised technique is distributed data streams applications. In these applications, multiple data streams are being generated, for instance, from distributed deployed sensors. The processing unit of each sensor node needs to sample from its data stream individually, and a final sample needs to be generated which represents all data streams.
A primary concern with generating a final sample from multiple data stream samples is the maintenance of the uniformity of the final sample while each data stream is sampled independently. To illustrate this problem, assume a random sample R of size |R| from two data streams S1 and S2, where |S1| and |S2| denote the number of data elements generated so far from S1 and S2, respectively. The straightforward approach for generating a sample from the two data streams is to redistribute one data stream to another and take a random sample of |R| from a data set of size |S1|+|S2|. Note that in this case, there are
different possible samples of size |R| that can be selected from |S1|+|S2| elements. Without redistribution, each of the streams S1 and S2 needs to be sampled individually. Assume that two random samples are drawn independently from S1 and S2 such that the size of sample is proportional to the number of elements seen from each stream thus far and, then, both samples are combined to produce R. That is to say, |R1|=|R|(|S1|/(|S1|+|S2|)) and |R2|=|R|(|S2|/(|S1|+|S2|)). In this case, the number of different samples that can be eventually obtained is
such that |R1|+|R2|=|R1. It is clear that this number
is less than
which indicates that there are some possible random samples that cannot be generated following this method. To insure uniformity, a sampling technique has to generate as many possible combinations as the straightforward approach would generate.
The proposed novel multi-level reservoir sampling technique, illustrated in
Since i can be anywhere from 0 to |R|, this means that the number of possible random sample combinations that can be generated using the proposed technique
This, therefore, verifies that the proposed multi-level sampling technique guarantees the uniformity of sample.
A key property of the multi-level sampling technique is that it achieves 100% uniformity in the final sample while taking into consideration the proportion of data from which the sample is drawn. Consider the following example assuming two streams S1 and S2, where the number of data elements seen from S1 is 10 and from S2 is 5, and random sample of size 4 is desired. It is expected that more data elements will be selected from S1 than from S2 as S1 has more elements. The improved multi-level reservoir sampling technique achieves this result when it decides on how many elements to select from each intermediate, or Level 1, reservoir using the probability function discussed above. Table 1 shows the probability of selecting a certain number of elements from Stream S1:
Note two points about the data in Table 1. First, the highest probability is to select 3 elements from S1 and the remainder (which in this case is 4−3=1) from S2. This means that the algorithm favors S1 over S2 because Si has more elements. Second, the sum of all probabilities equals to 1. This demonstrates that uniformity is achieved by the sampling algorithm because it indicates that using the devised algorithm results in the same number of different random samples of size 4 that can be obtained from combining S1 and S2 together before sampling.
The sampling technique illustrated in
The multi-level sampling technique described above and illustrated in the figures addresses an important problem in an efficient manner As aforementioned, random sampling is an indispensable functionality to any data management system. With data continuously evolving and naturally being distributed, this improved sampling technique becomes even more important. It is theoretically proven and practically implementable. It can be implemented for traditional distributed database systems, distributed data streams, and modern processing models (e.g., MapReduce). It is easily implemented in a commercial and open-source database and big data systems, such as the Teradata Unified Data Architecture™ (UDA), illustrated in
The foregoing description of the invention has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed.
Additional alternatives, modifications, and variations will be apparent to those skilled in the art in light of the above teaching. Accordingly, this invention is intended to embrace all alternatives, modifications, equivalents, and variations that fall within the spirit and broad scope of the attached claims.