METHOD AND SYSTEM FOR IMPLEMENTING ANALYTIC FUNCTION BASED ON MAPREDUCE

Description

FIELD OF THE TECHNOLOGY

The present disclosure relates to the field of data warehouses, and in particular, to a method and system for implementing an analytic function based on MapReduce.

BACKGROUND OF THE DISCLOSURE

A data warehouse is a warehouse in which data is organized, stored, and managed according to a data structure. With popularization of computers, the data warehouse has been widely applied in work and life. Currently, with rapid development of Internet and information technologies, the data warehouse not only can store and manage data, but also has a strong data analysis capability. Common databases such as ORACLE and PostgreSQL all provide multiple analytic functions to analyze data according to user needs and provide analytic results to users. The analytic function is used to calculate an aggregate value based on a data group. Differing from the aggregate function, the analytic function returns multiple rows of data after processing the data group, while the aggregate function returns one row of data after processing the data group.

MapReduce is a programming model and is used to perform parallel computing on large-scale data sets. Currently, a distributed data warehouse (such as a Hive data warehouse) based on a MapReduce framework cannot use the analytic function to perform data processing, which brings much inconvenience in a process of using the database.

SUMMARY

Embodiments of the present application provide a method and system for implementing an analytic function based on MapReduce, which can solve a problem that for a distributed database based on a MapReduce framework, the analytic function cannot be used to perform data processing.

In order to achieve the foregoing objective, the following technical solutions are used in the embodiments of the present application.

According to a first aspect, an embodiment of the present application provides a method for implementing an analytic function based on MapReduce, including: a table scan operator acquiring a data row from a file block, and sending the data row to a reduce sink operator; upon receipt of the data row, the reduce sink operator determining a reduce key, a partition key, and a sort key of the analytic function, and sending the data row to an analysis operator by means of a MapReduce framework, the analysis operator belonging to a Reduce end of the MapReduce framework; and upon receipt of the data row, the analysis operator analyzing the data row to obtain an analytic result, and forwarding the data row and the analytic result to a subsequent operator.

According to a second aspect, an embodiment of the present application further provides a computing system for implementing an analytic function based on MapReduce, the computing system including one or more processors and memory for storing a plurality of program modules to be executed by the one or more processors and the plurality of program modules further including: a table scan operator module, a reduce sink operator module, and an analysis operator module, the table scan operator module being configured to acquire a data row from a file block, and send the data row to the reduce sink operator module; the reduce sink operator module being configured to receive the data row, determine a reduce key, a partition key, and a sort key of the analytic function, and send the data row to the analysis operator module by means of a MapReduce framework, the analysis operator module belonging to a Reduce end of the MapReduce framework; and the analysis operator module being configured to receive the data row, analyze the data row to obtain an analytic result, and forward the data row and the analytic result to a subsequent operator module.

According to a third aspect, an embodiment of the present application further provides a non-transitory computer readable medium in conjunction with a computing system having one or more processors, the computer readable medium storing a plurality of program modules to be executed by the one or more processors for implementing an analytic function based on MapReduce, the plurality of program modules further comprising: a table scan operator module, a reduce sink operator module, an analysis operator module, and a subsequent operator module: a table scan operator module, a reduce sink operator module, and an analysis operator module, the table scan operator module being configured to acquire a data row from a file block, and send the data row to the reduce sink operator module; the reduce sink operator module being configured to receive the data row, determine a reduce key, a partition key, and a sort key of the analytic function, and send the data row to the analysis operator module by means of a MapReduce framework, the analysis operator module belonging to a Reduce end of the MapReduce framework; and the analysis operator module being configured to receive the data row, analyze the data row to obtain an analytic result, and forward the data row and the analytic result to a subsequent operator module.

The method and system for implementing an analytic function based on MapReduce provided in the embodiments of the present application can be applied in a distributed database based on a MapReduce framework (such as a Tencent distributed data warehouse and a Hive database) to implement data analysis and add a function of the distributed database based on the MapReduce framework, so that a user can perform data analysis in the distributed database based on the MapReduce framework.

BRIEF DESCRIPTION OF THE DRAWINGS

To describe the technical solutions of the embodiments of the present application or the prior art more clearly, the following briefly introduces the accompanying drawings required for describing the embodiments or the prior art. Apparently, the accompanying drawings in the following description show only some embodiments of the present application, and a person of ordinary skill in the art may still derive other drawings from these accompanying drawings without creative efforts.

FIG. 1 is a schematic flowchart of a method for implementing an analytic function based on MapReduce according to Embodiment 1 of the present application;

FIG. 2 is a schematic flowchart of a method for implementing an analytic function based on MapReduce according to Embodiment 2 of the present application;

FIG. 3 is a schematic structural diagram of an analysis operator buffer according to Embodiment 2 of the present application;

FIG. 4 is a schematic structural diagram of an analyzer buffer according to Embodiment 2 of the present application;

FIG. 5A to FIG. 5D and FIG. 6A to FIG. 6D separately are schematic diagrams of a window mode according to Embodiment 2 of the present application;

FIG. 7 is a schematic structural diagram of a system for implementing an analytic function based on MapReduce according to Embodiment 3 of the present application; and

FIG. 8 is a schematic structural diagram of an analysis operator module 53 shown in FIG. 7.

DESCRIPTION OF EMBODIMENTS

The following clearly and completely describes the technical solutions in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application. Apparently, the described embodiments are merely some of the embodiments of the present application rather than all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without creative efforts shall fall within the protection scope of the present disclosure.

Embodiment 1

This embodiment of the present application provides a method for implementing an analytic function based on MapReduce. The method is applicable to data analysis in a distributed data warehouse based on a MapReduce framework. As shown in FIG. 1, the method includes the following steps.

Step 101: A table scan operator acquires a data row from a file block, and sends the data row to a reduce sink operator.

Step 102: The reduce sink operator receives the data row, determines a reduce key, a partition key, and a sort key of the analytic function, and sends the data row to an analysis operator by means of a MapReduce framework, where the analysis operator belongs to a Reduce end of the MapReduce framework.

Step 103: The analysis operator receives the data row, analyzes the data row to obtain an analytic result, and forwards the data row and the analytic result to a subsequent operator.

The subsequent operator may be determined according to operations needed by specific situations, for example, may be an aggregate operator, a filter operator, or a file operator, but is not limited thereto.

The method for implementing an analytic function based on MapReduce provided in this embodiment of the present application can be applied in an analytic function to perform data analysis in a distributed database based on a MapReduce framework (such as a Tencent distributed data warehouse and a Hive data warehouse), and add a function of the distributed database based on the MapReduce framework, so that the analytic function is used in the distributed database based on the MapReduce framework to perform data analysis.

Embodiment 2

Step 201: A table scan operator acquires a data row from a file block, and sends the data row to a reduce sink operator.

It should be noted that, in the method provided in this embodiment, multiple different analytic functions may be preset to analyze data. Exemplary analytic functions, for example, may include LAG, LEAD, RANK, DENSE_RANK, ROW_NUMBER, SUM, COUNT, AVG, MAX, MIN, or RATIO_TO_REPORT. Optionally, in the method provided in this embodiment, a new analytic function may be added according to user needs.

Step 202: The reduce sink operator receives the data row, determines a reduce key, a partition key, and a sort key of the analytic function, and sends the data row to an analysis operator by means of a MapReduce framework, where the analysis operator belongs to a Reduce end of the MapReduce framework.

For example, the reduce sink operator may determine the reduce key, the partition key, and the sort key of the analytic function by using the following method. The method may specifically include:

(1) when the analytic function comprises a partition by clause and/or an order by clause, using a column in the partition by clause and/or a column in the order by clause of the analytic function as the reduce key, when the analytic function does not comprise an order by clause but comprises a distinct key word, using a distinct column as the reduce key, when the analytic function does not comprise a partition by clause, an order by clause, or a distinct key word, designating any constant as the reduce key;

(2) when the analytic function comprises the partition by clause, using the column in the partition by clause of the analytic function as the partition key, or using a constant that is the same as the reduce key as the partition key when the analytic function does not comprise the partition by clause; and

(3) when the analytic function comprises the order by clause, use the column in the order by clause as the sort key.

Step 203: The analysis operator receives the data row, and stores the data row into an analysis operator buffer, so that all analyzers uses the data row.

In order to implement data sharing, an analysis operator buffer AnalysisBuffer may be provided in an analysis operator module formed by the analysis operator. The buffer has the following features: a. allowing data of a designated length to be stored in a memory; b. overflowing half content in an original memory buffer to a hard disk when a length exceeds a limit value; c. allowing a user to access an element in the buffer according to an index; and d. allowing a user to delete an element, which has been forwarded, in the buffer from the beginning.

Specifically, as shown in FIG. 3, the analysis operator buffer may include the memory buffer and a magnetic disk buffer (which may be located in a magnetic disk shown in FIG. 4). In the analysis operator buffer, a received new data row may be preferentially put into the memory buffer; and if the memory buffer is full, an old data row in the memory buffer may be stored into the magnetic disk buffer, so as to release storage space of the memory buffer, and then the received new data row may be put into the memory buffer.

Step 204: The analysis operator parses out a partition by field and an order by field of the data row, determines whether the data row belongs to a current partition, the current partition is a partition to which a previous data row received by the analysis operator belongs; and if the data row belongs to the current partition, executes step 205; or if the data row does not belong to the current partition, executes step 206.

Step 205: The analysis operator invokes an analyzer corresponding to the analytic function to analyze the data row to obtain an analytic result, and stores the analytic result into an analyzer buffer.

It should be noted that each analytic function may correspond to one analyzer, and each analyzer may correspond to one analyzer buffer, which is used to store an analytic result and an intermediate result that are related to each data row, or a total aggregate result. As shown in FIG. 4, the analyzer buffer may include the memory buffer and the magnetic disk buffer (which may be located in the magnetic disk shown in FIG. 4), and the memory buffer may include an output buffer and an input buffer.

The analyzer buffer is used to buffer and update the analytic result. Specifically, when the analyzer buffer buffers the analytic result, the analytic result may be stored into the output buffer; and if the output buffer is full, content in the output buffer may be stored into the magnetic disk buffer, so as to release storage space of the output buffer. When the analyzer buffer updates the analytic result, if a to-be-updated row is stored in the output buffer, the analytic result may be directly updated according to the to-be-updated row and received new data in the output buffer; if the to-be-updated row is stored in the input buffer, the analytic result may be directly updated according to the to-be-updated row and received new data in the input buffer; and if the to-be-updated row is stored in the magnetic disk (that is, the magnetic disk buffer), content in the input buffer may be stored into the magnetic disk, and a buffer block in which the to-be-updated row in the magnetic disk is located is read into the input buffer, so as to update the analytic result according to the to-be-updated row and the received new data in the input buffer.

Step 206: The analysis operator ends analysis on the current partition, aggregates all data rows of the current partition stored in the analysis operator buffer and all analytic results of the current partition stored in the analyzer buffer into a new data row, and forwards the new data row to a subsequent operator.

It should be noted that if the analytic function does not need accumulation, after the analyzer corresponding to the analytic function is invoked to analyze the data row to obtain the analytic result, the data row and the analytic result may be directly aggregated, and forwarded to the subsequent operator, and the data row and the analytic result do not need to be buffered.

For ease of understanding, this embodiment briefly describes 11 common exemplary algorithms of the analytic function. Details are as follows.

Algorithm 1: a brief description of a LAG algorithm:

It is assumed that an invoked analytic function is lag(col, offset) over( . . . ).

There is only one row number counter p (an initial value is −1) in an analyzer buffer of LAG. When a new row is analyzed, p is increased by 1. If p>=offset, a column of a row to which p points is set to content at a col column of a p-offset row, and it indicates that content at the p-offset row and a preceding row may be forwarded; otherwise, a result of a current row is set to null, and all rows cannot be forwarded.

Algorithm 2: a brief description of a LEAD algorithm:

It is assumed that an invoked analytic function is lead(col, offset) over( . . . ).

There are two pointers in an analyzer buffer of LEAD. A pointer P1 points to a minimum row that has not been processed, and a pointer p2 points to a current row. When a new row is analyzed, the pointer p2 is increased by 1. In this case, if p2−p1>=offset, a result of a row to which the p1 points is set to content at a col column of a row to which the p2 points, and p1 increases by one (p1++), and rows having row numbers less than or equal to p1 may all be forwarded.

Algorithm 3: a brief description of a RANK algorithm:

There are a current sequence number rank, a value, value, corresponding to the current sequence number, and a row number, number, having the current sequence number in an analyzer buffer of RANK. When a new row is analyzed, if a value of the new row is equal to the value, a rank column of the row is set to the rank, and number++ in the analyzer buffer; otherwise, the rank column is set to rank+number, and at the same time, the rank in the analyzer buffer is set to the rank+number; the value is set to a designated value of the new row; and the number is set to 1. All rows that are currently processed can be forwarded.

Algorithm 4: a brief description of a DENSE_RANK algorithm:

There are a current sequence number rank, a value, value, corresponding to the current sequence number, and a row number, number, having the current sequence number in an analyzer buffer of DENSE_RANK. When a new row is analyzed, if a value of the new row is equal to the value, a rank column of the row is set to the rank, and number++ in the analyzer buffer; otherwise, the rank column is set to rank+1, and at the same time, the rank in the analyzer buffer is set to the rank+1; the value is set to a designated value of the new row; and the number is set to 1. All rows that are currently processed can be forwarded.

Algorithm 5: a brief description of a ROW_NUMBER algorithm:

There is only one rownumber value (an initial value is −1) in an analyzer buffer of ROW_NUMBER. When a new row is analyzed, a rownumber column of the new row is set to rownumber+1, and at the same time, the rownumber in the analyzer buffer is set to the rownumber+1. All rows that are currently processed can be forwarded.

Algorithm 6: a brief description of a SUM algorithm:

In an analyzer buffer of SUM, a variable, that is, a current sum, is stored. When a new row is analyzed, a value of the sum plus a value (which needs to be non-null) of a designated expression of the new row is stored into sum.

Forwarding cannot be performed before whole partition analysis is completed. After the partition analysis is completed, a value of the sum is used as a calculation result of each row.

Algorithm 7: a brief description of a COUNT algorithm:

There is only one count counter in an analyzer buffer of COUNT. Each time a new row is analyzed, if a value of a to-be-analyzed column is non-null, the counter is increased by 1.

Forwarding cannot be performed before whole partition analysis is completed. After the partition analysis is completed, a value of the count is used as a calculation result of each row.

Algorithm 8: a brief description of an AVG algorithm.

There are two counter values in an analyzer buffer of AVG. One is sum (an initial value is 0), and the other is count (an initial value is 0). When a new row is analyzed, if an expression is a non-null value, count++, and the sum is set to an expression value of a new row sum+.

Any row cannot be forwarded before whole partition analysis is completed. After the partition analysis is completed, if count!=0, a value of sum/count is used as a calculation result of each row; otherwise, null is used as an analytic result of each row.

Algorithm 9: a brief description of a MAX algorithm.

There is only one max value in an analyzer buffer of MAX. When a new row is analyzed, an expression (non-null) of the new row is a compared with max. If the expression is greater than max, max is updated. When partition analysis is completed, designated columns of all rows are set to max.

Forwarding cannot be performed before whole partition analysis is completed.

Algorithm 10: a brief description of a MIN algorithm.

There is only one min value in an analyzer buffer of MIN. When a new row is analyzed, an expression (non-null) of the new row is a compared with min. If the expression is less than min, min is updated. When partition analysis is completed, designated columns of all rows are set to min.

Forwarding cannot be performed before whole partition analysis is completed.

Algorithm 11: a brief description of a RATIO_TO_algorithm.

There is only one sum value in an analyzer buffer of a RATIO_TO_REPORT class. When a new row is analyzed, an expression (non-null) of the new row plus sum is set to a value of sum. When partition analysis is completed, designated columns of all rows respectively divided by sum are set to values of the columns. If sum is 0, the values of the columns are all set to null.

Forwarding cannot be performed before whole partition analysis is completed.

It should be noted that, in the analytic function, an aggregate value is calculated for each row of data based on a group of records (such as multiple data rows), to obtain an analytic result, where the based group of records is referred to as “window”. Each row of records has one window, which is used to designate the analytic function to execute a record set of aggregate computation. For a case in which there is a window clause, this embodiment provides the following 8 modes (that is, a window mode, specifically, a mode of setting a window location) to be referred to:

Mode 1 is shown in FIG. 5A: