Loading new data into existing tables is a process commonly performed for most databases. New data is typically loaded into existing tables to ensure that the data contained in the tables is up to date. The new data often includes both data that is new and data that is an update to existing data. Existing methods for loading the data typically employ two separate operations, one operation to load updated data and another operation to load new data.
Features of the present disclosure are illustrated by way of example and not limited in the following figure(s), in which like numerals indicate like elements, in which:
For simplicity and illustrative purposes, the present disclosure is described by referring mainly to an example thereof. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. It will be readily apparent however, that the present disclosure may be practiced without limitation to these specific details. In other instances, some methods and structures have not been described in detail so as not to unnecessarily obscure the present disclosure.
Disclosed herein are methods for merging data from a source table into a target table, in which both the source table and the target table contain tuples of data. Also disclosed herein are an apparatus for implementing the methods and a non-transitory computer readable medium on which is stored machine readable instructions that implement the methods. Methods for merging data include a merge operation, which may be implemented or invoked in a Vertica™ column-stored database or other types of databases. The merge operation of the present disclosure may perform both an update to existing data and an insertion of new data in the target table during a single merge operation.
According to an example, a single merge operation of the present disclosure may be implemented as a delete operation and an insert operation executed in parallel to update existing data and insert new data in the target table. As such, compared with other merge operations, which require a right outer join operation to insert updated and new tuples into the target table, the merge operation of the present disclosure provides a more efficient manner of merging data into the target table. For example, the insert operation of the merge operation of the present disclosure does not require a join operation or input from a join operation, and thus the insert operation may be implemented in parallel with the delete operation to reduce execution time for the merge operation. According to an example, the insert operation may insert both matched and unmatched tuples contained in the source table into the target table as new tuples. Therefore, according to examples of the present disclosure, the merge operation may only require one insert operation, as opposed to separate insert operations for updated tuples and new tuples as required in other merge operations.
According to an example of the present disclosure, the delete operation may delete a tuple in the target table that matches a tuple in the source table based on a predetermined condition. The matched tuple may, for instance, be an existing tuple in the target table to be updated. The existing tuple in the target table may be identified as a matched tuple by executing a late materialized join operation between the source table and the target table. By implementing a late materialization strategy and delaying the materialization of the matched tuple, the position of the matched tuple in the target table may be preserved for later processing by the delete operation. The position of the matched tuple for example is the location of the matched tuple in a merged target table. The location may be the position of the tuple's row in the merged target table.
For a late materialized join operation between the source and the target table, the position of the rows from the target table are preserved. By doing this, the target table does not have to be rescanned to find the position of the rows that are to be marked as deleted in a delete vector. An early materialized join operation reads all the columns of a table before a join operation, and thus does not need to preserve the position of the rows because all the data from the column has been read. In contrast, in a late materialized join operation reads only the join columns before the join operation and fetches the rest of the non-join columns only after passing the join operation. That is, in a late materialized join operation, non-join columns are fetched using position of the rows that passed join operation.
Examples of the present disclosure implement a late materialized join operation, whereby the position of the matched tuple may be preserved for addition to a delete vector without an additional scan of the target table. In one example, the merge operation of the present disclosure may only scan the target table once during the initial join operation, which improves the performance and speed of the merge operation. In contrast, other update operations typically employ an early materialization strategy, which does not preserve the position of the matched tuples. Thus, other update operations (which include delete operations) require an additional scan (e.g., join operation) of the target table to fetch the position of matched tuples before deleting and updating the matched tuples. This additional scan is computationally expensive in terms of the performance and speed for other merge operations.
A merge operation of the present disclosure may utilize a semi inner join operation to identify tuples in the target table that match tuples in the source table based on a predetermined condition. Compared with other merge operations, which typically utilize a right outer join operation, the merge operation of the present disclosure provides a less restrictive and less computationally expensive semi inner join operation for identifying the matched tuples. Moreover, by utilizing a semi inner join operation, the merge may ignore unmatched tuples. Accordingly, in contrast to other merge operations, the merge operation of the present disclosure may not differentiate between matched and unmatched tuples when inserting the tuples into the target table, which results in a relatively faster merge operation as discussed in greater detail herein.
With reference to
The machine 100 is depicted as including a processor 102, a data store 104, an input/output interface 106, and a merge operation manager 110. The machine 100 comprises a computer. In one example, the computer is a server but other types of computers may be used. Also, the components of the machine 100 are shown on a single computer as an example and in other examples the components may exist on multiple computers. The machine 100 may store a target table and/or may manage the storage of data in a target table stored in a separate machine, for instance, through a network device 108, which may comprise, for instance, a router, a switch, a hub, etc.
The merge operation manager 110 is depicted as including a delete operation module 112 and an insert operation module 114. The processor 102, which may comprise a microprocessor, a micro-controller, an application specific integrated circuit (ASIC), or the like, is to perform various processing functions in the machine 100. The processing functions may include the functions of the modules 112 and 114 of the merge operation manager 110 as discussed in greater detail herein below.
In one example, the merge operation manager 110 comprises machine readable instructions stored on a non-transitory computer readable medium 113 and executed by the processor 102. Examples of the non-transitory computer readable medium include dynamic random access memory (DRAM), electrically erasable programmable read-only memory (EEPROM), magnetoresistive random access memory (MRAM), Memristor, flash memory, hard drive, and the like. In another example, the merge operation manager 110 comprises a hardware device, such as a circuit or multiple circuits arranged on a board. In this example, the modules 112 and 114 comprise circuit components or individual circuits, such as an embedded system or ASIC.
The input/output interface 106 comprises a hardware and/or a software interface. In any regard, the input/output interface 106 may be connected to a network, such as the Internet, an intranet, etc., through the network device 108, over which the merge operation manager 110 may receive and communicate information, for instance, the data contained in the source table 110 and data contained in other tables. The processor 102 may store information received through the input/output interface 106 in the data store 104 and may use the information in implementing the modules 112 and 114. The data store 104 may include volatile and/or non-volatile data storage.
The delete operation module 112 and the insert operation module 114 perform delete and insert operations for merge operations described herein. Examples of the merge operations are discussed in greater detail with respect to
With reference first to
As shown in
Although the source table 210 and the target table 220 have been depicted as having the same number of columns, it should be understood that various aspects of the present disclosure may be implemented in source tables and target tables having different numbers of columns and data types. In one regard, therefore, various aspects of the present disclosure may be implemented with source and target tables that are not the same schema.
A query to merge the source table 210 and the target table 220 into the merged target table 230, as shown in
MERGE INTO Source using Target on Source.USERID=Target.USERID;
when matched then update
set USERID=Source.USERID, CITY=Source.CITY, STATE=Source.STATE, ZIPCODE=Source.ZIPCODE;
when not matched then
insert (USERID, CITY, STATE, ZIPCODE) values (Source.USERID, Source.CITY, Source.STATE, Source.ZIPCODE).
As discussed in greater detail herein, the update clause and the insert clause of the query above may be performed by an example merge operation including a delete operation in parallel with an insert operation. For example, the query above may be converted into an example merge operation with following SQL commands:
DELETE from Target where exists (select 1 from Source where Target.USERID=Source.USERID);
INSERT into Target as select USERID, CITY, STATE, ZIPCODE from Source.
As shown in
Because tuples 212 and 222 match, the tuple 222 in the target table 220 may be deleted by adding its position identification (PosID) to a delete vector 240. For example, the PosID may be a row ID or row number. The delete vector 240, for instance, may keep track of the address of the deleted tuple 222 in the target table 220 as the PosID. Moreover, according to an example of the merge operation, both the matched tuple 212 of the source table and the unmatched tuples 214, 216 may be inserted into the merged target table 230 in parallel, e.g., simultaneously, with the deletion of tuple 231 in the merged target table 230. Also, the delete vector 240 identifies the deleted tuples but the deleted tuples may not actually be deleted from the tables. However, if a tuple is in the delete vector 240, the tuple may not be used for database operations. For instance, a deleted tuple identified in the delete vector 240 is not retrieved for a read operation but still may be stored in the database table.
As discussed above, the update clause and the insert clause of the original merge query may be converted into the example merge operation to perform a delete operation in parallel with an insert operation. According to the converted commands of the example merge operation, the insert operation does not require a join operation or input from a join operation, and thus may be implemented in parallel with the delete operation to reduce execution time for the merge operation. Moreover, the matched tuple 222 in the target table 220 may be identified as a matched tuple using a late materialized join operation between the source table 210 and the target table 220.
By implementing a late materialization strategy and delaying the materialization of the matched tuple 222, the position of the matched tuple 222 in the target table 220 may be preserved for addition to the delete vector 240 without an additional scan of the target table. For example, the merge operation may only scan the target table once during an initial join operation, which improves the performance and speed of the merge operation.
Further, the merge operation may utilize a semi inner join operation to identify the matched tuple 222 in the target table 220 based on a predetermined condition. As such, the merge operation may ignore the unmatched tuples 224, 226 in the target table 220 and does not differentiate between matched and unmatched tuples when inserting the tuples 112, 114, 116 contained in the source table 210 into the merged target table 230, which results in a relatively faster merge operation. The result of the merge operation is represented by tuples 231-236 in the merged target table 230 shown in
With reference to
In
According to an example, as shown in block 315, the delete operation of block 310 may first identify a tuple in the target table as a matched tuple based on a predetermined condition by utilizing a semi inner join operation between the source table and the target table. According to an example, the late materialized, semi inner join operation may not construct tuples until after the join operation, thus allowing the delete operation, for instance, to operate directly on a column of positions of the scanned tuples as long as possible. That is, a position of the matched tuple in the target table may be preserved and received from the late materialized join operation. Accordingly, the preserved position of the matched tuple in the target table may be added to a delete vector without an additional scan of the target table according to an example.
At block 320, an insert operation may be performed to insert each tuple contained in the source table into the target table as a new tuple, for instance, by the insert operation module 114. According to an example, the insert operation may be performed in parallel with the delete operation depicted in block 310. That is, the insert operation may not require a join operation or input from a join operation in order to insert each tuple contained in the source table into the target table. Accordingly, the insert operation does not differentiate between matched tuples and unmatched tuples, and thus, the inserting of each tuple from the source table into the target table may be implemented without identifying matched tuples. According to an example, the unmatched tuples from the source table may be new tuples. The insert operation of an example, therefore, may insert both matched tuples and unmatched tuples contained in the source table into the target table as new tuples.
As discussed above with respect to the diagram 200 depicted in
Moreover, as discussed above with respect to the diagram 200 depicted in
Turning now to
In
At block 420, the source table and the target table are semi inner joined based on a predetermined condition to identify matched tuples. The outcome of the identification at block 420 is a determination as to which tuples in the target table require an update. That is, according to an example, the matched tuples may be marked as removed in the target table to prevent access to the matched tuple in future queries, as further described below.
As discussed above with respect to the diagram 200 depicted in
At block 430, the position of the matched tuple in the target table may be preserved by delaying materialization of the matched tuples. For example, the positions of the matched tuple in the target table may not be discarded prior to being added to a delete vector, as further described below. In other words, the position of the matched tuple in the target table may be sustained after a join operation, in contrast to an early materialization strategy of other merge operations, which typically discards the position of the matched tuple after a join operation. The semi inner join operation of the identification at block 420 may, for instance, be performed with a late materialization strategy. According to an example, the late materialization strategy may not materialize tuples until after the semi inner join operation, thus allowing the delete operation 401 to operate directly on a column of positions of the scanned tuples as long as possible. Therefore, according to an example, the positions of the matched tuples in the target table may be preserved using the late materialized semi inner join operation.
At block 440, the preserved position of the matched tuple in the target table may be added to a delete vector. By implementing a late materialization strategy and delaying the materialization of the matched tuples, as shown in blocks 420 and 430 of the delete operation 401, the position of matched tuples in the target table are preserved. In contrast to other merge operations that typically use an early materialization strategy, the method 400 of the present disclosure does not require an additional scan (e.g., join) of the target table to fetch the positions of the matched tuples prior to deletion.
According to an example, when adding the positions of the matched tuples to the delete vector, the positions (i.e., address or PosID) of the matched tuples may be marked as removed to prevent access to the matched tuples in future queries. In other words, as described with respect to the method 400, when a delete operation 401 is performed, data is neither physically updated nor removed from the target table. Instead, the existing tuples may be marked as removed in the delete vector and the updated tuples are not updated but newly inserted into the target table.
Turning to the diagram 200 depicted in
Referring back to
With reference to the diagram 200 depicted in
According to an example, the tuples from the source table may be merged into target projections. Also, the source table may also be a projection. A projection is a set of columns that are either from a table or a join of different tables. In an analytic database, column data is usually stored redundantly in various projections with different column sort orders or data segmentation. This storage mechanism ensures that queries still work when one or more nodes are down, and improves the performance of many different queries. Hence, in the methods 300 and 400, the data in the source table may be merged into a plurality of target projections.
According to an example, the merge operation of the present disclosure may be implemented if the join column is the primary key column of the target table, the insert operation of the update clause is the same as the insert clause of an original MERGE query. The merge operation of the present disclosure may effectively rewrite a merge statement into a delete statement and an insert statement to provide more efficient manner of merging data into a target table
What has been described and illustrated herein are examples of the disclosure along with some variations. The terms, descriptions and figures used herein are set forth by way of illustration only and are not meant as limitations. Many variations are possible within the scope of the disclosure, which is intended to be defined by the following claims—and their equivalents—in which all terms are meant in their broadest reasonable sense unless otherwise indicated.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2013/072650 | 12/2/2013 | WO | 00 |