Various embodiments relate generally to the field of data transfer, and in an embodiment, but not by way of limitation, to a system and method involving a control process to direct and oversee the transfer of data from a source system to a target system.
Over the past several decades, there has been an explosion of information available on virtually any subject. A major reason for this explosion has been the advent of and the subsequent ubiquity of computers and networks. This information explosion has particularly impacted large or moderately sized business entities. Quite often, for reasons such as backups or disaster recovery, data stored within a business organization has to be transferred from one system (source) to another system (target). In the information processing profession, transformation from one database management system to another is referred to as extraction, transfer, and loading (ETL) operations. Such ETL functions however can occupy a great deal of resources on both the source and target systems (such as processor time and bandwidth) and network resources—resources that could be better used for other information processing needs.
There are some drawbacks however to systems like the one illustrated in
Various embodiments of the invention relate to a system that transfers data from a source system to a target system. In an embodiment, one or more control processes create one or more modules on a source system that will transfer data and one or more modules on a target system that will receive data. The control process communicates with the source modules and target modules. After the control process has successfully brought up the source and target processes, the control process informs the source and target modules to communicate with each other, and to begin the transfer of data. In embodiments in which the control process instantiates multiple source modules and target modules, a massively parallel processing system is created between the source system and the target system.
a,
2
b and 2c illustrate an example embodiment of a system architecture that may be used in connection with transferring data from a source system to a target system.
One or more embodiments of a system and method for transferring data from a source system to a target system are described. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be evident, however, to one skilled in the art that the present invention may be practiced without these specific details.
a illustrates an embodiment of an architecture that may be employed to transfer data from a source system to a target system. The architecture 200 includes a master control server 205 and a control server 210. The control server 210 includes at least a global control process 213, and a parent control process 215. The global control process 213 communicates with one or more parent control processes 215 through communication paths 217. Each parent control process 215 is associated with a command socket 220, through which it communicates with a source system 240 and a target system 260. In an embodiment, the communication to source system 240 is through the command socket 220 and a source data management utility (DMU) 247, and the communication to the target system 260 is through the command socket 220 and a target data management utility (DMU) 267. The source system 240 may include one or more source server nodes 245, and the target system may include one or more target nodes 265. These multiple source nodes and target nodes may operate on multiple images of the operating system, and as such, provide a massively parallel processing system to transfer data from the source system 240 to the target system 260. Each source node 245 includes one or more of the source data management utility programs 247. Each target node 265 includes one or more of the target data management utility programs 267. A data transfer socket 243 facilitates communication between a node 245 on the source system 240 and a node 265 on the target system 260. In an embodiment, a node is a single operating system with a dedicated processor and storage resources.
b illustrates another embodiment of an architecture that may be used to transfer data from a source system to a target system. The system 200 of
c illustrates another embodiment of a data transfer system 200. In
In an embodiment, the master control server 205 reads the global processing file 270. The global processing file 270 includes parameters that define the environment in which a system such as the system 200 in
The global processing file 270 further may include a parameter indicating the number of parent control processes 215 to instantiate. After the parent control processes 215 are instantiated, the control server 210 reads the global processing file 270 to determine the number and identity of source DMUs 247 and target DMUs 267, and the location of the source nodes 245 and target nodes 265 on which these DMU modules execute. In an embodiment, this is implemented in a script that contains a logon id and a password for each particular source node 245 and target node 265, which allows the control server 210 to access the particular source and target systems. Such a script may also include the locations of the files that will be transferred to the target node 265 by the particular source node 245 that is being logged onto. In an embodiment, this transfer may involve the complete transfer of a file or a number of files. In another embodiment, this transfer may only involve the records in a file that have been changed since the last transfer of data from the source node 245 to the target node 265. The logic to control such a determination may be in the source DMU 247, the source utility program 241, or other processes, files, or scripts within the system.
When instantiating the source and target nodes 245, 265, the control server 210 further uses the global processing file 270 to determine the sockets through which the instances of the source DMU 247 will communicate with the paired instances of the target DMU 267. The control server 210 may further determine a checkpoint and a checksum from the global processing file 270.
The global processing file 270 may further include a parameter to indicate the bandwidth that the system 200 is permitted to consume during a transfer. For example, if the total bandwidth available between the source system and target system is 100 Mbs, a parameter in the global processing file 270 may indicate that only 25 Mbs of bandwidth are to be consumed by the transfer process - - - leaving the remainder of the bandwidth for other processing/communication needs. Consequently, in an embodiment, a source node 245 and target node 265 will monitor themselves to assure that they stay within the confines of their allotted bandwidth. If they are approaching or exceeding their allotted bandwidth, the DMU processes 247, 267 may pause themselves for a period of time, thereby freeing up network resources for other processes.
The use of a global processing file 270 to persist and allocate resources in the system 200 is a form of statically persisting and allocating such resources. Any other method that allows an operator to statically persist and allocate resources would also work in lieu of the global processing file 270. In another embodiment, the system 200 dynamically persists and allocates resources. As an example embodiment of such a dynamic system, a service process is invoked which waits for instructions. The instructions may consist of the specification of the available resources, and the service will then determine how to use these resources based on load and availability. The service process can then determine at runtime the number of source and control processes that should be instantiated.
The instantiation and control of multiple source nodes 245 and multiple target nodes 265, each node with its own dedicated operating system and memory resources, in connection with the partitioning of data across these multiple instances of the server nodes and target nodes, and the compression of that data, provides a massively parallel processing environment that transfers the data from a source system 240 to a target system 260 without overburdening network resources. The exact architecture of a system 200 as illustrated in
In the meta data phase 310, a collection of work 311 is data that is to be analyzed to determine the portion thereof that is to be transferred from the source system 240 to the target system 260. For example, a particular data segment such as a data table may be small enough that the whole table can be transferred without an excessive strain on the system. In other cases, a subset 312 of this collection of work 311 may be created to decrease the amount of data that has to be transferred. In an embodiment, such a subset 312 may consist of only the records that have changed since the last transfer of data from the source system 240 to the target system 260. In an embodiment, the records to be transferred may be extracted with an SQL WHERE clause (block 313). A script is built at 314 that gathers environmental variables and sets up instances to execute.
In the preparation phase 320, runtime environments are built at 321 from the scripts generated at 314. Directories are built at 322 based on the subset 312. The directories are later used to locate the data that is to be transferred. The preparation phase is further used to build additional scripts at 323 based in part on the results of SQL WHERE operations. These scripts, when interpreted at 324, take part in four aspects of the data transfer process. First, scripts are constructed that generate the subset(s) of information tables. Second, scripts are generated that instruct the source DMU 247 and the target DMU 267 what to do. For example, a particular script may contain commands to instruct the source DMU 247 what tables or portions thereof to move from the source system 240 to the target system 260. In an embodiment, this involves invoking the utility programs 249, 269. Third, scripts are generated that validate all the data on the target system 260 that has been transferred there from the source system 240. Fourth, scripts are generated that are interpreted on the target system 260 and execute the apply function 350 on the target system. In particular, these scripts receive the data on the target system 260, and write the data to a clean database. Another script then will validate the transferred data, and if the data validates, the data will be written to the permanent database on the target system 260. The scripts also determine on the target system whether the data is a complete table, or just a portion of a larger table. If the data is a complete table, the target DMU 267 can overwrite the pertinent database on the target system. If it is only a portion of the database, the invocation of the scripts by the target DMU 267 will only change the records that have be transferred to the target system.
In the subset phase 330, scripts are generated at 331 to build subset tables that reflect the subsets 312 generated in the meta data phase 310. These scripts are executed at 332 resulting in tables being built at 333 that represent the subsets 312.
In the build and ship phase 340, the data to be transferred from the source system 240 to the target system 260 is acquired at 341 from the tables that were built at 333 using the scripts that were generated in the preparation phase 320. This data forms a dataset, which is read from the tables at 341. The dataset may then be compressed and/or encrypted at 342. Various compression lossless algorithms may be used such as gZip or zLib. After compression and encryption, the dataset is transferred at 343 to the target system 260, and a logging message is sent from the source system 240 to the control server 210 at 345.
In the apply phase 350 and load phase 360, the dataset first lands on the target system 260 at 351. Scripts generated in operation 320 are used at 352 to validate the data, and at 353 to determine if the transfer was successful or whether there was a problem in the transfer. The data is then applied at 354 to a clean database, validated, and loaded at 360, 361 to the permanent database 262.
The computer system 700 includes a processor 702, a main memory 704 and a static memory 706, which communicate with each other via a bus 708. The computer system 700 may further include a video display unit 710 (e.g., a liquid crystal display (LCD) or a cathode ray tube (CRT)). The computer system 700 also includes an alpha-numeric input device 712 (e.g., a keyboard), a cursor control device 714 (e.g., a mouse), a disk drive unit 716, a signal generation device 720 (e.g., a speaker) and a network interface device 722.
The disk drive unit 716 includes a computer-readable medium 724 on which is stored a set of instructions (i.e., software) 726 embodying any one, or all, of the methodologies described above. The software 726 is also shown to reside, completely or at least partially, within the main memory 704 and/or within the processor 702. The software 726 may further be transmitted or received via the network interface device 722. For the purposes of this specification, the term “computer-readable medium” shall be taken to include any medium that is capable of storing or encoding a sequence of instructions for execution by the computer and that cause the computer to perform any one of the methodologies of the present invention. The term “computer-readable medium” shall accordingly be taken to included, but not be limited to, solid-state memories, optical and magnetic disks, and carrier wave signals.
Thus, a method and apparatus for transferring data from a source system to a target system have been described. Although the present invention has been described with reference to specific exemplary embodiments, it will be evident that various modifications and changes may be made to these embodiments without departing from the broader spirit and scope of the invention. Accordingly, the specification and drawings are to be regarded in an illustrative rather than a restrictive sense.