This invention relates to data processing tools for assisting users in decision making processes. Data Warehousing is a tool used for providing users with information needed in various decision making processes, for example, in various types of businesses and other types of organizational environments. In order to provide current information, the data warehouse must be continuously updated and consolidated to properly reflect a current state of a business or organizational entity. Data from multiple sources is extracted, transformed and loaded into the data warehouse at various times in a process. Typically, the transformation steps are modeled as data flows, which can be implemented as SQL scripts, shell scripts, application programs, and other types of command execution. Frequently, in an enterprise, for a complete end-to-end solution, such data flow executions are also interspersed with other existing processes, for example, legacy or third party processes.
Businesses and organizations often rely on a service-oriented approach to support their processes. A programming language that is used in this service-oriented approach is BPEL4WS (Business Process Execution Language for Web Services). This standard language is supported by a variety of runtime engines, and enables the orchestration of web services. Common data warehousing scenarios, such as the above mentioned extract, transform and load (ETL) scenario, are not well supported in BPEL-based processes, as the current BPEL language does not provide for a native definition of data integration processing elements.
A BPEL process is a flow of activities, where BPEL branching semantics indicate sequenced or parallel execution of the activities, including handling errors in activity executions. A BPEL process can have variables, which are used as either input or output parameters in the context of an activity's invocation. Several BPEL flow level and activity level attributes can exist. For example, groups of activities can be part of the same transaction. Other attributes include settings, such as whether transactional processing is managed by a BPEL engine, or whether individual activities manage their own transactions.
In contrast to BPEL processes, whose focus is the coordination of activities among service providers, data warehouse processes usually need to handle significant amounts of data. A data warehouse process includes a set of steps to extract, transform and load data, which are repeatedly performed depending on predefined control conditions. Data movement and transformation steps use database functionality and script files, as well as custom programs, to process the data. A data warehouse process can be scheduled (or triggered), and there is usually only a single instance of a data warehouse process running at any specific time. Error recovery in transformational processing is critical and is done using the transactional processing capability of SQL engines, or by running specific cleanup flows to “undo” partial processing.
Having a service-oriented approach, based on standards such as BPEL, is essential for enterprise component and process integration, especially when multiple parties are involved. With a service-oriented approach the distinctions between control processes (that is, BPEL processes) and data (that is, warehouse transformation) processes becomes less clear, especially from an end user's or administrator's perspective. Importantly, all the constituent activities, including the data transformation activities, must participate in a unified transactional context as well, to enable proper error recovery and restarts of failed processes, and so on. However, BPEL provides little support for data transformation activities. Hence, it would be advantageous to have a seamless integration of process steps and data integration steps to enable a combination of activities as flexibly as needed, while avoiding having to rely on multiple technology solutions for solving similar problems.
One solution for integrating a data flow into a BPEL-based process is to expose a data flow job through a web service. This approach, however, requires a separate web service interface for each data flow job, leading to a proliferation of web services in complex ETL scenarios and thus to extensive administration requirements. Furthermore, from an invocation point of view, web services are remote, which adds overhead for invocations of data transformations that have the potential of being handled locally.
Another possible solution is based on extensions to the BPEL-standard to support various high-level programming languages. For example, BPEL4Java systems enable BPEL process developers to embed Java™ code into BPEL, such as Java™ conditional expressions for evaluating branches or for invoking Java™ methods. However, these generic programming related solutions do not pertain to typical data transformation scenarios, especially since ETL jobs are usually executed on runtime (remote) systems.
Often, data transformations are performed using SQL. Thus, yet another possible solution for including data integration processing into a process is to make use of a native SQL-extension approach in the BPEL language. This solution has the advantage of integrating the data directly into the process, but also implies that already existing data transformation flows would need to be re-implemented as SQL embedded inside BPEL. There is also a dependency on a particular runtime environment as the BPEL engine must support this non-standard extension, and the data processing activities are limited to SQL. In typical situations, external programs and other non-SQL mechanisms are extensively used to complete even the most trivial of data warehouse processing. Also, the “dialect” of SQL used varies between different vendors, which makes it difficult to provide a standardized and useful extension to BPEL.
In general, in one aspect, the invention provides methods and apparatus, including computer program products, implementing and using techniques for integrating control activities and data transformation activities in a process flow. A data transformation activity is invoked through local invocation. The data transformation activity is part of a process flow defined in a standard business process execution language format and is invoked from within the process flow.
Advantageous implementations can include one or more of the following features. The local invocation can be done by calling a service for local invocation in a data transformation processing subsystem. One or more data transformation activities can be defined as part of the process flow. One or more process-scoped variables can be generated to exchange information between a data transformation system and a process control engine that executes the process flow. A template partner link definition can be provided as part of the process flow to invoke a remote data transformation system. The template partner link definition can be used as a placeholder partner link when the remote data transformation system has not been identified at design time of the process flow. The template partner link definition can be used as a map to predefined partner links when the remote data transformation system has been identified at design time of the process flow.
An abstraction layer can be provided as part of the process flow. The abstraction layer can include a control flow level and a dataflow level. The dataflow level can contain specifics about data transformations. The control flow level can contain specifics about dependencies and an execution sequence of the dataflow. The control flow level can include one or more transactional and error recovery options. The process flow can be deployed into a system comprising a process control engine for executing the process flow and one or more data transformation systems for performing the data transformations. The process control engine can be a business process execution language engine. The data transformation activities can include one or more of: an extract, transform and load activity and a data movement that is a precursor to an extract, transform and load activity. A data transformation activity can be invoked through remote invocation, when the data transformation activity is to be performed on a remote system. The invocation can be done by calling a service for remote invocation in a data transformation processing subsystem. Each dataflow activity and each control activity can have a unique identifier.
In general, in another aspect, the invention provides a system for executing a process flow including one or more control activities and one or more data transformation activities. The system includes a process control engine, a data transformation subsystem and a control data repository. The process control engine executes activities included in the process flow. The data transformation subsystem stores domain specific definitions of data transformation processes of data in one or more databases. The control data repository stores domain specific activity information relating to the process flow.
Advantageous implementations can include one or more of the following features. The control data repository can store one or more of: runtime data for the process flow, auditing information for the process flow, and statistics generated from the execution of the activities in the process flow. The process control engine can include a business process execution language interface and a web services description language interface for communicating with other system components. The data transformation subsystem can be invoked as a local service or a remote service. The system can generate one or more process-scoped variables operable to exchange data between the process control engine and the data transformation subsystem. The process control engine can be a business process execution language engine and the process flow is described in a business process execution language. The process flow can include a template partner link definition to be used as a placeholder partner link when the remote data transformation system has not been identified at design time of the process flow, and to be used as a map to predefined partner links when the remote data transformation system has been identified at design time of the process flow.
The invention can be implemented to include one or more of the following advantages. The various embodiments of the invention enable a simplified integration of ETL or data integration activities in a process flow. The various embodiments of the invention provide a user with an option of doing either remote partner links or local method calls, with the complexity hidden from the user, as opposed to the default BPEL way of using partner links only.
The details of one or more embodiments of the invention are set forth in the accompanying drawings and the description below. Other features and advantages of the invention will be apparent from the description and drawings, and from the claims.
Like reference symbols in the various drawings indicate like elements.
Overview
The various embodiments of the invention provide simplified integration of control and data activities in a process flow, by extending the BPEL language to represent a combination of processing nodes and data transformation activities. A simplified definition of data transformation activities is provided as part of a BPEL-based process. BPEL-process scoped variables are automatically generated in order to exchange status information between an ETL system and a BPEL choreography engine. A template partner link definition is provided for remote ETL web service invocations. The template partner link definition is used as a placeholder partner link (for example, to a specific hostname or IP address) when the service distribution is not known at design time. The user can later map this placeholder to the actual system that will be used. This feature also allows the user to easily switch between different systems, for example, from a test system to a live or production system. Alternatively, the template partner link definition can be used to map predefined partner links when the service distribution is known at design time. Finally, an option for local (that is, same system) execution is provided in order to avoid web services administration overhead. For example, one option for local execution is to use a direct Java method call instead of using partner links, which avoids the communication overhead that is necessary for partner links (and web services in general). The interaction between these different components will now be described in greater detail by way of example in the context of an exemplary BPEL for ETL system (BPEL4ETL), which is schematically shown in
The BPEL4ETL System
The BPEL4ETL system (100) is an extended BPEL framework that enables user to intermix typical data warehousing processing, such as ETL jobs, as needed with other more process-control oriented activities in a BPEL process flow. The BPEL4ETL system (100) includes a BPEL engine (102), an extension ETL processing subsystem (104), a control data repository (106), a logical data source (108), and a logical data target (110). The BPEL engine (102) processes a BPEL4ETL file, which contains ETL activity definitions for a particular process, and as a result of the processing triggers the execution of these activities. The BPEL engine (102) has a WSDL interface and a BPEL interface through which the BPEL engine (102) can communicate with other remote components in the BPEL4ETL system (100).
It should be noted that although only one logical data source (108) and one logical data target (110) is shown in
The extension ETL processing subsystem (104) can be invoked either locally as a local ETL service (112) or remotely as a web ETL service (114) and contains a domain-specific definition of common data warehousing processes involving the logical data source (108) and the logical data target (110). In some embodiments this domain-specific definition is provided as a two-level abstraction layer on top of the standard BPEL definition. The first level of the abstraction layer is the data flow level, which specifies the details of the data transformation or ETL. This is typically modeled using very specific techniques, which are known to those of ordinary skill in the art. The second level of the abstraction layer is the control flow level, which specifies the dependencies and the execution sequence of the data flows and activity nodes. The control flow level also includes transactional and other error recovery options applicable to such ETL activities, which are well-known to those of ordinary skill in the art. The extension ETL processing subsystem (104) communicates with remote components in the BPEL4ETL system (100) through a WSDL interface.
The control data repository (106) stores ETL domain specific activity information, such as ETL process information (118), and in some embodiments runtime ETL data (116), such as auditing information and statistics from activity executions, so that this information can be obtained and evaluated at a later time, if need be. As will be described in further detail below, in some embodiments of the BPEL4ETL system (100), each instance of an extended BPEL process is associated with a unique identifier. Furthermore, each activity referenced in a BPEL process is also associated with an identifier that is unique to the activity in the specific BPEL process instance. It should, however, be realized that in other embodiments it is possible to use completely unique identifiers, both for the process instances and data activity instances.
The BPEL4ETL Process Definition
As was described above, a BPEL flow instance can have many attribute values as well as variable definitions in addition to the references to the different activities, which together describe the execution semantics. For the BPEL4ETL system (100), additional information is required, and hence a specific data structure, herein referred to as a “BPEL4ETL process definition” is provided in accordance with various embodiments of the invention. The BPEL4ETL process definition includes:
This BPEL4ETL process definition is included in the interface definitions of the different components of the BPEL4ETL system (100). In some embodiments the BPEL4ETL process definition is included in the WSDL interface definition, the WSDL client interface, and the BPEL file. The WSDL interface definition contains the entry and exit points and the message definitions for the BPEL process. The WSDL client interface is used to identify the ETL services provided on remote ETL systems that are used as partners. The BPEL file contains the flow and activity definitions and uses the message and partner descriptions exposed through the WSDL files. Having described the BPEL4ETL system (100) and the BPEL4ETL process definition, it will now be explained how a user can add ETL activities to BPEL flows.
Adding ETL Activities to BPEL Flows
As is well known to those of ordinary skill in the art, there exists a variety of graphical user interface (GUI) editors for BPEL, such as the WebSphere Integration Developer provided by IBM Inc. of Armonk, N.Y., the BPEL Process Manager provided by Oracle Inc. of Redwood Shores, Calif., and various types of open source BPEL editors. In various embodiments, these GUI editors can be extended to include functionality for adding ETL activities to BPEL flows, so that a user can select specific ETL activity types and create instances of them to be added to a currently selected BPEL instance. Alternatively, completely new GUI editors can be created that include editing functionality for BPEL4ETL processes. The specific editing mechanisms that are used by such editors are well known to those of ordinary skill in the art and are beyond the scope of this application.
If it is determined in step 406 that the activity is an ETL activity, then the process stores the ETL specific attributes, including the auto-generated activity identifier from step 404, as extension metadata to the BPEL information, and creates a state variable for the activity (step 408). Next, the process examines whether the ETL activity is a local activity (step 410). If the ETL activity is a local activity, the process adds a BPEL-J Java™ activity, specifying the ETL extension system's entry-point Java™ method, in order to indicate a local invocation (step 412). If the ETL activity is not a local activity, that is, the ETL activity is a remote activity, an assign activity and an invoke activity are placed in the BPEL sequence, which represent the remote activity (step 414).
After completing step 412 or step 414, respectively, the process checks whether the entire BPEL process is complete (step 416). If the BPEL process is not complete, the process returns to step 402, where the next BPEL flow activity is created, and then continues as described above. If it is determined in step 416 that the BPEL process is complete, a BPEL4ETL file is generated and saved for later execution (step 418), which completes the process (400).
It should be noted that in some embodiments, users can add extension metadata to activities in the BPEL process, either in the same BPEL file or in an external associated file. In other embodiments, extensions that do not conform to BPEL specifications can be stored as XML or Java™ comments (in the case of BPEL-J systems) in the BPEL file. For BPEL4ETL processes, these comments are pre-processed during deployment by the ETL specific extension pre-processor, which extracts the comments and makes them available in the ETL control data repository (106). This makes it possible to exchange the content of an activity without the need to redeploy the BPEL process itself.
Executing a BPEL Process Including ETL Activities
If the activity is an ETL activity, then the BPEL engine (102) triggers the execution of the ETL activity through invocation of either the local ETL service (112) or the remote web ETL service (114) (step 508). The invoked ETL service accesses ETL specific control data from the control data repository (106) (step 510) and performs the requested data transformation (512) based on the control data. Next, the process returns the control to the BPEL engine (102) (step 514). The BPEL engine (102) checks whether all the activities in the BPEL4ETL file have been processed (step 516). If there are still more activities to be processed, the process continues to step 504 where the next activity is executed, as described above. If it is determined in step 516 that there are no more activities to be processed, the process (500) ends.
Example BPEL4ETL
The invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements. In a preferred embodiment, the invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, and so on.
Furthermore, the invention can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk-read only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.
A data processing system suitable for storing and/or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.
Input/output or I/O devices (including but not limited to keyboards, displays, pointing devices, etc.) can be coupled to the system either directly or through intervening I/O controllers.
Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.
A number of implementations of the invention have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the invention. Accordingly, other embodiments are within the scope of the following claims.
Number | Name | Date | Kind |
---|---|---|---|
20030236690 | Johnston-Watt et al. | Dec 2003 | A1 |
20040249644 | Schiefer et al. | Dec 2004 | A1 |
20050187991 | Wilms et al. | Aug 2005 | A1 |
20050251527 | Phillips et al. | Nov 2005 | A1 |
20050262192 | Mamou et al. | Nov 2005 | A1 |
20060020948 | Carr et al. | Jan 2006 | A1 |
20060031750 | Waldorf et al. | Feb 2006 | A1 |
20060075391 | Esmonde, Jr. et al. | Apr 2006 | A1 |
20060095274 | Phillips et al. | May 2006 | A1 |
20060106626 | Jeng et al. | May 2006 | A1 |
20060112109 | Chowdhary et al. | May 2006 | A1 |
Entry |
---|
Mendling, Jan et al., “An Approach to Extract RBAC Models from BPEL4WS Processes”, Proceedings of the 13th IEEE International Workshops on Enabling Technologies . . . , 2004. |
Karastoyanaova, Dimka et al., “Extending BPEL for Run Time Adaptability”, Proceedings of the 2005 9th IEEE International EDOC Enterprise Computing Conference, 2005. |
Andrews, Tony et al., “Business Process Execution Language for Web Services, version 1.1 dated May 5, 2003”, BEA Systems et al., 2002,2003. |
Blow, Michael et al., “BPELJ: BPEL for Java: A Joint White Paper by BEA and IBM”, Mar. 2004, http://ftpna2.bea.com/pub/downloads/ws-bpelj.pdf. |
“Java Community Process (SM) Program: JSRs: Java Specification Requests—Detail JSR . . . ”, Sun Microsystems, Retrieved Oct. 31, 2006 from http://web1.jcp.org/en/jsr/detail?id=207. |
Christensen, Erik et al., “Web Services Description Language (WSDL) 1.1.”, Ariba, IBM, Microsoft, 2001, http://www.w3org/TR/wsdl. |
“Information Integration for BPEL on WebSphere Process Server” Manual, IBM, Retrieved Oct. 31, 2006 from http://vvww.alphaworks.ibm.com/tech/ii4bpel. |
“BPEL Project Milestone Plan”, The Eclipse Foundation, Retrieved Oct. 31, 2006 from http://www.eclipse.org/bpel/developers/milestone—plan.php. |
Openlink Virtuoso Universal Server: Documentation. Ch. 15.13 “BPEL Reference”, Openlink Software, Retrieved Oct. 31, 2006 from http://docs.openlinksw.com/virtuoso/bpel.html. |
Number | Date | Country | |
---|---|---|---|
20080115135 A1 | May 2008 | US |