A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
The present invention generally relates to data warehouses and business intelligence, and particularly to supporting extract, transform, and load metadata in a business intelligence server.
In the context of computer software, and particularly computer databases, the term “data warehouse” is generally used to refer to a unified data repository for all customer-centric data. A data warehouse environment tends to be quite large. The data stored in the data warehouse can be cleaned, transformed, and catalogued. Such data can be used by business professionals for performing business related operations, such as data mining, online analytical processing, and decision support. Typically, a data warehouse can be associated with extract, transform, and load (ETL) processes and business intelligence tools. Extract, transform, and load (ETL) is a process of extracting data from source systems and bringing it into a data warehouse. Generally, the ETL process includes extracting data from outside sources, transforming the data to fit operational needs, and loading the data into an end target database or data warehouse. A data warehouse environment tends to be very large. As such, designing and maintaining the ETL process is often considered one of the more difficult and resource-intensive portions of a data warehouse project. Many data warehousing projects use ETL tools to manage this process. Some data warehouse builders provide ETL capabilities and take advantage of inherent database abilities. Other data warehouse builders create their own ETL tools and processes, either inside or outside the database. This is the general area that embodiments of the invention are intended to address.
In accordance with an embodiment, a business intelligence (BI) server maintains a plurality of metadata objects to support the extract, transform and load (ETL) processes. These metadata objects includes a transparent view object, which takes a joined set of source tables and represents a data shape of the joined set of source tables using a transformation, and a ETL mapping association object that maps the transformation contained in the transparent view object to a target table. The BI server can then orchestrate the movement of data from source systems into the target data warehouses in a source and target system agnostic way.
The present invention is illustrated, by way of example and not by way of limitation, in the figures of the accompanying drawings in which like references indicate similar elements. It should be noted that references to “an” or “one” or “some” embodiment(s) in this disclosure are not necessarily to the same embodiment, and such references mean at least one.
As described herein, a data warehouse can be used to store critical business information. Business intelligence (BI) applications running on top of the data warehouse can provide powerful tools to the users for managing and operating their business. These BI tools can not only help the users run their day-to-day business, but also help the users make critical tactical, or even long term strategic, business decisions.
There can be different types of BI applications used in the enterprise environment, such as sales, marketing, supply chain, financial, and human resource applications. An application framework, such as ADF, can be used to implement the different types of BI applications. Each BI application can store and use one or more application data objects in its own application data store, outside of the data warehouse.
A BI server can reside between the BI applications and the data warehouse. The BI server allows the BI applications to use high-level analytical queries to scan and analyze large volumes of data in the data warehouse using complex formulas, in order to provide efficient and easy access to information required for business decision making. The BI applications can rely on the BI server to fulfill its analytic requirement.
A data warehouse can be sourced from multiple data source systems associated with the BI applications. As such, a BI server can associate an entity in the target data warehouse with data objects from multiple data sources, by extracting data from the various data sources into a single staging area, where the data conformance is performed before the conformed data can be loaded into the target data warehouse.
Furthermore, when BI applications make changes, or extensions, on the application data objects in application data store. The BI server can propagate the changes and the extensions on the application objects in the application framework to the underlying data warehouse that stores the data in the application objects.
The BI server uses extract, transform, and load (ETL) processes to extract data from the outside data sources, transform the source data to fit operational needs, and load the data into the target data warehouse. ETL metadata can be used to define and manage the ETL processes associated with the data warehouse. Such metadata is essential to the data warehouse and the BI systems on top of the data warehouse. An administration tool on the BI server allows a user to interact with the BI server, and manage the extension process of the underlying data warehouse through metadata.
In accordance with an embodiment, ETL processes can be based on different types of conceptually independent metadata: such as data transformation logic metadata, data flow metadata, and task execution metadata.
The data transformation logic metadata can specify the data transformations, such as joins between participating entities, expressions etc., to logically construct an entity such as a target table on a target system based on one or more entities from the participating source systems.
The data flow metadata can specify metadata properties and functionalities to allow data to flow through the defined transformation steps. The data flow metadata captures a specific set of properties that are related to ETL runs. One exemplary data flow metadata can specify whether a ETL run is incremental or full; another exemplary data flow metadata can specify whether a table should be joined or not.
The task execution metadata can specify actual execution of the ETL scripts to move data from the various sources to a target. The task execution metadata can analyze the task dependency and generate plans for parallelization.
Other types of ETL metadata can include execution management metadata, project metadata, and scheduling metadata. The execution management metadata comprises different types of metadata related to ETL execution workflow management. The project metadata allows the user to group together and execute a collection of data flows. The scheduling metadata evaluates the supported features and implementation.
In accordance with an embodiment, the project metadata includes set definition that is driven by facts selected by the users for extract. Using the set definition, the system can analyze dependencies and pull in related artifacts that need to participate in the ETL processes, such as base facts, dimensions. The system can exclude and/or include additional target artifacts, and allows a user to persist and maintain a customized set.
As shown in
In accordance with an embodiment, a BI server allows the administrator to capture each of the transformation steps shown in
In accordance with an embodiment, a business intelligence (BI) server can use ETL data transformation logic metadata to orchestrate the movement of data from source systems into the target data warehouses in a source and target system agnostic way. The ETL data transformation logic metadata can be structured and declarative metadata to facilitate easy maintenance and improve understandability.
In accordance with an embodiment, the TV objects can be completely database agnostic, and extremely flexible. In an embodiment, a TV object can represent a data shape of multiple different source tables that generates a SQL construct, such as a select physical SQL statement. In another embodiment, the TV objects, which are not execution data structures, can store declarative rules that describing how ETL data transformations happen.
In accordance with an embodiment, the TV objects can be defined in the context of a physical source. Users are able to specify operations such as: joins, expression based derived columns, and filters. In one example, the TV object can be implemented in a similar manner to Logical Table Sources (LTS), which allows an administrator to create a logical table by transforming one or more physical tables from one or more sources.
In accordance with an embodiment, the BI server allows users to span a TV object across multiple databases and tables, so that users can progressively build the data shape by nesting objects within each other. A nested TV object can be joined with other physical layer objects, such as source tables.
In the example as shown in
As shown in
In accordance with an embodiment, a code generator can read the TV objects and the EMA object to generate one or more ETL scripts. In another embodiment, TV objects can participate in complex ETL data transformation process, such as three-way merge project and project extract, since users can select a TV object and easily visualize all source and target links associated with the TV object, from an immediate link to the complete graph.
In accordance with an embodiment, ETL data flows can be broken down into several steps. There can be an extract step, which handles source to staging transformations. There can also be a load step, which handles staging to target transformations, and another step for post load data transformations.
In accordance with an embodiment, ETL data flow metadata can be independent of the actual transformation steps. The data flow metadata can capture the operational steps that need to be implemented before or after a transformation step.
The data flow metadata can specify physical structure maintenance. In an embodiment, ETL data flow can split the load operations into update operation for existing rows and insert operation for new rows. The administrators can run a sequence of operations improves performance, since bulk inserts on an indexed structure can be very slow. The sequence of operations can include: 1) ‘Update’ load, 2) drop indices, 3) run the insert statement either via SQL or via a fast load mechanism, and 4) recreate indices.
The data flow metadata can distinguish an incremental load from a full load. In an embodiment, in order to facilitate an incremental load, the source system can have a ‘Last Updated Date’ column. A filter can be added to the query to ensure that only rows updated after a certain point are considered for extraction. The metadata for incremental load enablement, for example the ‘Last Updated Date’ column, can be captured as a metadata property within the transparent view metadata structure. The preference to run either an incremental load or a full load can be defined as a part of the data flow metadata.
The data flow metadata can specify additional data flow properties, such as currencies that a deployment wants to report on. Transactional systems can have two currencies: a local currency and a global currency. The local currency records the transaction in the actual currency that it was exercised under. The global currency is a single currency (For example USD, or EUROs) in which all transaction amounts are recorded. The data flow can convert the global currency to the desired target currency, in order to fulfill the reporting currency conversion requirements. So that, the problem of converting many local currencies to many target currencies is reduced to a simpler problem of converting one global currency to many target currencies. In this example, the currency table registration, which joins between target fact tables and the currency conversion table, can be captured in transparent view metadata. The choice of the actual reporting currencies can be captured and handled within the data flow.
The data flow metadata can also specify partitioned workflows. Some data warehouse implements capabilities for parallelizing the full loads of large fact tables by partitioning the load into multiple parallel loads. In order to achieve such parallelism, users can make multiple copies of the ETL maps, one map for each partition.
In accordance with an embodiment, there can be a clean separation between the data transform logic metadata and ETL data flow metadata. Users can either invoke the same data transform logic via multiple work flows, or invoke the data transform logic via a parameterized ETL workflow, which can be executed in parallel for each set of parameters.
In accordance with an embodiment, there can be different approaches to support the ETL execution, such as an ETL code generation approach and a BI Server ETL execution approach. Using the ETL code generation approach, a BI Server can generate the ETL scripts for a desired third party, such as vendors of choice. At runtime, the ETL tool can carry out the ETL execution, with BI Server acted as a data source. The ETL code generation approach allows ETL vendors to implement various optimization techniques that the BI Server may not support, for example, non-SQL fast load and parallel loads. Additionally, the ETL vendors can allow fine grained options, in terms of performance and functionality.
Using the BI Server ETL execution approach, a BI server works as the ETL execution engine. The BI Server can be responsible for interacting with the source and target directly, executing the various transform steps (backed with internal execution capabilities), and load data in the target and build/maintain the related physical artifacts, such as indices etc. The BI Server ETL execution approach eliminates the need to install, deploy and maintain another product and a metadata repository. Every time a user by-passes the BI Server, the user risks to increase the total cost of ownership, since these by-passes needs to be manually patched and upgraded.
In accordance with an embodiment, these two approaches can be used together for expediency and risk mitigation reasons. For example, a BI Server can support code generation for ETL and perform minimum required execution capabilities. The BI server can also provide the extensions required by the content developers to express transforms. The BI Server allows users to select certain target objects and have the ETL scripts generated for these target objects. Users can then have these scripts executed via the ETL vendor's execution engine. The BI System can provide extensibility updates support to these scripts, and the execution management support for these scripts. In an embodiment, the BI System allows the users to edit these scripts manually via the ETL designer's user interface.
In accordance with an embodiment, a BI server can use an externalized user interface (UI) to support a variable number of ETL mapping types. Using the externalized UI, the ETL mapping types can be extended without changing the underlying UI implementation software source code. In an embodiment, each ETL mapping type can be defined via XML declarations. Additionally, the BI server can support a set of data manipulation language (DML) options, with each ETL mapping type exposing a subset of the DML options.
The following Listing 1 is an exemplary XML file that defines an EMA UI.
The following Listing 2 is an exemplary schema associated wit the XML file that defines the EMA UI.
In accordance with an embodiment, the layout algorithm can first read the dependency graph. The layout algorithm can throw an error for circular dependencies. In an embodiment, the row and column positions can be readjusted based on the options visible in the externalized ETL mapping UI. When the value of an option changes, the algorithm can go through the dependencies again and redo the layout or enable the controls as required.
Based on the same underlying implementation software source code, the BI server can generate different UIs to support different ETL mapping types.
In accordance with an embodiment, all related ETL objects can be modeled and queried together. A dialog can manage TV objects and EMA objects by filtering objects by target tables, TV objects, and dependencies. Additionally, since each EMA object corresponds to one target table, a physical layer UI can show the TV objects and EMA objects in a tree representation. In one example, the EMA object and the corresponding TV objects can be shortcuts to the actual objects, since they can be replicated across different target tables.
The present invention may be conveniently implemented using a conventional general purpose or a specialized digital computer or microprocessor programmed according to the teachings of the present disclosure. Appropriate software coding can readily be prepared by skilled programmers based on the teachings of the present disclosure, as will be apparent to those skilled in the software art.
In some embodiments, the present invention includes a computer program product which is a storage medium (media) having instructions stored thereon/in which can be used to program a computer to perform any of the processes of the present invention. The storage medium can include, but is not limited to, any type of disk including floppy disks, optical discs, DVD, CD-ROMs, microdrive, and magneto-optical disks, ROMs, RAMs, EPROMs, EEPROMs, DRAMs, VRAMs, flash memory devices, magnetic or optical cards, nanosystems (including molecular memory ICs), or any type of media or device suitable for storing instructions and/or data.
The foregoing description of the present invention has been provided for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many modifications and variations will be apparent to the practitioner skilled in the art. The code examples given are presented for purposes of illustration. It will be evident that the techniques described herein may be applied using other code languages, and with different code.
The embodiments were chosen and described in order to best explain the principles of the invention and its practical application, thereby enabling others skilled in the art to understand the invention for various embodiments and with various modifications that are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the following claims and their equivalents.
This application is a continuation of U.S. patent application Ser. No. 13/100,255, filed May 3, 2011 entitled “SYSTEM AND METHOD FOR ENABLING EXTRACT TRANSFORM AND LOAD PROCESSES IN A BUSINESS INTELLIGENCE SERVER”, and which application claims the benefit of priority to U.S. Provisional Patent Application No. 61/349,710, entitled “SYSTEM AND METHOD FOR ENABLING EXTRACT TRANSFORM AND LOAD (ETL) PROCESSES IN A BUSINESS INTELLIGENCE (BI) SERVER” filed May 28, 2010, which applications are hereby incorporated by reference in its entirety. This application is related to the following applications which are incorporated herein by reference: U.S. Patent Application Ser. No. 12/711,269, entitled “GENERATION OF STAR SCHEMAS FROM SNOWFLAKE SCHEMAS CONTAINING A LARGE NUMBER OF DIMENSIONS” by Samir Satpathy, filed on Feb. 24, 2010; U.S. patent application Ser. No. 13/100,245, entitled “SYSTEM AND METHOD FOR PROVIDING DATA FLEXIBILITY IN A BUSINESS INTELLIGENCE SERVER USING AN ADMINISTRATION TOOL” by Raghuram Venkatasubramanian et al., filed on May 3, 2011; U.S. patent application Ser. No. 13/100,248, entitled “SYSTEM AND METHOD FOR SPECIFYING METADATA EXTENSION INPUT FOR EXTENDING A DATA WAREHOUSE” by Raghuram Venkatasubramanian et al., filed on May 3, 2011; and U.S. patent application Ser. No. 13/100,249 entitled “SYSTEM AND METHOD FOR SUPPORTING DATA WAREHOUSE METADATA EXTENSION USING AN EXTENDER” by Raghuram Venkatasubramanian et al., filed May 3, 2011.
Number | Date | Country | |
---|---|---|---|
61349710 | May 2010 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 13100255 | May 2011 | US |
Child | 17030149 | US |