The present invention generally relates to multidimensional analysis programs for relational databases. More specifically, this invention pertains to a system and method for automatically creating OLAP (multidimensional) metadata objects from a relational database. This metadata can be used by OLAP products, such as relational OLAP (ROLAP), hybrid OLAP (HOLAP), and multidimensional OLAP (MOLAP), or by the relational database itself.
Relational database management systems (RDBMs) have been in existence for many years. Historically, these database systems have had limited metadata. Though there was some metadata describing the tables, views and columns, little information existed about the relationships between tables and columns. Much of the semantic information for the database existed either in the user's concept of the database or possibly in a warehouse product
In recent years, on-line analytical processing (OLAP) has become popular. OLAP systems provide more extensive metadata to describe multidimensional abstractions such as dimensions and hierarchies. Some commercial software products have been written to support OLAP.
The implementation of OLAP requires the introduction of additional metadata objects. Databases have been extended to support OLAP by introducing new objects on top of the relational database. Typically, a mechanism is provided to database administrators for defining OLAP objects. Software products use these objects to provide functional or performance improvements. However, the task of defining OLAP objects can be very time consuming. Many of these databases are very large, with many objects, fields, relationships, etc. Furthermore, many older databases have no documentation on the database design and organization. Consequently, database administrators must first analyze their database and then document its characteristics prior to defining the metadata. This time-consuming process requires skilled labor, and can be very expensive. In addition, without access to the original database programmer or documentation, the process of defining OLAP objects can be prone to error.
Users of relational databases having no concept of metadata may wish to take advantage of the higher, multidimensional capability of programs such as OLAP. However, OLAP objects must first be created for those relational databases. What is therefore needed is a system and associated method for quickly, efficiently, and automatically creating the OLAP objects for a relational database that does not have OLAP objects. The need for such a system and method has heretofore remained unsatisfied.
The present invention satisfies this need, and presents a system, a computer program product, and an associated method (collectively referred to herein as “the system” or “the present system”) for automatically building an OLAP model in a relational database. The present system automatically creates OLAP (multidimensional) metadata objects from a relational database. This metadata can be used by OLAP products, such as relational OLAP (ROLAP), hybrid OLAP (HOLAP), and multidimensional OLAP (MOLAP), or by the relational database itself (collectively referred to herein as “OLAP”).
The present system eases the transition from a non-OLAP relational database to one with OLAP objects. It is one feature of the present system to efficiently, quickly, and automatically create the OLAP structure for existing large databases.
The present system automatically generates OLAP objects from SQL statements without involving the database administrator, thus eliminating the need for database administrators to perform this analysis and design. The present system builds OLAP objects by analyzing stored SQL statements. Although SQL statements do not explicitly contain OLAP objects, it is possible to analyze the syntax of SQL statements, individually and in aggregate, to determine the OLAP objects.
The present system deconstructs or parses the SQL statement into tables and aggregate metrics for measures and joins. The system recognizes that relational database systems contain tables that function as facts and dimensions. Over many SQL statements, fact tables will have a large measure metric while dimension tables will have a low measure metric.
Tables are linked based on large join metrics, while relatively smaller join metrics are ignored. The present system builds the OLAP cube model from the fact tables, dimension tables, and joins. Within a dimension table there may exist one or more hierarchies; the analysis of the SQL statements allows the present system to map the hierarchies within the dimension table. In addition, the analysis of SQL statements provides the present system information about attribute relationships and cubes.
Three exemplary schemas exist for fact and dimension tables. The first schema is the star schema with all the dimension tables joined directly to the fact table. All the hierarchy information is contained within the dimension tables. The second schema is the snowflake schema with some dimensions connected directly to the fact table. Other dimensions are connected to the dimensions that connect to the fact table in a long dependency chain. Consequently, the hierarchy is spread over the dependency chain as opposed to being contained in one dimension table. The third schema is the known configuration of a dimension inside a fact table. The present system creates OLAP objects for these and/or other schemas. The result could be one or a combination of those three schemas. The dimension tables and fact tables can also be a combination of tables.
The metadata objects of the present system describe relational information as intelligent OLAP structures, but these metadata objects are different from traditional OLAP objects. The metadata objects of the present system store metadata, the information about the data in the base tables of the relational database. These metadata objects describe where pertinent data is located and further describe relationships within the base data. For example, a fact object is a specific metadata object that stores information about related measures, attributes and joins, but does not include the data specifically from the base fact table.
Each metadata object completes a piece of the larger picture that helps explain the meaning of the relational data. Some objects act as a base to directly access relational data by aggregating data or directly corresponding to particular columns in relational tables. Other objects describe relationships between the base metadata objects and link these base objects together. Ultimately, the present system groups all of the objects together by their relationships to each other into a multidimensional metadata object called a cube model.
A cube model comprises a set of tables that represent facts and dimensions. It contains a set of tables within a database.
It is one objective of the present system to map SQL queries to OLAP objects metadata.
Another objective of the system is to provide bridges between various OLAP vendor products and relational databases to allow the exchange of metadata. Providing bridges that allow importing of metadata into the present system helps customers quickly adapt their databases to OLAP capability. Many relational database users have not used OLAP products yet and therefore have no OLAP objects. However, these customers have SQL statements they have been executing for years. By analyzing these SQL statements and generating corresponding OLAP objects, these users can more quickly utilize the advantages of an OLAP-based system. Furthermore, the present system provides a mechanism for populating the metadata catalogs for the present system with minimal human interaction.
The algorithms of the present system could apply to any SQL statement and any OLAP object. The SQL statement could be for anything that can be accessed via SQL. OLAP objects could be produced for any OLAP product. The particular implementation of the algorithms in the present system focuses on mapping SQL from relational databases into OLAP objects, aiming to produce usable OLAP objects with no (or minimal) interaction required by the database administrator. The mapping produces functional cube models and cubes so that the metadata is immediately usable. The database administrator can later modify the OLAP objects as needed.
The various features of the present invention and the manner of attaining them will be described in greater detail with reference to the following description, claims, and drawings, wherein reference numerals are reused, where appropriate, to indicate a correspondence between the referenced items, and wherein:
The following definitions and explanations provide background information pertaining to the technical field of the present invention, and are intended to facilitate the understanding of the present invention without limiting its scope:
API: Application Program Interface, a language and message format used by an application program to communicate with the operating system or some other control program such as a database management system (DBMS) or communications protocol.
Attribute: Represents the basic abstraction performed on the database table columns. Attribute instances are considered members in a multidimensional environment.
Attribute Relationship: describes relationships of attributes in general using a left and right attribute, a type, a cardinality, and whether the attribute relationship determines a functional dependency. The type describes the role of the right attributes with respect to the left attribute. Attributes that are directly related to the hierarchy attributes can be queried as part of the hierarchy, allowing each level of the hierarchy to define attributes that complement the information of a given level.
Cardinality: Typically refers to a count. A column cardinality refers to the number of distinct values in the column. The cardinality of a table would be its row count. For attribute relationship, the cardinality is expressed as the relationship of the counts of the attributes. A “1:N” implies that for each 1 instance on one side there are N instances on the other side.
Cube: A very precise definition of an OLAP cube that can be delivered using a single SQL statement. The cube defines a cube fact, a list of cube dimensions, and a cube view name that represents the cube in the database.
Cube Dimension: Used to scope a dimension for use in a cube. The cube dimension object references the dimension from which it is derived. It also references the cube hierarchy applied on the cube dimension. The joins and attribute relationships that apply to the cube dimension are kept in the dimension definition.
Cube Hierarchy: The purpose of a cube hierarchy is to scope a hierarchy to be used in a cube. The cube hierarchy object references the hierarchy from which it is derived, and a subset of the attributes from such hierarchy. A cube hierarchy object also references to the attribute relationships that apply on the cube.
Cube Facts: Select an ordered list of measures from a specific fact object. The purpose of a cube fact is to give the flexibility to a cube to scope a cube model's fact. The structural information, joins, and attributes, is kept in the fact object.
Cube Model: Groups facts and dimensions that are interesting for one or more applications. A cube model allows a designer to expose only relevant facts and dimensions to developers of an application. Cube models are intended for use by tools and applications that can handle multiple views of a specific dimension.
Dimension: defines a set of related attributes and possible joins among the attributes. Dimensions capture all attribute relationships that apply on attributes grouped in the dimension and also references all hierarchies that can be used to drive navigation and calculation of the dimension.
Facts: Stores a set of measures, attributes, and joins and groups related measures that are interesting to a given application. Facts are an abstraction of a fact table; however multiple database tables can be joined to map all measures in a fact object.
Hierarchy: Defines navigational and computational means of traversing a given dimension by defining relationships among a set of two or more attributes. Any number of hierarchies can be defined for a dimension. The relationship among the attributes is determined by the hierarchy type.
Join: Represents a relational join that specifies the join type and cardinality expected. A join also specifies a list of left and right Attributes and a operator to be performed.
Measure: Makes explicit the existence of a measurement entity. For each measure, an aggregation function is defined for calculations in the context of a cube model, or cube.
Schema: A database design, comprised of tables with columns, indexes, constraints, and relationships to other tables. The column specification includes a data type and related parameters such as the precision of a decimal or floating-point number.
Snowflake Schema A variation of a star schema in which a dimension maps to multiple tables. Some of the dimension tables within the schema join other dimension tables rather than the central fact table creating a long dependency. The remaining dimension tables join directly to the central fact table.
SQL: Structured Query Language, a standardized query language for requesting information from a database.
Star Schema: A schema in which all the dimension tables within the schema join directly to the central fact table.
XML: eXtensible Markup Language. A standard format used to describe semi-structured documents and data. During a document authoring stage, XML “tags” are embedded within the informational content of the document. When the XML document is subsequently transmitted between computer systems, the tags are used to parse and interpret the document by the receiving system.
By analyzing the SQL statements 12 and information from a database catalog 14, system 10 produces information about facts, dimensions, hierarchies and other OLAP objects. A database, such as database 16, contains tables such as tables 18, 20, 22, and 24. An important aspect of the definition of the OLAP objects for database 16 is to identify the facts and dimensions within tables 18, 20, 22, and 24.
Existing relational database systems have introduced the concept of sampling data. A query is executed, but rather than fetching all the data, only a representative subset is returned. This facilitates the analysis of large amounts of data with acceptable performance. In one embodiment, the present system utilizes the sampling data.
A star schema comprises a fact table surrounded by dimension tables. A snowflake schema is a variation of a star schema in which a dimension maps to multiple tables; the tables are normalized. Each dimension is a single table. The fact table contains rows that refer to the dimension tables. Typical queries join the fact table to some of the dimension tables and aggregate data.
For example, from a fact containing sales data, a query might obtain the total revenue by month and product. The time dimension might contain year, quarter, month, and day data. The product dimension could contain detailed product information. Some dimensions reside in the fact table. A census fact table that contains a row per person might have a column for gender that could be considered a dimension.
Relational database 16 comprises tables such as tables 18, 20, 22, 24 that are configured, for example purpose only, in a snowflake schema. The OLAP cube model objects 26 can be arranged in many ways, but are often built to represent a relational star schema or snowflake schema. A cube model based on a simple star schema is built around a central fact object 28 that describes aggregated relational data from a fact table. Measure objects 30, 32 describe data calculations from columns 34, 36 in a relational table and are contained in fact object 28. Columns of data from relational tables are represented by attribute metadata objects that can be put in a dimension object such as dimension object 38, 40.
Dimension objects 38, 40 are connected to the fact object 28 of the cube model 42 similar to the manner in which dimension tables are connected to the fact table in a star schema. Attributes 44, 46, 48 are created for the relevant dimension and fact table columns such as columns 50, 52, 54 in relational tables 20, 22, 24. Join objects 56, 58 join each key dimension to the central fact object 28 on the corresponding dimensional key attributes. Join object 60 joins two dimensions of a snowflake together.
Consequently, the sales table 205 has five dimensions: promotion 220, employee 225, product 230, time 235, and store 240. Each dimension has a key attribute such as PromotionID, EmployeeID, ProductID, TimeID, and StoreID. Product 230 maps to two additional tables, category 245 and brand 250. The sales table Sales 205 includes the five dimensional key attributes; the dimensions are joined with the fact table sales 205 based on either the PromotionID, EmployeeID, ProductID, TimeID, or StoreID attributes. Sales 205 is configured in a snowflake schema.
Shipments 210 contains detailed information about shipments made from the warehouse to the retail stores. Shipments 210 also has five dimensions, three of which are shared with the sales table 205: product 230, time 235, store 240, warehouse 255, and truck 260. The unique dimensions, warehouse 255 and truck 260, allow shipments 210 to specify which truck shipped the products and the product's originating warehouse. Warehouse 255 has the key attribute WarehouseID while truck 260 has the key attribute TruckID. Shipments 210 includes dimensional key attributes for its dimensions; the dimensions are joined with the fact object table shipments 210 based on either the ProductID, TimeID, StoreID, WarehouseID, or TruckID attributes. Product 230 is also configured in a snowflake schema; all other dimensions in this example are configured in star schemas.
A method of operation 300 of system 10 is illustrated by the high-level process flow chart of
By evaluating these usage metrics, system 10 selects the best candidates for OLAP objects at block 310. System 10 uses, for example, twelve different OLAP objects to map the relational database structure to OLAP: fact tables, dimension tables, cube models, hierarchy, joins, attributes, attribute relationships, measures, cubes, cube facts, cube dimensions, and cube hierarchies. Various criteria are used by system 10 to rate the candidate OLAP objects.
System 10 then generates an XML file that defines the OLAP objects as metadata at block 315 in a format that conforms to the API definition of the database management system. In one embodiment, an OLAP metadata API 17 has a GUI sub-component that generates the XML metadata file. This layer could be used if the bridge is written in Java® to reduce the coding effort.
The object of system 10 is to generate information about facts, dimensions, hierarchies and other OLAP objects from a set of SQL statements that query a database. There might be information available such as referential constraints that illuminate the relationship between the tables in the database. However, many database designers avoid defining referential constraints because they increase the cost of updating tables. Consequently, database tables can be viewed as a set of disjointed tables, as illustrated in
Block 305 of method 300 (
Analyzing a subset of SQL statements can still produce good recommendations of OLAP objects. Furthermore, less time is required to analyze a smaller set of SQL statements. In one embodiment, customers can filter which SQL statements are analyzed by specifying various criteria, including but not limited to query attributes such as creator, owner, creation date, modification date, last used date, query schema usage frequency, table schemas, as well as selecting a random subset of queries, etc. This also allows customers to focus on those queries they feel best represent their business requirements.
Each syntactical clause isolated at block 520 begins with a predicate such as “select”, “from”, “where”, “group”, “having”, “order”, etc. For example, an SQL statement might be:
The clauses for this SQL statement begin with “select”, “from”, “where”, “group by”, and “order by”. The aggregation, sum(dollars_sold) in this SQL statement is a measure. A measure is one or more columns to which an aggregator such as sum, average, maximum, minimum, etc. is applied. The quality of the measure is ranked by the frequency that it is used. Measures referenced frequently are more likely to indicate a fact table. Measures referenced infrequently are more likely to be “measures” within dimensions.
In one embodiment, the simplest metrics that are gathered for the OLAP objects are on a per query basis. Each time an object such as a measure, attribute, table or join appears in a query, the corresponding metric is incremented. To improve the selection of metadata weightings can be applied. There may be statistics available about query usage. This information could come from database traces or from some product that executes the queries. The statistics can be used to adjust the metrics to better reflect the relative importance of the metadata. For example, if a measure is identified that appears in a query that has a usage count of 1000, then the measure metric is incremented by 1000 instead of 1.
Queries can reference tables and views. There are different ways to process views. One way to think of views is as a saved query. But views are more than that. Views represent the users view of the structure of the database. System 10 will maintain data structures that track not only what tables were referenced by a query but whether these were actual tables or views.
There are several alternative embodiments for handling views. In one alternative embodiment, the queries are processed as they were written. The tables and views are treated in a similar way. The result being that the generated OLAP metadata will refer to both tables and views. The advantage of this approach is that it maintains the user abstractions.
In another embodiment, the SQL queries are rewritten to refer only to tables (block 512,
In yet another embodiment, the SQL queries are rewritten to refer only to tables (block 512,
System 10 then analyzes the predicates to determine if they are being used as joins or to subset the data. Those predicates used to join tables should map to join objects. These joins can be categorized as:
facts to fact joins;
facts to dimension joins; or
dimension to dimension joins.
Joins are specified explicitly from the join syntax. Joins can also be specified implicitly from “where” clauses. When using “where” clauses to define the joins, system 10 performs additional analysis to determine which portion of the “where” clause is performing the join rather than using row selection.
Predicates used to subset data provide hints about hierarchies. Statements written to aggregate data for a subset of data typically do not explicitly group the data. For example, a SQL statement that creates a report for the second quarter of 2002 will probably not include a “group by” request for year and quarter even though that is the hierarchy.
A hierarchy defines navigational and computational means of traversing a given dimension. To accomplish this, a hierarchy defines relationships among a set of one or more attributes. Any number of hierarchies can be defined for a dimension. The hierarchy object also references to a set of attribute relationships that are used to link hierarchy attributes. The relationship among the attributes is described by the hierarchy type.
System 10 supports, for example, the following hierarchy types: balanced, ragged, and network hierarchy types.
A balanced hierarchy is a simple hierarchy fully populated with members at all level attributes. In a balanced hierarchy, levels have meaning to the designer.
A ragged hierarchy has varying depth at which levels have semantic meaning. The ragged hierarchy can be used, for example, to represent a geographic hierarchy in which the meaning of each level such as city or country is used consistently, but the depth of the hierarchy varies.
A network hierarchy does not specify the order of levels; however, the levels do have semantic meaning. For example, product attributes such as color and package type can form a network hierarchy in which aggregation order is arbitrary and there is no inherent parent-child relationship.
A hierarchy also specifies the deployment mechanism for the hierarchy. System 10 supports a standard deployment. In the standard deployment, each instance of an attribute corresponds to a member of that level. The standard deployment can be used with all hierarchy types.
The “group” clause provides information about hierarchies and is the primary mechanism for determining hierarchies. Examples of SQL statements using the group clause are:
select . . . group by country, region, state.
select . . . group by country, region, state, productLevel, productName.
select . . . group by year(date), month(date).
select . . . group by year, month.
The group clause provides information about hierarchies. The order of the grouped values determines the order in the hierarchy. There may be more than one hierarchy specified by the grouping. For example, the first SQL statement above has a three-level hierarchy: country, region, and state. The second SQL statement has the same three-level hierarchy plus a different two-level hierarchy: productLevel and productName.
System 10 determines the hierarchies through an analysis of the group clause; the group clause will specify one or more hierarchies implicitly. Further analysis on the origination of each column must be performed to determine which adjacent group values represent actual hierarchies. The hierarchy comprises adjacent columns. For the example above, [country, region, state] is a possible hierarchy but not [country, state]. An expression such as year(date) or month(date) in which a column has multiple functions applied to it implies that this is a hierarchy. If adjacent attributes in a group are from the same dimension and the dimension is for a different table than the facts, then these attributes comprise a hierarchy. Network hierarchies can be specified by determining if the same set of columns from the same table appear contiguously but in different orders.
In the following SQL statement there exist three possibilities for the location of country, region, and state:
select . . . group by country, region, state.
First, each column is located in a separate table and these three tables are joined to each other and to a fact table. This case is a snowflake schema and system 10 considers the columns a hierarchy. Second, each column is located in the same table and this table is joined to the fact table. This case is a star schema and system 10 considers the columns a hierarchy. Third, each column is located in the same table as the facts. This case is a “dimension in facts” situation. In this case, system 10 is unable to determine the hierarchy based on this statement alone.
If the SQL statement does not include a “group by” clause, system 10 can gain additional insight about the hierarchies from the “order by” clause. An example of an “order by” clause is
select . . . order by year, month.
Suppose the SQL statement specifies predicates to obtain the data for a specific quarter and year but does no grouping. Instead the SQL statement requests an “order by” year and quarter. System 10 then assumes that the hierarchy is year and month.
System 10 classifies each table as a fact or a dimension. In some cases, a table may be both a fact and a dimension. If the table has a large number of measures or frequently referenced measures associated with it then system 10 considers it a fact. A fact object is one or more fact tables that can be directly joined (i.e., adjacent nodes) and have a strong join affinity. If the table has level attributes from hierarchies or has few measures, then it is a dimension. Dimension objects are a single dimension table or multiple dimension tables that have a strong join affinity. If an aggregation is used such as “sum(sales)”, system 10 maps the clause to a measure. Otherwise, system 10 maps the clause to an attribute.
Analyzing an SQL statement results in information such as tables referenced, which tables were joined, which measures (aggregations) were used, which groupings were performed and how the data was ordered. All of these data provide metrics accumulated for later analysis. After analyzing the SQL statement, system 10 updates the measure metrics at block 525.
At block 530, system 10 updates the attribute metrics. The attributes are values that are not aggregations. An attribute can involve multiple columns in addition to other attributes.
The level attribute is used to define hierarchies. Common level attributes are, for example, Year, Quarter, State, and City. Description attributes are used to associate additional descriptive information to a given attribute. A common scenario for Product, for example, is to have a main attribute with a product code and a description attribute with a textual description. The dimensional attribute represents different characteristics and qualities of a given attribute. Common dimensional attributes are Population, Size, etc. Dimensional key attributes represent the primary key in a dimension table or a foreign key used in a fact table to represent a dimension table. In addition, a key attribute is used for keys other than dimensional keys. Key attributes may be used in a snowflake dimension.
System 10 updates the table metrics at block 535, the join metrics at block 540, and the hierarchy metrics at block 545. System 10 then looks for more SQL statements at decision block 550. If additional SQL statements are available for analysis, system 10 returns to block 515 and repeats blocks 510 through 550. Otherwise, system 10 proceeds to block 310 of
The most important metrics accumulated by system 10 are the join metrics and measure metrics.
Numerous SQL expressions have common parts. For example:
Sum(A) and sum(B) are common to all of the foregoing clauses. An alternative embodiment would be to break the expression in the SQL select clause into component pieces, in order to produce a more concise set of measures.
If all the unique expressions are mapped to measures, there may be an overwhelming number of measures that differ in insignificant ways. These would clutter the metadata and potentially impact optimization based on the metadata. If only the common subexpressions are selectively mapped to the metadata, then important metadata is lost such as complex formulas. One implementation is to map very commonly used expressions directly to measures, but to also look for subexpressions that may be referenced frequently as well.
Block 310 of method 300 of
The fact rating is obtained by adding the measure ratings of all measures contained in the table. One embodiment of the present invention allows a client to control the threshold. As an example, an absolute threshold for marking as a fact and having a relative threshold such as select the 5% of tables with the highest ratings. Other metrics can alternatively be used to select a fact. For example, a high row count is generally a property of fact tables rather than dimension tables. The core metrics should be viewed as a starting point but not a comprehensive list of heuristics. It should be understood by a person of ordinary skill in the field that heuristics comprise, but are not limited to a set of rules used for evaluation purposes.
Fact tables contain many measures, corresponding to the large amount of factual data within the fact table. Dimension tables can also contain factual data. However, the resulting measure from that factual data is much less than that of the fact tables. If the table has more measures than the fact table threshold, system 10 designates the table as a fact table at block 715. A fact object may reference multiple tables.
The following are exemplary rules used by system 10 when specifying fact tables:
If the measure metrics of the table do not meet the threshold for a fact table, system 10 designates the table as a dimension table at block 720. A dimension object may reference multiple tables.
The following are exemplary rules used by system 10 in designating a table as a dimension table:
A single table can be part of both fact and dimension objects. Also, one table may contain multiple dimensions. If it is unclear whether adjacent tables are fact tables or dimension tables, system 10 uses row count and other criteria such as column datatypes to define the table type at decision block 710. In general, fact tables have much larger row counts than dimension tables. The dimension tables are tables adjacent to fact tables with high join counts. Not all tables joined to the fact table are considered dimension tables; some tables joined infrequently will be ignored by system 10. At decision block 725, system 10 determines if all tables have been designated as either fact or dimension. If additional tables must be checked, system 10 proceeds to block 730, selects the next table, and repeats blocks 710 through 725.
Once the dimension tables have been identified, system 10 defines the cube model for each fact table at block 735. A cube model is a fact table plus any adjacent joinable dimension tables. If necessary, adjacent fact tables can be placed into the same “facts” object” to make the dimension table and fact table configuration fit within the cube model constraints. A “facts” object can contain multiple “fact tables”, just as a “dimension” can contain “multiple dimension tables”.
The following are exemplary object rules followed by system 10 in creating cube models:
At this stage, the key metadata and core objects have been selected.
During the selection process, system 10 retains the key objects and eliminates peripheral objects at block 737. All objects identified by SQL statement analysis could be mapped to OLAP objects. However, this will likely result in a metadata catalog largely filled with information that is not very important. By filtering on usage, system 10 maps the key objects to OLAP objects. In one embodiment, customers are allowed to control this filtering process by setting usage thresholds. As indicated earlier, this filtering could be implemented in several ways including based on an absolute rating or a relative rating such as select the top rated 20% of metadata. Now that the key objects have been selected the remaining objects can be defined based on this core set.
System 10 then proceeds to select joins at block 740. The following are exemplary rules used by system 10 to select joins:
System 10 selects measures at block 745. The following are exemplary rules used by system 10 to select measures:
At block 750, system 10 selects attributes. The attributes are values that are not aggregations. The following are exemplary rules used by system 10 used by system 10 to select attributes:
System 10 also selects attribute relationships at block 750. The following are exemplary rules used by system 10 to select attributes relationships:
The group clause will contain both level attributes and attribute relationships. While system 10 may not be able to distinguish between the two based on the analysis of a single statement, the attribute relationships become obvious when many SQL statements are analyzed. System 10 then selects hierarchies at block 755. The following are exemplary rules used by system 10 in selecting hierarchies:
At block 760, system 10 selects the cube. Cubes are commonly referenced and strongly correlated subsets of cube models. A key feature of the cubes is that each dimension has a single hierarchy, unlike a cube model that supports multiple hierarchies. If the cube model has no dimensions with multiple hierarchies, system 10 would simply include every dimension with its complete hierarchy in the cube.
In more complex cases, system 10 analyzes the dimensions referenced by the SQL statements. For example, a cube model may have 20 dimensions; of 1000 SQL statements, 400 referenced some combination of dimensions D1, D2, D3, D4, D5, D6, and D7. None of these 400 statements referenced any dimension other than these seven. System 10 would then define a cube containing these seven dimensions based on the affinity within the SQL statements for these dimensions.
The following are exemplary rules used by system 10 when selecting the cube:
System 10 also selects cube facts using the following exemplary rules:
System 10 then selects cube dimensions using the following exemplary rules:
Finally, system 10 selects cube hierarchies using the following exemplary rules:
A dimension table is associated with a fact table if the join between the dimension table and the fact table is a strong join. Tables 610, 905, 910, 915 and 920 have strong joins to fact table 405. In addition, table 925 has a strong join to table 910 and table 930 has a strong join to table 925. Table 935 has a weak join to table 405, so is not considered a dimension table for table 405. System 10 then creates a cube model 940 for the table 405. System 10 creates cube models 945, 950, and 955 by following a similar process.
The key objects are those within the cube models 940, 945, 950 and 955. All other objects are dropped by system 10, as seen in
The interpretation of any one SQL statement by system 10 may be incorrect. An SQL statement against a dimension may look the same as a statement against a fact. However, when system 10 aggregates the analysis of many SQL statements, the facts, dimensions, and other OLAP objects become more obvious. System 10 also uses sampling to analyze the correlation of columns, cardinalities to analyze columns, and query history (dynamic statistics) in addition to a static view of existing queries to further refine the OLAP structure for the relational database.
The primary input to the mapping process is SQL queries. But other information is also available that assists in the analysis. For example, there may be information that indicates the frequency that queries are run. This allows the focus to be on queries that are actually run rather than those that merely exist. The database has statistics about tables and columns. These statistics can help make decisions. For example, table row counts can be used to help spot fact tables. Column cardinalities can help choose attribute relationships. Sampling (i.e., the process of reading a subset of table data) can be implemented as a means of quickly spotting relationships between columns. This can help spot attribute relationships.
The following series of SQL statements illustrate the method used by system 10 to analyze and interpret SQL statements for a relational database with unknown structure. The first SQL statement analyzed is:
The select clause in the preceding SQL statement has a single expression that is an aggregation and a reference to a salesfact table. System 10 interprets this aggregation, sum(sales), as a measure. Since the salesfact table has a measure, system 10 increments the fact metric for that table by one. The results of the analysis of this SQL statement is illustrated in
System 10 analyzes a second SQL statement:
This SQL statement references two tables, the product table and the salesfact table. The select clause in this SQL statement has two values, the skuname and the aggregate sum(sales). System 10 considers the aggregation as a measure, while the non-aggregation, skuname, is not a measure. Skuname is an attribute that participates in a hierarchy that is specified by the “group by” clause. Since this SQL statement references two tables, it is important for system 10 to determine the source table for each select expression. Since skuname (the level attribute) comes from the product table, system 10 interprets the product table as a dimension. System 10 interprets salesfact as a fact table since sum(sales) appears to originate from the salesfact table. Consequently, system 10 increments the facts measure for the salesfact table by one. The where clause contains a fact-dimension join between the product table and the salesfact table; system 10 increments the join metric between the product table and the salesfact table by one. The results of analysis of the second SQL statement are shown in
System 10 analyzes a third SQL statement:
Although this SQL statement is more complex, system 10 still interprets it incrementally. As before, the sum(sales) aggregation is a measure; consequently, system 10 increments the fact metric for the aggregation's source table, the salesfact table, by one. System 10 identifies a hierarchy on the product name, skuname, from the “group by” clause. The product name, skuname, is a level attribute. Four tables are joined in this SQL statement: salesfact is joined to product; salesfact is joined to market; market is joined to region, salesfact is joined to time. System 10 increments the join metrics between these tables by one.
The results of analysis of the third SQL statement are shown in
Although the “group by” clause references only one column, the “where” clause has predicates on three columns: state, year, and quarter. System 10 recognizes this as a clue that these three columns are levels within hierarchies. This SQL statement provides less information here than if the three columns were explicitly listed in the “group by” clause. For example, system 10 can determine no order for the three columns. However, an aggregation is in effect performed for these values by the SQL statement.
The SQL analysis phase will maintain a significant number of metrics and data structures. The present graph diagrams represent a small percentage of the information and metrics maintained, and represent exemplary core metrics. It should be understood that other metrics can also exist. As an example, hierarchies need to be tracked. This means that for each SQL statement there exists one or more possible hierarchies. All of the possible hierarchies need to be tracked with usage counts maintained so that the best candidates can be selected during the metadata selection phase. Join information needs to include which columns of the tables were joined. Eventually, when OLAP joins are defined, attributes need to be defined for these columns.
In addition to keeping counts of how often each a join is done, tracking of which tables were joined in tandem will also be done. As an example, suppose there are 3 queries:
The fourth SQL statement for this relational database might be:
The select clause in this SQL statement has three measures (all from salesfact) and three non-measures. The “group by” lists three columns which should be considered a possible hierarchy: region, state, and skuname. There are several possible hierarchies here ranging from one three-level hierarchy to three one-level hierarchies:
(region, state, skuname)
(region, state), (skuname)
(region), (state, skuname)
(region), (state), (skuname)
Since the market and region tables are joined in a snowflake configuration a reasonable interpretation would be that the hierarchies are (region, state) and (skuname).
The results of analysis of the fourth SQL statement are shown in
A second relational database example is as follows. The first SQL statement for this exemplary database is:
This SQL statement references one table, the run table. The select clause has two aggregations (count and sum) and one non-aggregation (year(date)). System 10 considers the run table as a fact since both aggregations came from this table. The “group by” clause implies there is a one-level hierarchy of year. The year function is applied to date; date originates from the run table. Consequently, system 10 considers the run table as a dimension and the run table has both fact and dimensional information. Tables that have both fact and dimensional information are often used in relational database. For example, a fact table for a census would probably contain a number of columns that contained dimensional data. To classify the gender of people listed in the table there might be a single char(1) column with M or F. This approach would be easier for the database designer than creating a completely new table just for gender. Even though a single table serves as a fact and dimension, OLAP object metadata requires a fact-dimension join to be defined. System 10 defines this join using the primary key of the table.
A second SQL statement for the second relational database example might be:
Within this SQL statement, system 10 identifies two non-measures (year and month) and 14 measures (5 aggregations on distance and 3 aggregations to each of duration, pace, and speed). The “group by” clause implies either a two-level hierarchy (year, month) or two one-level hierarchies (year) (month). Since these expressions are based on the same underlying column it is reasonable to assume the hierarchy is (year, month).
It is to be understood that the specific embodiments of the invention that have been described are merely illustrative of certain application of the principle of the present invention. Numerous modifications may be made to the method for automatically building an OLAP model from a relational database invention described herein without departing from the spirit and scope of the present invention. Moreover, while the present invention is described for illustration purpose only in relation to relational databases, it should be clear that the invention is applicable as well to any database or collection of data lacking a metadata description.
Number | Name | Date | Kind |
---|---|---|---|
5537524 | Aprile | Jul 1996 | A |
5594897 | Goffman | Jan 1997 | A |
5692107 | Simoudis et al. | Nov 1997 | A |
5692175 | Davies et al. | Nov 1997 | A |
5748188 | Hu et al. | May 1998 | A |
5767854 | Anwar | Jun 1998 | A |
5832475 | Agrawal et al. | Nov 1998 | A |
5832496 | Anand et al. | Nov 1998 | A |
5870746 | Knutson et al. | Feb 1999 | A |
5905985 | Malloy et al. | May 1999 | A |
5918232 | Pouschine et al. | Jun 1999 | A |
5926815 | James, III | Jul 1999 | A |
5926818 | Malloy | Jul 1999 | A |
5943668 | Malloy et al. | Aug 1999 | A |
5960423 | Chaudhuri et al. | Sep 1999 | A |
5978788 | Castelli et al. | Nov 1999 | A |
5991754 | Raitto et al. | Nov 1999 | A |
6003024 | Bair et al. | Dec 1999 | A |
6003036 | Martin | Dec 1999 | A |
6031977 | Pettus | Feb 2000 | A |
6092064 | Aggarwal et al. | Jul 2000 | A |
6115547 | Ghatate et al. | Sep 2000 | A |
6122636 | Malloy et al. | Sep 2000 | A |
6134532 | Lazarus et al. | Oct 2000 | A |
6144962 | Weinberg et al. | Nov 2000 | A |
6175836 | Aldred | Jan 2001 | B1 |
6205447 | Malloy | Mar 2001 | B1 |
6226647 | Venkatasubramanian et al. | May 2001 | B1 |
6249791 | Osborn et al. | Jun 2001 | B1 |
6292797 | Tuzhilin et al. | Sep 2001 | B1 |
6308168 | Dovich et al. | Oct 2001 | B1 |
6317750 | Tortolani et al. | Nov 2001 | B1 |
6324533 | Agrawal et al. | Nov 2001 | B1 |
6327574 | Kramer et al. | Dec 2001 | B1 |
6339776 | Dayani-Fard et al. | Jan 2002 | B2 |
6362823 | Johnson et al. | Mar 2002 | B1 |
6366903 | Agrawal et al. | Apr 2002 | B1 |
6374234 | Netz | Apr 2002 | B1 |
6385604 | Bakalash et al. | May 2002 | B1 |
6385609 | Barshefsky et al. | May 2002 | B1 |
6408292 | Bakalash et al. | Jun 2002 | B1 |
6418428 | Bosch et al. | Jul 2002 | B1 |
6421665 | Brye et al. | Jul 2002 | B1 |
6438537 | Netz et al. | Aug 2002 | B1 |
6449609 | Witkowski | Sep 2002 | B1 |
6477536 | Pasumansky et al. | Nov 2002 | B1 |
6480836 | Colby et al. | Nov 2002 | B1 |
6484179 | Roccaforte | Nov 2002 | B1 |
6539371 | Bleizeffer et al. | Mar 2003 | B1 |
6542895 | DeKimpe et al. | Apr 2003 | B1 |
6546381 | Subramanian et al. | Apr 2003 | B1 |
6546395 | DeKimpe et al. | Apr 2003 | B1 |
6567796 | Yost et al. | May 2003 | B1 |
6574619 | Reddy et al. | Jun 2003 | B1 |
6574791 | Gauthier et al. | Jun 2003 | B1 |
6581054 | Bogrett | Jun 2003 | B1 |
6581068 | Bensoussan et al. | Jun 2003 | B1 |
6604110 | Savage et al. | Aug 2003 | B1 |
6609123 | Cazemier et al. | Aug 2003 | B1 |
6615201 | Seshadri et al. | Sep 2003 | B1 |
6628312 | Rao et al. | Sep 2003 | B1 |
6633882 | Fayyad et al. | Oct 2003 | B1 |
6633885 | Agrawal et al. | Oct 2003 | B1 |
6636845 | Chau et al. | Oct 2003 | B2 |
6636853 | Stephens, Jr. | Oct 2003 | B1 |
6643633 | Chau et al. | Nov 2003 | B2 |
6643661 | Polizzi et al. | Nov 2003 | B2 |
6651055 | Kilmer et al. | Nov 2003 | B1 |
6654764 | Kelkar et al. | Nov 2003 | B2 |
6665682 | DeKimpe et al. | Dec 2003 | B1 |
6671689 | Papierniak | Dec 2003 | B2 |
6681223 | Sundaresan | Jan 2004 | B1 |
6684207 | Greenfield et al. | Jan 2004 | B1 |
6694322 | Warren et al. | Feb 2004 | B2 |
6697808 | Hurwood et al. | Feb 2004 | B1 |
6707454 | Barg et al. | Mar 2004 | B1 |
6711579 | Balakrishnan | Mar 2004 | B2 |
6711585 | Copperman et al. | Mar 2004 | B1 |
6714940 | Kelkar | Mar 2004 | B2 |
6768986 | Cras et al. | Jul 2004 | B2 |
6775662 | Witkowski et al. | Aug 2004 | B1 |
6801992 | Gajjar et al. | Oct 2004 | B2 |
6823334 | Vishnubhotla et al. | Nov 2004 | B2 |
6831668 | Cras et al. | Dec 2004 | B2 |
6842758 | Bogrett | Jan 2005 | B1 |
6865573 | Hornick et al. | Mar 2005 | B1 |
6871140 | Florance et al. | Mar 2005 | B1 |
6898603 | Petculescu et al. | May 2005 | B1 |
6931418 | Barnes | Aug 2005 | B1 |
6947929 | Bruce et al. | Sep 2005 | B2 |
6957225 | Zait et al. | Oct 2005 | B1 |
6996556 | Boger et al. | Feb 2006 | B2 |
7007039 | Chaudhuri et al. | Feb 2006 | B2 |
7051038 | Yeh et al. | May 2006 | B1 |
7139764 | Lee | Nov 2006 | B2 |
7149983 | Robertson et al. | Dec 2006 | B1 |
7162464 | Miller et al. | Jan 2007 | B1 |
7181450 | Malloy et al. | Feb 2007 | B2 |
7188090 | Kim et al. | Mar 2007 | B2 |
7191169 | Tao | Mar 2007 | B1 |
7203671 | Wong | Apr 2007 | B1 |
7246116 | Barsness et al. | Jul 2007 | B2 |
7266565 | Diab | Sep 2007 | B2 |
7275024 | Yeh et al. | Sep 2007 | B2 |
7346601 | Chaudhuri et al. | Mar 2008 | B2 |
7430562 | Bedell et al. | Sep 2008 | B1 |
7447687 | Andersch et al. | Nov 2008 | B2 |
7472127 | Malloy et al. | Dec 2008 | B2 |
7480663 | Colossi et al. | Jan 2009 | B2 |
20010026276 | Sakamoto et al. | Oct 2001 | A1 |
20010037228 | Ito et al. | Nov 2001 | A1 |
20010037327 | Haas et al. | Nov 2001 | A1 |
20010047355 | Anwar | Nov 2001 | A1 |
20010047364 | Proctor | Nov 2001 | A1 |
20010051947 | Morimoto et al. | Dec 2001 | A1 |
20010055018 | Yaginuma et al. | Dec 2001 | A1 |
20020002469 | Hillstrom | Jan 2002 | A1 |
20020029207 | Bakalash et al. | Mar 2002 | A1 |
20020073088 | Beckmann et al. | Jun 2002 | A1 |
20020078039 | Cereghini et al. | Jun 2002 | A1 |
20020087516 | Cras et al. | Jul 2002 | A1 |
20020091679 | Wright | Jul 2002 | A1 |
20020091681 | Cras et al. | Jul 2002 | A1 |
20020095430 | Egilsson et al. | Jul 2002 | A1 |
20020122078 | Markowski | Sep 2002 | A1 |
20020123993 | Chau et al. | Sep 2002 | A1 |
20020124002 | Su et al. | Sep 2002 | A1 |
20020129003 | Bakalash et al. | Sep 2002 | A1 |
20020129032 | Bakalash et al. | Sep 2002 | A1 |
20020138316 | Katz et al. | Sep 2002 | A1 |
20020143783 | Bakalash et al. | Oct 2002 | A1 |
20020188587 | McGreevy | Dec 2002 | A1 |
20020188599 | McGreevy | Dec 2002 | A1 |
20030004914 | McGreevy | Jan 2003 | A1 |
20030004942 | Bird | Jan 2003 | A1 |
20030014397 | Chau et al. | Jan 2003 | A1 |
20030033277 | Bahulkar et al. | Feb 2003 | A1 |
20030055813 | Chaudhuri | Mar 2003 | A1 |
20030055832 | Roccaforte | Mar 2003 | A1 |
20030061207 | Spektor | Mar 2003 | A1 |
20030071814 | Jou et al. | Apr 2003 | A1 |
20030078852 | Shoen et al. | Apr 2003 | A1 |
20030078913 | McGreevy | Apr 2003 | A1 |
20030081002 | De Vorchik et al. | May 2003 | A1 |
20030093424 | Chun et al. | May 2003 | A1 |
20030101202 | Kelkar et al. | May 2003 | A1 |
20030115183 | Abdo et al. | Jun 2003 | A1 |
20030115207 | Bowman et al. | Jun 2003 | A1 |
20030126144 | O'Halloran et al. | Jul 2003 | A1 |
20030184588 | Lee | Oct 2003 | A1 |
20030206201 | Ly | Nov 2003 | A1 |
20030212667 | Andersch et al. | Nov 2003 | A1 |
20030225768 | Chaudhuri et al. | Dec 2003 | A1 |
20040006574 | Witkowski et al. | Jan 2004 | A1 |
20040010505 | Vishnubhotla | Jan 2004 | A1 |
20040059705 | Wittke et al. | Mar 2004 | A1 |
20040098415 | Bone et al. | May 2004 | A1 |
20040122844 | Malloy et al. | Jun 2004 | A1 |
20040128287 | Keller et al. | Jul 2004 | A1 |
20040128314 | Katibah et al. | Jul 2004 | A1 |
20040139061 | Colossi et al. | Jul 2004 | A1 |
20040181502 | Yeh et al. | Sep 2004 | A1 |
20040181538 | Lo et al. | Sep 2004 | A1 |
20040215626 | Colossi et al. | Oct 2004 | A1 |
20040267774 | Lin et al. | Dec 2004 | A1 |
20050027754 | Gaijar et al. | Feb 2005 | A1 |
20050033730 | Chaudhuri et al. | Feb 2005 | A1 |
20050267868 | Liebl et al. | Dec 2005 | A1 |
20050278290 | Bruce et al. | Dec 2005 | A1 |
20050283494 | Colossi et al. | Dec 2005 | A1 |
Number | Date | Country |
---|---|---|
09106331 | Apr 1997 | JP |
09146962 | Jun 1997 | JP |
10247197 | Sep 1998 | JP |
2001243242 | Sep 2001 | JP |
2001243244 | Sep 2001 | JP |
2002-007435 | Jan 2002 | JP |
2002123530 | Apr 2002 | JP |
0022493 | Apr 2000 | WO |
0065479 | Nov 2000 | WO |
0072165 | Nov 2000 | WO |
0109768 | Feb 2001 | WO |
WO 0129690 | Apr 2001 | WO |
Number | Date | Country | |
---|---|---|---|
20040122646 A1 | Jun 2004 | US |