1. Field of the Invention
The present invention relates generally to systems and methods for analyzing data sources to generate metadata, (i.e., data about the underlying data). More particularly, the present invention relates to systems and methods for generating metadata from disparate data sources by inferring data relationships and rules based on the data in the disparate data sources and/or query and procedure code.
2. Description of the Related Art
Databases are well known and are used to store and maintain information for almost every conceivable application. The use of databases has become increasingly prevalent over the last few decades; and, today, the vast majority of business records are maintained in databases.
For large businesses, database information can include thousands of different kinds of information and often will range from millions to billions of records. Such businesses seek to leverage their data to make quick, informed business decisions. Business Intelligence (BI) tools help companies make better business decisions by allowing access to relevant data, providing ways to perform in-depth analysis of the data, and presenting the results of the analysis in a manner that facilitates understanding and dissemination of the information contained in the data. The primary users of BI tools are business analysts, managers and decision makers.
A Database Management System (DBMS) is a computer software application that creates and manages databases according to the specific requirements of an individual database model. Exemplary well-known database models include the hierarchical, network, relational, object-oriented and object-relational database models. The relational model is currently the predominant database model and the corresponding database management system is known as a Relational Database Management System (RDBMS).
One of the most common operations performed on relational databases is called a “query,” which is a program construct that accesses the database to obtain a subset of the data of interest to the user. In relational databases, the data is often split across multiple tables. The declarative computer language used to query, access and manipulate data stored in relational databases is known as the Structured Query Language (SQL). However, most business analysts, managers and corporate decision makers are not knowledgeable of the SQL language and are also not familiar with the data model. “Data model” refers to the logical organization of application data in the database. As a result, the primary users of such BI tools are often unable to access the data directly. In addition, information technology staffs are often unable to successfully use SQL to retrieve the desired data because the user must have an intimate knowledge of the data model used to create the database in order to include the correct commands in a query to retrieve the required data from the tables.
Such problems are also commonly encountered with the use of commercial off-the-shelf products (COTS), as well as BI software applications that have been custom developed for an organization. In the instance of custom developed software, once the original designers have departed from the organization, it is frequently the case that no one remains with intimate knowledge of the data model used to create the custom software and often either associated documentation is missing or there is minimal informative documentation available for the customized software.
BI tool manufacturers face the challenge of enabling business users to query the data by hiding the complexities of SQL and allowing users to access information without an intimate knowledge of the underlying data model. Many BI tools do not directly query the relational database. Instead, the tools extract, transform and load (ETL) the relational data or a subset thereof into a multidimensional data warehouse or data mart. However, the ETL operation also requires an intimate knowledge of the data model.
BI tools that query the relational data directly usually rely upon a metadata layer, that is, data about the underlying data. The metadata layer contains information about the data model. The metadata layer approach is popular because it helps empower the end user to access data without having to deal with the technical intricacies. However, the construction of the metadata layer requires in depth knowledge of the data model.
RDBMS store some information about the database's data model in catalog tables. The catalog tables contain information about the tables, columns, rows and any expressly defined constraints and relationships. The catalog tables may be queried to reverse engineer some of the data model. This approach works well when the catalog tables contain complete information about the data model.
Catalog tables only contain information that is explicitly defined, such as tables, rows and columns. However, constraints need not be explicitly defined as part of the data model. There are many reasons why even constraints supported by the RDBMS may not be explicitly defined as part of the data model, including the following:
As applications get larger and more complex and interface with a large number of other applications, the explicit data model defined in catalog tables becomes less sufficient to populate the metadata layer for a BI system. This is a commonly encountered problem during BI implementations. A large amount of time and effort is spent in obtaining a global view of the data model that is not necessarily available through the relational catalogs. The global view of the data model may be referred to as the enterprise data model. The enterprise data model can span across multiple data sources in an organization, including for example, multiple database instances with different underlying database models and/or vendor technologies, flat files, and other data stores. Any system used by an organization to manage one or more of their data sources is referred to herein as an “enterprise management system.”
Current BI tools usually only query the relational catalogs to populate the metadata layer with the explicit data model. Some may also use explicit name matching between columns to infer basic implicit referential integrity constraints. However, such techniques are not sufficient and the capture of the enterprise data model is a largely manual process and involves the BI implementers obtaining information from various sources within the organization. A lack of knowledge about the implicit data model within an organization often hinders this effort. Consequently, populating the metadata layer of BI tools becomes a time consuming, expensive and onerous process.
Thus, there exists a need for new and improved systems and methods for generating metadata relating to one or more data sources, particularly when such data sources are part of an organization's enterprise management system.
It is therefore an object of the present invention to overcome disadvantages of the prior art by providing systems and methods that infer implicit elements of a data model within a data source, as well as infer elements of the data model that span across multiple data sources.
In accordance with a first aspect of the present invention, a method is provided for generating metadata relating to at least one data source having a one or more of tables and columns. The method includes a step of inferring constraints existing on at least one of the tables and columns in the data source based on data in the tables and columns. The method also includes a step of generating metadata that includes information on the inferred constraints.
In one embodiment, the method includes steps of:
A. identifying a set of data sources to be analyzed and connection information corresponding to each identified data source, and storing the identification of each data source and the corresponding connection information in a metadata repository;
B. for each data source identified in step A, determining one or more tables of interest from the each data source along with the column definitions for each table and storing the identified tables of interest along with the column definitions in the metadata repository;
C. for each table and column obtained in step B, determining a list of explicitly defined constraints for each table, and storing the list of explicitly defined constraints in the metadata repository;
D. generating column names obtained in step B to a user-friendly form by applying a function to each column name to generate aliases and storing the aliases in the metadata repository;
E. determining indices on each of the tables obtained in step B;
F. identifying view definitions including corresponding query statements for each of the data sources;
G. determining procedural code including corresponding query statements for each of the data sources;
H. obtaining a list of query statements that have been executed against each of the data sources;
I. normalizing each query statement identified in steps F through H to extract table and column information, storing the table and column information in the metadata repository;
J. identifying potential keys for each table identified in step C based on the table and column information of at least one of steps E through I and storing the potential keys in the metadata repository; and
K. identifying sets of columns that are not known to be potential keys that have similar names to potential keys identified in step J and storing the sets of columns as additional potential keys in the metadata repository.
According to another embodiment, the method includes steps of receiving table and column names; identifying one or more alternate names for each received table and column name; identifying potential relationships between the tables based on a similarity of alternate names; and outputting information defining the potential relationships.
In yet another embodiment, the method includes a step of identifying tables from the one or more disparate data sources to be analyzed. Next, aliases are identified for each of the columns in each of the tables. Next, a potential table relationship is identified when a column in a first table has a name or alias that is substantially the same as an alias of a column in a second table. Potential table relationships are stored in a metadata repository. Further, a step can be executed for identifying a number of rows in each table in the identified table relationship. A step of receiving a number of expected rows in each table associated with each identified table relationship can be performed, and a probability can be calculated for each table associated with the identified table relationship based on the number of rows and the number of expected rows.
In accordance with a second aspect of the present invention, a system for generating metadata for one or more data sources includes one or more computer program units configured to access one or more data sources and to generate metadata by inferring constraints based on data in tables and columns of at least one data source.
In one embodiment, the system for generating metadata relating to one or more disparate data sources includes means for identifying constraints on tables and columns in the one or more disparate data sources based on at least one of data in the tables and columns, and at least one query statement for accessing data in the tables and columns; and means for storing data relating to the inferred constraints in an accessible format. According to an embodiment of the present invention, a system for generating a metadata repository includes means for accessing one or more disparate data sources and means for analyzing the disparate data sources accessed to identify at least data objects, data constraints and data relationships for data in the disparate data sources. The system further includes means for generating a metadata repository comprising data about the data objects, data constraints and data relationships.
In accordance with a third aspect of the present invention, a computer-readable medium is provided which stores computer-executable instructions for generating metadata relating to at least one database having one or more of tables, each table having one or more columns, by performing operations including: inferring constraints existing on at least one of the tables and columns in the data source based on data in the tables and columns; and generating metadata comprising information on the inferred constraints.
Further applications and advantages of various aspects and embodiments of the present invention are discussed below with reference to the drawing figures.
While the present invention may be embodied in many different forms, a number of illustrative embodiments are described herein with the understanding that the present disclosure is to be considered as providing examples of the principles of the invention and such examples are not intended to limit the invention to the embodiments described and/or illustrated herein.
Referring to
According to embodiments of the invention, some or all of the following steps can be automated programmatically and some or all of the results of each step can be stored in a metadata repository. One skilled in the art will recognize that the execution of some steps results in the generation of metadata while the execution of other steps results in data used in later steps. Therefore, data generated or obtained in each step is preferably stored in a metadata repository, such as an RDBMS, for use throughout the method. The metadata repository is not limited to databases, however, and can comprise flat files, XML documents or other data storage tools. Also, as used herein, the term “data source” is meant to include databases, such as relational and object oriented databases, data objects, tables, files, documents, flat files, data stores, spread sheets and any other electronic storage structures.
Referring to
As mentioned above, the first step (S10-1) includes identifying data sources to be analyzed. Connection information necessary to uniquely identify and connect to the data source is obtained or captured (S10-2) and used throughout the process for performing each step in the process. The identification of each data source along with the connection information may be obtained through user input, input files or any other means that may be used to provide input to a process.
The connection information may include a username and password, a URL or other network address, a Java Database Connectivity (JDBC) or other connection string, for example, or any other information necessary for accessing each data source for analysis. The information obtained is stored in the metadata repository (S10-3).
In step 20, table and column information is identified for each data source. Such information can be extracted from each data source, for example, from catalog tables and headers. Exemplary sub-flows for this step are shown in
First, a set of tables of interest can be identified (S20-1). Preferably, it is presumed that all tables in each data source are to be analyzed, however, with extremely large databases (e.g., a database having about 10,000 or more tables), a user may be interested in only a subset of the database. Thus, the identification of each table of interest may be made manually (for example via a client interface or flat input file) or filters or the like may be utilized to limit which tables are analyzed by the present method. For example, all tables having a table name that includes the word “TEST” may be ignored. In another example, all tables having “ACCT” or “ACCOUNT” in the name are selected.
For each table to be analyzed, the column information is identified. For example, column name and definitions may be obtained (S20-2). This list may be obtained by directly querying the catalog tables in each database instance or by means of an interface that exposes such definitions.
Table name and column information may be obtained by querying the relational catalogs (i.e., catalog tables) or by alternate means such as an interface that exposes such definitions. The alternate means includes, but is not limited to, reading from a flat file or spreadsheets or XML or HTML files that contain this information. The information obtained in step 20 may be stored in the metadata repository (S20-3).
In step 30, for all tables and columns obtained in the previous step, a list of explicitly defined constraints is obtained. A constraint limits the set of values that constitute a valid relational state. As used herein, the term “explicitly defined constraints” includes constraints that are determined from the specific data model. Explicitly defined constraints are essentially validations of constraints on the data stored in the database. Thus, such constraints are facts that are always true about the data in a database (e.g., NOT NULL, check and foreign key constraints). Exemplary sub-flows for this step are shown in
Constraints may be obtained by directly querying catalog tables (for example, ALL_CONSTRAINTS, ALL_KEY_CONSTRAINTS) or by means of an interface that exposes the definitions (S30-1).
Explicitly defined constraints may include, but are not limited to, NOT NULL, column-level and row-level check constraints, unique key constraints, primary key constraints, and referential integrity constraints.
The NOT NULL constraint on a column disallows the NULL value from being stored in that column. Check constraints may be at a column level or at a row level. A column level check constraint specifies a condition that the value in the column must satisfy to be valid and disallows data to be committed to the database unless the constraint is met. For example, a check constraint might require that the value of a column be positive. A row level constraint specifies a condition that a row of data must satisfy to be valid. As an example, the value of a Start Date might be required to precede chronologically the value of an End Date in the same row.
A unique key constraint specifies that the values of a given set of one or more columns in a table must be unique within that table. The set of columns on which a unique key constraint is defined is referred to as a unique key.
A primary key constraint is a combination of the unique key constraint and NOT NULL constraints on the given set of one or more columns. Only a single primary key constraint may be defined on any given table. The columns on which a primary key constraint is defined are referred to as the primary key. Referring to
A referential integrity constraint limits the values in a given set of one or more columns (called the foreign key) in a table (called the child table) to the set of values stored in a corresponding unique key or primary key in the same or another table (called the parent table). For example, where Order is the child table and Customer is the parent table, Customer ID in the Order table can be the foreign key corresponding to Customer ID in the Customer table, which is the primary key of the Customer table.
The constraints obtained in (S30-1) may be expressed as a SQL statement. This SQL statement can be stored along with the constraint name and type in the metadata repository (S30-2). As described in further detail below, rather than storing the SQL statement, or in addition to storing the SQL statement, useful information can be extracted from the SQL statement and stored. This process is called “normalizing” the SQL and is explained in further detail below with references to steps 60 and 70.
In step 40, alternate user-friendly terms also referred to herein as “aliases” or “pseudo-names”, are identified and/or generated for the column names of each of the columns identified in step 20. These aliases serve at least two purposes: first, aliases assist users with understanding the data; and second, these aliases are used in the method and system of the present invention to match column names and identify “similarly” named columns during the process of inferring table relationships.
In practice, columns are typically named to conform to the syntax requirements for names imposed by an RDBMS; and, accordingly, acronyms, abbreviations or other names are often used that are not easily understandable by the user. Thus, there exists a need to provide alternate user-friendly versions of column names and table names in the form of pseudo-names or aliases that can be understood by those not having intricate knowledge of a data model.
There are issues associated with generating alternate names from the column names. For example, column names may include underscores, rather than spaces, to separate words. Alternatively, uppercase characters may be used as word separators. When uppercase letters are used to begin a new word, the RDBMS typically does not retain the exact case of each letter in the names in its catalog (i.e., all letters are stored in uppercase). Hence it might not be possible to determine word beginnings in an RDBMS. As mentioned above, names may also contain acronyms to shorten the names.
Exemplary sub-flows for step 40 are shown in
According to an embodiment of the present invention, a function F is used to convert a column name into one or more alternate user-friendly names, i.e. pseudo-names or aliases, according to the following steps:
Various ontologies for creating pseudo-names are well known in the art and readily available. The following is one example of a rule that transforms all names ending in “id” by replace it with “identity.”
The name ontologies can be used to represent transformation rules specific to a domain. The system allows a user to associate a set of ontologies with each data source. Thus column names in that data source are transformed using the domain-specific name ontologies that the user selects.
The pseudo-names for each column name may be stored in the metadata repository (S40-6). A record may be explicitly created with the column name itself as the pseudo-name, or the record can simply be omitted but implied.
In step 50, all indices on all the tables obtained in step 20 are obtained. Exemplary sub-flows for this step are shown in
Indices may be obtained by directly querying the catalog tables in the appropriate database instances or by means of an interface that exposes the index definitions (S50-1). All the subsets of the set of columns in each index definition are marked as “potential” keys for the table on which the index is defined (S50-2). Thus, for example, if a composite index exists on columns a, b and c on table T, columns a, b and c are individually marked as potential keys. The sets {a, b}, {a, c}, {b, c} and {a, b, c} are also marked as potential keys. These potential keys are used to infer implied single-column or multi-column relationships between tables as described in subsequent steps. The information may be stored in the metadata repository (S503).
In step 60, all view definitions for all the database instances are obtained. Exemplary sub-flows for this step are shown in
As used herein, the term “views” means named queries. Views are defined by SQL queries in most RDBMS. View definitions can be extracted from the catalog tables in an RDBMS (S60-1). Wherever a table name may be used in a query, a view may be used instead. Thus, a query may be reused without having to repeat it in its entirety. The view query statements are analyzed to aid in the inference (S60-2). For the purpose of such analysis, the query is first parsed and normalized. The normalization process is carried out as follows:
If the query references any views, the views are expanded. This process is repeated until no views are referenced in the query.
SQL allows joins to be specified in two ways, either by specifying the join condition as part of the WHERE clause or by specifying the join predicate as part of the FROM clause. The following are examples of two queries that are semantically the same
The second form is preferable since it clearly separates out the join predicates. Not all RDBMS support the second form. To aid further analysis, any query in the first form is converted to a query in the second form.
A query may contain sub-queries. The sub-queries may in turn contain other sub queries and so on. The sub queries may be correlated (i.e., refer to a value from the containing queries). For example, the following query contains the correlated sub-query (SELECT * FROM b WHERE b.f2=a.f1)
The following query contains no correlated sub-queries but is semantically the same. It contains the non-correlated sub-query (SELECT b.f2 FROM b):
Sub-queries provide useful information about potential join paths between tables. To aid further analysis, sub-queries are transformed into the second form of joins described in the previous step. Thus, both the above queries would be re-expressed as the query:
SELECT DISTINCT <column list> queries are converted into SELECT <column list> GROUP BY <column list> queries.
Queries involving the UNION set operation are split into individual queries. This applies to even inner queries. Thus, if a query contains three inner queries that are each unions between two tables, a total of eight queries are obtained by treating each UNION query as an individual query.
Queries involving the EXCEPT set operation are converted into LEFT OUTER JOIN that only returns rows when the right hand table does not have a corresponding row. Thus, the query
After the above transformations, the normalized query would contain other queries only in the FROM clause. Such queries can in turn contain other queries in their FROM clause. The list of tables and queries in the FROM clause are referred to as correlations. Correlations in a normalized query may refer to tables or other queries (S60-3).
Any queries that are part of a FROM clause in a normalized query are added to the set of queries that are used for further analysis. The information may be stored in the metadata repository (S604).
The normalized query itself is also added to the set of queries used for further analysis.
The view definition itself is included within the enterprise data model much like the table definitions along with the view column definitions. This includes computed columns. The column names are again converted into user-friendly names as described in step 40 and may be stored in the metadata repository (S60-5).
In step 70, available procedural code is obtained for each database source. Exemplary sub-flows for this step are shown in
Exemplary procedural code includes code in the form of triggers, procedures, functions and packages, as well as code embedded or contained in any triggers, procedures functions and packages. Procedural code may be extracted from the data source (S70-1).
SELECT queries are normalized (S70-2) as follows:
The query is parsed. If the SELECT query contains a clause to fetch data into host variables, that clause is culled out from the SELECT query. The list of column expressions in the SELECT clause is matched against the list of host variables. If a column expression is not in the form of a pseudo-name or alias, then the name of the host variable that it is fetched into is used as an alias for that column expression.
Thus the following query:
The SELECT query is then normalized as described in step 60.
UPDATE queries are analyzed by converting them into SELECT queries. For example, UPDATE queries of the form
DELETE queries are converted to SELECT queries in a manner analogous to the manner in which UPDATE queries are converted to SELECT queries.
Converted queries are normalized in a like fashion as SELECT queries. The normalized SQL may be stored in the metadata repository (S70-3).
In step 80, a list of all available queries that are run against each data source identified in step 10 is obtained. Exemplary sub-flows for this step are shown in
Query code may be in the form of SQL code or the like and may be obtained from query logs (i.e., records of executed queries maintained by the DBMS), from client-side SQL logs (i.e., logs generated by ODBC/JDBC connections), from text files that contain a list of all queries that are typically used by the MIS staff, from user input, or from a combination of any of these or similar sources (S80-1). All queries obtained through such means are analyzed and normalized in the same manner as described above with respect to step 70 (S80-2). The normalized SQL may be stored in the metadata repository (S80-3).
In step 90, name propagation is performed for all queries or sub-queries obtained during the normalization process. The purpose of the name propagation is to generate alternate names (i.e., pseudo-names or aliases) for columns in tables or views. In all cases, the names are generated as already described with respect to step 40. Exemplary sub-flows for this step are shown in
The name propagation for a single query proceeds as follows:
For simple column references, if the correlation is a query, the column alias is propagated inwards. An example wherein an alias is propagated inwards is as follows: Given a query of the form:
This transformed inner query will in turn result in the computed column f5-f6 being defined as a compute column on table t. The information may be stored in the repository (S90-3).
In step 100, the results from the normalization process described in step 60 are analyzed to identify potential keys in the tables. In the method of the present invention, the normalization process is used to identify potential keys. Exemplary sub-flows for this step are shown in
In S100-1, all correlations in the FROM clause of the normalized query are inspected. If a correlation maps to a table, the set of all columns of that table that are specified as part of the predicates in the FROM or WHERE clauses or as part of the GROUP BY clause are obtained. All subsets of this set are marked as potential keys for that table. For all the correlations that are queries, the columns from the correlation referenced in the outer query can be transposed inwards into the inner query in the form of “dummy” predicates. Then the same process of marking potential keys is applied to the inner query as described in this step.
To illustrate the above, for example, if the normalized query to be analyzed is:
This query in turn contains two correlations, b and c, both of which are tables. Since columns f2 and f4 from table b are used in predicates in the query, all subsets of {f2, f4} are marked as potential keys in table b. Similarly all subsets of {f3, f5} are marked as potential keys in table c. The identified potential keys may be stored in the metadata repository (S100-2).
In step 110, the set of potential keys is expanded by adding sets of columns that are not known to be potential keys but have similar names to known potential keys. Exemplary sub-flows for this step are shown in
For each potential key, the column name is compared to column names and aliases in the repository (S110-1). Each match is identified as a new potential key (S110-2) and may be stored in the repository (S110-3).
For example, if S1 is a known potential key containing the columns c1 and c2, and S2 is any set containing columns c3 and c4, wherein S1 and S2 do not have any elements in common, if it is determined that the name of c1 is similar to that of c3, and c2 has a name similar to c4, then S2 is also added as a potential key.
In step 120, the cardinality of each table identified in step 20 is obtained. “Cardinality” as used herein with respect to a table means the number of rows in a table. Similarly, the phrase “cardinality of a column” means the number of non-null values in a given column. The implied domain set of a column is the set of all distinct non-null values for the column. The cardinality of the implied domain set of a column is the number of distinct non-null values for the column. Exemplary sub-flows for this step are shown in
A query statement can be generated to calculate the cardinality (S1201) and range of values (S120-2) of columns of interest in these tables. The results of these queries form the basis for inferring the implicit NOT NULL, single column unique key or primary key constraints. For example, if table t has columns a, b and c, the following exemplary SQL statement may be generated:
If the result of the COUNT(<column>) expression is same as the result of the COUNT(*) expression, the implication may be made that in the current relational state, that column does not store any NULL values. If the result for a COUNT(<column>) is the same as the COUNT(DISTINCT <column>) expression for the same column, the implication can be made that in the current relational state, all the non NULL values for that column are distinct or unique.
To break down the query further, the COUNT(*) provides the cardinality of the table; COUNT(<column>) provides the cardinality of the particular column in the table t; and COUNT(DISTINCT <column>) provides the cardinality of the domain set implied by the current relational state for that column in the table t. If the cardinality of the column is the same as the cardinality of the table, then it is determined that the column does not contain any NULL values. Similarly, if the cardinality of the column is the same as the cardinality of the implied domain set for that column, then it is determined that the column contains only unique values.
The COUNT(<column>) and COUNT(DISTINCT <column>) are indicative of the column specific expression that could be part of the generated query and may not be necessary if one were to take into account existing knowledge about the domain. For example, if the column is explicitly defined as NOT NULL, the COUNT(<column>) is not necessary as it will always be same as the COUNT(*). Similarly, if the column is explicitly defined as a unique key in the DBMS, then the COUNT(DISTINCT <column>) is not necessary. If the column is explicitly defined as NOT NULL, the COUNT(DISTINCT <column>) is the same as COUNT(*) for COUNT(<column>). If the column does allow NULLs, then COUNT(DISTINCT <column>) is the same as COUNT(<column>). Additionally, if the DBMS does not ignore NULLs from the uniqueness check, then COUNT(<column>)=COUNT(*) or COUNT(<column>)=COUNT(*)−1. Similarly, certain other expressions such as MIN(<column>), MAX(<column>) may also be useful in addition to the COUNT(<column>), COUNT(DISTINCT <column>) expression. The MIN and MAX values may be used to determine the range (difference between the minimum and the maximum) for the implied domain set of a column.
The generation of the SQL can proceed as follows: The COUNT(*) is added to the select clause. A function f is then applied to each column in that table. This function returns all the expressions on the given column that should be part of the select clause. These returned expressions are duly added to the SQL. This function may omit the COUNT(<column>) if the column has a NOT NULL constraint. It may also omit the COUNT(DISTINCT <column>) if the column is defined as a unique key. A default implementation may add the expressions MIN(<column>), MAX(<column>), MIN(LENGTH(<column>)) and MAX(LENGTH(<column>)) for VARCHAR columns. It may add the MIN(<column>) and MAX(<column>) expressions for integer or date time columns. It may also omit all expressions in cases where it is believed that the column may not be part of a unique or primary key. Examples of such columns may be columns that store decimal amounts or floating point data or columns that store creation/modification timestamps, version stamps or users. The determination of whether columns store creation/modification timestamps, version stamps or users can be made based on predefined or configurable ontologies. If there is an indication that the column name obtained from the known ontologies represents such timestamps, version stamps or users, then the column is treated as such.
The cardinality and ranges may be stored in the metadata repository (S120-3).
In step 130, each query generated in step 120 is executed and the results may be stored. Each table is queried only once without any joins. The queries generated in step 120 could be formulated as multiple queries per table instead of one query per table. For example, one query could be generated for expressions of each column. However this would require multiple scans of each table and would thus increase the time complexity.
In step 140, for each column, constraints are inferred. Exemplary sub-flows for this step are shown in
The cardinality and ranges are selected from the repository (S140-1). The cardinalities obtained by executing the queries in step 130 are used to infer NOT NULL and unique key constraints (S140-2). The logic for this inference is described in step 120. The inferred constraints may be stored in the metadata repository (S140-3).
In step 150, implicit single column table relationships are inferred. Exemplary sub-flows for this step are shown in
For each column in the repository (S150-1), queries are executed to determine table relationships (S150-3), and the table relationships may be stored in the metadata repository (S150-4).
For example, given table p, table c, column u in table p and column f in table p, table c is a child of table p with foreign key f referencing column u if and only if the following template SQL returns 0:
The above represents a rigorous test for determining relationships. Often, however, data may not satisfy such a rigorous test as described above, but may still be related. For example, applications performing inserts/updates on a child table may be storing values such as 0 or an empty string instead of NULLs in the columns. Such values are often referred to as sentinel values in the art. Further, data may be initially inserted in the referencing columns such that it matches the data in the referenced columns. However, the data in the referenced column may be subsequently deleted or updated without updating of the referencing columns. This could be a systematic behavior that applications maintaining the data sources exhibit or it could be outlier cases where certain steps may have been erroneously skipped.
For the purpose of identifying join paths, even relationships that do not satisfy the rigorous criteria stated above are important. Thus, instead of obtaining an “either-or” result using the above criteria, a probability is computed for the relationship based on the distribution of values that match or do not match. Specifically, the following query is executed to test the relationship:
If the total_child_rows returned by this query matches the total_parent_rows, then the rigorous test would also be satisfied and the highest probability (1) is assigned to this inference. Otherwise, this inference is assigned a probability computed as MAX(matched_child_rows/total_child_rows, distinct_parent_values/distinct_child_values). If this probability is below a certain threshold, the inference is automatically rejected. MAX is merely indicative of a function that might be used to compute the probability based on the two ratios: the first ratio (r1) being number of child rows that matched and the second ratio (r2) being the number of elements in the domain set of the child that matched the domain set of the parent. The general idea of any function defined on these two ratios would be to reward a high value for any ratio. An alternative to the MAX is SQRT(SQR(r1)+SQR(r2))/SQRT(2).
If the distinct_parent_values is less than distinct_child_values by only 1 or some other very small number, presuming distinct_child_values is not a very small number, a test may be performed to see whether the unmatched values are all sentinel values. If so, then the probability assigned to the inference may be “boosted” by applying a certain multiple greater than 1. The rationale for the boost being that the sentinel values are almost like NULLs and thus should not considered for testing matches. An allowance for a single possible sentinel value could be made by adding the column:
Note that the values of total_child_rows and distinct_child_values can be obtained from the metadata repository itself as a result of the processing in step 130. The above query is an example intended to demonstrate the values required for performing the inference and the assignment of a probability to it. The actual query may be optimized in various ways to reduce computing costs.
If the processing described above indicates that p is a parent of c, the relationship type is determined to be one-to-one if both u and f are unique keys. It is one-to-many if u is a unique key but f is not a unique key. The relationship type is determined to be many-to-many if neither f nor u is a unique key. Note that p being parent of c with foreign key f in c referencing column u in p does not preclude c from being a parent of p with foreign key u in p referencing column f in c. Neither are p and c precluded from being the same table. However f and u may not be the same column in the same table as such a relationship would always exist and is hence trivial.
Executing the above query for each possible pair of columns can yield all single column relationships implied by the current relational state, however executing such a query for each and every pair of columns can often be infeasible. For example, executing the query for a model with only 1000 columns would involve executing nearly a million queries with two tables scanned for each query. In order to make the inference feasible, the following heuristics and restrictions may be applied to determine whether to test column f in table c as a potential foreign key that references column u in table p (S150-2). Preferably, every test must be satisfied for the test to be performed:
The relationship information may be stored in the metadata repository (S150-4).
In step 160, multi-column or composite unique keys are inferred. Exemplary sub-flows for this step are shown in
For each set of multiple columns in a table in the repository (S160-1), queries are executed to infer key constraints (S160-3), and the inferred constraints may be stored in a metadata repository (S160-4).
For example, given a table T, a set of columns C such that each element of C is a column in T, the columns in set C represent a unique key on table T if and only if the following template query returns no rows:
Executing the query for each possible set C of the columns in table T should yield all of the implied multi column unique keys for table T. However, even for a small table of for example, twenty columns, there are 220—in excess of over 4 billion—such possible combinations. Thus, the simple approach of executing this query for every possible combination of columns is clearly infeasible. Instead, the following heuristics and restrictions are applied (S160-2) to keep the number of such possible sets C to test to a manageable level:
For each unique key test made for the set of columns C, if the test succeeds, the set of columns C is marked as an implied unique key for the pertinent table.
In step 170, multi-column table relationships are inferred. This analysis is similar to that of step 150 and uses the multi-column unique keys inferred in step 160. Exemplary sub-flows for this step are shown in
For each multi-column unique key in the repository (S170-1), queries are executed to determine multi-column table relationships (S170-3), and the table relationships are stored in the metadata repository (S1704).
For example, given table p, table c, an ordered list of columns U={u1, u2, . . . , un} in table p and an ordered list of columns F={f1, f2, . . . , fn} in table c, where the number of columns in U is same as the number of columns in F, table c is determined to be a child of table p with the columns in U referencing the columns in F if and only if the following template SQL returns 0:
The columns returned by the above query are the same as the ones returned for the analogous query for inferring single column relationships. This allows the probability for multi column table relationships to be computed in exactly the same manner as described for the single column table relationships. For testing whether all the columns in unmatched child rows are sentinel values, the following expression may be added to the select column list of the outermost SELECT in the above query:
If unmatched_sentinel_tuples=distinct_child_values−distinct_parent_values, all unmatched rows have only sentinel values in all referencing columns and hence the probability for the inference may be boosted by a multiplier analogous to the manner in which this is done for single column relationships.
The problem of computational feasibility is even more of a consideration for multi column relationships than for single column relationships since the number of all multi column combinations is much larger than the set of all single columns. Thus before querying the data, the following checks are made as follows:
In step 180, a conceptual directed graph of the data model is formed using explicit and implied inferences. Exemplary sub-flows for this step are shown in
Table and column data is selected from the repository (S180-1). The table and column data is used to identify hierarchical relationships between the tables (S180-2). For example, parent-child relationships can be used to identify the hierarchical relationships. An entity relationship diagram (ERD) or the like can be constructed from the relationships (S180-3). For example, in
In step 190, final probabilities are computed for all implied inferences. It is noted that all inferences in the prior steps are based on the current relational state. If the relational state changes, some of the inferences may no longer be valid. Unlike explicitly defined elements of the data model, there is no absolute certainty that assertions inferred from the current relational state will apply to all future relational states. If the current relational state was reasonably mature, a high degree of probability could be assumed for such inferences. However, if the current relational state was empty, all tables would be empty and no meaningful inference would be possible.
To compute the final probability for implied inferences, a table is assigned a probability indicating to what extent the current data in the table is representative of the expected data in the table. For example, if a table is expected to contain at most a million rows and the table contains a million or more rows, the probability assigned to such a table may be relatively high. At the same, if it contains only 1000 rows, then the probability that the data in the table is representative of a future relational state is relatively low.
Exemplary sub-flows for this step are shown in
First, for each inferred constraint in the metadata repository, the associated tables are identified (S190-1). As used herein the phrase “inferred constraint” means any constraint inferred from the method of the present invention of examining the actual data in the data source. To compute probabilities for each inference, either a probability or the number of expected rows in a set or subset of tables in step 20 is obtained (S190-2). This may be in the form of user input and/or some input file. If the information provided is the expected number of rows in the table, the probability for that table could be computed as MIN (<cardinality of the table obtained in step 130 (S190-3)>/<expected rows>, 1.0) (S1904). To estimate a similar probability for tables that do not have a probability specified explicitly, the following steps are performed on the graph of the data model described in step 180:
After the above steps, each node that is part of an implied inference has a certain probability associated with it (S190-8). At this point, the probability of each implied inference is computed. For inferred NOT NULL and unique key constraints, the probability is computed as the probability associated with the single table involving the constraint. For inferred foreign key constraints, the probability is computed as the product of the probability associated with the inference itself and the probability associated with the child table.
In step 200, the implied inferences may be accepted or discarded based on certain criteria, which may be preset or may include user input. Exemplary sub-flows for this step are shown in
First, the calculated probability can be selected from the repository (S200-1). Next, the probability can be compared to a threshold (S200-2). As example, threshold h and l may be configured where 1>=h>=l>=0. All inferences with probability>=h are automatically accepted (S200-3). All inferences with probability<l are automatically discarded (S200-5). All inferences with probability<h but >=l are presented to the user in ascending order of probability (S200-6). The user may then accept as valid or discard as invalid some or all of these inferences. By accepting, it is meant that the inferences are confirmed to be valid and accurate and the metadata repository can be made to reflect when an inference is accepted as valid or rejected as invalid. For example, data records can be marked and/or rejected inferences can be deleted.
One skilled in the art will understand that the above steps need not be executed in exactly the same order. One skilled in the art will further understand that certain steps may be run repeatedly and other steps may be skipped altogether. Further, some steps may be replaced with equivalent or similar processes.
The enterprise data model and metadata information derived from the above steps may be persisted to some form of computer storage (e.g., the metadata repository). This allows it to be referenced or updated on an ongoing basis. New data sources may be added to the enterprise data model after the initial model has been constructed, which will result in new elements being added to the model. Data sources already included in the enterprise data model may be reanalyzed. In such a case, certain existing elements of the model may change in response to analysis of new elements.
The “metadata repository” itself can be the output. In other words, each step of the analysis generates data into the repository, which is preferably a database. The repository could also be in the form of a relational database, object-oriented database or text/XML files, and could be stored in any format.
At any point in the process, the repository can be exported or otherwise output to a form suitable for use with BI systems. The repository itself can be tailored to comply with the input requirements of a BI system or an API can be used to convert data from the repository into the proper form for input into a BI system.
Exemplary repository output is illustrated in S120-3 with reference to step 120. In simple terms, step 120 determines the count of distinct or NOT NULL values in columns along with the number of rows in the table itself. For example, assume that the data source has a single table called Employee with 4 columns called Employee_id, Employee_name, Department_id and Salary. The metadata could be stored in a relational database in the following tables:
In the example of an employee table having 1000 rows, all Employee_ids and Employee_names are unique, there are 10 distinct Departments, only 900 employees have been assigned to a department, and there are 50 distinct salaries and everybody gets a salary(!), the analysis would populate the above tables as follows:
Alternatively, metadata could be stored in an XML document as follows:
According to this example, the same data (output from the analysis) can be stored in 2 different formats. However, one skilled in the art will understand that the data could be stored in many other formats and also could be easily converted from one format to another.
According to embodiments of the present invention, some or all of the above-described method steps of the invention may be performed automatically or with the assistance of operator input. Further, the invention may comprise means for performing one or more of the above-described functions or steps, including, but not limited to, appropriate computer hardware and software configured to access disparate data sources, perform the analysis and generate metadata as described in the above-processes with reference to
One skilled in the art will readily understand that numerous commercially available hardware processors and operating systems may be used to implement the method of the present invention. Program units (including, e.g., a set of APIs or similar interfaces) may be provided to expose elements of the enterprise data model or perform certain operations described above programmatically. Further, an operator control board or graphical user interface (GUI) can be provided for user input and control, or computer programs that implement any of the above-steps could be executed from a command line of a computer system connected with the enterprise management system being analyzed by the present invention. One skilled in the art will understand that a combination of the above could be used to implement the present invention.
System 100 is shown in communication with an enterprise management system 108 including one or more servers 108A and one or more data sources 108B. System 100 can connect to, and communicate with, the enterprise management system 108 over an electronic data network 106 as shown, or the system can be directly connected to the enterprise management system. Examples of suitable electronic data networks include, but are not limited to, local area networks (LAN), wide area networks (WAN), and distributed networks such as the Internet. It will also be appreciated that the system can be connected to multiple data sources via a combination of direct and networked connections. The system 100 is configured to execute a process for generating metadata, such as the one shown and described above with reference to
As already described above, information useful for generating robust metadata, such as SQL code, reports, spread sheets, and other informative documents may be found on other computer systems or in other network locations, such as on network servers 112 and user desktops 114.
According to a preferred embodiment, system 100 is connected to one or more computer systems running BI tools. The BI tools may operate on any one of the enterprise management system 108, the server 112, or the user computer 114, or some combination of the foregoing.
According to embodiments of the present invention, the system can be configured to accept user input as follows:
Data sources to analyze: the system may be configured to allow a user to specify any number of data sources to analyze. The following information is preferred for each data source:
Tables to analyze in each data source: By default, all tables in each data source should be analyzed. However, the user may specify a filter and/or select through a GUI only a subset of tables to be included in the analysis. Each table in a given data source may be identified by the 3-ary (ternary) tuple with the attributes (catalog_name, schema_name, table_name). The user may define a filter in the form of SQL to filter the set of tables. Thus a user may specify a filter such as
In addition to such a filter, the user may also explicitly select/deselect tables to select/deselect through a GUI.
Name ontologies to use for each data source: The system uses name ontologies or rules to generate alternate column names. These may be specified for each data source.
Sentinel Values: User may specify the sentinel values for various data types. These values are used while inferring non-rigorous table relationships.
Options for inferring table relationships: User may specify various options to be used while inferring table relationships. Different options may be defined based on the number of columns in the key. Various sub-options allowed are:
Queries from external data sources: The system can be configured to obtain all procedural code from the relational catalogs. Additionally, the system allows the user to provide an additional set of queries that can be used to aid in the analysis. This is done by providing a list of files containing such external queries for each data source.
The system should be configured to accept data regarding expected rows in a table or probability of the data in a table being representative of a mature relational state: These are used to assign probabilities (level of confidence) to constraints inferred by the system.
Probability thresholds for inferred constraints: 2 probabilities ‘l’ and ‘h’ can preferably be specified for each data source. ‘l’ specifies a probability threshold such that if the computed probability for an inferred constraints is below ‘l’, the inference is automatically rejected. ‘h’ specifies a probability threshold such that if the computed probability for an inferred constraints is same as or above ‘h’, the inference is automatically accepted.
User acceptance of inferred constraints: For inferred constraints whose computed probability is >=‘l’ and <‘h’, the system can be configured to prompt the user to accept/reject the inference.
System defaults: The system can be configured to allow the user to specify default probability threshold and name ontologies to be associated with data sources. The data source level then either override the probability threshold or extend the name ontologies.
The present invention can be arranged in many known architectures. Accordingly, the system of the present invention can be implemented in a centralized or distributed architecture. Programs implementing steps of the process can be executed from a command line prompt, such as on client interface 102, using inputs via XML files or the like. For example, the following illustrates a sample XML input file along with its corresponding DTD:
Accordingly, the above program can be invoked from the command line by the command:
The output of the system (e.g., inferred restraints, etc.) can be in XML format as well. In a preferred embodiment, the user may modify the output file and provide the modified output file as input to re-perform the analysis.
A simple, exemplary user interface is shown in
Other methods to provide user input and control are contemplated as well. For example, the input could be provided via simple text files with, for example, windows.ini or the like, or could simply be passed as command line arguments. Similarly, a GUI may be web-based or windows-based client, or other known interface technology.
Thus it would be possible to perform a reanalysis periodically by scheduling invocation of such an API to reanalyze all existing database interfaces.
As a result of the above configurations, universally applicable computer software can be provided that creates metadata from any group of enterprise databases by interrogating and reverse engineering the databases. The present invention infers specific information and rules about the retrieved data that are then used to streamline and accelerate the output of the enterprise software. Optionally, the invention uses predefined, as well as user-definable specifications, to aid in the inference. The present invention also provides a method of creating such inference that may be applied to any relational or relational-like database systems that expose their metadata through some means and provides a SQL-like language to query the data itself.
The software of the present invention creates rules that allow quick implementations and iterations that produce output that becomes business knowledge for critical decision-making by the user of the software product. The present invention is effective with any COTS (commercial off-the-shelf) enterprise software used by corporations today. Whether the application is CRM (customer relation management), accounting, specialized inventory or any other data intensive analysis, the present invention enables correct results quicker, better and cheaper than competitive software.
Thus, according to the above-described embodiments, systems and methods are provided for analyzing data sources to automatically infer constraints and relationships to generate metadata that is particularly useful with business intelligence (BI) systems.
One non-limiting advantage of the present invention is that the systems and methods make possible the reverse engineering of the most complicated enterprise management systems, which may include disparate data sources, and production of a comprehensive and robust metadata repository for use with BI platforms. Until the present invention, no other system existed that performed as in-depth analysis nor did one exist that could analyze disparate data sources like the present invention.
One feature of the present invention includes a system for “automatic” inference of NOT NULL, unique and referential integrity constraints. Such inferences serve primarily the following two purposes:
According to embodiments of the present invention, a number of techniques can be employed to derive information from data sources, including, but not limited to, data catalog queries, interrogation of data, program analysis, and analysis of other queries.
For example, queries can be run against the source database catalog for constraints. Data can be interrogated to reverse engineer various kinds of constraints. The primary challenge with this technique is the computational infeasibility of trying out all permutations and computations. This will be addressed by the algorithm of the present invention that reduces the search space by looking at a host of other information such as column naming conventions, data distribution, and other attributes. Analysis of existing procedure, SQL and other common programming languages can be performed to aid in the inference of data relationships. Analysis of queries actually executed against the database over a period of time can also provide additional information such as potential keys that can aid in this inference.
As will be understood by one skilled in the art, the degree of sophistication of the reverse engineering systems and methods determines the level of manual effort required to populate the metadata repository. That is, although it is preferred that the present invention perform all steps automatically, it is realized that extremely large and/or complex systems often present unique obstacles that require some level of manual analysis or customization. If, however, all data sources were accessible from a single point, theoretically, the present invention could be implemented in a plug-and-play embodiment.
Thus, a number of preferred embodiments have been fully described above with reference to the drawing figures. Although the invention has been described based upon these preferred embodiments, it would be apparent to those of skill in the art that certain modifications, variations, and alternative constructions could be made to the described embodiments within the spirit and scope of the invention.