The present disclosure relates to the automatic discovery of relationships between database tables.
Organizations employ databases storing information, such as information about business processes. It is increasingly common for knowledge about the structure and content of such databases to be lacking or not available at all. For example, legacy databases may be undocumented and staff familiar with their design and implementation may have moved-on. This can make the interpretation and exploitation of information stored in such databases very challenging.
Current approaches to addressing the determination of relationships between databases rely on close inspection of data stored in database tables and explicit relationships between tables. Such approaches often require intensive human analysis. To the extent that such processes may be automated, the need to characterize data types and contents within database tables involves processing large volumes of stored data to define bounds and formats of fields.
Accordingly, it is beneficial to provide improvements to the identification of relationships between data stored in databases.
According to a first aspect of the present disclosure, there is provided a computer implemented method of identifying one or more relationships between columns in a database, the method comprising: determining a subset of a set of all columns of all tables in the database, the columns in the subset satisfying predetermined primary key characteristics, each predetermined characteristic being defined by a rule for identifying a column as a potential primary key; identifying one or more relationships between columns in the subset and columns in the set of all columns based on each of: primary key and foreign key relationships between columns; and relationships between columns indicated in one or more scripts, each script including instructions for accessing the database.
In some embodiments, the rule of a primary key characteristic includes one or more of: a requirement that a column is explicitly identified as a primary key; a requirement that data stored in a column is alphanumeric; a requirement that each data item stored in the column is absent null data; a requirement that each data item stored in the column is unique within the column; and a requirement that each data item stored in the column is comprised of a predetermined subset of characters.
In some embodiments, a primary key and foreign key relationship is identified for a first and second columns by one or more of: an explicit identification that the first column is a primary key and the second column is a corresponding foreign key; and a determination that the second column includes data items storing only values that appear in the first column.
In some embodiments, a script indicates a relationship between a first and second columns by the script including a statement that links the two columns such as a database join statement.
In some embodiments, the columns in the subset are modelled in a data structure including an element for each column, and wherein identifying one or more relationships between columns in the subset and columns in the set of all columns further comprises: identifying a column in the subset as related to a column in the set of all columns; modelling the related column in the set of all columns as an element in the data structure; and modelling the relationship between the column in the subset and the column in the set of all columns in the data structure.
In some embodiments, the modelled relationship in the data structure is adapted to indicate a confidence of the relationship, the confidence being based on how the relationship was identified.
According to a second aspect of the present disclosure, there is provided a computer system including a processor and memory storing computer program code for performing the method set out above.
According to a third aspect of the present disclosure, there is provided a computer program element comprising computer program code to, when loaded into a computer system and executed thereon, cause the computer to perform the method as described above.
Embodiments of the present disclosure will now be described, by way of example only, with reference to the accompanying drawings, in which:
Embodiments of the present disclosure provide for the identification of relationships between columns 202 in a database. Notably, the columns 202 in
A relationship identifier 206 component is provided as a hardware, software, firmware or combination component configured to identify one or more relationships between columns 202 in the database 200. The relationship identifier is operable to determine a subset of all columns 202 satisfying predetermined primary key characteristics 204. The primary key characteristics 204 are characteristics of a column within a database table indicative of the column being a primary key for the database table. As will be apparent to those skilled in the art, a primary key in a database table can be a column that is used to uniquely identify a record in a relational database table.
In embodiments of the present disclosure, each primary key characteristic 204 includes a rule for identifying a column as a primary key. For example, the rule of a primary key characteristic 204 can include a requirement that a column is explicitly identified as a primary key, such as by being indicated as such in a definition of the column for a database table. Additionally or alternatively, the rule of a primary key characteristic 204 can include a requirement that data stored in a column is alphanumeric. This reflects a common requirement that primary key data is alphanumeric in various databases. Additionally or alternatively, the rule of a primary key characteristic 204 can include a requirement that each data item stored in the column is absent NULL data. That is, that data stored in the column is NOT NULL, or that the column is indicated as a NOT NULL column. Such an indication precludes the storage of NULL in the column within records stored in a database table as a primary key cannot be a NULL value. Additionally or alternatively, the rule of a primary key characteristic 204 can include a requirement that each data item stored in the column is unique within the column. Additionally or alternatively, the rule of a primary key characteristic 204 can include a requirement that each data item stored in the column is comprised of a predetermined subset of characters, such as WORD characters that can be characterized as all alphanumeric characters and an underscore character. For example, the predetermined subset of characters can exclude whitespaces and special characters, for example, as is typical among primary key columns of databases.
Thus, the relationship identifier 206 identifies a subset of columns 202 satisfying the primary key characteristics 204. Subsequently, the relationship identifier 206 is operable to identify relationships between the columns in the identified subset of columns and the set of all columns 202. The relationships can be identified based on primary key and foreign key relationships between the columns Additionally, the relationships can be identified based on indications of relationship in scripts including instructions for accessing the database 200.
A primary key and foreign key relationship can be identified for a first and second columns by, for example, an explicit identification that the first column is a primary key and the second column is a corresponding foreign key. Additionally or alternatively, a primary key and foreign key relationship can be identified by a determination that the second column includes data items storing only values that appear in the first column.
Scripts including instructions for accessing the database 200 can include, for example, inter alia, existing scripts extracted from the database 200 or software interfacing with or using the database 200. For example, such scripts can include existing database logic such as procedures, functions, views or user queries. From such scripts, columns that co-occur in linking tables together can be identified such as script instructions involving table JOIN or the like. Such scripts provide indications of relationships between columns
In one embodiment, the relationship identifier 206 models the columns in the identified subset in a data structure, such as a graph data structure in which elements such as nodes in the data structure correspond to columns. Using such a data structure, the relationship identifier 206 can indicate or identify relationships between columns in the subset and columns in the set of all columns 200 by modelling identified relationships as relationships in the data structure. In particular, related columns can be indicated in the data structure and relationships can be indicated by, for example, edges in a graph data structure. Furthermore, indications of relationship in the data structure can be further supplemented by an indication of a degree of confidence of the relationship, such as by evaluating a degree of confidence depending on how the relationship was identified. For example, an explicitly indicated primary and foreign key relationship can provide a high degree of confidence; whereas an inferred relationship based on stored data or script information can provide a relatively lesser degree of confidence.
Thus, in use, the relationship identifier 206 is operable to identify relationships between pairs of columns in the database 200. Such relationships are especially valuable where the database 200 is otherwise undocumented or poorly understood since it serves to indicate commonality of entities represented by records in the database 200 by commonality of related columns For example, records stored in a first database table having a column determined to be related to a second database table can be associated with the records in the second database table based on the column relationship. Accordingly, new representations of data in the database 200 can be generated, new queries can be formulated and software can exploit data stored in the database 200 more efficiently.
An exemplary embodiment of the present disclosure is described below. In this embodiment, we first consider and extract information about primary keys that has already been explicitly defined for each table in the database model. In an ideal relational database, many or even all tables should contain a primary key, but in real-world databases this it is not always the case. Provided that this information already exists in the database for at least some tables, this can be accessed with pre-defined statements that retrieve the information of a database object that describes all constraint definitions on tables which are accessible to users. As shown in Algorithm 1, given a set of tables, a subset of tables with known primary keys are initially retrieved. We use a function to retrieve all tables with their respective primary keys if they exist.
In the next phase, we automatically infer, without any prior knowledge, potential primary key columns by using other information. Our inference is based on four tests, which are explained below:
The first test requires that a primary key must be of an alphanumeric datatype. We obtain a list of all tables in the database with their respective columns and datatypes, and then select tables with associated alphanumeric columns. In a typical Oracle database, this would include columns with datatypes such as char, number, nchar, varchar, nvarchar, varchar2 and nvarchar2.
One of the major constraints that defines a primary key is a NOT NULL constraint which means that a primary key must not be empty (null). This information might not be explicitly defined for a table but we can infer by checking for non-null columns of each table in the database. For example, a statement is embedded in the algorithm to retrieve a null value from each column of a table. For each column, the algorithm checks whether the query returns an empty result or not. An empty result indicates that a column contains no empty/null values.
For a column to be regarded as a potential primary key candidate, it must meet the uniqueness test—primary key column must only contain unique, i.e. not duplicate, values. Unique constraint definition enforces the uniqueness of a column by ensuring that no duplicate values exist in the column If this constraint has not already been defined explicitly in the database, a uniqueness test can be performed on each column in a table. If a column possesses the uniqueness property, then the total number of rows in the table must be equal to the number of distinct values in the column. In our algorithm, we establish both numbers and if they match, a column passes the uniqueness test.
This test requires that values in a primary key column contain only “word” characters. For example, word characters include any letter, any digit and the underscore character. Some characters can be specifically excluded, such as whitespaces and special characters. Our algorithm retrieves all values for a column of a table and then use a check to establish whether each value consists of only word characters.
Table 1 shows a typical sample table as may be found in a business database. It holds information about different departments and has four columns: Department_ID, Department_Name, Manager_ID and Location_ID. In this table, only the first column Department_satisfies all four tests, and thus is identified as a potential primary key column.
Primary—Foreign Keys Relationships
To find all column pairs where one column is a primary key/key candidate from X and the other column is a corresponding foreign key column from another table in C, we distinguish two cases:
I. Either a column is explicitly marked as a foreign key in the database itself,
II. Or we need to establish a 100 percent one-to-one match between the two columns, meaning the second (foreign key) column can only contain values that appear in the first (primary key candidate) column.
Algorithm 1 outlines our approach. We establish first the set A of explicitly specified primary keys, then calculate set B of primary key candidates by checking all potential columns against the four individual tests, construct the union X=A∪B, and finally calculate all pairs of a) primary key/primary key candidate from set X and b) foreign key/foreign key candidate from set C.
The check w(ci, cj) returns 1 if two columns ci and cj are in a (potential) primary/foreign key relationship, and 0 otherwise:
As Algorithm 1 does an exhaustive search across the considered tables in a database, it can be computationally expensive. It is therefore possible to limit the set of tables to be considered for the detection of potential primary key candidates and/or potential foreign key candidates to a smaller subset of tables in T, in particular if an implementation is used in an interactive way of exploration.
The usage-based approach makes use of existing scripts extracted from a database. This can include existing database logic such as procedures, functions, views or user queries. From these scripts, we extract all pairs of columns that co-occur in linking tables together. In short, we exploit the fact that other, previous users have already created and used logic that links certain table columns together, and automatically infer that a meaningful link between these columns is likely to exist.
Let S be the set of existing scripts for a database: S={s1, . . . , sq}.
Each script si references a number of tables {ti1, . . . , tin} in its logic.
If script si contains a link statement, e.g. a join statement, between tables tix and tiy, and more specifically links columns cix and ciy belonging to tables tix and tiy, we infer a link between those two columns.
If a link exists, we add the two columns as nodes to a graph g2 and connect them in the graph.
The steps involved in the usage-based approach are provided in algorithm 2. We automatically identify from each script, all pairs of columns (cix, ciy) used by its logic to link two tables referenced in script si.
Combining the Two Approaches
By combining the two approaches, i.e. by overlaying the two graphs g1 and g2 extracted in Algorithms 1 and 2, we get a more comprehensive picture of the data structure and data links than we would if we used only one of the algorithms in isolation. This can be extended to include further (heuristic) approaches to establish relationships between table columns.
Furthermore, we can establish two additional parameters for each edge in the resulting overall graph, i.e. for each candidate relationship between two table columns:
Both confidence and frequency values of relationship between table columns can be used to sort and prioritize the different relationship candidates, and thus help to look at the most certain and/or most frequently used relationship first, thereby providing an ordered series of relationships.
Insofar as embodiments of the disclosure described are implementable, at least in part, using a software-controlled programmable processing device, such as a microprocessor, digital signal processor or other processing device, data processing apparatus or system, it will be appreciated that a computer program for configuring a programmable device, apparatus or system to implement the foregoing described methods is envisaged as an aspect of the present disclosure. The computer program may be embodied as source code or undergo compilation for implementation on a processing device, apparatus or system or may be embodied as object code, for example.
Suitably, the computer program is stored on a carrier medium in machine or device readable form, for example in solid-state memory, magnetic memory such as disk or tape, optically or magneto-optically readable memory such as compact disk or digital versatile disk, etc., and the processing device utilizes the program or a part thereof to configure it for operation. The computer program may be supplied from a remote source embodied in a communications medium such as an electronic signal, radio frequency carrier wave or optical carrier wave. Such carrier media are also envisaged as aspects of the present disclosure.
It will be understood by those skilled in the art that, although the present disclosure has been described in relation to the above described example embodiments, the disclosure is not limited thereto and that there are many possible variations and modifications which fall within the scope of the disclosure.
The scope of the present disclosure includes any novel features or combination of features disclosed herein. The applicant hereby gives notice that new claims may be formulated to such features or combination of features during prosecution of this application or of any such further applications derived therefrom. In particular, with reference to the appended claims, features from dependent claims may be combined with those of the independent claims and features from respective independent claims may be combined in any appropriate manner and not merely in the specific combinations enumerated in the claims.
Number | Date | Country | Kind |
---|---|---|---|
20160295.0 | Mar 2020 | EP | regional |
The present application is a National Phase entry of PCT Application No. PCT/EP2021/054385, filed Feb. 23, 2021, which claims priority from EP Patent Application No. 20160295.0, filed Mar. 1, 2020, each of which is hereby fully incorporated herein by reference.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2021/054385 | 2/23/2021 | WO |