The present invention generally relates to database management systems. In particular, the present system relates to defining and unifying objects in different data sources to share data between data sources or merge data sources into a target data structure.
Databases are commonly used in businesses and organizations to manage information on employees, clients, products, etc. These databases are often custom databases generated by the business or organization or purchased from a database vendor or designer. Information management techniques and goals are continually evolving, requiring integration of databases into a common database or a sharing of data between databases. For example, a business with an extensive customer database may acquire another company. The business wishes to merge or integrate the customer databases or otherwise share information that is common in purpose. To merge or integrate source databases into a target database, the source databases are typically manually analyzed on a field-by-field or table-by-table basis to identify common structures in which data can be integrated or shared.
Information integration requires identification of objects (i.e., data structures) that are common in purpose to the data sources or databases being integrated. For example, company A with database A has merged with company B with database B. Both database A and database B are designed to track orders. Company A defines a customer object within database A as comprising the name of the customer, the location of the customer, and the revenue of the customer. Company B defines a customer object within database B as comprising the name of the customer, the location of the customer, and the number of employees associated with the customer. The name and location of the customer are common attributes of the customer object and can be shared between customer A and customer B provided a method for sharing can be achieved.
These common objects, referenced herein as universal data objects, facilitate effective querying and use of integrated data by presenting a common data interface to sources. Universal data objects further facilitate an understanding by application developers and database administrators of the content of data sources and how to navigate between objects and attributes within the data sources. Universal data objects can be used as the target of schema mapping; different sources can be mapped to the same set of universal data objects, making the sources appear uniform.
A conventional approach to defining universal data objects requires manual examination of objects residing in different sources (Application Specific Business Objects, or ASBOs). The manually identified objects (sometimes referred to as Generic Business Objects, or GBOs) are then typically unified according to some unwritten set of heuristics and “rules of thumb”. This approach is highly subjective and error-prone because of human involvement. Furthermore, this approach is not scalable to large numbers of sources and objects.
Thus, there is a need for a method that replaces the manual process of defining and unifying objects in databases with an automated one, making universal data object discovery more objective, more scalable, and less error-prone than conventional approaches. What is therefore needed is a system, a service, a computer program product, and an associated method for automatically discovering universal data objects. The need for such a solution has heretofore remained unsatisfied.
The present invention satisfies this need, and presents a system, a service, a computer program product, and an associated method (collectively referenced herein as “the system” or “the present system”) for automatically discovering universal data objects (also referred to as Universal Business Objects, or UBOS) in a set of data sources. The purpose of a universal data object is exchange of these objects at a desired level of granularity. The present system automatically identifies candidate universal data objects, ranks the candidate universal data objects according to predetermined criteria, and merges source schemas into one or more unified universal data objects within the set of data sources.
The present system comprises a schema processing module, a clustering module, and a merging module. From data inputs and a set of control parameters, the schema processing module computes a degree of sharing score for composite structures in the source schemas. The data inputs comprise source schemas expressed as leaf-level data elements and tree-like composite structures, one or more similarity values of elementary and composite data structures across and within data sources, and one or more foreign key relationships across and within data sources.
The schema processing module ranks structures with respect to an associated degree of sharing score and identifies as candidate universal data objects those structures whose degree of sharing score exceeds a predetermined threshold. Control parameters place further restrictions on candidate universal data objects. The control parameters comprise a minimum and maximum size of the universal data object in terms of bytes, a minimum and maximum difference in cardinality (number of instances) between a parent and a child in the candidate universal data object, and a minimum degree of sharing of the candidate universal data objects.
The merging module calculates a similarity between candidate universal data objects and merges candidate universal data objects that are similar. Merging by the merging module comprises taking an intersection of the schemas of the candidate universal data object or taking a union of the schemas of the candidate universal data object. The merged universal data objects are the output of the present system.
The present system may be embodied in a utility program such as a universal data object discovery utility program. The present system also provides means for the user to identify a universal data object by specifying a set of data sources comprising schema similarity values, specifying a set of control parameters, specifying any required additional metadata, and then invoking the universal data object discovery utility to search and identify such universal data objects. The set of control parameters comprises a minimum and maximum size of the universal data object, a minimum and maximum difference in relative cardinality (number of instances) between a parent and a child in the a candidate universal data object, and a minimum value for a degree of sharing score of a candidate universal data object.
The various features of the present invention and the manner of attaining them will be described in greater detail with reference to the following description, claims, and drawings, wherein reference numerals are reused, where appropriate, to indicate a correspondence between the referenced items, and wherein:
The following definitions and explanations provide background information pertaining to the technical field of the present invention, and are intended to facilitate the understanding of the present invention without limiting its scope:
Attribute: an element of an object. Attributes can be simple, comprising only one attribute, or complex, comprising additional attributes in a structure. Attributes can also be repeating, occurring more than once.
Cardinality: A number of instances of a value or item occurring in a data structure element such as an object or an attribute.
Foreign key: a key that uniquely relates one object with another object.
Object: a data structure element in a schema or an object graph.
Universal Data Object: An object with elements and function in common across different data sources.
The data source 1, 20, comprises a data structure that comprises schemas. For the data source 1, 20, similarities between the schemas in the data structure of the data source 1, 20, have been determined. Furthermore, cardinalities (instances) of objects and attributes within the data source 1, 20, have been determined and foreign keys have been identified.
The data source 2, 25, comprises a data structure that comprises schemas. For the data source 2, 25, similarities between the schemas in the data structure of the data source 2, 25, have been determined. Furthermore, cardinalities (instances) of objects and attributes within the data source 2, 25, have been determined and foreign keys have been identified.
The schema processing module 205 constructs a single object graph that represents some or all of the source schemas (step 310). The schema processing module 205 adds to the object graph pairwise similarity scores and functional dependency information received as input. The schema processing module 205 computes a degree of sharing score for objects in the object graph (step 400, further described in
In one embodiment, the merging module 220 applies an intersection semantic to selected universal data sources that are to be merged. The intersection semantic merges those attributes that are common to all the similar selected universal data objects. Attributes found in selected universal data objects that are not in common are pruned. In another embodiment, the merging module 220 applies a union semantic to selected universal data sources that are to be merged. The union semantic merges those attributes that are found in any of the universal data objects.
The schema processing module 205 computes a structural sharing score for one or more objects in the object graph (step 405). For the selected attribute, the schema processing module 205 considers a number of parent structures or a chain of ancestors associated with the selected attribute. Each link in the object graph of an object to a parent or superclass contributes to the structural sharing score of the selected object; i.e., the more parents or superclasses an object O has, the higher the score. For example, a link from object O to its immediate parent(s) has a structural sharing value of 1.0. Links to the parents of the parents of object O have a structural sharing value of 0.5. Each level of ancestry has a structural sharing value that is one-half of the structural sharing value of an immediately lower level. For instance, if object O is 3 levels down from a root in a tree structure, object O has a structural sharing score of 1+0.5+0.25=1.75. The position-dependent structural sharing score is calculated as the sum of the distances from the object to each of the ancestors of the object according to the following equation:
Score=Σ(½)(n−1),
where n is the distance from the object to the ancestor measured as the number of links.
The schema processing module 205 selects an initial object in the object graph (step 410). The schema processing module 205 selects a similar object with a similarity to the selected object that is above a predetermined threshold (step 415). The schema processing module 205 computes a value relationship for the selected object and the selected similar object (step 420) by multiplying the similarity of the selected similar object by the structural sharing value of the selected similar object. Computation of the value relationship considers the similarity of object O to other objects and uses the structural sharing value of those other objects to increase the value relationship score of object O. For instance, if object O is similar to object X (with a similarity value 0.8) and object X has a structural sharing value of 1.5, then the computed value relationship between object O and object X is 0.8*1.5.
The schema processing module 205 determines whether additional remain for processing for the selected object (decision step 425). If yes, the schema processing module 205 selects a next similar object, a next object that has a similarity to the selected object that is above a predetermined threshold (step 430). The schema processing module 205 computes the value relationship for this next similar object and the selected object as before (step 420). The schema processing module 205 repeats step 420 through step 430 until no additional objects remain with similarity to the selected object above a predetermined threshold.
The schema processing module 205 computes a value relationship score for the selected object by summing the computed value relationships determined in step 420 through step 430 (step 435). The schema processing module 205 performs step 415 through step 430 for simple attributes and complex attributes.
The schema processing module 205 determines whether an instance of the selected object is referenced by another object (decision step 440). If yes, a foreign key relationship in another object points to the selected object. A foreign key relationship indicates that a specific instance of object O (i.e., a key field of object O) is referenced by another object X (i.e., a foreign key field of object X).
The schema processing module 205 selects an initial foreign key referencing the selected object (step 445). The schema processing module 205 computes a foreign key relationship value for the selected foreign key and the selected object (step 450) by multiplying a foreign key strength for the selected foreign key by the structural sharing score of the primary key in the selected object to which the foreign key is pointing. If, for example, the foreign key relationship has foreign key strength of 0.9 and object X has a structural sharing score of 1.75, the computed foreign key relationship value is 0.9*1.75.
The schema processing module 205 determines whether additional foreign keys that reference an instance of the selected object remain for processing (decision step 445). If yes, the schema processing module 205 selects a next foreign key (step 460). The schema processing module 205 computes the foreign key relationship for this next foreign key and the selected object as before (step 450). The schema processing module 205 repeats step 450 through step 460 until no additional foreign keys remain that reference an instance of the selected object.
The schema processing module 205 computes a foreign key relationship score for the selected object by summing the computed foreign key relationship values determined in step 450 through step 460 (step 465).
The schema processing module 205 computes a degree of sharing score for the selected object by summing the foreign key relationship score (if any), the value relationship score, and the structural sharing score (step 470). If no instances of the selected object are referenced in decision step 440, no foreign key relations exist for the selected object and no foreign key relationship score is computed.
The schema processing module 205 determines whether additional objects remain for processing (step 475). If yes, the schema processing module selects a next object (step 480) and repeats step 415 through step 480 until no additional objects remain for processing. The schema processing module 205 outputs degree of sharing scores for objects in the object graph (step 485).
The control parameters comprise a range in desirable size of a candidate universal data object; the range in desirable size comprises a minimum size and a maximum size. For example, a candidate universal data object can be an “address” of a person comprising 200 bytes; 200 bytes is a reasonable size for a universal data object. An example of an object that is not a reasonable selection for a universal data object is a CAD design comprising 1 GB. Another example of an object that is not a reasonable selection for a universal data object is a “name” of a person comprising 20 bytes; 20 bytes is generally too small for a universal data object. However, the “name” of a person may be an attribute of a universal data object.
The control parameters further comprise a range in relative cardinality (number of instances) of a candidate universal data object with respect to the parent of the candidate universal data object; the range in cardinality comprises a minimum and a maximum difference in relative cardinality between a candidate universal data object and the parent of the candidate universal data object.
The control parameters comprise a minimum degree of sharing score for the candidate universal data object. The degree of sharing score for candidate universal data objects is above a predetermined threshold that is the minimum degree of sharing score. Candidate universal data objects are objects that are common in the source schemas. The degree of sharing score indicates how common an object is in the source schema; objects that are desirable as candidate universal data objects have a desirable degree of sharing score. The selection module 210 selects as candidate universal data objects those objects that pass the filters of the control parameters (step 515).
Otherwise, if the result of decision step 615 is no, the clustering module 215 determines whether the relationship between the parent and the candidate universal data object is 1:1 (decision step 625). If the relationship between the parent and the candidate universal data object is 1:1, the clustering module 215 inserts a foreign key into the parent (step 630) and links the inserted foreign key to a primary key in the universal data object. Otherwise, (if the relationship between the parent and the candidate universal data object is not N:M or 1:1), the relationship between the parent and the candidate universal data object is 1:N and the clustering module 215 inserts a foreign key in the candidate universal data object (step 635) and links the inserted foreign key to a primary key in the parent.
After creating a separate relationship object (step 620), inserting a foreign key in the parent (step 630), or inserting a foreign key in the candidate universal data object (step 635), the clustering module 215 determines if additional candidate universal data objects remain for processing (decision step 640). If yes, the clustering module 215 selects a next candidate universal data object (step 645) and repeats step 610 through step 645 until no additional candidate universal data objects remain for processing.
A source 1 (Src1706) comprises an identifier (Name 708), a customer object (Cust 710), and an order object (Order 712). Cust 710 comprises an identifier (ID 714), a phone object (phone 716), a name object (Name 718), and an address object (Addr 720). Phone 716 comprises an area code attribute (Area 722) and a phone number attribute (Nbr 724). Name 718 comprises a first name attribute (First 726) and a last name attribute (Last 728). Addr 720 comprises a street attribute (Street 730), a city attribute (City 732), and a state attribute (State 734). Order 712 comprises an identifier (ID 736), a date attribute (Date 738), a customer attribute (Cust 740), and a line item object (Line 742). Line 742 comprises an identifier (PrID 744), a quantity attribute (Qty 746), and a price attribute (Price 748).
A source 2 (Src2750) comprises an identifier (Name 752), an employee object (Emp 754), and a department object (Dept 756). Emp 754 comprises an identifier (Num 758), a name object (N 760), and a home address object (Home 762). N 760 comprises a first name attribute (F 764) and a last name attribute (L 766). Home 762 comprises a street attribute (S 768), a city attribute (C 770), and a state attribute (ST 772). Dept 756 comprises an identifier (Num 774), a manager attribute (Mgr 776), an employee attribute (Emps 778), and a location object (LOC 780). LOC 780 comprises a street attribute (STR 782), a city attribute (CIT 784), a state attribute (STA 786), and a building attribute (BLD 788).
One to many relationships (1:N) or many to many relationships (N:M) between parent and child are indicated in the object graph 702 and the object graph 704 as a double arrow, represented by double arrow 790.
The schema processing module 205 quantifies the relationship values between parent and child, as shown in
The schema processing module 205 identifies similarities between attributes and objects that exceed a predetermined threshold as shown in
The schema processing module 205 identifies foreign keys in object graph 702 and object graph 704 and calculates foreign key scores, as illustrated in
The schema processing module 205 uses the foreign key scores (
The clustering module 215 splits candidate universal data objects from parent objects and inserts foreign keys as indicated in
The clustering module 215 separated Name 718 from Cust 710, inserted a foreign key (FK31215), and replaced the link to Cust 710 with a link from FK31215 to the identifier for Cust 710, ID 714. The clustering module 215 separated Addr 720 from Cust 710, inserted a foreign key (FK41220), and replaced the link to Cust 710 with a link from FK41220 to the identifier for Cust 710, ID 714. The clustering module 215 separated Line 742 from Order 712, inserted a foreign key (FK51225), and replaced the link to Cust 710 with a link from FK51225 to the identifier for Order 712, ID 736.
The clustering module 215 separated Emp 754 from Src2750, inserted a foreign key (FK61230), and replaced the link to Src2750 with a link from FK61230 to the identifier for Src2750, Name 752. The clustering module 215 separated Dept 756 from Src2750, inserted a foreign key (FK71235), and replaced the link to Src2750 with a link from FK71235 to the identifier for Src2750, Name 752.
The clustering module 215 separated N 760 from Emp 754, inserted a foreign key (FK81240), and replaced the link to Emp 754 with a link from FK81240 to the identifier for Emp 754, Num 758. The clustering module 215 separated Home 762 from Emp 754, inserted a foreign key (FK91245), and replaced the link to Emp 754 with a link from FK91245 to the identifier for Emp 754, Num 758. The clustering module 215 separated LOC 780 from Dept 756, inserted a foreign key (FK101250), and replaced the link to Dept 756 with a link from FK101250 to the identifier for Dept 756, Num 774.
System 10 selects universal data objects as indicated in
System 10 merges the selected universal data objects as indicated in
Pseudocode for system 10 can be summarized as:
It is to be understood that the specific embodiments of the invention that have been described are merely illustrative of certain applications of the principle of the present invention. Numerous modifications may be made to the system, service, and method for automatically discovering universal data objects described herein without departing from the spirit and scope of the present invention. Moreover, while the present invention is described for illustration purpose only in relation to the databases, it should be clear that the invention is applicable as well to, for example, any data source than can be represented as an object graph.