Method and system for federated querying of data sources

Information

  • Patent Application
  • 20050234889
  • Publication Number
    20050234889
  • Date Filed
    December 29, 2004
    20 years ago
  • Date Published
    October 20, 2005
    19 years ago
Abstract
A system for managing and querying a plurality of data sources, including an interface to a plurality of inter-related data sources of diverse types, each of the plurality of data sources having a schema that describes its data structure, a schema generator communicating with the interface, for generating a federated schema that describes the structures of the plurality of data sources and their inter-relationships, and for modifying the federated schema over time as the plurality of data sources undergo changes, and a query generator communicating with the schema generator, for generating a query for the federated schema. A method and a computer-readable storage medium are also described.
Description
FIELD OF THE INVENTION

The present invention relates to management and querying of enterprise data.


BACKGROUND OF THE INVENTION

Information integration applications take data that is stored in two or more data sources and build from them one large database, possibly a virtual database, containing information from all of the sources, so that the data can be queried as a unit. Thus, for example, enterprise accounting data may be stored within a relational database, and enterprise inventory may be stored within XML documents. Information integration enables an enterprise to access its various data sources from within a single data store application.


Information integration is discussed in Chap. 20 of Garcia-Molina, H., Ullman, J. D. and Widom, J., “Database Systems: The Complete Book”, Prentice-Hall, New Jersey, 2002. As pointed out in Sec. 20.1 of this reference, there are three basic modes of information integration: (i) federated databases, (ii) data warehousing, and (iii) mediation.


In a federated database architecture, data sources are independent, but one source can call on others to supply information.


In a data warehousing architecture, data from several sources is extracted and combined into a global schema. The data is then stored at the warehouse, which appears to the user like an ordinary database. Once data is in the warehouse, queries are issued by a user exactly as they would be issued to any database. However, user updates to the warehouse are generally forbidden, sine they are not reflected in the underlying sources, and thus can make the warehouse inconsistent with the sources.


A data warehouse is updated periodically, by reconstructing it from current data in the data sources. Typically, a data warehouse is updated once a night, when the system can be shut down, so that queries are not issued while the warehouse is being constructed. Alternatively, the data warehouse may be incrementally updated based on changes that have been made to the data sources since the last time the warehouse was modified.


Conventional data warehouses are generally expensive and inflexible. In addition, such data warehouses generally do not provide real-time operation.


In a mediation architecture, a software component, referred to as a mediator, supports a virtual database, which a user may query as if it were physically constructed. The mediator stores no data of its own. Rather, it translates a query into one or more queries to its sources, synthesizes the answer to the query from the responses of the sources, and returns an answer to the user. A mediator supports a virtual view, or collection of views, that integrates several sources.


An example of a mediation system is the Enterprise Information Integrator (EII) of IBM Corporation, which generates a virtual warehouse. EII supports integrated querying across multiple data sources, including IBM DB2 relational databases, Microsoft SQL relational databases, and XML document databases.


All three of the approaches to information integration described above use transformers, referred to as wrappers or extractors, to transform data when it is extracted from a data source. Wrappers are used to pass ad-hoc queries to data sources, receive responses from the source, and pass information to an information integrator.


A drawback with conventional information integration is the lack of uniformity in semantics, and the lack of traceability back to individual data sources. Each database accessed by a warehouse generally has its own semantics, including inter alia names for tables and their fields, names for XML complex types and their elements, and data formats. It may happen that the same name is used in different contexts within different databases, or multiple names are used for the same construct, perhaps formatted differently for different names. Further complications that can arise with non-uniform semantics include inter alia, different inter-relationships between data constructs, different business rules relating the same data constructs, redundancies and inconsistencies.


It is thus desirable to be able to introduce a common semantic foundation for all of the data sources accessed within a data warehouse, and to provide a translation layer which enables a user to access data using queries expressed in common and meaningful semantics, and buffers the user from the individual semantics for the individual data sources.


SUMMARY OF THE DESCRIPTION

The present invention concerns use of common semantics for querying data within diverse types of data sources. The present invention provides a generic translation layer that buffers a user from the particulars of individual semantics for each data source, and enables the user to express queries in terms of common and meaningful semantics.


There is thus provided in accordance with a preferred embodiment of the present invention a system for managing and querying a plurality of data sources, including an interface to a plurality of inter-related data sources of diverse types, each of the plurality of data sources having a schema that describes its data structure, a schema generator communicating with the interface, for generating a federated schema that describes the structures of the plurality of data sources and their inter-relationships, and for modifying the federated schema over time as the plurality of data sources undergo changes, and a query generator communicating with the schema generator, for generating a query for the federated schema.


There is further provided in accordance with a preferred embodiment of the present invention a method for managing and querying a plurality of data sources, including generating a federated schema that describes the structures of a plurality of inter-related data sources and their inter-relationships, and for modifying the federated schema over time as the plurality of data sources undergo changes, wherein each of the plurality of data sources has a schema that describes its data structure, and generating a query for the federated schema.


There is yet further provided in accordance with a preferred embodiment of the present invention a computer-readable storage medium storing program code for causing a computer to perform the steps of generating a federated schema that describes the structures of a plurality of inter-related data sources and their inter-relationships, and for modifying the federated schema over time as the plurality of data sources undergo changes, wherein each of the plurality of data sources has a schema that describes its data structure, and generating a query for the federated schema.




BRIEF DESCRIPTION OF THE DRAWINGS

The present invention will be more fully understood and appreciated from the following detailed description, taken in conjunction with the drawings in which:



FIG. 1 is a simplified diagram of three schemas mapped into a central ontology, and a federated schema that combines the three schemas, in accordance with a preferred embodiment of the present invention;



FIG. 2 is a simplified block diagram of a system that generates the federated schema of FIG. 1, in accordance with a preferred embodiment of the present invention;



FIG. 3 is a simplified block diagram of an enterprise information system that uses the federated schema of FIG. 1 to query across multiple databases using common semantics, in accordance with a preferred embodiment of the present invention;



FIG. 4 is a detailed block diagram of the enterprise information system of FIG. 3, in accordance with a preferred embodiment of the present invention; and



FIGS. 5A, 5B and 5C are illustrations of a user interface for semantic discovery and query, in accordance with a preferred embodiment of the present invention.




DETAILED DESCRIPTION

Enterprise data is typically distributed over multiple databases, referred to as data assets. Data assets can be of various types, including inter alia relational database tables, XML databases, entity-relationship (ER) databases and Cobol databases. Each data asset stores data according to a specific data structure format, referred to as a schema or as asset metadata. Asset metadata serves as descriptors, explaining how to interpret data stored within the asset. Without knowledge of asset metadata, data stored within an asset is generally unintelligible. Thus, for example, relational database tables store data according to a relational database schema, XML documents store data according to an XML schema, ER databases store data according to an ER logical model, and Cobol databases store data according to a Cobol Copybook. Each such schema is asset metadata, which explains how to interpret data stored within the asset.


Asset metadata is typically comprised of basic data structures, or constructs. For example, relational database schema are comprised of fields and tables, XML schema are comprised of simple types and complex types, ER logical models are comprised of entities and relationships, and Cobol Copybooks are comprised of elementary items and group items.


In a preferred embodiment of the present invention, semantics are provided to enterprise data through (i) a common ontology model, referred to also as an information model; and (ii) mappings of enterprise asset metadata into the ontology model. An ontology model is comprised of classes and properties, and is particularly useful for representing data in a semantically meaningful way. Mappings are associations of constructs of a first asset metadata with constructs of a second asset metadata in a consistent way. Mappings of asset metadata into the ontology model serve as dictionaries through which constructs of the asset metadata can be semantically understood.


Reference is now made to FIG. 1, which is a simplified diagram of three schemas, for three enterprise data assets, which are mapped into a central ontology, and a federated schema that combines the three schemas, in accordance with a preferred embodiment of the present invention. Shown in FIG. 1 are three conventional data schemas, an IBM DB2 schema 110 for a relational database, a Microsoft SQL schema 120 for a relational database, and an XML schema 130 for one or more XML documents. Each of schemas 110, 120 and 130 is individually mapped into a common ontology model 140.


Ontology model 140 preferably encapsulates substantially all of the constructs from schemas 110, 120 and 130, including inter alia database tables and their fields and their interrelationships through foreign keys, and XML complex types and their elements and the type inter-relationships, as well as business rules that relate table fields to one another and XML elements to one another. Ontology model 140 serves not only to model enterprise data assets, but also provides common semantics for referencing the data assets.


The individual mappings of schemas 110, 120 and 130 into ontology model 140 can be inverted, and used to map ontology model 140 into a single federated schema 150. Federated schema 150 effectively combines the three individual schemas 110, 120 and 130. Moreover, federated schema inherits the common semantics of ontology model 140.


Reference is now made to FIG. 2, which is a simplified block diagram of a system that generates the federated schema of FIG. 1, in accordance with a preferred embodiment of the present invention. Shown in FIG. 2 are the three schemas 110, 120 and 130 from FIG. 1, the ontology model 140 of FIG. 1 and the federated schema 150 from FIG. 1. Also shown in FIG. 2 is a schema-to-ontology generator 210, which generates ontology model 140 from the three schemas 110, 120 and 130, using mappings of the three schemas into the ontology model. Preferably, schema-to-ontology generator 210 acts as a reverse-engineer, and creates ontology classes and properties that correspond to the constructs of schemas 110, 120 and 130 and their inter-relationships. Preferably, schema-to-ontology generator 210 also generates business rules that inter-relate properties of ontology classes, corresponding to the business rules that apply to schemas 110, 120 and 130.


It may be appreciated by those skilled in the art that ontology model 140 may be alternatively generated without the use of schema-to-ontology generator 210. Thus, ontology model 140 may alternatively be an industry standard “off-the-shelf” model, or a custom ontology model generated with the use of an ontology modeling tool.


After ontology model 140 is generated, an ontology-to-schema generator 220 creates federated schema 150. Federated schema may be a relational database schema, or an XML schema or another schema. Preferably, ontology-to-schema generator 220 creates relational database tables and fields, or XML complex types and elements, which correspond respectively to the classes of ontology model 140 and their properties.


Reference is now made to FIG. 3, which is a simplified block diagram of an enterprise information system that uses the federated schema of FIG. 1 to query across multiple databases using common semantics, in accordance with a preferred embodiment of the present invention. Shown in FIG. 3 are three enterprise data assets; namely, a DB2 relational database 310, an MS SQL relational database 320, and a repository of XML documents 330. Each data asset includes a corresponding wrapper 340, which serves to identify the data asset. Preferably, wrappers include meta-data that identifies the server or servers that house the data asset, access control information including users and passwords, and the constructs that comprise the data assets.


Also shown in FIG. 3 is a federated database 350, such as IBM's DB2 Information Integrator, which supports access to the three databases 310, 320 and 330. Using federated database 350, the databases 310, 320 and 330 can be queried. However, in order to use federated database 350, a user must be intimately familiar with the semantics of the individual databases 310, 320 and 330.


By using federated schema 150 in accordance with the present invention, a generic translation layer can be generated, which buffers the user from the specific semantics of the individual databases 310, 320 and 330, and enables him to express database searches using the semantics of ontology model 140. Specifically, as shown in FIG. 3, the user can query the databases using a query engine 360, which conforms to ontology model 140. Query engine 360 generates and processes queries expressed generically in terms of ontology model 140 and, using the translation layer provided by federated schema 150, activates federated database 350 to query across the three databases 310, 320 and 330.


Reference is now made to FIG. 4, which is a detailed block diagram of the enterprise information system of FIG. 3, in accordance with a preferred embodiment of the present invention.


Shown in FIG. 4 is an information integrator 410. Information integrator 410 retrieves data from a plurality of data sources 420, using wrappers 430 to identify the locations and structures of data sources 420. Generally, each data source 420 conforms to a different naming system, making it difficult to work with more than a single data source at a time.


Also shown in FIG. 4 is a design environment 450, referred to as the Unicorn Workbench™. Design environment 450 provides a user workflow for generating an ontology model 460 and for mapping one or more data schemas 470 into ontology model 460 using semantic mappings 480. Data schemas 470 provide the necessary meta-data, or IT catalogue, for using the various data sources 420. As shown in FIG. 4, ontology model 460 preferably uses meaningful semantics, referred to as a common business language, thus making it easy and efficient to work with the many data sources 420 simultaneously.


After data schemas 470 have been mapped to ontology model 460 by Unicorn Workbench 450, the ontology model and semantic mappings are stored on a server 490, which provides a web portal into the enterprise data. A user of the portal formulates queries and generates views using the common business language, which are automatically converted to appropriate information integrator 450 names and executed within information integrator 450, and the results are converted back into the common business language—all transparent to the user. Preferably, view definitions are formulated in terms of a standard query language, such as SQL or XQuery.


The user also uses the common business language to run business applications including inter alia visualization and analysis tools. It may thus be appreciated that ontology model 460 and semantic mappings 480 obviate the need for the user to learn the intricacies of the data naming conventions for the individual data sources 420 and the inter-dependencies among their data.


Reference is now made to FIGS. 5A, 5B and 5C, which are illustrations of a user interface for semantic discovery and query, in accordance with a preferred embodiment of the present invention. As shown in FIG. 5A, a search, referred to as “Semantic Discovery,” for an ontology property named “sumsUnderManagement” of a class named “Employee,” denoted Employee.sumsUnderManagement, produces four relational database tables for checking accounts and savings accounts, each having a column named “balance.”


As shown in FIG. 5B, detailed information about the database tables is provided when a user clicks on the corresponding entry in FIG. 5A. Thus, according to a business rule named “calculateSumunderManagement,” which relates properties of class Employee, the property sumsUnderManagement is determined by summing the properties branch.savingAccounts.balance and branch.checkingAccounts.balance. Moreover, the table “CHECKINGACCT” belongs to the DB2 relational database CHECKING, and the column “balance” of table CHECKINGACCT corresponds to the ontology property “balance” of ontology class “CheckingAccount.”


Shown in FIG. 5C are query results for individual employees and the sums they manage. It may thus be appreciated that the present invention enables a user to (i) express a query that spans multiple databases residing on multiple servers, in terms of a coherent ontology property such as Employee.sumsUnderManagement; and (ii) discover the sources of the data required to determine the results of the query.


In the foregoing specification, the invention has been described with reference to specific exemplary embodiments thereof. It will, however, be evident that various modifications and changes may be made to the specific exemplary embodiments without departing from the broader spirit and scope of the invention as set forth in the appended claims. Accordingly, the specification and drawings are to be regarded in an illustrative rather than a restrictive sense.

Claims
  • 1. A system for managing and querying a plurality of data sources, comprising: an interface to a plurality of inter-related data sources of diverse types, each of the plurality of data sources having a schema that describes its data structure; a schema generator communicating with said interface, for generating a federated schema that describes the structures of the plurality of data sources and their inter-relationships, and for modifying the federated schema over time as the plurality of data sources undergo changes; and a query generator communicating with said schema generator, for generating a query for the federated schema.
  • 2. The system of claim 1 further comprising a query engine communicating with said interface, for executing a query to the federated schema and for returning query results.
  • 3. The system of claim 2 wherein said query engine is a software application.
  • 4. The system of claim 3 wherein said query engine is a component of a business intelligence tool.
  • 5. The system of claim 3 wherein said query engine is a component of a business process management environment.
  • 6. The system of claim 1 further comprising a semantic modeler communicating with said interface and with said schema generator, for mapping the plurality of data sources into a common ontology model using semantic mappings, and wherein said query generator generates a query formulated in terms of nomenclature of the common ontology model.
  • 7. The system of claim 6 further comprising a view generator communicating with said semantic modeler, for generating a database view of the plurality of data sources, using the federated schema, wherein the database view is formulated in terms of nomenclature of the common ontology model, and wherein said query generator generates a query for the database view.
  • 8. The system of claim 7 wherein the database view is formulated using SQL statements.
  • 9. The system of claim 7 wherein the database view is formulated using XQuery statements.
  • 10. The system of claim 7 wherein said view generator updates the database view when the common ontology model is changed.
  • 11. The system of claim 6 further comprising a user manager communicating with said semantic modeler for managing user access privileges to the plurality of data sources, wherein user access privileges are expressed in terms of nomenclature of the common ontology model.
  • 12. The system of claim 6 further comprising a data thesaurus communicating with said semantic modeler, for looking up the meaning of a data construct within the plurality of data sources, relative to the common ontology model.
  • 13. The system of claim 12 wherein said data thesaurus traces the origin of a data construct within the plurality of data sources.
  • 14. The system of claim 12 wherein said data thesaurus traces query results back to data sources.
  • 15. The system of claim 6 wherein said query generator comprises a web-based user interface for creating and editing queries while browsing the common ontology model.
  • 16. The system of claim 6 wherein the common ontology model includes classes, properties of classes, inheritances among classes, and business rules relating properties of classes, and wherein said query generator formulates a semantic query from an expression language that uses the classes, properties, inheritances and business rules.
  • 17. The system of claim 16 further comprising a view generator communicating with said semantic modeler, for generating a database view of the plurality of data sources, using the federated schema, wherein the database view is formulated in terms an expression language that uses the classes, properties, inheritances and business rules, and wherein said query generator generates a query for the database view.
  • 18. The system of claim 17 wherein said view generator automatically embeds business rules within view definitions.
  • 19. The system of claim 17 wherein said view generator updates the database view when a relevant business rule is changed.
  • 20. The system of claim 1 wherein the plurality of data sources of diverse types include at least one Microsoft SQL database and one XML document data source.
  • 21. The system of claim 1 wherein the plurality of data sources of diverse types include at least one Microsoft SQL database and one IBM DB2 database.
  • 22. The system of claim 1 wherein the plurality of data sources of diverse types include at least one XML document data source and one IBM DB2 database.
  • 23. The system of claim 1 wherein said schema generator modifies the federated schema so as to make it appear that one or more of the plurality of data sources is removed.
  • 24. The system of claim 1 wherein said schema generator modifies the federated schema when a new data source is added to the plurality of data sources.
  • 25. The system of claim 24 wherein said schema generator automatically detects and scans new data sources.
  • 26. The system of claim 1 wherein said schema generator modifies the federated schema when one of the plurality of data sources is modified.
  • 27. A method for managing and querying a plurality of data sources, comprising: generating a federated schema that describes the structures of a plurality of inter-related data sources and their inter-relationships, and for modifying the federated schema over time as the plurality of data sources undergo changes, wherein each of the plurality of data sources has a schema that describes its data structure; and generating a query for the federated schema.
  • 28. The method of claim 27 further comprising: executing a query to the federated schema; and returning query results.
  • 29. The method of claim 28 wherein said executing is performed by a software application.
  • 30. The method of claim 29 wherein said executing is performed by a business intelligence tool.
  • 31. The method of claim 29 wherein said executing is performed within a business process management environment.
  • 32. The method of claim 27 further comprising mapping the plurality of data sources into a common ontology model using semantic mappings, and wherein said generating a query formulates a query in terms of nomenclature of the common ontology model.
  • 33. The method of claim 32 further comprising generating a database view of the plurality of data sources, using the federated schema, wherein the database view is formulated in terms of nomenclature of the common ontology model, and wherein said generating a query generates a query for the database view.
  • 34. The method of claim 33 wherein the database view is formulated using SQL statements.
  • 35. The method of claim 33 wherein the database view is formulated using XQuery statements.
  • 36. The method of claim 33 wherein said generating a database view updates the database view when the common ontology model is changed.
  • 37. The method of claim 32 further comprising managing user access privileges to the plurality of data sources, wherein user access privileges are expressed in terms of nomenclature of the common ontology model.
  • 38. The method of claim 32 further comprising looking up the meaning of a data construct within the plurality of data sources, relative to the common ontology model.
  • 39. The method of claim 38 wherein said looking up traces the origin of a data construct within the plurality of data sources.
  • 40. The method of claim 38 wherein said looking up traces query results back to data sources.
  • 41. The method of claim 32 wherein said generating a query employs a web-based user interface for creating and editing queries while browsing the common ontology model.
  • 42. The method of claim 32 wherein the common ontology model includes classes, properties of classes, inheritances among classes, and business rules relating properties of classes, and wherein said generating a query formulates a query from an expression language that uses the classes, properties, inheritances and business rules.
  • 43. The method of claim 42 further comprising generating a database view of the plurality of data sources, using the federated schema, wherein the database view is formulated in terms an expression language that uses the classes, properties, inheritances and business rules, and wherein said generating a query generates a query for the database view.
  • 44. The method of claim 43 wherein said generating a database view automatically embeds business rules within view definitions.
  • 45. The method of claim 43 wherein said generating a database view updates the database view when a relevant business rule is changed.
  • 46. The method of claim 27 wherein the plurality of data sources of diverse types include at least one Microsoft SQL database and one XML document data source.
  • 47. The method of claim 27 wherein the plurality of data sources of diverse types include at least one Microsoft SQL database and one IBM DB2 database.
  • 48. The method of claim 27 wherein the plurality of data sources of diverse types include at least one XML document data source and one IBM DB2 database.
  • 49. The method of claim 27 wherein said generating a federated schema comprises modifying the federated schema so as to make it appear that one or more of the plurality of data sources is removed.
  • 50. The method of claim 27 wherein said generating a federated schema comprises modifying the federated schema when a new data source is added to the plurality of data sources.
  • 51. The method of claim 50 further comprising automatically detecting and scanning new data sources.
  • 52. The method of claim 27 wherein said generating a federated schema comprises modifying the federated schema when one of the plurality of data sources is modified.
  • 53. A computer-readable storage medium storing program code for causing a computer to perform the steps of: generating a federated schema that describes the structures of a plurality of inter-related data sources and their inter-relationships, and for F modifying the federated schema over time as the plurality of data sources undergo changes, wherein each of the plurality of data sources has a schema that describes its data structure; and generating a query for the federated schema.
CROSS REFERENCES TO RELATED APPLICATIONS

This application is a continuation-in-part of assignee's pending application U.S. Ser. No. 10/637,339, filed on Aug. 8, 2003, entitled “Method and System for Mapping Enterprise Data Assets to a Semantic Information Model,” which is a continuation-in-part of assignee's pending application U.S. Ser. No. 10/340,068, filed on Jan. 9, 2003, entitled “Brokering Semantics between Web Services,” which is a continuation-in-part of assignee's pending application U.S. Ser. No. 10/302,370, filed on Nov. 22, 2002, entitled “Enterprise Information Unification,” which is a continuation-in-part of assignee's pending application U.S. Ser. No. 10/159,516, filed on May 31, 2002, entitled “Data Query and Location through a Central Ontology Model,” which is a continuation-in-part of assignee's pending application U.S. Ser. No. 10/104,785, filed on Mar. 22, 2002, entitled “Run-Time Architecture for Enterprise Integration with Transformation Generation,” which is a continuation-in-part of assignee's pending application U.S. Ser. No. 10/053,045, filed on Jan. 15, 2002, entitled “Method and System for Deriving a Transformation by Referring Schema to an Central Model,” which is a continuation-in-part of assignee's pending application U.S. Ser. No. 09/866,101, filed on May 25, 2001, entitled “Method and System for Collaborative Ontology Modeling.”

Continuation in Parts (7)
Number Date Country
Parent 10637339 Aug 2003 US
Child 11026358 Dec 2004 US
Parent 10340068 Jan 2003 US
Child 10637339 Aug 2003 US
Parent 10302370 Nov 2002 US
Child 10340068 Jan 2003 US
Parent 10159516 May 2002 US
Child 10302370 Nov 2002 US
Parent 10104785 Mar 2002 US
Child 10159516 May 2002 US
Parent 10053045 Jan 2002 US
Child 10104785 Mar 2002 US
Parent 09866101 May 2001 US
Child 10053045 Jan 2002 US