The embodiments of the invention generally relate to schema matching, and, more particularly, to a method of matching schemas that maps schema elements of a target system and a source system using multiple levels of ontologies.
Schema matching is a basic problem in many database application domains and has practical applications like legacy system migration, information integration, e-commerce, data warehousing, and semantic query processing. One fundamental operation in schema matching is to take two schemas as input and produce a mapping between elements of the two schemas that correspond semantically to each other.
Independent software vendors such as International Business Machines (IBM), Armonk, N.Y., USA have come up with tools like Rational Data Architect (RDA) that provide automated support for schema matching. However, these tools offer algorithms that are very generic in nature. Therefore, in many current implementations (e.g. data migration in billing consolidation for telecommunications companies) schema matching is typically performed manually, perhaps supported by a graphical user interface. Manually specifying schema matches requires complete knowledge of the data and is a tedious, time consuming, and error-prone process that is, therefore, expensive. With more and more legacy systems to migrate, an increasing number of web data source, and E-businesses to integrate, schema matching is a growing problem.
A plethora of researchers have studied the problem of schema matching and suggested techniques for matching schemas automatically. These can be broadly classified as Schema information based matching and Data instance based matching. Schema information based matchers only consider schema information, not instance data. The schema information includes the usual properties of the schema elements, such as name, description, data-type, relationship types (part-of, is-a, etc.), constraints, and schema structure. Data instance based matchers, on the other hand, use data instances to get important insight into the contents and meaning of schema elements. This is especially useful when schema information is limited, as is often the case for semi structured data.
Conventional schema matching algorithms that are based on general (not domain specific) schema matching techniques are too generic, and do not take advantage of using domain specific information. This results in generation of a lot of incorrect mappings.
The term domain can refer to an industry, an application, a geography etc. For example, industry verticals like Banking and Insurance can be considered as a domain. Similarly, applications for Billing, Customer Relationship Management, Accounting can also be referred to as a domain. Further, geographies corresponding to specific regions, countries or Continents can be classified as a domain. Domain knowledge can be captured in various forms, like an Ontology, a Thesaurus, and a set of Rules. Ontology is used to store domain specific concepts like Customer, Bill etc. and the relationships among them. Thesaurus, on the other hand, is used to store synonyms and abbreviations used in a particular domain. For example, customer can be treated to be the same as party in the Telecom domain. A rule is another way of capturing domain knowledge and can be specified for an industry, for an application, or for geography. Industry specific rules are applicable to the whole industry, e.g. Telecom, and are agnostic to the application. For example, a mobile SIM card number is a 20 digit integer in Telecom. Similarly, application specific rules correspond to a particular IT application, like Billing, CRM etc. For instance, bill generation period can only be fortnightly, monthly, or quarterly for a billing application. Geography specific rules are for a particular geography. For example, the postal code in India is a 6 digit integer.
There have been attempts to improve schema matching by using domain knowledge. This has included use of a corpus of known schemas and mappings as well as utilization of domain integrity constraints. A formal ontology of domain has also been used for semantic mapping connecting the schema describing the data to the ontology. However, ontology has been used only for the concepts in the domain. No attempts have been made to use the process ontology or the data-type ontology, either stand-alone or in a structured combination. In essence, there is no logical organization and use of understanding of the domain in terms of functionalities available (for example, a telecom billing domain has functionalities like PayBill, AddCustomer, RedeemPoints, etc.), classification of entities into concepts, etc.
This disclosure presents a method that uses multiple levels of ontology in a logical structured manner to improve schema matching. This method builds on existing schema matching algorithms and techniques of semantic mapping using domain knowledge.
In one specific embodiment herein, the method of matching schemas maps functions of a target system to a process ontology and maps functions of a source system to the process ontology to produce a first mapping of target functions and source functions to the process ontology. The mapping of the functions partitions the target system and the source system into corresponding subsets of functions. The method identifies parameters upon which the target functions operate and identifies parameters upon which the source functions operate. Then, the method maps the target function parameters to concept ontology and maps the source function parameters to the concept ontology to produce a mapping of the target function parameters (parameters are also referred as schema elements) and the source function parameters to the concept ontology. The concept ontology is domain specific in that it represents industry, application or geography knowledge. This schema element mapping is then enhanced by mapping the target function parameters to a data-type ontology and mapping the source function parameters to the data-type ontology. This produces an enhanced schema mapping of the target function parameters and the source function parameters to the concept ontology. This enhanced second mapping can be the resultant schema matching output.
These and other aspects of the embodiments of the invention will be better appreciated and understood when considered in conjunction with the following description and the accompanying drawings. It should be understood, however, that the following descriptions, while indicating embodiments of the invention and numerous specific details thereof, are given by way of illustration and not of limitation. Many changes and modifications may be made within the scope of the embodiments of the invention without departing from the spirit thereof, and the embodiments of the invention include all such modifications.
The embodiments of the invention will be better understood from the following detailed description with reference to the drawings, in which:
The embodiments of the invention and the various features and advantageous details thereof are explained completely with reference to the accompanying drawings. It should be noted that the features illustrated in the drawings are not necessarily drawn to scale. Descriptions of well-known components and processing techniques are omitted so as to not unnecessarily obscure the embodiments of the invention. The examples used herein are intended merely to facilitate an understanding of ways in which the embodiments of the invention may be practiced and to further enable those of skill in the art to practice the embodiments of the invention. Accordingly, the examples should not be construed as limiting the scope of the embodiments of the invention.
The embodiments herein address the deficiencies of existing schema matching (domain specific and/or domain independent) techniques by following a logical approach to classification of domain knowledge. More specifically, the embodiments herein provide a top-down method to perform schema mapping using three levels of ontology. The methods herein provide a technique to determine corresponding subsets of the tables relevant for data mapping based on mapped functions between the source and target system, using a process ontology. These tables and their attributes are mapped based on a concept ontology which can have mapping rules associated with each concept. These rules can be industry, application or geography specific. Finally, the mappings thus generated are refined based on a data-type ontology. The data-type ontology captures the various data-types occurring in a domain and can also help in mapping the concepts in the domain to the expected data-types.
The techniques described herein can be used in conjunction with other known techniques and these methods do not require one single person to understand both the source and target systems. The embodiments herein leverage the domain knowledge/information distributed among two sets of people—one for the source system, another for the target.
As shown in flowchart form in
Industry, application and geography rules 108 are obtained from the domain knowledge 102 in order to store with the concept ontology 110 operation of embodiments herein. More specifically, in item 110, the method maps the target function parameters to a concept ontology and maps the source function parameters to the concept ontology to produce a mapping of the target function parameters and the source function parameters to the concept ontology. Additionally, tools like RDA, or domain specific matchers can also be used to generate another set of mappings. Optionally the mapping of parameters can be enhanced by further creating subsets of parameters on the source system and subsets of parameters on the target system and mapping only between the related subsets on the source and targets sides.
Data-type ontology is used to generate another set of function parameter mapping. Thus generated mapping then is used either to filter previously generated function parameter mappings by only selecting repeated overlapping mappings in two, or can be used to augment previously generated mapping with additional mappings found. In item 112, the function parameter mapping is enhanced by mapping the target function parameters to a data-type ontology and mapping the source function parameters to the data-type ontology. This produces an enhanced second mapping of the target function parameters and the source function parameters to the concept ontology. This enhanced second mapping can be the resultant schema matching output.
The process ontology aspect of embodiments herein is shown in greater detail in
The mapping process is shown as item 206 in
The embodiments herein also identify parameters, and other data elements, for identified functions on source and target systems, as shown in
The concept ontology aspect of embodiments herein is shown in greater detail in
In
The pre-processor 602 has access to the source schema and target schema (XML/RDB) and partitions the source and target schemas into smaller matching subsets/segments. Domain specific mappers generate parameter mappings in the processor 604. These mappers are built for different concepts, for example, mappers can be implemented for domain concepts, including Address, Contact, Category, Id and Date. More such mappers can be seamlessly plugged into the embodiments. Ontology based mappers 605 use concept ontology 612 to map function parameter in segments obtained from pre-processor 602. Similarly existing algorithms provided by tools such as RDA can also be used to generate mappings. The post-processor 606 uses domain rules 614, including industry, application, and geography rules, to provide additional mappings. Various mapping results thus far produced using various matching algorithms and techniques are combined using filtering, ranking and merging these results into the final schema map. In this particular embodiment, the filtering is performed by Ontology based filter 607 that uses the data-type ontology 608. The RDA 616 is utilized by the user to select, reject, or edit these mappings. The data stage connector 618 takes the final schema map and generates data stage jobs (migration job skeleton) that can be run by the data stage 620.
The embodiments of the invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment including both hardware and software elements. In one embodiment, the invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
Furthermore, the embodiments of the invention can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk—read only memory (CD-ROM), compact disk—read/write (CD-R/W) and DVD.
A data processing system suitable for storing and/or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.
Input/output (I/O) devices (including but not limited to keyboards, displays, pointing devices, etc.) can be coupled to the system either directly or through intervening I/O controllers. Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.
The foregoing description of the specific embodiments will so fully reveal the general nature of the invention that others can, by applying current knowledge, readily modify and/or adapt for various applications such specific embodiments without departing from the generic concept, and, therefore, such adaptations and modifications should and are intended to be comprehended within the meaning and range of equivalents of the disclosed embodiments. It is to be understood that the phraseology or terminology employed herein is for the purpose of description and not of limitation. Therefore, while the embodiments of the invention have been described in terms of embodiments, those skilled in the art will recognize that the embodiments of the invention can be practiced with modification within the spirit and scope of the appended claims.