1. Field of the Invention
The present invention relates to a distributed database systems in which pieces of data are distributed in a plurality of databases.
2. Description of the Related Art
In recent years, distributed database systems in which pieces of data are distributed in a plurality of databases have been employed to distribute the load and reduce risk of loss of data. Specifically, if the pieces of data are distributed in various databases, the load caused by concentration of queries can be distributed. Moreover, if any failure occurs, only some of the databases will fail, so that data in other databases is safe.
Although the data is distributed; however, the distributed database system offers a function that, when the data needs to be referenced, the databases can be used as if they were a single database. As a method to realize such a function, for example, Japanese Patent Application Laid-open No. 2005-208757 discloses a technique by which the data distributed in a plurality of Relational Databases. (RDBs) is integrated into an integrated data view in a tagged document format, and a query based on an integrated reference to the RDBs is made possible through execution of a query made to the integrated data view.
However, there is a wide variety of available databases, and there are some databases that are different from RDBs, which have conventionally been used. For example, there is an Extensible Markup Language Database (XML-DB) in which data is stored in an Extensible Markup Language (XML) format. Accordingly, a distributed database system may be configured so as to include a database, like an XML-DB, that is different from RDBs.
In such an XML-DB, because the schema is indefinite or semi-fixed, the schema of the integrated data view defined based on the schema is also indefinite. On the other hand, the schemas in RDBs are strictly definite. For this reason, even if the conventional technique disclosed in, for example, Japanese Patent Application Laid-open No. 2005-208757 is used, a problem remains where it is impossible to perform a query processing using the integrated data view on a group of databases including both an XML-DBs and an RDB, because of the characteristic that the schema of the integrated data view may be indefinite.
As explained above, because there are a wide variety of databases and because the types of databases in which data is distributed are different from one another, the problem arises where it is impossible to perform a query processing using an integrated data view.
Further, the schema of the data stored in an XML-DB does not necessarily coincide with the schema of the integrated data view that the user wishes to use. There is a possibility that, if XML document data obtained from an XML-DB is applied to an integrated data view as it is, it is not possible to provide a user with an integrated data view that the user wishes to use.
It is an object of the present invention to at least partially solve the problems in the conventional technology.
According to an aspect of the present invention, a computer-readable recording medium that stores therein a computer program that causes a computer to reference pieces of data that are distributed in a plurality of different types of databases including a database that returns a query result as data that is uniquely identified in a hierarchical structure, by outputting, in an integrated view, a query result obtained as a result of queries that are made, in query formats, to the databases causes the computer to execute storing a view generation rule for generating the integrated view that is defined by a correspondence relationship between elements in the data that is uniquely identified in the hierarchical structure and elements in the databases and a correspondence relationship among the elements in the databases; and structuring, based on the view generation rule, the query result obtained as the result of the queries that are made, in the query formats, to the databases, in response to a query that is made, in a query format, to the integrated view.
According to another aspect of the present invention, a computer-readable recording medium that stores therein a computer program that causes a computer to reference pieces of data that are distributed in a plurality of different types of databases including a tagged document database that returns a query result as a tagged document of which a structure is predetermined, by outputting, in an integrated view, a query result obtained as a result of queries that are made, in query formats, to the databases causes the computer to execute storing a view generation rule for generating the integrated view that is defined by a correspondence relationship between elements in the tagged document and elements in the databases and a correspondence relationship among the elements in the databases; and structuring, based on the view generation rule, the query result obtained as the result of the queries that are made, in the query formats, to the databases, in response to a query that is made, in a query format, to the integrated view.
According to still another aspect of the present invention, a database integration reference method of referencing pieces of data that are distributed in a plurality of different types of databases including a database that returns a query result as data that is uniquely identified in a hierarchical structure, by outputting, in an integrated view, a query result obtained as a result of queries that are made, in query formats, to the databases, includes storing a view generation rule for generating the integrated view that is defined by a correspondence relationship between elements in the data that is uniquely identified in the hierarchical structure and elements in the databases and a correspondence relationship among the elements in the databases; and structuring, based on the view generation rule, the query result obtained as the result of the queries that are made, in the query formats, to the databases, in response to a query that is made, in a query format, to the integrated view.
According to still another aspect of the present invention, a database integration reference apparatus that makes it possible to reference pieces of data that are distributed in a plurality of different types of databases including a tagged document database that returns a query result as a tagged document of which a structure is predetermined, by outputting, in an integrated view, a query result obtained as a result of queries that are made, in query formats, to the databases, includes a storage unit that stores therein a view generation rule for generating the integrated view that is defined by a correspondence relationship between elements in the tagged document and elements in the databases and a correspondence relationship among the elements in the databases; and a processing unit that structures, based on the view generation rule present in the storage unit, the query result obtained as the result of the queries that are made, in the query formats, to the databases, in response to a query that is made, in a query format, to the integrated view.
The above and other objects, features, advantages and technical and industrial significance of this invention will be better understood by reading the following detailed description of presently preferred embodiments of the invention, when considered in connection with the accompanying drawings.
Exemplary embodiments of the present invention will be explained in detail below with reference to the accompanied drawings. In the exemplary embodiments described below, the present invention is applied to a database integration reference program, a database integration reference method, and a database integration reference apparatus that integrate an Extensible Markup Language Database (XML-DB) with a Relational Database (RDB) in such a manner that it is possible to reference these databases, where a tagged document is used as an XML document. In the following description, a database and databases may be referred to as a DB and DBs.
Firstly, the overview and the characteristics of a database integration reference system according to a first embodiment of the invention will be explained with reference to
As shown in
In this system, as shown in
To be more specific, the database reference apparatus structures an integration query engine for providing data from the integrated databases in an XML model and handles the data distributed in the databases as an XML file. Thus, the database reference apparatus realizes a data view integration on the apparatus side.
With the database integration reference apparatus according to the first embodiment having the configuration described above, it is possible to achieve, for example, real-time data access, a remarkable reduction in man-hours for the development of upper-level applications, a database integration having a high level of flexibility and extensibility, and a step-by-step metadata structuring, which are described below.
According to the first embodiment, the distributed data is not physically gathered in one place like a data warehouse (DWH), but the data remains to be distributed in the existing databases. When a query is made, only necessary data is obtained, and as a result, an integrated data view is generated. With this arrangement, it is possible to achieve real-time data access.
In addition, according to the first embodiment, the distributed data is integrated into a file in an XML format. A query is made to the XML file, using XQuery, and it is possible to take out the query result also in an XML format. In other words, it is possible to provide a data view that is integrated in an XML file, to the upper-level application side. Thus, there is no need to put a function for data view integration into the upper-level application side. Accordingly, it is possible to remarkably reduce the man-hour for development of the upper-level applications.
Also, according to the first embodiment, the data in the databases including the XML-DB and the RDBs is eventually integrated into the data view in the XML file after a model conversion. Because such an XML file format has a high level of flexibility and extensibility, it is possible to use the integrated XML file in a flexible manner. To be more specific, because the data view according to the first embodiment is integrated using an XML, it is possible to, for example, easily structure not only a search system but also various application systems that are compatible with the XML, on the system according to the first embodiment. Thus, it is possible to integrate the databases with a high level of flexibility and extensibility.
Further, according to the first embodiment, the metadata for integration is used to define, with flexibility, what data view is structured from the pieces of distributed data. During this operation, it is possible to make the definition only with the information that is necessary for the queries. With this arrangement, there is no need to define all the pieces of information at the beginning. Thus, it is possible to structure the metadata for integration in a step-by-step manner.
Next, the overall configuration of the database integration reference system according to the first embodiment will be explained.
The databases in this system are such databases that are integrated according to the first embodiment. According to the first embodiment, the received-order DB 11 is an XML-DB, whereas the item DB 12 and the stock DB 13 are RDBs. In the description of the first embodiment, as shown in
In this example, the received-order DB 11 is a database that stores therein the information related to the orders received by a corporation. As shown in
The item DB 12 is a database that stores therein the information related to items that are handled by the corporation. As shown in
The stock DB 13 is a database that stores therein the information related to the stock of the handled items. As shown in
In the order form described above, the types of items are expressed only with the item codes; however, when people look at order forms, it is easier to understand when the names of the items are displayed. Thus, when the user wishes to convert the item codes in the order forms into the names of the items, using the handled item table 12a stored in the item DB 12, it is advantageous to use the database integration reference system according to the first embodiment.
Also, when the user processes an order while looking at the order form, if the user wishes to check the stock by having the stock quantity displayed at the same time, it is advantageous to use the database integration reference system according to the first embodiment. (In this situation, the stock quantity of each item is obtained from the stock DB. Because the stock quantity of each item is stored in the stock DB 13, it is necessary to make queries to both about the stock quantity.)
As explained so far, when the user wishes to reference data that is related to one order and is distributed in the three databases, as one piece of collective data, it is advantageous to use the database integration reference system according to the first embodiment.
Returning to the description of
As shown in
As shown in
Returning to the description of
The database integration reference apparatus 20 is configured so as to include, as shown in
In the metadata for integration 21a, the information that is necessary for the integration of the databases is defined. To be more specific, as shown in
To describe it more in detail, the virtual XML schema information defines, as shown in
The virtual XML schema information is explained more specifically, with reference to
A1: Complex Element
A Complex Element is an intermediate node that has one or more other nodes as its subordinates. When the corresponding database is an RDB, a set that is made up of a Complex Element and one or more Simple Elements being its subordinates corresponds to one record in a database. When the corresponding database is an XML-DB, a Complex Element is an intermediate node that has one or more other nodes as its subordinates, and the Complex Element itself has no value. A Complex Element has attributes as listed below. Any of the three types of nodes, namely, a Complex Element, a Simple Element, and a Tag Element may appear as a subordinate of a Complex Element.
A Simple Element is a terminal node that has a value as its subordinate. When the corresponding database is an RDB, a Simple Element corresponds to one column in a record and holds only its value. When the corresponding database is an XML-DB, a Simple Element corresponds to a terminal node having a value. A Simple Element has attributes as listed below. Because a Simple Element is a terminal node, no other node can be a subordinate of a Simple Element. Name: the tag name of the node in the integrated data view Visible or Invisible: Whether it should be displayed in the integrated data view
A Tag Element is a dummy node used for inserting a tag and does not have a corresponding database element. A Tag Element has an attribute such as “Name: the tag name of the node in the integrated data view”. Any of the three types of nodes, namely, a Complex Element, a Simple Element, and a Tag Element may appear as a subordinate of a Tag Element.
A unique ID is given to each Complex Element and each Simple Element so that the correspondence relationship between the node and the corresponding database element can be understood. The unique IDs are called a Complex Element-ID and a Simple Element-ID, respectively. When the corresponding database is an RDB, a set made up of a Complex Element and one or more Simple Elements corresponds to one record in the RDB. A tree structure is constructed by connecting such sets to one another. When the sets are connected, it is necessary to have an entry that makes an association (i.e. matching of the values) between the sets.
Regardless of this arrangement, it is possible to insert a Tag Element at a place where a dummy tag needs to be added. When the corresponding database is an XML-DB, it is necessary to structure a virtual XML schema in compliance with the schema of the XML data stored in the XML-DB. When a tag that does not exist in the schema of the original XML data needs to be added, a Tag Element is used. When a tag that exists in the schema of the original XML data needs to be deleted, the attribute of the tag for “Visible or Invisible” is set to “False”.
As the database information, as shown in
When the corresponding database is an RDB, it is described to which table in which RDB, each of the Complex Elements corresponds. It is also described to which column in the table, each of the Simple Elements being subordinate to the Complex Element corresponds.
When the corresponding database is an XML-DB, it is described a sub-tree including which Complex Elements corresponds to which XML-DB data. Further, when the tag name in the data view is different from the tag name in the XML-DB, the correspondence between these tag names is also described. (If there is no description about tag name correspondence for some Complex Elements and Simple Elements, it is assumed that the tag name in the data view is the same as the tag name in the XML-DB.) When the processing target is only a repetitive structure that is a part of a large piece of XML data stored in an XML-DB, the path from the root to the repetitive structure is written here.
As the information for associating elements, as shown in
The information for associating elements describes information for connecting the “sets made up of Complex Elements and Simple Elements” that correspond to RDBs to one another and connecting a “set made up of a Complex Elements and Simple Elements” to an XML sub-tree that corresponds to an XML-DB. To be more specific, it is described using which Simple Element and which Simple Element, the matching of the values is performed. In the first embodiment, the association is made through only one type, which is “a complete match of the values”.
As for the “sets made up of Complex Elements and Simple Elements” that correspond to RDBs, any one of the Simple Elements in the sets can be used for making associations. On the other hand, as for the XML sub-tree that corresponds to an XML-DB, the Simple Elements that can be used for making associations are restricted so that one-to-one correspondence relationship can be ensured. When another database is connected to the lower level, for a Complex Element that is used as a connection point in the virtual XML schema information (i.e. a node that corresponds to the connected database appears as a subordinate of the Complex Element), only the Simple Elements that are the child nodes of the Complex Element can be used for making the associations. When another database is connected to the upper level, only the Simple Elements that are the child nodes of the Complex Element on the uppermost level of the XML sub-tree can be used for making the associations.
When the Simple Elements that can be used for making the associations are restricted, it is inconvenient because the virtual XML views that can be generated are also restricted. Thus, the restriction is mitigated using the number of maximum appearances set for the Complex Element. For example, when the maximum number of appearances for the Complex Element being the connection point is 1, it is possible to enlarge the range of associations to the Simple Elements that are the child nodes of a Complex Element that is positioned adjacent on the upper level in the XML sub-tree. Recursively, as long as the maximum number of appearances for a Complex Element is 1, it is possible to enlarge the range of associations to the Simple Elements that are the child nodes of a Complex Element that is positioned in the next upper level. Conversely, for a Complex Element being the connection point, if the maximum number of appearances for the Complex Element being its subordinate is 1, it is possible to enlarge the range of associations to the Simple Elements that are the child nodes of the Complex Element. It is also possible to enlarge the range of associations recursively for the Complex Elements in the further lower levels.
The metadata for integration shown separately in
The method (or the rule) for mapping the data in the databases onto an XML tree structure can be described as follows: (1) It appears, to a user, as if a piece of data that is obtained by combining pieces of data from different databases was contained in one XML repeatedly as many times as the number of pieces of data. (2) The pieces of data from the databases to be integrated are mapped onto the XML elements in units of tables. (3) The XML elements that correspond to the tables can be arranged in a hierarchical manner. (4) Of the XML elements that correspond to the tables, the elements that are positioned adjacent to each other, above and below, in the hierarchical structure require that pieces of data that are in the respective corresponding tables should be associated with each other. In other words, one column in each of the tables should have the same value. (5) It is acceptable for a table that corresponds to one XML element to specify a plurality of different tables that are included in different databases. (6) The tag name of an XML that corresponds to a column of a database may be a different name from the column name.
Returning to the description of
Of these elements, the query parser unit 22a is a processing unit that, after analyzing and checking the syntax of the XQuery query received from the user terminal 10, converts the contents of the query into an internal format. When the query has a syntax violation, the query parser unit 22a returns an error message indicating the syntax violation to the user terminal 10.
The query processing engine unit 22b is a processing unit that actually processes the XQuery query converted by the query parser unit 22a, obtains data by making necessary queries to the databases accordingly, generates a query result in an XML, and returns the generated query result to the user terminal 10. In other words, the query processing engine unit 22b plans what queries need to be made to the databases in what order so as to obtain the data (i.e. generates a structured query language (SQL) to make queries to the databases) and executes the plan (i.e. sends the generated SQL to the databases and obtains the results). The query processing engine unit 22b then constructs XML document data to be eventually returned to the user terminal 10, using the data obtained from the databases as the query results. The specific contents of the processing performed by the query processing engine unit 22b will be explained more in detail later, with reference to
The access processing unit 22c is a processing unit that actually accesses the databases after the query processing engine unit 22b has made query requests to the databases. The access processing unit 22c performs the processing of transmitting, to the corresponding databases, queries that correspond to the databases and that have been generated from the XQuery query converted by the query parser unit 22a.
Next, the query processing procedure performed by the database integration reference apparatus 20 will be explained with reference to FIGS. 10 to 18.
As shown in
Subsequently, the database integration reference apparatus 20 reads the metadata for integration that is related to the query from the storage unit 21 and finds out the structure of the XML being the query target and in which databases the data that corresponds to the elements is stored (step S1303).
To be more specific, as shown in
As a method to optimize the order in which queries are made, the database integration reference apparatus 20 then divides the elements in the XML structure obtained at step S1303 depending on in which database the data is stored, examines the conditional statement specified by the user in the XQuery query, and determines a database in which it is most likely to be able to narrow down the data (step S1304).
To be more specific, as shown in
Subsequently, the database integration reference apparatus 20 generates a query for querying about the data that matches the condition to the first database determined at step S1304 (step S1305). The query generated at this step is generated in a format that corresponds to the type of database being the query target. To be more specific, when the database being the query target is an XML-DB, the query is written in an XPath (or an XPath-compatible query language). When the database being the query target is an RDB, the query is written in an SQL. Next, the generated query is sent to the corresponding database so as to obtain a query result (step S1306). It should be noted, however, that the value obtained from the database at this point in time is only the column associated with an element in the upper level.
To be more specific, as shown in
When a sub-query text for an XML-DB is generated using an XPath (or an XPath-compatible query language), firstly, of condition expressions provided in the XQuery executed on the integrated data view, condition expressions that apply conditions on the nodes within the range of the XML sub-tree to which the XML-DB being the target corresponds are selected. Secondly, the XPath is generated according to the paths in the XML sub-tree, based on the selected condition expressions. This operation is only to convert the XQuery into the XPath, except that substitutions of paths occur due to the change of the position of the root.
When there are a plurality of condition expressions in the XQuery, and the variable used in the paths in the condition expressions is bound to a node outside the range of the XML sub-tree being the target, there are some cases where it is not possible to put the condition expressions together using one XPath. In such a case, the XPath is constructed using only some of the condition expressions with which it is likely to be able to narrow down the data, without using some other condition expressions.
Subsequently, the database integration reference apparatus 20 generates a query for sequentially finding out the upper-level elements in the XML tree structure, using the result of the previous queries to the databases (step S1307). The method of selecting the query type is the same as the one used at step S1305. The generated query is sent to the corresponding database, and a query result is obtained (step S1308). The processing at steps S1307 and S1308 is repeatedly performed until the element in the uppermost level in the XML tree structure is obtained, by sequentially obtaining the values of pieces of data that correspond to the elements in an upper level each time, starting from the element at which the query to the databases has begun (step S1309).
In this processing, the association with the previous query result is used as the condition to narrow down the data, and also if there are other conditions specified by the user in the XQuery query, those conditions are also added to the conditions used to narrow down the data. The values obtained from the databases are only the columns that are associated with the elements in the upper levels, but when the processing has reached the uppermost level element, all the columns that correspond to the uppermost level element are obtained.
To be more specific, as shown in
The generated query is sent to the received-order DB 11 (XML-DB) so that a query result that reads “<order><id>121</id><purchaser>AsianTraders</purchaser><item><item_code>0345</item_code><number>2</number></item><item><item_code>0872<item_code><number>5</number></item><date>2005-07-25</date></order>” is obtained from the order form XML, as the data that matches the conditions. In the example shown in the drawing, because the processing has reached the uppermost level element, all the columns that correspond to the uppermost level element are obtained.
Subsequently, when the element in the uppermost level in the XML is obtained (step S1309: Yes), the database integration reference apparatus 20 performs the processing of generating a query for sequentially obtaining all the elements in the lower levels below the uppermost level, sending the SQL query to the corresponding database, and obtaining a query result (steps S1310 through S1311) until all the elements below the uppermost level in the XML tree structure are obtained so as to sequentially obtain the values of the pieces of data that correspond to the lower-level elements (step S1312). The method of selecting the query type at steps S1310 is the same as the ones used at steps S1305 and S1307. When this processing is performed, the association with the query result of an upper element is specified as a condition with which the data is narrowed down. All the columns that correspond to the elements are obtained as-the values obtained from the databases.
To be more specific, as shown in
Further as shown in
Then, when the data values of all the elements are obtained through the processing described above (step S1312: Yes), the database integration reference apparatus 20 constructs a query result XML from the obtained data values, while going through the XML tree structure from the top, as shown in
As a result of the series of processing described above, the data in the XML format is returned, as a query result, to the user terminal 10 that has originated the XQuery query. At steps S1307 through S1312, the processing goes up to the uppermost level element first, and then a query is made to the lower-level element again. Because two queries are made to the same database, it might seem wasteful. It is, however, necessary to perform this procedure because there is a possibility that a part of the XML document data may be missing otherwise. To be more specific, for example, in
The XML data that is returned as the result of the sub-query to the XML-DB is analyzed, using the XML parser included in the query processing engine unit 22b. The reason why the analysis is made is because, unless the value of the node used in the process of making associations is extracted, it is not possible to make a query to the next database. The analysis is made also for the purpose of preventing illegitimate data from mixing in, by checking if the result matches the schema of the XML defined in the metadata for integrating the databases. The XML data of which the analysis is finished is stored in the memory in an intermediary data format (a format that is compliant with a document object model (DOM)).
There are two possible methods to perform the processing when, in the virtual XML schema information in the metadata for integrating databases, the Simple Elements that appear directly below a single Complex Element appear in a different order in the returned XML data. One of the possible methods is to consider the XML data to be illegitimate XML data having a schema violation and treat it as an error (i.e. the data is discarded or an error message is returned and the processing is ended. The other possible method is to rearrange the order according to the virtual XML schema information. According to the first embodiment, the latter method is used. With this arrangement, according to the first embodiment, it is possible to change, with flexibility, the order in which tags appear in a virtual data view.
The XML data that is a result of the XQuery query is generated by outputting the results of the sub-queries to the databases that are stored in the memory in the intermediary data format, as XML data according to the virtual XML schema in the metadata for integrating databases.
Next, the method for optimizing the query order (the processing related to step S1304 in
When the database integration reference apparatus 20 according to the first embodiment is used, when pieces of relevant data are sequentially obtained from a plurality of databases, the piece of data obtained first is obtained by narrowing down the data based on the conditions specified in the query from the user, whereas the other pieces of data that are obtained thereafter are obtained by narrowing down the data based on both the association with the previously obtained data and the conditions specified by the user. For this reason, when the data is not narrowed down sufficiently, a large amount of data is returned as a result of the queries to the databases. In this situation, not only it requires a long period of time to transfer the data, but also the load on the network is increased.
To explain this situation more specifically, as shown in
In this situation, when the amount of data obtained as a result of the first query is large, the amount of data obtained as a result of the next query, which uses the data resulting from the first query, also becomes large. Thus, even if the final query result to be returned to the user is the same, the amount of data collected in the database integration reference apparatus 20 during the process increases. In such a case, not only it takes a longer period of time to send the response to the user because the transfer of the data requires more time, but also the load on the network is increased. To cope with this problem, the database integration reference apparatus 20 determines the database to which the first query is made, after studying to which one of the databases, the SQL query should be issued first so as to make the amount of data in the query result smaller. This processing is performed by considering the four points, namely, (1) through (4) shown below, after obtaining the metadata of each of the databases themselves (which is different from the metadata for integration) from the databases.
(1) Restrictive Conditions Related to Redundancy of Data
By referring to the metadata of the databases, it is checked whether the column conditioned in the XQuery query is the main key of the table or whether a unique restriction is imposed on the column. If one of these conditions is satisfied, the column has no duplication of data. Thus, there is a high possibility of being able to narrow down the data.
(2) The Number of Pieces of Data
By referring to the metadata of the databases, it is checked if the number of records in the table is large. It is checked because when the number of records in the table is large, there is a higher possibility that a large number of records are returned as the query result.
(3) The Type of Data and the Number of Digits
By referring to the metadata of the databases, it is checked if the data type of the column is one with a small variety, for example, numerals or true/false values, or if the number of digits is small. In such situations, there is a higher possibility that the column has a large amount of duplication of data. Thus, there is a higher possibility that a large number of records are returned as the query result.
(4) The Type of Condition Specification in the Condition Expressions Specified by the User
It is checked whether the condition expression in the XQuery query is specified using an equality sign or an inequality sign. It is checked because when the condition is specified using an equality sign, there is a higher possibility of being able to narrow down the data than when the condition is specified using an inequality sign.
The database integration reference apparatus 20 checks whether each of these four criteria is satisfied and gives a score to each of the query conditions according to the result of the checking. The database integration reference apparatus 20 starts the query with the database that involves the condition with the highest score. In the example shown in
After the database with which the query is started is determined using the optimization method, the elements are sequentially obtained through the processing that moves to an element respectively positioned immediately above, toward the uppermost level element in the XML at first, using the association information, as explained in the description of the procedure in the query processing.
As explained so far, according to the first embodiment, not only a means of access to the databases that can be used in common among the databases is provided, but also an XML data view in a further upper level is made available. In other words, the entire relevant data that exists in the plurality of databases is presented to the user as a virtual XML document. As a result of a query to extract a part of the XML document, data reference is performed in such a manner that an XML document is returned. Also, when the user issues a query, it is judged in what order, from which database, and with what query, the data should be obtained, based on the metadata for integration that is prepared in advance. According to the result of the judgment, the necessary data is obtained, and the obtained data is constructed into an XML document and returned to the user. Thus, the user does not have to be concerned about the structure in which the data is stored and does not have to recognize at all in which one of the databases, each piece of data is stored. Accordingly, it is possible to treat the plurality of databases as if they were one database.
Also, according to the first embodiment, even if pieces of data of the same type are stored in a plurality of databases and the user does not know in which one of the databases one of the pieces of data having a certain value is stored, when the user issues an XML document query, the database integration reference apparatus 20 sends a query to each of all the databases that have a possibility of storing the piece of data therein, based on the metadata for integration and finds the data automatically. With this arrangement, the user does not have to look for the data from the databases. Thus, it is possible to treat the plurality of databases as if they were one database.
Further, according to the fist embodiment, when data is obtained from the databases, a plan for issuing the queries is made so that the query results become as small as possible, based on the meta information of the databases and the contents of the queries, and the data is sequentially obtained from the databases according to the plan. With this arrangement, the data is narrowed down to the result data by manipulating the order in which the queries are made. Thus, it is possible to reduce the amount of data being transferred and to shorten the period of time required for the queries, and also to reduce the load on the network.
In addition, according to the first embodiment, after the database with which the query is started is determined, the data values corresponding to the elements are sequentially obtained, starting with the element of which the data value is obtained first, and in such a manner that the processing moves onto an upper-level element each time in the XML document tree structure. When the data value of the uppermost level element is obtained, the data values of all the lower level elements are sequentially obtained, while going down the structure from the uppermost level. This procedure is always the same regardless of the definition of the XML document structure and the contents of the queries. With this arrangement, it is possible to obtain, without any exception, the entire XML document that serves as the query result, regardless of the definition of the XML document structure and the contents of the queries. Also, it is possible to make the number of times queries are made to the databases small.
The first embodiment described above has the characteristics as described below.
It is assumed that an XML-DB stores therein a large number of pieces of XML document data with a predetermined fixed schema and has an interface so that, when having received a query, the XML-DB returns one or more pieces of XML document data that correspond to the conditions while the data remains in the current format. As many pieces of XML document data as satisfy the conditions are returned. When it is assumed that the XML-DB has such an interface, it is possible to consider that the schema in the pieces of XML document data returned from the XML-DB is fixed. Thus, it is possible to embed the fixed schema as a part of the schema of the data view in an XML format that is visibly presented to the user.
To embed the schema of the pieces of XML document data that are returned from the XML-DB into the schema of the data view in the XML format, a view generation rule defines the schemas as to how to connect the XML tree structure returned from the XML-DB to the XML tree structure generated from the data structure of another RDB and thereby a view with what tree structure is obtained and also defines the entries that are used to make associations between these tree structures.
In the query processing, the XML document data returned from the XML-DB is embedded, without being modified, as a part of the XML document data that serves as the query result. In other words, the XML document data is treated in the same way as XML sub-trees structured from a plurality of RDBs are treated. It is safe to say that the tree structure that defines the schema of the XML document data view also defines the schema of the XML document data returned from the XML-DB, according to the first embodiment.
This method, however, can be applied only to an XML-DB that has the hypothetical interface described above. Also, it is not possible to apply this method when the XML document data returned from the XML-DB has a semi-structured characteristic. Further, the schema of the integrated data view that is presented to the user is also restricted by the schema of the XML document data returned from the XML-DB.
To solve the problem that remains even after the invention according to the first embodiment is applied, and also to present other functions that may be added to the first embodiment, more exemplary embodiments are presented below as a second embodiment of the invention. Firstly, a first characteristic of the second embodiment will be explained.
According to the first embodiment, it is assumed that the XML-DB stores therein a large number of pieces of XML document data with a predetermined fixed schema and has an interface so that, when having received a query, the XML-DB returns one or more pieces of XML document data that correspond to the conditions, while the data remains in the current format. Thus, this arrangement is not applicable to an XML-DB that only has an interface of other kinds. Generally speaking, however, the interfaces in many XML-DBs are arranged in such a manner that one (or more than one) large piece of XML document data is stored, and an instruction is issued so that a part of the XML document data is extracted in the query language, and a partial data of the stored XML document data is returned. Additionally, when a path to the repetitive structure in the XML data is specified in the database information in the metadata for integrating the databases, it is necessary to correct the XPath so that the specified path is added at the beginning before the issuance.
To cope with this situation, as shown in
The processing of automatically modifying, before the issuance, the sub-query issued by this system, according to the path that is from the root node to the repetitive structure and is recorded in the view generation rule is executed by the query processing engine unit 22b. The path from the root node to the repetitive structure is stored in the metadata for integration 21a.
Next, a second characteristic of the second embodiment will be explained.
These definitions are related to each other, and it is not possible to set the definitions without some kind of order. The nodes that are used to make an association need to be in a one-to-one correspondence. Thus, an XML-DB has a restriction as follows: a node used in the definition of association needs to be a terminal node, which is a child node of an intermediate node being the connection point in the definition of the schema. Because of this restriction, a problem arises where the level of flexibility in defining the schema of the view is low, and it is not possible to define a view with flexibility (see
To cope with this situation, as shown in
The processing of calculating the number of appearances of each of the intermediate nodes or the ratio of number of appearances between the intermediate nodes, based on the maximum number of appearances of each of the intermediate nodes in the sub-tree corresponding to the specified XML-DB and judging if it is possible to specify a node in an upper level or in a lower level as a node with which an association is made, in the range that a one-to-one correspondence is possible, is executed by the query processing engine unit 22b. The maximum number of appearances of each of the intermediate nodes in the sub-tree corresponding to the specified XML-DB is stored in the metadata for integration 21a.
Next a third characteristic of the second embodiment will be explained.
To cope with this situation, as shown in
The processing of changing, in the view schema definition in the view generation rule, the name of each of the nodes to a different name from the one used in the databases is executed by the query processing engine unit 22b. The name of each of the nodes and a corresponding name for the use in the databases as well as the relationship between the names are stored in the metadata for integration 21a.
Next, a fourth characteristic of the second embodiment will be explained.
To cope with this situation, as shown in
The processing of inserting the tag of the specified imaginary node when the analysis of the XML document data serving as the query result is finished is executed by the query processing engine unit 22b. The tag information of the specified imaginary node is stored in the metadata for integration 21a.
Next a fifth characteristic of the second embodiment will be explained.
To cope with this situation, as shown in
The processing of removing the tag of the node that is specified not to be displayed when the analysis of the XML document data serving as the query result is finished is executed by the query processing engine unit 22b. The tag information of the node that is specified not to be displayed is stored in the metadata for integration 21a.
Next, a sixth characteristic of the second embodiment will be explained.
To cope with this situation, as shown in
The processing of displaying, as a mere character string, the information of the node for which it has been designated to cancel the schema checking when the analysis of the XML document data serving as the query result is finished, is executed by the query processing engine unit 22b. The tag information of the node for which it has been designated to cancel the schema checking is stored in the metadata for integration 21a.
According to the first embodiment and the second embodiment that have been explained, when the pieces of data that are arranged so as to be distributed in a plurality of databases including an XML-DB and an RDB are referenced, it is possible to reference the data without being concerned about the physical distribution of the databases and by simply following the basic method of use of the XQuery. In addition, because the flexibility level of the schema definition in the integrated data view is high, it is possible to make flexible queries using XQuery, with the feeling as if an access was made to one database.
So far, the first and the second embodiments of the present invention have been explained. The present invention may be, however, embodied in various forms other than the first and the second embodiments, as long as it is within the scope of the technical ideas defined in the claims. In the following sections, various other exemplary embodiments will be explained by dividing them into the categories of: (1) tagged document; (2) databases; (3) metadata for integration; (4) access processing; (5) system configuration etc.; and (6) program.
(1) Tagged Document
For example, in the first and the second embodiment, the example in which an XML is used as a tagged document is explained. However, the present invention is not limited to this example. It is acceptable to use other tagged documents such as a Hyper Text Markup Language (HTML) or a Standard Generalized Markup Language (SGML).
In the description of the first and the second embodiments, an example is used in which “XQuery”, which is a query language for which the World Wide Web Consortium (W3C) is working on its standardization process, is used in the query sent to the XML data view, whereas “XPath (or an XPath-compatible query language)” is used in the query sent to the XML-DB. However, the present invention is not limited to this example. It is acceptable to use other query languages, including “XQuery” and “XPath (or an XPath-compatible query language)”, in each of both types of queries.
(2) Databases
In the description of the first and second embodiments, the example in which the XML-DB and the RDBs are integrated is explained. However, the present invention is not limited to this example. It is possible to apply the present invention in the same way to a case where other types of databases are integrated. For example, the database may be an object-oriented database or an object relational database. In an object-oriented database, the data is identified by a path in a hierarchical structure. Thus, by using a processing and a function that convert the hierarchical structure into a hierarchical structure of a tagged document, it is possible to treat the object-oriented database as if it was an XML-DB. On the other hand, the data management method of an object relational database is compliant with that of an RDB. Thus, it is possible to treat an object relational database substantially in the same way as an RDB is treated.
(3) Metadata for Integration
In the description of the first and the second embodiments, the example in which one piece of metadata for integration is provided is explained. However, the preset invention is not limited to this example. It is acceptable to provide a plurality of pieces of metadata for integration, depending on the method of integrating the databases. For example, it is one idea to provide a plurality of pieces of metadata for integration that correspond to different modes in which the query result is output.
(4) Access Processing
In the first embodiment, the example is based on an assumption that Globus Toolkit 4+OGSA-DAI WSRF 2.1 is used for the RDBs, whereas an application programming interface (API) that is compatible with XPath is used for the XML-DB, to access the plurality of different types of databases. However, the present invention is not limited to this example. How to make a query to the different types of databases is irrelevant. It is acceptable to access to the databases with any method. In particular, the XPath-compatible API is a sub-set of the XPath, which is an XML search language. Thus, it is possible to modify so that the query processing is performed using the XPath.
(5) System Configuration etc.
The constituent elements of the apparatuses shown in the drawings (especially, the database integration reference apparatus 20) are based on functional concepts. The constituent elements do not necessarily have to be physically arranged in the way shown in the drawings. In other words, the specific mode in which the apparatuses are distributed and integrated is not limited to the one shown in the drawing. A part or all of the apparatuses may be distributed or integrated functionally or physically in any arbitrary units, according to various loads and the status of use. A part or all of the processing functions offered by the apparatuses may be realized by a CPU and a program analyzed and executed by the CPU, or may be realized as hardware with wired logic.
Of the various types of processing explained in the description of the first and the second embodiments, it is acceptable to manually perform a part or all of the processing that is explained to be performed automatically. Conversely, it is acceptable to automatically perform, using a publicly-known technique, a part or all of the processing that is explained to be performed manually. In addition, the processing procedures, the controlling procedures, the specific names, and the information including various types of data and parameters that are presented in the text and the drawings may be modified in any form, except when it is noted otherwise.
(6) Computer Program
The various types of processing explained in the description of the first and second embodiments may be realized through execution of a program, which is prepared in advance, in a computer system such as a personal computer, a server, or a work station.
As another exemplary embodiment, the functions in the first and the second embodiments may be realized by reading and executing a program recorded on a predetermined recording medium in a computer system. The predetermined recording medium may be a “portable physical medium” such as a Flexible Disk (FD), a Compact Disc Read Only Memory (CD-ROM), a Magneto Optical (MO) disk, a Digital Versatile Disk (DVD), a Magneto Optical Disk, or an Integrated Circuit (IC) card, or a “stationary physical medium” such as a hard disk drive (HDD) provided on the inside or the outside of a computer system, a Random Access Memory (RAM), or a Read-Only Memory (ROM), or a “communication medium” that stores there in a program for a short period of time when the program is transmitted, such as a public circuit that is connected via a modem, or a Local Area Network (LAN)/a Wide Area Network (WAN) to which another computer system and a server are connected. The predetermined recording medium may be any recording medium that records thereon a program that is readable by a computer system.
To be more specific, the program used in this exemplary embodiment is recorded on a recording medium such as a “portable physical medium”, a “stationary physical medium”, or a “communication medium” in such a manner that the program is computer-readable. The computer system realizes the same functions as described in the exemplary embodiments above, by reading the program from the recording medium and executing the read program. The program used in this exemplary embodiment is not limited to being executed by a computer system. The present invention is applicable to an example in which other computer system or a server executes the program or in which other computer system and a server collaborate to execute the program.
According to the present invention, it is possible to reference the pieces of data that are distributed in the plurality of different types of databases including the database that returns the query result as the data that is uniquely identified in the hierarchical structure, by outputting, in the integrated view, the query result obtained as a result of the queries that are made, in the query formats, to the databases. Thus, an effect is achieved where it is possible to make the queries without being concerned about the pieces of data being distributed. Accordingly, the level of flexibility in the database development work is enhanced.
According to the present invention, it is possible to reference the pieces of data that are distributed in the plurality of different types of databases including the tagged document database that returns the query result as the tagged document of which the structure is predetermined, by outputting, in the integrated view, the query result obtained as a result of the queries that are made, in the query formats, to the databases. Thus, an effect is achieved where it is possible to make the queries without being concerned about the data being distributed. Accordingly, the level of flexibility in the database development work is enhanced.
Further, according to the present invention, it is possible to store the specific repetitive structure included in a tagged document data within the tagged document database and to obtain the data as the query result, based on the stored repetitive structure. Thus, an effect is achieved where the range of tagged document databases that can be the targets of the integration is widened.
In addition, according to the present invention, the schema of the tagged document data returned from the tagged document database does not restrict the nodes that can be used for making associations with another database. Thus, there are more options of nodes that can be used for making associations. Accordingly, an effect is achieved where the level of flexibility in the design of the integrated data view is improved and also the level of flexibility in the upper-level application development is improved.
Further, according to the present invention, it is possible to determine the names of the elements defined in the schema of the integrated data view without dependency on the names of the elements defined in the schema of the tagged document data returned from the tagged document database. Thus, an effect is achieved where it is possible to determine the names of the elements defined in the schema of the integrated data view in such formats that are easy to understand for the users.
In addition, according to the present invention, it is possible to put the one or more elements that do not exist in the schema of the tagged document data returned from the tagged document database into the schema of the integrated data view. Thus, it is possible to determine, with flexibility, the schema of the integrated data view. Accordingly, an effect is achieved where the level of flexibility in the upper-level application development is significantly improved.
Furthermore, according to the present invention, it is possible to arrange so that the schema of the integrated data view does not include one or more of the elements that exist in the schema of the tagged document data returned from the tagged document database. Thus, it is possible to determine, with flexibility, the schema of the integrated data view. Accordingly, an effect is achieved where the level of flexibility in the upper-level application development is significantly improved.
Moreover, according to the present invention, even if the tagged document data returned from the tagged document database is indefinite or has a semi-structured characteristic, it is possible to integrate the tagged document database. Thus, an effect is achieved where the range of tagged document databases that can be the targets of the integration is widened.
Although the invention has been described with respect to a specific embodiment for a complete and clear disclosure, the appended claims are not to be thus limited but are to be construed as embodying all modifications and alternative constructions that may occur to one skilled in the art that fairly fall within the basic teaching herein set forth.
Number | Date | Country | Kind |
---|---|---|---|
2006-077649 | Mar 2006 | JP | national |