Various embodiments described herein relate to computer systems, methods and program products and, more specifically, to database management computer systems, methods and program products.
DataBase Management Systems (DBMSs) are specially designed software applications that interact with users, other applications and a database itself, to capture and analyze data. More specifically, a DBMS may allow the definition, creation, querying, update and administration of databases. Commercially available DBMSs include MySQL, PostgreSQL, SQLite, Microsoft SQL Server, Microsoft Access, Oracle, SAP, dBASE, FoxPro, IBM DB2, LibreOffice Base and FileMaker Pro.
Databases may be divided into relational databases and non-relational databases, based upon their organization. A relational database is a database that has a collection of tables of data items, which are formally described and organized according to a relational model. Data in a single table represents a relation. Moreover, tables may have additionally defined relationships with each other. In the relational model, each table schema identifies a column or group of columns, called the primary key, to uniquely identify each row. A relationship can then be established between a row in the table and a row in another table, by creating a foreign key, i.e., a column or group of columns in one table that points to a primary key of another table. Many relational databases are designed and implemented using Structured Query Language (SQL), so that relational databases may also be called SQL databases. SQL-based relational database management systems are presently provided by many vendors.
In contrast, a non-relational database, also referred to as a “Not Only SQL” or a “NoSQL” database, provides a mechanism for storage and retrieval of data that is modeled using techniques other than tabular relations used in relational databases. Non-relational databases may be more simple in design and may provide better horizontal scaling compared to relational databases. A non-relational database may also be referred to as a “hierarchical database”, because the data elements are provided in a hierarchy rather than using table relationships. Non-relational databases are finding significant and growing use in big data and real-time Web applications. Non-relational databases include column-based databases, such as Accumulo, Cassandra and HBase; document-based databases, such as Clusterpoint, Couchbase, MarkLogic and MongoDB; key-value-based databases, such as Dynamo, FoundationDB, MemcacheDB, Redis, Riak and FairCom c-treeACE; and graph-based databases, such as Allegro, Neo4J, OrientDB and Virtuoso.
Database design is a process for producing a database. A DBMS may facilitate database design. Database design may generally include three overall operations: conceptual modeling of the database, logical modeling of the database and physical design of the database. In conceptual modeling, a database modeling computer system, method and/or program product is used to identify relationship dependencies within the data to be stored in the database. In logical modeling, a logical structure of a database may be defined using, for example, keys, domains and/or data types. Once the database is conceptually and logically modeled, the physical design of the database can specify the physical configuration of the database in a storage media. The physical design may be performed by a database creation system. Common DBMSs may include both a database modeling system for the conceptual and/or logical modeling and a database creation system, although these functionalities may also be separately provided by separate systems.
A database may be designed using an Entity-Relationship (ER) model, which is a data model for describing a database in an abstract way. In ER modeling, an entity may be defined as a thing which is recognized as being capable of an independent existence and which can be uniquely identified. In English grammar structure, an entity may be thought of as a “noun”. A relationship captures how entities are related to each other. Thus, relationships can be thought of as “verbs”, linking two or more nouns. Entities and relationships can both have attributes, which may be thought of as “adjectives” for entities, and “adverbs” for relationships.
Various embodiments described herein can selectively link join-queried entities in an entity-relationship model of a database. The selective linking may be performed by defining a pre-join tag in the entity-relationship model of the database and selectively providing a respective pre-join tag between respective join-queried entities in the entity-relationship model of the database. In some embodiments, frequently join-queried entities in the entity-relationship model of the database are identified, and a respective pre join tag is provided between the respective frequently join-queried entities in the entity-relationship model of the database. The frequently join-queried entities may be identified by identifying join-queried entities that comprise a comprise a join-queried frequency that exceeds a threshold.
In some embodiments, prior to the linking of the frequently join-queried entities in the entity-relationship model of the database, the frequently join-queried entities are identified. Various techniques may be used to identify the frequently join-queried entities. In some embodiments, a non-relational database is mined to identify frequently occurring structures therein. In other embodiments, frequently occurring queries are identified from query logs of the database, entities that are associated with the frequently occurring queries are identified, and the frequently join-queried entities are identified from the frequently occurring queries and from the entities that are associated with the frequently occurring queries.
In some embodiments when the database is a non-relational database, the frequently join-queried entities may be identified from reverse key tables. In other embodiments when the database is a non-relational database, the frequently join-queried entities may be identified from composite keys. In still other embodiments when the database is a relational database, the frequently join-queried entities may be identified by identifying runtime joins that are performed on the relational database in response to queries.
In some embodiments, a non-relational database may be generated from the entity-relationship model by creating merged entities in the non-relational database based on the selective linking. The generating may be performed by creating an embedded structure in the non-relational database based on the merged entities and/or by creating a composite key in the non-relational database based on the merged entities.
In other embodiments, a relational database is generated from the entity-relationship model by selectively creating merged entities in the relational database based on the selective linking. The merged entities may be created by selectively merging two tables in the relational database based on the merged entities.
Other embodiments can selectively link frequently join-queried entities in an entity-relationship model of a relational database by identifying, from query logs of the relational database, tables that are frequently joined at run time in response to queries of the relational database; identifying entities that are associated with the tables that are frequently joined; and providing a respective pre-join tag between the respective entities that are associated with the tables. The entity-relationship model of the relational database including the pre-join tags may be converted into a non-relational database. The converting may take place by creating merged entities in the non-relational database based on the pre join tags. The frequently occurring queries may be executed on the non-relational database using the merged entities.
Various embodiments described herein may provide an apparatus that comprises processor circuitry and memory circuitry coupled to the processor circuitry and comprising computer readable program code, such as a database management system, that, when executed by the processor circuitry, causes the processor circuitry to perform operations comprising selectively linking join-queried entities in an entity-relationship model of a database according to any of the embodiments described herein. Other embodiments provide methods of selectively linking frequently join-queried entities in an entity-relationship model of a database according to any of the embodiments described herein, wherein the generating is performed by processor circuitry. Still other embodiments provide a computer program product that comprises a computer-readable storage medium having computer readable program code embodied in the medium that, when executed by processor circuitry of a computer system, causes the computer system to perform operations comprising selectively linking join-queried entities in an entity-relationship model of a database according to any of the embodiments described herein.
It is noted that aspects described with respect to one embodiment may be incorporated in different embodiments although not specifically described relative thereto. That is, all embodiments and/or features of any embodiments can be combined in any way and/or combination. Moreover, computer equipment, systems, methods, and/or computer program products according to embodiments will be or become apparent to one with skill in the art upon review of the following drawings and detailed description. It is intended that all such additional computer equipment, systems, methods, and/or computer program products be included within this description and protected by the accompanying claims.
The accompanying drawings, which are included to provide a further understanding of the present disclosure and are incorporated in and constitute a part of this application, illustrate certain embodiment(s). In the drawings:
Various embodiments will be described more fully hereinafter with reference to the accompanying drawings. Other embodiments may take many different forms and should not be construed as limited to the embodiments set forth herein. Like numbers refer to like elements throughout.
Various embodiments described herein can allow join-queried entities in an entity-relationship model of a database to be selectively linked. As used herein, a “join” is an operation that combines records from two or more tables in a relational database or combines two or more collections in a non-relational database. A “join-query” is a query, the performance of which requires data from two or more tables in a relational database or from two or more collections in a non-relational database.
The selective linking may be performed using a pre-join tag between respective join-queried entities in the entity-relationship model of the database. In some embodiments, the selective linking may identify frequently join-queried entities in the entity-relationship model of the database and providing a respective pre join tag between the respective frequently join-queried entities in the entity-relationship model of the database. Frequently join-queried entities may be identified by identifying join-queried entities in the entity-relationship model of the database that comprise a join-query frequency that exceeds a threshold.
If the database is a relational database, the pre-join tags may be used to reduce the number of runtime join-queries that need be performed. Moreover, the pre-join tags may be used to convert the relational database into a non-relational database or to merge the relational database with a non-relational database, by creating merged entities in the non-relational database based on the pre-join tags. If the database is a non-relational database, the pre-join tags may also be used to improve query performance by creating merged entities in the non-relational database based on the pre-join tags. Improved querying of join-queried entities may thereby be provided.
Various embodiments described herein may arise from a recognition that a traditional entity-relationship model includes entities with attributes, relationships and keys, but does not allow the database modeler to link frequently join-queried entities in the entity-relationship model. In a database, querying of data related to different entities may use a “join” clause or other query form. However, execution of a join clause in a database may generate large numbers of input/output operations resulting in slow performance. Moreover, in non-relational databases, join queries are extremely difficult to execute.
In sharp contrast, various embodiments described herein allow a pre-join tag to be associated with entities that are frequently join-queried, so as to link the frequently join-queried entities in the entity-relationship model of the database. At the database creation phase, a physical model can be automatically created to enhance join queries by selectively merging entities into a common table or a common collection, in response to the pre join tags. Accordingly, physical models may be built that can reduce input/output costs for queries across multiple database tables or collections, and thereby allow improved query performance in a database management system.
A high level description of various embodiments described herein will now be provided using block diagrams and flowcharts.
A DBMS 130 may be stored in the memory 120. The DBMS 130 may be used to provide the definition, creation, querying, update and administration of one or more databases 160. The database(s) 160 may communicate with the DBMS 130 directly and/or may be connected to the DBMS 130 by any conventional public and/or private, real and/or virtual wired and/or wireless network including all or a portion of the global communication network known as the Internet. The database(s) 160 may be stored in a common memory 120 with the DBMS 130, as illustrated in
Still referring to
Specifically, referring to
At Block 220, frequently join-queried entities are identified. In general, join-queried entities in the entity-relationship model of the database that comprise a pre join frequency that exceeds a threshold, may be identified. Various techniques for identifying frequently join-queried entities will be described below in connection with
Referring to Block 230, a respective pre-join tag is selectively provided between respective join-queried entities in the entity-relationship model of the database, for example based upon the frequently join-queried entities that are identified. Thus, the frequently join-queried entities are linked in the entity-relationship model of the database. The operations of Blocks 210-230 may be performed, for example by a database modeling system 140.
Referring to Block 240, the database may then be generated or created, for example by a database creation system 150, using the pre-join tags that are included in the entity-relationship model of the database. Finally, at Block 250, queries may then be executed on the database that was generated using the pre join tags. The queries may be executed, for example, by the database query system 180 of
For example, at Block 310, query logs may be used to identify queries in which joined entities are used. Frequency of occurrence of these queries may then be determined, and those queries which have a frequency of occurrence that exceeds a threshold, such as a highest frequency occurrence, may be identified as candidates for linking by pre join tags. In other embodiments, as illustrated at Block 320, a non-relational database may be mined to identify frequently occurring structures therein, which then may be candidates for pre join tags. At Block 330, in a non-relational database, frequently join-queried entities may be identified from reverse key tables and/or at Block 340, frequently join-queried entities may be identified in a non-relational database from composite keys. At Block 350, in a relational database, runtime joins that are performed on the relational database in response to queries may be identified, and those runtime joins that are frequently performed, for example which exceed a given threshold, may be candidates for pre-join tags. Finally, at Block 360, other techniques may also be used to identify frequently join-queried entities.
More specifically, an “embedded structure” and a “composite key” are both structures that may be used by a NoSQL database to perform pre-join. However, not all NoSQL databases can perform an “embedded structure” and not all NoSQL databases can perform a “composite key”. Many NoSQL databases can perform an “embedded structure”, and a “composite key” may be used in column-based databases, such as HBase. Some embodiments may only choose one method in a NoSQL database.
An “embedded structure” is a structure that stores one entity into another. For example, assume the entity “Student” and the entity “Class”. They may appear as follows in a document database:
For a pre-join tag, it may be desirable to query the Student entity together with the Class entity, such as “get class with its related students”. In a NoSQL database, for example, a document database, the Student entity can be stored in the Class entity, as follows:
Since these two entities are stored together, by an embedded structure, the pre-join tag can be performed.
In another example, in HBase, each row has a row-key, and in an HBase table, all rows are sorted according to row keys. It is easier to get adjacent rows in one query in HBase. Using the example above, Class and Student entities are merged into one HBase table and then the Class id or name is stored into the row-key. Other information also may be stored in the row key, such as time stamp, etc., so it may be referred to as a “composite row key”. A composite row key can make a query easier. The table structure will be as shown in the following Table, which assumes Class A has id 1, B has 2, C has 3:
Referring to
A low level description of various embodiments described herein will now be provided, along with working examples. These examples shall be regarded as merely illustrative and shall not be construed as limiting the invention.
Specifically,
In contrast, when it is desired to query from multiple collections or entities in a non-relational database, a runtime join is generally not a practical choice, as it may be very inefficient to do a runtime join in non-relational databases. Non-relational databases are becoming increasingly popular. As was noted above, non-relational databases often have a hierarchical structure, as illustrated in the example of
It is often desirable to query from multiple collections or entities in a non-relational database. This query may be accomplished by merging one collection into another, to build a new physical collection in the non-relational database. For example, as illustrated in
Accordingly, as illustrated in
As was described in the High Level Description above, various embodiments described herein can selectively link join-queried entities, such as frequently join-queried entities, in an entity-relationship model of a database. In some embodiments, the selective linking is provided by defining a pre join tag, and selectively providing a respective pre-join tag between respective frequently join-queried entities in the entity-relationship model of the database. A specific example will now be provided.
A detailed explanation of an “ON clause” will now be provided. Since a join query is a query over two entities, an ON clause is a clause in a join query to figure out how to put these two entities together. For example, assume a Class entity with properties ID, Name and Type, and a Student entity with properties ID, Name, Age and In_Class_ID. A query may be performed over these two tables with a join query: “Select Class.Name as className, Student.Name as studentName from Student join Class on Student.In_Class_ID=Class.ID”. This join query will query over two tables and get student's name with its related class name. The ON clause is “on Student.In_Class_ID=Class.ID”. The On clause shows that Student entity and Class entity are put together. When they have the same value on Student.In_Class_ID and Class.ID, then Class.Name is selected as “className, Student.Name as studentName” as the result.
Referring back to
In yet another alternative, a pre join table may be built, for example as illustrated in
As shown in
Thus, to deal with join queries in relational databases, a join may be performed during the query (runtime join) or a join query may be performed by defining an N-N (multi-to-multi) relationship table. Unfortunately, however, since joins are not supported in non-relational databases, these non-relational databases may only have an opportunity to represent joins in a pre-join collection. Pre join collections may be established in non-relational databases manually. Unfortunately, this may be a cumbersome and incomplete process.
In sharp contrast, various embodiments described herein may allow an entity relationship model to be used on a different physical production environment, such as a non-relational database, by linking frequently join-queried entities in the entity-relationship model of the database. Thus, by adding pre-join characteristics to the entity-relationship model, the entity-relationship model may appear as illustrated in
To summarize, traditional entity-relationship models generally do not have the ability to include pre-join characteristics, so that they are not suitable to generate proper physical models that can automatically fit different physical environments with pre-join characteristics. Non-relational databases are widely used for data storage and data analysis. Non-relational databases generally are schema-free or schemaless. Thus, when data modelers need to migrate data between non-relational databases and a traditional relational database, they may need to define or extract the schema of the non-relational database manually. Moreover, existing transform tools for data and schema transform may ignore the data characteristics. However, since collections in non-relational database may have different characteristics, especially pre-join, these characteristics may have an important impact on the transformation. As schemaless data storage, non-relational databases may store relations between entities, but this relation may not be explicit. Non-relational databases may store the relationships as embedded collections, or may not store the relationships, but just may store the entities. Thus, when migrating data and structure from relational databases to non-relational databases, modelers may need to redesign the entity relationship structure to ensure efficiency of the target database. Accordingly, besides the data structure itself, pre-join characteristics may also be relevant for data and schema migration. Since traditional entity models do not include elements representing pre join characteristics, manual structure redesign may be needed to fit for the different environment when migrating.
In sharp contrast, various embodiments described herein can add a pre-join tag in the traditional entity-relationship model, as a new model element. The pre join characteristics can help to generate proper physical models that can automatically fit different physical environments. Data modelers can build models with pre-join data tags and/or extract the model from existing databases. The model can then be transformed to different physical models of databases.
Pre-join tags according to various embodiments described herein may be used during model definition (which may correspond to Block 210 of
In model definition, as was already described above, a database model such as illustrated in
Once the pre-join tags are defined (for example as was described at Block 210), frequently join-queried entities are then identified, for example as was described at Block 220. Frequently join-queried entities may be defined manually by modelers and/or may be generated automatically from relational or non-relational databases. Different techniques may be used to discover pre joined tags in relational and non-relational databases, as was previously illustrated in
For relational databases, pre-join tags may be generated by mining database query logs, as was generally described at Block 310 of
Other techniques may be used to generate a model from a non-relational database, as was described, for example, at Blocks 320-340 of
A frequent sub-structure, or frequently embedded structure, is a structure that frequently occurs in a NoSQL collection. For example, assume a NoSQL collection Class. Since NoSQL databases may be schemaless, there may be different structures in one NoSQL collection. Specifically, there may be some documents without sub-structure, like:
If documents with sub-structures occur frequently in a collection (the threshold may be set by users), a determination may be made that there is a frequent sub-structure in the collection.
Moreover, at Block 2250, a determination is made as to whether the database has a special structure, such as a reverse key table or a composite key for query. If so, at Block 2260, pre-join tags are built from the special structure. Specifically, some non-relational databases have reverse key tables for indexing and optimizing join queries. A mining process can be performed to recover the pre-join tags from these tables. Moreover, some non-relational databases may have composite keys for indexing and optimizing join queries. A mining process can be done to recover these pre join tags from these keys, as well. It will be understood that the operations of Blocks 2230 and 2250 may be performed out of the order shown in
Still other techniques may be used. For example, an entity-relationship model may be present and multiple non-relational databases may be related to this entity-relationship model, with no relational databases related to this entity-relationship model. In this case, pre-join tags may be mined from one or more non-relational databases, attached to the entity-relationship model, and then deployed for one or more of the other non-relational databases.
Finally, different strategies may be generated for data migration or schema building using pre-join tags. For example, when a model with pre join tag between two entities needs to be converted into a physical structure of a non-relational database, non-relational embedded structure can be built for query optimization, composite keys can be built, and/or two tables can be merged. In more detail, pre join tags may be used when an entity-relationship model of a relational database is applied to anon-relational database, as was illustrated, for example, in
Accordingly, by adding pre-join characteristics that represent join queries to the entity relationship model, computation and applications can be generated at the physical production environment data level and need to decide how to deploy the entity relationship model on physical environments of a relational or non-relational database.
As will be appreciated by one skilled in the art, aspects of the present disclosure may be illustrated and described herein in any of a number of patentable classes or contexts including any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof. Accordingly, aspects of the present disclosure may be implemented entirely hardware, entirely software (including firmware, resident software, micro-code, etc.) or combining software and hardware implementation that may all generally be referred to herein as a “circuit,” “circuitry,” “module,” “component,” and/or “system.” Furthermore, aspects of the present disclosure may take the form of a computer program product comprising one or more computer readable media having computer readable program code embodied thereon.
Any combination of one or more computer readable media may be used. The computer readable media may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an appropriate optical fiber with a repeater, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable signal medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C++, C#, VB.NET, Python or the like, conventional procedural programming languages, such as the “C” programming language, Visual Basic, Fortran 2003, Perl, COBOL 2002, PHP, ABAP, dynamic programming languages such as Python, Ruby and Groovy, or other programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider) or in a cloud computing environment or offered as a service such as a Software as a Service (SaaS).
Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus, and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable instruction execution apparatus, create a mechanism for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer readable medium that when executed can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions when stored in the computer readable medium produce an article of manufacture including instructions which when executed, cause a computer to implement the function/act specified in the flowchart and/or block diagram block or blocks. The computer program instructions may also be loaded onto a computer, other programmable instruction execution apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatuses or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting to other embodiments. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises,” “comprising,” “includes” and/or “including”, “have” and/or “having” when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Elements described as being “to” perform functions, acts and/or operations may be configured to or otherwise structured to do so. As used herein, the term “and/or” or “/” includes any and all combinations of one or more of the associated listed items.
The corresponding structures, materials, acts, and equivalents of any means or step plus function elements in the claims below are intended to include any disclosed structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present disclosure has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the disclosure in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the disclosure. The aspects of the disclosure herein were chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure with various modifications as are suited to the particular use contemplated.
Many different embodiments have been disclosed herein, in connection with the above description and the drawings. It will be understood that it would be unduly repetitious and obfuscating to literally describe and illustrate every combination and subcombination of these embodiments. Accordingly, all embodiments can be combined in any way and/or combination, and the present specification, including the drawings, shall support claims to any such combination or subcombination.
In the drawings and specification, there have been disclosed typical embodiments and, although specific terms are employed, they are used in a generic and descriptive sense only and not for purposes of limitation, the scope of the disclosure being set forth in the following claims.