a. Field
The instant invention relates to database systems and methods of processing database data.
b. Background
In other instances, different types of databases are required to match application needs. For example, an in-memory database may be needed for fast access to specific data. This requires the application developer to write program code to input data into two or more databases, attempting to keep data in sync. Then subsequently data must be retrieved from the potential plurality of database types to access the data in the most efficient way. This again requires a lot of programmatic work, and the application developer must understand each type of database engine, how it is used, and define specific fixed data structures for each.
In other implementations developers need to use complex, cumbersome and slow extract/transform/load processes between source and target databases, or even within a single database.
In yet other prior implementations, database views are supported within a single database engine. These can be used to generate custom data structures, but perform slowly and burden the entire database since they are performed in a single database.
The foregoing and other aspects, features, details, utilities, and advantages of the present invention will be apparent from reading the following description and claims, and from reviewing the accompanying drawings.
The embodiments set forth in the drawings are illustrative and exemplary in nature and not intended to limit the subject matter defined by the claims. The following detailed description of the illustrative embodiments can be understood when read in conjunction with the following drawings, where like structure is indicated with like reference numerals and in which:
Outputs from the one or more application client(s) 102 and the one or more source databases or files 104 are transformed (by pull or push) into a stream of data 105, optionally grouped into one or more transactions. Transactions are supported across one or more databases or partitions, guaranteeing that the entire transaction succeeds or fails as an atomic operation. The stream 105 is then coupled to one or more stream processing nodes for subsequent processing 106. The stream processing nodes 106 process data from the one or more source database streams 105 and, optionally, store one or more processing results in a database view 108. The stream processing nodes 106 may, for example, be arranged in any manner and may perform any data processing on data received from the one or more application clients 102 and/or the one or more source databases 104.
In this particular implementation, the database system 500 further includes a schema optimizer 114 and a query optimizer 116. The optimizers 114, 116 are used as either a Query Optimizer, to decide which View to use (e.g., the best or relatively better) as a source of data for a query, and/or a Schema Optimizer to identify and recommend View schemas to match application needs. Examples of Query and Schema Optimizers 114, 116 are further described below.
In one implementation of the database system of
In various implementations, such as the one shown in
Any of the databases of the databases system 100 may comprise sharded or unsharded databases. Streams can be sharded to improve parallel distributed processing performance, with each Stream processing node handling a subset of the stream data based on a shard key. Views can also be sharded together with or separate from the stream sharding; in the case where a View is sharded together with a stream using the same shard key, high performance local access from the stream to the view is supported, avoiding network delays to retrieve data from other servers on the network. Target external databases can also be sharded.
Any record in the database can be looked up by the key index in operation 206, or by any custom index or other data structure to enable navigation of the append-only log. The key index or other structure is used to find the position of the record in the append-only log file to look up the record.
In operation 208, an ordered queue including the order in which transactions were written is exposed in addition to using a key-value lookup as described in operation 206. A special client called a consumer can read data in order of operations using this facility. This allows for stream processing on changes to the database in the order performed. For example, the ordered queue exposed in operation 208 may be used to accumulate a sum of records (e.g., persons) located in a particular geographic region. In another implementation, another database may be constructed as a view of a first database, sorted by a different key (e.g., city, state, etc.). The additional database can contain the entirety of the original (primary) database or can contain all or a portion of the information in the record of the original database, or can contain a manipulated, enriched or transformed version of the data in the original database. In another implementation, the database may be replicated to a copy for reliability, within a single transaction to ensure that no data is lost. These implementations, however, are merely examples of types of stream processing that may be performed on changes to the database. Other variations are contemplated.
The ordered queue enables synchronous processing with guaranteed transactions in the primary store and further enables downstream synchronous or asynchronous processing on a streaming basis, such as for real-time or near-real-time processing and manipulation of data. Thus databases may be chained together, building different view of the data in the downstream instances as a result of stream processing. Further the stream may also be used to trigger events, such as when a threshold is hit for some value. Since the databases all have access to one another, and to the streams, values can be joined together as needed during processing, from multiple streams and/or views providing a flexible database system.
In this implementation, at least two interfaces are provided for the system: (1) a key-value or sorted key query facility or custom structure to meet application needs and (2) a consumer client of the ordered sequence of the updates as a reliable queue. The interfaces are exposed together to provide for stream processing in the database. This provides multiple access paths to a single database store to an application developer or other components in the system, both in order of operation or by specific keys or data structures as needed.
At stream processing node 506(B), the node reads an “Order” value from the relational database 504(A) for the “Shipped Order ID” received from stream processing node 506(A). The stream processing node 508(B) processes the data received from the stream processing node 506(A) and the relational source database 504(A) and creates a second database view 508(B). The second database view 508(B) may, for example, include a custom database view that is persistent and query-able organized by the “Shipped Order ID” key value. Similarly, stream processing node 506(C) also receives the “Shipped Order ID” value from the from the stream processing node 506(A) and reads “Order Lines” values from the relational source database 504(A) based on the “Shipped Order ID” value received from the stream processing node 506(A). The stream processing node 506(C) also creates a third database view 508(C). The third database view 508(C) may, for example, include a custom database view that is persistent and query-able organized by a “Shipped Order ID” key value.
The views for 508(B) and 508(C) can be read by a consumer to create a reliable relay stream to another data center or cloud environment 512, allowing data to be easily transported to remote locations or processing facilities. The stream processing nodes 506(B) and 506(C) may also write data to another view 510(A), such as an Aggregate Stream Database 510(A) and concurrently write to an external database such as a data warehouse external target 510(B). In one implementation, for example, the aggregate stream database 510(A) may include one or more summaries of records created by one or more of the stream processing nodes 506.
In one implementation, the database system 500 provides processing identifiers at one or more of the stream processing nodes to allow for consistent reads even in the event of asynchronous processing being performed by one or more of the stream processing nodes 506. In this implementation, for example, a timestamp, a general sequence ID or other identifier may be added to the processing results at each step to guarantee consistent reads of data throughout the system 500.
As further shown in
Where append-only log files are used, for example, consumer-grade non-volatile data storage such as non-volatile Flash data storage may be used in one or more of the DboD's in a commercial application because the number of re-write cycles to the data storage is relatively minor/less than a corresponding system in which non-volatile data storage is repetitively re-written.
In one particular implementation, a custom high-level declarative definition language may be implemented for creating and managing processing streams within the database systems described herein. Use of such a language alleviates the need for a programmer to write custom complex code, reducing and simplifying the work required to create a stream flow and views. For example, the declarative definition language may comprise one or more statements for creating a stream from one or more sources (e.g., source files, databases, other streams, views or the like), creating a view (query-able or non-query-able), selecting one or more data structures for the view, and/or identifying an output external target for the process and/or view created.
One programming example for implementing the stream processing node 506(A) and view 508(A) shown and described above with reference to
In this example, a stream is first created via a “CREATE STREAM” command in which shipped orders from the binary log 504(B) are pulled to create the stream and the view 508(A) is created using shipped order IDs from customer orders.
Next, the “Get Order”stream processing node 506(B) and view 508(B) are implemented as follows:
In this example, a stream for stream processing node 506(B) is created via the “CREATE STREAM” command in which shipped order IDs are received from stream processing node 506(A) and order data is retrieved from MySQL relational database 504(A). View 508(B) is created via “VIEW” command using the shipped order IDs and order data are stored as defined.
The “Get Order Lines”stream processing node 506(C) and view 508(C) are implemented as follows:
In this example, a stream for stream processing node 506(C) is created via the “CREATE STREAM” command in which shipped order IDs are received from stream processing node 506(A) and order line data is retrieved from MySQL relational database 504(A). View 508(C) is created via “VIEW” command using the shipped order IDs and order line data are stored as defined.
In one implementation, a custom stream could be added here to purge one or more source databases.
Other processing including providing information to external targets, such as the aggregate stream 510(A) and view and the external data warehouse 510(B) may be implemented, such as by the following:
INTERVAL 1 MINUTE AS 1_MINUTE)
AS order_date_dim,
product_id,
SUM(line_total)
(SELECT line_number, product_id qty, unit_price, line_total
FROM s_order_line
WHERE order_id=s_order.order_id)
Optimizers (e.g., schema and query optimizers 114 and 116 shown in
In some implementations, a self-optimizing database machine is provided in which data structures are evolved (e.g., added, dropped and changed) and/or evolving data structures are suggested in response to database commands received from a client at a database. Further, algorithms may be pushed to one or more processors located at a drive and performed in parallel (e.g., via vector or other processing) to increase operational speed of a database. In one particular implementation, the new or modified database structures and/or algorithms may be compiled into object code (e.g., native low-level assembler code that can be run directly on the processor) to be executed on a processor co-located or embedded in a storage device. In another implementation the new or modified database structures and/or algorithms may be dynamically loaded into the chip pathways as custom gate arrays (e.g., in an FPGA) at a component of a database system (e.g., a processor co-located or embedded in a storage device storing a shard or a collection of micro-shards of a database).
In one implementation, a dynamic database system is provided including a non-transitory computer readable medium adapted for storing database data, wherein one or more data structures of the database or a stream view of the database is modified or created in response to monitoring database queries.
In various implementations, a self-optimizing database machine may be provided in which data structures are evolved (e.g., added, dropped and changed) in response to database commands received from a client at a database. Further, algorithms may be pushed to one or more processors located at a drive and performed in parallel (e.g., via vector or other processing) to increase operational speed of a database. In one particular implementation, the new or modified database structures and/or algorithms may be compiled into object code (e.g., native low-level assembler code that can be run directly on the processor) to be executed on a processor co-located or embedded in a storage device. In another implementation the new or modified database structures and/or algorithms may be dynamically loaded into the chip pathways as custom gate arrays (e.g., in an FPGA) at a component of a database system (e.g., a processor co-located or embedded in a storage device storing a shard or a collection of micro-shards of a database).
In the particular example implementation shown in
In this implementation, the Manage server (dbS/Manage) 728 accumulates statistics on queries run against the database cluster 734, the statistics being collected by the Client (dbS/Client) 724 and/or by monitoring a database transaction log.
By analyzing the query statistics, the Manage server 728 may monitor for such things as repetitive or slow queries issued to one or more of the databases 734. Statistics such as measurement of repetitive or slow queries are then used to feed an algorithm, such as a heuristic algorithm, to generate or recommend database structures that best fit the query needs. In this implementation, the Manage server 728 will evaluate query patterns, query frequency, average execution time for a query pattern, maximum/minimum execution time for a query; any of these statistics may be used in determining the optimum storage structure for the data.
Where the database queries in the stream are not resolved or completed satisfactorily (e.g., within a predetermined time period, within a predetermined efficiency, etc.), the Manage server 728 can dynamically determine a new data structure for the database or for a stream view (e.g., StreamView) in order to provide a new data structure to optimize or otherwise improve the performance operation of the database query. The Manage server 728, for example, may dynamically determine and further dynamically change a data structure of a database and/or a stream view or dynamically create a new stream view to optimize or otherwise improve the operation of one or more database commands. The system 720 automatically will rebuild the data in the new structure. The Manage server 728 may also or alternatively dynamically determine a new or revised data structure for the database, stream view and/or a new stream view and identify or suggest a new stream view that is optimized or otherwise configured to improve the operation of the one or more database query. Thus, the dynamic data structures comprise a learning algorithm to constantly improve database performance, by matching data structures to the queries that are made of the database.
In contrast to a query optimizer, that optimizes or otherwise suggests or provides a different query, the manage server in this implementation is configured to alter or configure a new data structure (e.g., in the database or a stream view) to increase the speed, efficiency, or other operating parameter of a database command. This is done by creating one or more structures that match the queries being sent to the database cluster. In this manner, the actual data structure of a database or stream view is altered dynamically (or suggested dynamically) to improve one or more operating parameters of a database or stream view based on an analysis of one or more database queries. For example, where a particular query is repetitively received that fails to achieve a threshold or other requirement of a database (e.g., performed within a time threshold), a revised database structure may be determined that increases the performance of the database with respect to future iterations of that (or similar) database commands issued to the database and/or one or more stream views.
In various implementations of systems, such as the systems 700, 720 of
One or both of the systems shown in
Replication in systems, such as systems 700, 720 shown in
Further, advanced sharding techniques, such as shard stripes may be used to eliminate minimize re-balancing. In one implementation, for example, shard striping may work with a monotonic shard key and a consistent hash and/or an arbitrary shard key and consistent hash. An example of shard stripes is provided below with reference to
In one implementation, data is written to a reliable replication or transaction log as the first step. This stores the actual data, including new, updated or deleted data in a single message in the reliable collection of replication servers (StreamEngine). The data is marked by a position within a shard, micro-shard, and file, such that it can be accessed rapidly when required after writing. In other implementations, this can be done by a sequence number of transactions within the stream, or other unique identifier. This is called the data producer stream. One or more data consumers subsequently read the data stream for one shard, micro-shard, or a collection of shards or micro-shards. The consumer then performs any operations on the data required to construct a dynamic data structure, or modify data within an existing dynamic data structure. The dynamic data structures are the stream views (StreamView) in this implementation.
In a specific implementation of the stream engine, network multiplexing, sharding, micro-sharding, and reliability are combined to provide efficiency and reliability in writing to the stream. Network multiplexing sends one or more data messages to the stream in a batch, and subsequently receives a batch of acknowledgements in return, indicating that the data has been written to the stream successfully and reliably. The data producer can write to one or more shards or micro-shards within the stream for portioning of the data. The reliability is provided by writing any stream message to at least two stream engine servers before acknowledging the write to the client.
A particular type of data in the stream may produce multiple stream views; each organized to optimize a specific query or queries. The stream view is distributed based on an optimized (or improved) pattern to provide improved performance for various application queries. This implementation ensures that all data is reliable, as the stream engine creates multiple copies of the data for each write. The processing of data by stream consumers into stream views can be performed in an asynchronous basis (i.e., the client does not wait for the stream view to be updated), or can be performed on a synchronous basis (i.e., the client does wait for the stream view to be updated). Each stream view maintains its position, for example, based on a last-acknowledged offset to an incoming log.
The system supports concurrent reads from the database, even if asynchronous updates to stream views are used. This is done by a) reading the data requested by the query from the stream view; b) reading the stream view position within the stream; and c) appending or applying any additional data that may reside in the stream that has not yet been applied to the stream view.
A stream view is comprised of a collection of one or more maps, which are data structures used to create a dynamic data structure. The maps are key-value pairs, or lists of data, and can be organized using any number of facilities provided for data storage optimization. These facilities can include indexes, customized views of the data, bitmap indexes, and column store formats. Thus a stream view can be created from one or more maps, each with a specific type of internal data structure, matching the needs of the application as closely as possible.
Dynamic changes to data structures (automated or suggested) can be used to provide a dynamic database and/or stream views for executing database commands in a relatively efficient manner. In many situations, for example, a database is not static, but rather is dynamic in which data and usage patterns change over time and new requirements for the database emerge. In the present application, various techniques are provided to dynamically change and/or suggest changes to the database or stream view data structure(s) for matching an application over time. The generated data structures can also be sharded (partitioned) to allow for scaling across many replication or shard servers, determining an optimized partitioning scheme that matches the application usage of the database based on query analytics as described above. Thus, existing data structures may be modified, dropped and/or replaced and new data structures added. Thus, flexible data structures as described herein can be designed to meet one or more needs one or more applications. A dynamic engine can be designed around Big Data streaming with built-in, automated scalability and reliability. In various implementations, for example, one or more components of the database may monitor a stream of database queries or transaction logs to provide the dynamic adjustment of data structures.
In one implementation, for example, one or more database structures, such as relational sharding, micro-sharding, partitioned data, partitioned indices or other data structures may be restructured on the fly (or database restructuring suggested for review by a database manager). For example, data structures and partitioning of data structures of the database or stream views may be restructured to increase or optimize single shard actions relative to multi-shard actions from database queries received. A database shard stream, in one implementation, for example, may analyze a database shard stream, summarize, aggregate or otherwise process information from the stream and then optimize one or more data structures of the database and/or shard streams. Thus the application requirements can be matched more closely, providing optimized performance and efficiency of the database system.
In one particular implementation, the analysis and/or revised programmatic code to access dynamic database structures can be performed at various locations/levels of a database system to increase the operational efficiency of the operations. These will be stored as compiled low-level access procedures, to perform one or many operations required to read from or write to a dynamic database structure. The procedure code to access dynamic database structures, for example, may be located (e.g., compiled into object code) to run on a processor co-located with a drive of a database. In one particular implementation, the procedure code algorithms (e.g., aggregation, vector processing, etc.) may be loaded into custom gate arrays in the chip itself, such as at an FPGA chip located at a particular component of a database system where the data structure is to be stored. Compiling, for example, may be implemented via a script, such as javascript, or other scripts to generate or alter a data structure on the fly.
The dynamic procedure algorithm used to access a dynamic data structure, for example, comprises a group of multiple database operations, such as, the following:
Heuristics may also be used to analyze monitored queries from one or more database streams to determine a dynamic database structure in response to information included in a monitored database stream. In this way, any number of potential dynamic database structures can be evaluated in order to find the best performing option to satisfy queries from the stream.
Monitoring a database stream may be used to determine an optimized data structure, such as by monitoring one or more queries that are repetitively submitted to one or more databases (e.g., shard, micro-shards or other portions of the database).
Micro-shards may also be used along with parallel replication and/or parallel processing of database operations.
In this example process, a user defines a schema as tables or object models in operation (1). An application performs queries in operation 902. An application client, such as shown in
If a particular query (e.g., a repetitive query submitted to the database in a stream of database commands) is not performing well (e.g., failing to be performed within a predetermined threshold range of response times or the like), the console server evaluates statistics such as by using heuristics, algorithms or other analytical techniques to create, modify or eliminate a stream view data structure in operation 908.
Where a particular query is not efficiently obtaining a result from a database, for example, a console server can use heuristic algorithms or other techniques to evaluate the query and the existing database structure. The console server can then determine a modified or new data structure of a stream view for optimally handling the repetitive query via a new or modified stream view. If an existing stream view is no longer warranted, such as due to changing database data, conditions or other factors, the out of date stream view may also be eliminated. If a new or modified stream view is determined to satisfy acceptable processing characteristics, it is used in operation 912 to process the query and future iterations of that query (or suggested as a possible data structure modification for future implementation).
In some implementations of the process, the process can be implemented in a cyclical manner so that after completion of operation 312 where a new stream view is defined and used or after an existing stream view is modified ad used or discarded, the process can loop back to operation 904 to monitor new queries. In other implementations multiple processes can be performed in parallel or in series.
In one particular implementation, for example, the server 950 comprises a processor 952, input/output hardware, a non-transitory computer readable medium 954 configured to store data of a database, a non-transitory memory 956, network interface hardware 960. Each of these components are operably coupled via a local interface 962, which may be implemented as a bus or other interface to facilitate communication among the components of the server 950.
The non-transitory computer-readable medium component 956 and memory 958 may be configured as volatile and/or nonvolatile computer readable medium and, as such, may include random access memory (including SRAM, DRAM, and/or other types of random access memory), flash memory, registers, compact discs (CD), digital versatile discs (DVD), magnetic disks, and/or other types of storage components. Additionally, the non-transitory computer readable medium component 956 and memory component 958 may be configured to store, among other things, database data, analysis and/or revised programmatic code to access dynamic database structures. The analysis and/or revised programmatic code to access dynamic database structures, for example, can be performed at various locations/levels of a database system to increase the operational efficiency of the operations. In one particular implementation, for example, the analysis and/or revised programmatic code to access dynamic database structures may be stored as compiled low-level access procedures, to perform one or many operations required to read from or write to a dynamic database structure. The procedure code to access dynamic database structures, for example, may be located (e.g., compiled into object code) to run on a processor, such as the processor 952, co-located with a drive of a database. In one particular implementation, the procedure code algorithms (e.g., aggregation, vector processing, etc.) may be loaded into custom gate arrays in the chip itself, such as at an FPGA chip located at a particular component of a database system where the data structure is to be stored. Compiling, for example, may be implemented via a script, such as javascript, or other scripts to generate or alter a data structure on the fly.
The processor 952 may include any processing component configured to receive and execute instructions (such as from the memory component 958). The input/output hardware 954 may include any hardware and/or software for providing input to computing device 950, such as, without limitation, a keyboard, mouse, camera, sensor, microphone, speaker, touch-screen, and/or other device for receiving, sending, and/or presenting data. The network interface hardware 954 may include any wired or wireless networking hardware, such as a modem, LAN port, wireless fidelity (Wi-Fi) card, WiMax card, mobile communications hardware, and/or other hardware for communicating with other networks and/or devices.
It should be understood that the memory component 958 may reside local to and/or remote from the computing device 950 and may be configured to store one or more pieces of data for access by the computing device 950 and/or other components. It should also be understood that the components illustrated in
Although implementations of this invention have been described above with a certain degree of particularity, those skilled in the art could make numerous alterations to the disclosed implementations without departing from the spirit or scope of this invention. All directional references (e.g., upper, lower, upward, downward, left, right, leftward, rightward, top, bottom, above, below, vertical, horizontal, clockwise, and counterclockwise) are only used for identification purposes to aid the reader's understanding of the present invention, and do not create limitations, particularly as to the position, orientation, or use of the invention. Joinder references (e.g., attached, coupled, connected, and the like) are to be construed broadly and may include intermediate members between a connection of elements and relative movement between elements. As such, joinder references do not necessarily infer that two elements are directly connected and in fixed relation to each other. It is intended that all matter contained in the above description or shown in the accompanying drawings shall be interpreted as illustrative only and not limiting. Changes in detail or structure may be made without departing from the spirit of the invention as defined in the appended claims.
This application claims the benefit of U.S. provisional patent application No. 61/988,861 entitled “Database Device and Processing of Data in a Database” and filed on May 5, 2014, which is hereby incorporated by reference in its entirety as if fully set forth herein. This application is also a continuation-in-part of U.S. patent application Ser. No. 14/613,356 entitled “Dynamic Modification of a Database Data Structure” and filed on Feb. 3, 2015, which claims the benefit of U.S. provisional application No. 61/935,301 entitled “Dynamic Modification of a Database Structure” and filed Feb. 3, 2014, each of which is hereby incorporated by reference as though fully set forth herein.
Number | Date | Country | |
---|---|---|---|
61988861 | May 2014 | US | |
61935301 | Feb 2014 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 14613356 | Feb 2015 | US |
Child | 14704821 | US |