The present invention relates to a method for storing information in a data processing system and, in particular, to a method for storing data records in key-value database.
An existing problem in computer-based data processing systems is that data is being generated and consumed at unprecedented scale, primarily due to wide adoption of mobile devices and Internet of Things. Current systems are inadequately equipped to deal with this “big-data” phenomenon.
This problem of overwhelming amounts of data has caused various novel approaches. For example, scalable NoSQL (Not Only SQL) database system have been developed to manage ever-increasing data volume. What is needed is a system and method to improve the storing of data records in such database systems. The methods of storing records in the present invention provide a solution to these and other problems of the prior art.
In this invention, we describe our system and method for storing data records in key-value database. A key-value database is one of the categories of NoSQL databases. Data records consisting of a key and a value are distributed in multiple data files in a database system. The terms “data file” and “file” will be used interchangeably without any important difference. The term “file number” means a serial number identifying a data file, normally starting from 1. The term “user” refers to a person or a client program that may insert, read, update, or delete data in the database system. The term “server” can represent a physical computer with CPU, memory, permanent storage medium, or a virtual server instance in a cloud environment. In the database system, one or more servers may be deployed to receive data from a user or provide data to the user. In each server, one more data files can be used to store key-value data records. The key in a data record uniquely identifies the record in the whole database system. The value in the data record contains all data in the record except the key. The key may consist of multiple data items, i.e., the key may be a composite key.
In the present invention we store data records in multiple data files on a computer server which may be a stand-alone server or one of the servers in a cluster environment where multiple servers are deployed. The reason to use multiple data files to store data on a computer server may be one of the following: (1) in a modern storage system, such as solid state drive (SSD) or other types of electronic non-volatile computer storage medium, concurrent data write and read operations on multiple files have advantage over a single file; (2) each file can be associated with a tag or a range allowing precise and faster retrieval of data that has the same association; (3) multiple files enable more fault tolerance because once a file is corrupted, only one file is affected, not all the remaining files.
The foregoing and other objects, features and advantages of the present invention will be apparent from the following description of the invention and embodiments thereof, as illustrated in the accompanying figures, wherein:
Referring to schematic diagram
To fetch the data record associated with a key, the key is provided to the key router which checks if the key exists in it. If the result is negative (the key does not exist), then a negative answer is returned to the user. If the result is positive, the key router provides the correct file number and directs the system to fetch data in the corresponding file. A file number can be uniquely mapped to a file name, for example, “user_n”, where n is the file number. The data file name can be in any format, as long as the file name is uniquely associated to the file number. The key router can be kept in volatile memory or in non-volatile memory. The key router also provides the data file number when searching for a data record.
The purpose of comparing the number of bits is to reduce resource consumption because shorter hash keys in hash table T use less memory and storage space. In the hash table T, key-value data pairs are stored. Note that the key-value data pair is different from the original key-value pair to be stored in our database system. In the hash table T, the key or the hash key is the combined hash value N1, or original key K1 in the case that the number of bits in K1 is less than the number of bits in N1. The value part associated with the hash key in T is a file number which is a non-zero natural number. The file number may use one byte to identify a total of 256 data files, or two bytes to identify a total of 256×256=65536 data file, or three bytes to identify a total of 256×256×256=17,777,216 files, or more bytes to identify more files.
Comparing the number of bits in key K1 and the number of bits in the combined value N1 may be performed online when each data record enters the system or offline when a schema of the database is initially designed. Here, the number of bits in N1 may not need actual counting. It can be determined by the way how Hash1, Hash2, and Hash3 are combined to construct N1. Normally a hash function generates a fixed-length hash value. For example, a MD5 hash function creates a 128-bit hash value. Determining the number of bits in N1 can be generally expressed with the following formula:
B(N1)=B(P(Hash1))+B(P(Hash2))+B(P(Hash3))
where P(Hash) means selecting all the bits or only part of the bits in Hash, and B(P) means the total number of bits in P. For example, all the bits in Hash1 may be selected, while only the first 16 bits in Hash2 selected, and none of the bits in Hash2 selected.
It should be noted that when the original key K1 is used as the hash key in the hash table T, mapping K1 to file numbers is lossless, meaning the system can always find the correct file number for a given key K1 in a data record. However, when the combined hash N1 is used as the hash key in the hash table T, mapping original record key K1 to file numbers may be lossy, meaning the system does not guarantee to find the correct file number for a given key K1 in a data record. The lossy mapping happens when the number of all possible values of key K1 is greater than the number of all possible values of the combined hash value N1 because of hash collision. In certain application scenarios, lossy mapping is acceptable since some data records can be dropped while entering the database system. Storing all data records in the database is not strictly required. Nevertheless, one can always use more hash values to construct the combined hash value N1 in order to decrease the possibility of lossy mapping.
The file number assigned to the value part associated with a hash key in hash table T may be determined with one of the methods described below.
The hash table T has the capability of associating a key with a value. Hash table T can be replaced with other data structures that can associate keys with values. For example, a binary search tree, or a B−Tree can be used to replace the hash table T.
In the previous descriptions, we have not described how data records are stored in the data files. Any method can be used to store data records in a data file. In one embodiment, new data records are simply appended to the end of the data file. Simply appending records to end of file is fast in data-write but slow in data-read since finding a data record in the file may require whole-file scan for the data record. In another embodiment, a data file can be organized with a B−Tree or B+Tree structure so new data records are inserted into the data file according to the rules of such tree structure. In another embodiment, a data file can be organized with a SEA (Sorted Elastic Array) structure, a sparse array, so new data records are inserted in the data file according to the rules of SEA structure. In another embodiment, new data records can first be cached in memory and sorted by the keys in memory, and finally written to the data files according to the same key order in memory.
For reading, updating, and deleting operations of a data record, the system can use the key router to find the file number and the corresponding file and then reads, updates, or deletes the record in the date file. Locating the data file for a record key follows the same process as storing new records in constructing the hash key in hash table T, and retrieving the value part as file number in the hash table T. Deletion of records in data files can affect the decision in selecting a data file for storing new data records. In one embodiment, if a significant portion of records, for example 30%, have been deleted in a data file, then this file may be used as the next file or active file for storing new data records. In another embodiment, the data file that contains the least number of data records among all existing data files and has a total number of data records that is significantly less than other files, for example 20%, can be used as the next file or active file to store the new records. If such criteria are not met, then the presented methods in
Multiple data files can be grouped by different entities. In relational database terms, the entities can be represented by “tables”. In NoSQL database terms, the entities can be represented by “collections”. For example, a “member” table or collection can have its own group of data files for storing membership data records. A “device” table or collection can have its own group of data files to manage all physical devices in an application scenario. The records, record keys, key router in different groups of data files are considered independent.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as exemplary forms of implementing the claims.
While the invention has been disclosed in connection with preferred embodiments shown and described in detail, their modifications and improvements thereon will become readily apparent to those skilled in the art. Accordingly, the spirit and scope of the present invention should be limited only by the following claims.
The present Application claims the benefit of U.S. Provisional Application No. 62/606,918, filed Oct. 13, 2017 by the same inventor as the present application and directed to the same invention and containing the same disclosure as the present application.
Number | Date | Country | |
---|---|---|---|
62606918 | Oct 2017 | US |