This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2018-69864, filed on Mar. 30, 2018, the entire contents of which are incorporated herein by reference.
The embodiment discussed herein is related to a data compression method, an apparatus for data compression, and a non-transitory computer-readable storage medium for storing a program.
For data storage in a relational database management system (RDBMS), an N-array storage model or a decomposition storage model is used. Meanwhile, as for a document DB storing semistructured data such as JavaScript (registered trademark) object notation (JSON) and extensible markup language (XML), the N-array storage model is usually used.
As a related art, there has been proposed a technology of inferring a schema of semistructured data, dynamically generating a cumulative schema, and combining the inferred schema with the cumulative schema.
As a related art, there has been proposed a technology of dividing attribute-specific data into files to be held, and holding a data structure as schema information.
As a related art, there has been proposed a technology of detecting a delimiter from a specified region, and coding a data string in the specified region based on the detected delimiter and structural information.
Examples of the related art include Japanese National Publication of International Patent Application No. 2015-508529 and Japanese Laid-open Patent Publication Nos. 2011-13758 and 2009-75887.
According to an aspect of the embodiments, a data compression method includes: specifying a structure of a group included in semistructured data, based on a data kind and a data type of each data in the group; setting a first identifier unique to each structure and setting a second identifier for a pair of the data kind and the data type of each data in the structure; storing the data in the group in different storage areas for each pair of the first identifier corresponding to the group and the second identifier corresponding to the data; and compressing the data for each storage area.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention.
Data using a decomposition storage model has higher compression efficiency than data using an N-array storage model. However, as for semistructured data, a schema is changed by adding or changing data, or the like. Therefore, it is difficult to use the decomposition storage model.
As one aspect of the embodiment, it is an object thereof to improve the compression efficiency of the semistructured data.
In the RDBMS, for example, one piece of data is called a record or a tuple. One record includes a plurality of attributes such as “name”, “birth date”, and “address”. A set of such records is called a table or a relation. In the RDBMS, operations such as insertion, deletion, and retrieval of records are executed on the table.
Such a table is a “set of records” as a design concept, but may also be interpreted as secondary information of rows and columns. Attributes of the records are called columns, and each of the records is called a row.
In the RDBMS, the N-array storage model is usually used for data storage. In the RDBMS, performance of inserting, deleting, and updating records are important. This is because it is easier to insert, delete, and update records when data is arranged in records on the storage.
On the other hand, in business intelligence and data warehouse used for data analysis, the decomposition storage model is often used. This is because only a specific attribute in a table is often read in data analysis. A database adopting the decomposition storage model is called a column-oriented database or a columnar database. Data stored using the decomposition storage model has high compression efficiency, and a data volume is reduced after compression. Thus, input/output (I/O) during read is reduced, and the performance is improved. Therefore, the column-oriented database is usually compressed.
With the recent growing demand for the column-oriented database, various column-oriented databases have been developed. As even for a row-oriented database adopting the N-array storage model, there is an increasing number of products capable of adding a column-oriented database function as an option.
There are many compression technologies that may be applied to the decomposition storage model. For example, run-length encoding (RLE) compression, dictionary compression, and the like are used.
A table structure (names of columns and data types) in the RDBMS is called a schema. The RDBMS is a database having its schema defined before insertion of data. On the other hand, there is a database called a document-type DB, which has its schema not defined before insertion of data and into which semistructured data of JSON format, XML format, or the like may be inserted.
As for some document DBs, the data volume is reduced by using a data format realized by compressing an internal structure of a document stored using the N-array storage model. However, the compression efficiency is not sufficient compared with the column-oriented database.
Hereinafter, description is given of an example of semistructured data according to the embodiment. A document used in the following description includes JSON-format semistructured data, and processing in this embodiment may also be applied to semistructured data of XML format or the like, other than JSON format.
A document 1 includes four objects (1) to (4). (1) and (4) have the same structure and may be considered to be conforming to the same schema. Meanwhile, the other objects have different structures including different other fields except for the “name” field.
As illustrated in
For “value”, an integer such as 0, 1, and −1 and a decimal number such as 0.1 may be used. For “string”, a string enclosed in double quotation marks, such as “string”, may be described. “Object” is data having elements enclosed in { }, such as {“name1”:“value1”, “name2”:“value2”}. Objects may be nested, such as {{“name1”:{“name”:“value2”}}. Multistage (three stages or more) nesting is also applicable.
“Array” is data having a plurality of elements enclosed in [ ], such as [value, value, value]. Data types may be freely specified for the elements in the array. The elements in the same array do not have to have the same data type. The array may also be used as the element in the array.
In this embodiment, schemas with different types of field values are considered as different schemas even though some of the fields in the object have the same field name. However, objects including fields with the same field name and different types of field values may be considered as the same schema.
The acquisition unit 11 acquires a document or the like including semistructured data from another information processor or the like. The specification unit 12 specifies a structure of a group included in the semistructured data, based on the kind and type of data. The kind of data is specified from the field name or field ID, for example. The group is, for example, an object.
The setting unit 13 sets a first identifier unique to each structure, and sets a second identifier for a pair of data kind and data type of each data in the structure. The structure is, for example, a schema, which is specified by a schema ID array to be described later. The first identifier is, for example, a schema ID to be described later. The second identifier is, for example, a field number to be described later.
When data (for example, the field value) is an array and elements in the array are groups, the setting unit 13 sets a first identifier different from the array for each of the groups in the array.
The generation unit 14 generates a first tree by hierarchizing the plurality of data kinds. The first tree is, for example, a field name/field ID tree to be described later. Upon acquisition of a new group, the generation unit 14 retrieves a data kind in the acquired group from the upper level of the first tree, and adds the data kind to the first tree when the data kind is not present in the first tree.
The generation unit 14 generates a second tree by hierarchizing the plurality of structures. The second tree is, for example, a field ID array/schema ID tree to be described later. Upon acquisition of a new group, the generation unit 14 retrieves a structure of the acquired group from the upper level of the second tree, and adds the structure to the second tree when the structure is not present in the second tree.
When a new document is added, the selection unit 15 selects a first identifier corresponding to a group in the document, based on a schema management table, and selects a second identifier corresponding to each data.
The storage unit 16 stores data in the group in different storage areas for each pair of a first identifier corresponding to the group and a second identifier corresponding to the data. The storage area is, for example, a file, a database, or the like. When the data is an array, the storage unit 16 stores the number of elements in the array and the elements in the array in different storage areas. When the data is an array and elements in the array are groups, the storage unit 16 stores the number of the elements in the array, a first identifier set for each of the groups in the array, and data in the group, in different storage areas.
A memory unit 17 stores the acquired documents, various trees to be described later, management information, uncompressed files, compressed files, and the like. The compression unit 18 compresses data for each storage area. The decompression unit 19 decompresses each of the compressed files to restore a document. The control unit 20 executes various control operations of the information processor 1.
The field name/field ID tree is a tree used to retrieve the field ID from the field name. For the field name/field ID tree, a data structure called a trie or a prefix tree is applied. The field ID/field name table is a table corresponding to the field name/field ID tree. The field ID/field name table may include arrays and B-tree structure.
The field ID array/schema ID tree is a tree used to retrieve the schema ID from the field ID array. The schema management table is a table corresponding to the field ID array/schema ID tree, and is used to manage the structure for each schema. The information illustrated in
The generation unit 14 retrieves the field name in each field from the field name/field ID tree. When the field name/field ID tree does not include a newly acquired field name, the setting unit 13 sets a field ID corresponding to the field name. The generation unit 14 adds the field name to the field name/field ID tree, and gives the set field ID to the field name.
When a new document is added, the setting unit 13 checks if the field name/field ID tree includes a field name in the added document. If not, the setting unit 13 gives a new field ID to the field name. As for the retrieval of the field name, the retrieval may be completed more quickly by retrieval from the field name/field ID tree rather than retrieval from the field ID/field name table.
For example, while the number of items under root is 8 in the field name/field ID tree illustrated in
For example, since the “name” field is common in all the objects in the document illustrated in
The setting unit 13 sets a unique schema ID for the schema ID array. The setting unit 13 sets a unique schema ID for each structure of an object. The generation unit 14 attaches the schema ID to the end of the tree. Since (1) and (4) have the same schema, the same schema ID (1) is attached thereto.
As illustrated in
When a new document is added, the setting unit 13 checks if the field ID array/schema ID tree of
For example, description is given of retrieval processing when the schema management table does not include the structure of the object to be added. In the case of retrieval from the schema management table, the setting unit 13 determines that the schema management table does not include the structure of the object to be added, as a result of retrieval of each entry in the schema management table. On the other hand, in the case of retrieval from the field ID array/schema ID tree of
The storage unit 16 stores the data in the generated files. In the example illustrated in
It is also assumed that the document 7 is added after the data in the document 1 is stored. The selection unit 15 selects the schema ID corresponding to the object in the document 7, based on the schema management table, and selects the field number corresponding to each field. In the example illustrated in
The storage unit 16 also stores the schema IDs of the stored data as a document index in the order of data storage.
The compression unit 18 compresses the files having the data stored therein, for each file. As described above, the file is generated for each pair of field name and data type. Therefore, since each data stored in one file has a common data type, the information processor 1 according to the embodiment may improve compression efficiency.
When there is a nested object as in the case of the document 2 of
For example, the generation unit 14 expresses the fields in the “address” field in the document 2 of
As described above, when there is a nested object, the field ID is given for each field in the lower object. Therefore, in the schema management table, the field number is also given for each field in the lower object. As a result, as illustrated in
Next, description is given of processing when there is an array in semistructured data. An array included in the semistructured data is classified into the following (A) to (C).
(A) All elements in an array are standardized as a basic data type, including boolean value, string, integer, floating point, and the like. Such an array is referred to as a basic data type array.
(B) All elements in an array are objects. The objects of the respective elements may have different schemas. Such an array is referred to as an object type array.
(C) An array other than (A) and (B). For example, an array in which elements have different data types or a basic data type is mixed with objects.
Since the processing of this embodiment is not applicable to arrays corresponding to (C), description is given of processing for the arrays of (A) and (B).
Since the document 8 includes two kinds of field names, “user” and array “group”, the setting unit 13 sets a field ID for each of the field names. The generation unit 14 uses the field IDs set by the setting unit 13 to generate a field ID/field name table.
Although the generation unit 14 generates a field name/field ID tree and a field ID array/schema ID tree for the document including the basic data type array, illustration thereof is omitted.
Since the first array “group” in the document 8 of
Since the first array “group” in the document 8 of
As illustrated in
Since the storage unit 16 stores the number of elements in the array and the elements in the array in different files, the arrays different in the number of elements may be handled as one schema. Thus, the number of files may be reduced.
Next, description is given of processing when an array in a document includes objects as elements. Although a plurality of objects in the array have different schemas in the following example, the same processing is applicable even when the plurality of objects in the array have the same schema.
Although the document 9 includes two arrays “roles”, which are different in the number of elements, the same field ID is given thereto.
In
In the example illustrated in
As in the case of the basic data type data, the storage unit 16 stores the objects in the array in different files for each pair of the schema ID and the field number in the schema management table. In the example illustrated in
For example, when a document includes a plurality of arrays having different schemas of the objects as elements, the number of schemas may be increased if the respective arrays are considered as different structures. In this embodiment, arrays having different schemas of the objects as elements are considered as the same schema, and the schema IDs are set for the objects in the array. Thus, an increase in the number of schemas may be avoided.
The information processor 1 executes compression preprocessing (Step S102). The compression preprocessing is described in detail later. The storage unit 16 stores a schema ID (P) in a document index file (Step S103). The compression unit 18 compresses data for each file (Step S104).
The generation unit 14 starts repetition processing for each field F directly under the level R (Step S201). When the document 2 of
The generation unit 14 executes second generation processing for the field F (Step S202). The second generation processing is processing of generating a field name/field ID tree and a field ID/field name table. Upon completion of the processing in Step S202 for each field F, the generation unit 14 terminates the repetition processing (Step S203).
The specification unit 12 specifies an object structure based on a combination of the field name and the data type in the object (Step S204).
The generation unit 14 uses the generated field name/field ID tree and field ID/field name table to generate a field ID array, and sets the generated field ID array as J (Step S205).
The generation unit 14 determines whether or not the field ID array/schema ID tree includes the generated field ID array (J) (Step S206). When the generation unit 14 determines that there is no field ID array (J) (NO in Step S206), the setting unit 13 sets a schema ID for the field ID array (J) and sets a field number for a pair of the field name and the data type of each data in the object (Step S207). As described above, the field ID array (J) is an array representing the structure of the object.
The generation unit 14 generates a field ID array/schema ID tree and a schema management table, to which the field ID array (J) is added (Step S208). When determining that there is the field ID array (J) (YES in Step S206), the generation unit 14 terminates the processing.
Since no field ID array/schema ID tree is generated in the first round of processing, the generation unit 14 skips Step S206 and executes Steps S207 and S208.
If YES in Step S301 or Step S302, it is checked if the field name/field ID tree includes the field name of the field F (Step S303). When the field name/field ID tree does not include the field name (NO in Step S303), the setting unit 13 sets “field name of S. F” as the field name, and sets the field ID corresponding to the field name (Step S304). When Step S304 is called up for the first time, the setting unit 13 sets the field name of the field F without change since S is empty.
Then, the generation unit 14 adds the field name and the field ID set for the field F to the field name/field ID tree and the field ID/field name table (Step S305).
When the field name/field ID tree includes the field name of the field F (YES in Step S303), the processing is terminated.
If NO in Step S302, it is determined whether or not the field F is the object type (Step S306). When the field F is not the object type (NO in Step S306), the information processor 1 stops the processing since the data is not the processing target data (Step S307).
When the field F is the object type (YES in Step S306), the setting unit 13 sets F as the processing target level and sets “field name of S. F” as the prefix (Step S309). Then, the generation unit 14 recursively calls up the first generation processing (Step S310).
When there is a nested object as illustrated in
The storage unit 16 starts repetition processing for each field (F) corresponding to the schema ID (P) (Step S402). The field number is I. The storage unit 16 determines whether or not the field F is the basic data type (Step S403). When the field F is the basic data type (YES in Step S403), the storage unit 16 stores the field value in the file (P-I) (Step S404).
When the field F is not the basic data type (NO in Step S405), the field F is an array, and thus the storage unit 16 stores the number of elements in the array in the file (P-I Number of Elements) (Step S405). Then, the storage unit 16 determines whether or not the field F is a basic data type array (Step S406). When the field F is the basic data type array (YES in Step S406), the storage unit 16 stores all the field values of the elements in the array in the file (P-I Element) (Step S407).
When the field F is not the basic data type array (NO in Step S406), the field F is the object type array, and thus the storage unit 16 starts repetition processing for each element G (object) in the array (Step S408).
The storage unit 16 recursively calls up the compression preprocessing (Step S409). In Step S409, as for the processing target element G (object), addition to the field ID array/schema ID tree and the schema management table, and the like are performed, and storage of the fields in the object is also performed.
The storage unit 16 stores the schema ID corresponding to the object stored in Step S409 in the file (P-I Schema ID) (Step S410). Upon completion of the processing for all the elements in the array (Steps S409 and S410), the storage unit 16 terminates the repetition processing (Step S411). Upon completion of the processing for all the fields corresponding to the schema ID (P) (Steps S403 to S411), the storage unit 16 terminates the repetition processing (Step S412).
The decompression unit 19 determines whether or not the field F is the basic data type (Step S603). When the field F is the basic data type (YES in Step S603), the decompression unit 19 decompresses the file (P-I) and reads data in the file (Step S604). When the field F is not the basic data type (NO in Step S603), that is, is an array, the decompression unit 19 decompresses the file (P-I Number of Elements) and reads data in the file (Step S605).
The decompression unit 19 determines whether or not the field F is the basic data type array (Step S606). When the field F is the basic data type array (YES in Step S606), the decompression unit 19 decompresses the file (P-I Element) and reads data from the decompressed file (Step S607).
When the field F is not the basic data type array (NO in Step S606), the field F is the object type array. In the object type array, the schema ID of each object in the array is stored in the file (P-I Element). Therefore, the decompression unit 19 starts repetition processing for each schema ID (P) in the file (P-I Element) (Step S608).
The decompression unit 19 recursively calls up the decompression processing with the schema ID in the file (P-I Element) as the processing target (Step S609). When the field in the object is the basic data type, the decompression unit 19 decompresses the file (P-I) in which the field value in the object is stored by the processing in Step S604, and reads data.
The decompression unit 19 terminates the repetition processing after executing Step S609 for all the schema IDs in the file (P-I Element) (Step S610). The decompression unit 19 terminates the repetition processing after executing Steps S603 to S610 for all the fields (Step S611).
As illustrated in
Even when there is the object type field, the information processor 1 may record the field name having the upper and lower field names connected by performing recursive processing in a state where the upper field name is stored.
Since the field “roles” is an array, the storage unit 16 stores the number of elements “2” in the array in the file. Then, the storage unit 16 calls up the first generation processing and the second generation processing by calling up the compression preprocessing for the elements in the array, thereby adding the fields in r2 to the field name/field ID tree and the schema management table. Thereafter, in the storage processing recursively called up, the storage unit 16 stores the field values of the elements “name” and “gender” in the object in the file. The storage unit 16 also stores the fields in r2 in the file.
As described above, the information processor 1 may store the elements in the object type array in the files.
The information processor 1a includes a compression tool 31 with the functions of the information processor 1 of the embodiment. The information processor 1a acquires a document including semistructured data. Then, the compression tool 31 performs the processing described above to store the semistructured data in a plurality of files, and compresses each of the files. The information processor 1a transmits the compressed file group to the information processor 1b.
The information processor 1b includes a decompression tool 32 with the functions of the information processor 1 of the embodiment. The information processor 1b acquires the compressed file group. Then, the decompression tool 32 performs the processing described above to decompress the compressed file group and restore the document.
The client terminal 2 transmits, to the information processor 1a, a document format message including semistructured data addressed to the server 4. The information processor 1a acquires the document format message. Then, the information processor 1a performs the processing described above to store the document format message in a plurality of files, and compresses each of the files. The information processor la transmits the compressed file group to the information processor 1b through the network 3.
The information processor 1b acquires the transmitted compressed file group. Then, the information processor 1b performs the processing described above to decompress the compressed file group and restore the document format message. The information processor 1b transmits the restored document format message to the server 4.
When the client terminal 2 continuously transmits messages, the information processor la performs storage and compression after receiving a predetermined number of document format messages, for example. The information processor 1b transmits the received document format messages to the server 4 after sequentially decompressing the messages.
<Hardware Configuration of Information Processor 1>
Next, description is given of an example of a hardware configuration of the information processor 1.
The processor 111 executes a program developed in the memory 112. As the program to be executed, a data compression program to perform the processing of the embodiment may be applied.
The memory 112 is, for example, a random access memory (RAM). The auxiliary storage device 113 is a storage device that stores various information, and a hard disk drive, a semiconductor memory, or the like may be applied, for example. The data compression program to perform the processing of the embodiment may be stored in the auxiliary storage device 113.
The communication interface 114 is connected to a communication network, such as a local area network (LAN) and a wide area network (WAN), to perform data conversion and the like associated with communication.
The medium connector 115 is an interface capable of connecting to a portable recording medium 118. As the portable recording medium 118, an optical disk (for example, a compact disc (CD) and a digital versatile disc (DVD)), a semiconductor memory, and the like may be applied. The portable recording medium 118 may record the data compression program to perform the processing of the embodiment.
The input device 116 is, for example, a keyboard, a pointing device, or the like, and receives input of instructions, information and the like from a user.
The output device 117 is, for example, a display device, a printer, a speaker, or the like, and outputs an inquiry or instruction to the user, processing results, and the like.
The memory unit 17 illustrated in
The memory 112, the auxiliary storage device 113, and the portable recording medium 118 are computer-readable non-transitory tangible storage media, rather than transitory media such as signal carriers.
<Others>
The embodiment is not limited to the one described above, but various changes, additions, and omissions may be made without departing from the scope of the embodiment.
All examples and conditional language provided herein are intended for the pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although one or more embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2018-069864 | Mar 2018 | JP | national |