This application is based upon and claims the benefit of priority from Japanese Patent Application No. 2017-054955, filed Mar. 21, 2017, the entire contents of which are incorporated herein by reference.
Embodiments described herein relate generally to a storage system and a processing method.
With the spread of cloud computing, there is an increasing demand for a storage system that can store a large amount of data and can process data input and data output at a high speed. This trend has become stronger as interest in big data has been increasing. As one of storage systems capable of responding to such a demand, there is proposed a storage system in which a plurality of memory nodes are connected to each other.
In such a storage system, in which memory nodes are connected to each other, the performance of the entire storage system may be degraded when there is a competition of exclusive locks for the memory nodes, and the like, which may occur at the time of writing data.
An embodiment provides a storage system and a processing method capable of enhancing access performance.
In general, according to an embodiment, a storage system includes a plurality of storage nodes, each including a local processor and one or more non-volatile memory devices, a first control node having a first processor and directly connected to a first storage node, a second control node having a second processor and directly connected to a second storage node. The local processor of a node controls access to the non-volatile memory devices of said node and processes read and write commands issued from the first and second processors that are targeted for said node. Each of the first and second processors is configured to issue read commands to any of the storage nodes, and issue write commands only to a group of storage nodes allocated thereto, such that none of the storage nodes can be targeted by both the first and second processors.
Hereinafter, embodiments will be described with reference to the accompanying drawings.
First, a first embodiment will be described.
The storage system 1 illustrated in
A plurality of slots may be formed, for example, on a front surface of a case of the storage system 1 and a blade unit 1001 may be stored in each slot. Further, a plurality of board units 1002 may be contained in each blade unit 1001. A plurality of NAND flash memories 22 is mounted on each board unit 1002. The NAND flash memories 22 in the storage system 1 are connected in a matrix configuration through connectors of the blade unit 1001 and the board unit 1002. The NAND flash memories 22 are connected in a matrix configuration, and as a result, the storage system 1 is able to provide a high-capacity data storage area.
As illustrated in
The NM 20 includes a node controller (NC) 21 and one or more NAND flash memories 22. The NAND flash memory 22 is, for example, an embedded multimedia card (eMMC®). The NC 21 executes access control to the NAND flash memory 22 and transmission control of data. The NC 21 has, for example, 4 lines of input/output ports. The NCs 21 are connected to each other via the input/output ports to connect the NMs 20 in a matrix configuration. Connecting the NAND flash memories 22 in the storage system 1 in a matrix configuration means connecting the NMs 20 in a matrix configuration. By connecting the NMs 20 in a matrix configuration, the storage system 1 is able to provide a high-capacity data storage area 30 as described above.
The CU 10 executes input/output processing (including updating and deleting the data) of the data in/from the data storage area 30, which is constructed as described above, according to the request from the client device 2. In more detail, an input/output command of the data, which corresponds to the request from the client device 2 is issued with respect to the NM 20. Further, although not illustrated in
The CU 10 includes a CPU 11, a RAM 12, and an NM interface 13. Each function of the CU 10 is stored in the RAM 12 and implemented by a program executed by the CPU 11. The NM interface 13 executes communication with the NM 20, in more detail, the NC 21. The NM interface 13 is connected with the NC 21 of any one of the plurality of NMs 20. That is, the CU 10 is directly connected with any one of the plurality of NMs 20 through the NM interface 13 and indirectly connected with the other NMs 20 through the NC 21 of the NM 20. The NM 20 directly connected with the CU 10 varies for each CU 10. Further, although not illustrated in
As described above, the CU 10 is directly connected with any one of the plurality of NMs 20. Therefore, even when the CU 10 issues the input/output command of the data with respect to the NMs 20 other than the directly connected NM 20, the input/output command is first transmitted to the directly connected NM 20. Thereafter, the input/output command is transmitted up to a desired NM 20 through the NC 21 of each NM 20. For example, when it is assumed that an identifier (M, N) is allocated to each NM 20 by combining a row number and a column number with respect to the NMs 20 connected in a matrix configuration, the NC 21 compares the identifier of its own NM 20 and the identifier designated as a transfer destination of the input/output command with each other, and as a result, the NC 21 may first determine whether the input/output command is addressed to its own NM 20. When the input/output command is not addressed to its own NM 20, the NC 21 may second determine to which NM 20 among the adjacent NMs 20 the input/output command is to be transmitted, based on a relationship of the identifier of its own NM 20 and the identifier designated as the transfer destination of the input/output command, in more detail, a size relationship of each of the row number and the column number. As the technique of transmitting the input/output command up to the desired NM 20, various well-known techniques may be adopted. A path to the NM 20, which is not originally selected as a transmission destination, may also be used as an auxiliary path.
Further, a result of the input/output processing according to the input/output command, that is, the result of the access to the NAND flash memory 22, by the NM 20 is also transmitted up to the CU 10, which is an issuing source of the input/output command, via several other NMs 20 by the operation of the NC 21 similarly to the transmission of the input/output command. For example, as information on the issuing source of the input/output command, the identifier of the NM 20 to which the CU 10 is directly connected is included, and as a result, the identifier may be designated as the transmission destination of the processing result.
As described above, the NM 20 includes the NC 21 and the one or more NAND flash memories 22. Further, as illustrated in
Here, referring to
It is assumed in this example that a predetermined CU 10 receives the writing request of the data from the client device 2. Further, it is assumed that another CU 10 also receives the writing request of the data from the client device 2 at substantially the same timing. In addition, it is assumed that these two CUs 10 select the same NM 20 as a storage destination of the key-value pair by, for example, a hash calculation using the key as a parameter or a round robin scheme. In general, in a storage device shared by a plurality of hosts (corresponding to the CUs 10), the exclusive lock is provided in order to secure data consistency and only the host which acquires the exclusive lock may execute the writing of the data. For that reason, in the above assumed case, a lock contention between the two CUs 10 occurs. The contention of the locks causes the performance of the storage device to deteriorate.
Therefore, in the storage system 1, with regard to the writing of the data, the NM 20 which may be selected as the writing destination is allocated between the CUs 10 without duplication for each CU 10 as illustrated in
In regard to the writing of the data, each CU 10 just selects the storage destination of the key-value pair with respect to only the NM 20 allocated thereto. In regard to the reading of the data, each CU 10 may read the keys from all of the NMs 20 to read the data from the NM 20 storing the corresponding key, and when an index is managed, each CU 10 may specify the NM 20, which is the storage destination of the data, to read the data from the NM 20, by referring to the index.
As a result, the storage system 1 may enhance the access performance without the need for the exclusive lock.
As illustrated in
The interface unit 501 of the client device 2 receives requests for registration, acquisition, search, and the like of the record from a user. The server communication unit 502 executes communication with the CU 10 (through, for example, the load balancer).
The client communication unit 101 of the CU 10 executes communication with the client device 2 (through, for example, the load balancer). The NM selector 102 selects the NM 20 of the writing destination at the time of writing the data. The CU-side internal communication unit 103 executes communication with another CU 10 or NM 20. The NM list 104 is a list of the NM 20 of the writing destination allocated to each CU 10. The NM list 104 is created such that one NM 20 is prevented from being included in a plurality of NM lists 104. The NM selector 102 selects the NM 20 of the writing destination based on the NM list 104. As the technique of selecting the NM 20, various well-known techniques, such as the round robin scheme or the load balancing scheme may be adopted.
The NM-side internal communication unit 201 of the NM 20 executes communication with the CU 10 or another NM 20. The command executing unit 202 executes the access to the memory 203 according to the request from the CU 10. The memory 203 stores the data from the user. The memory 203 includes, for example, the volatile RAM 212 for temporarily storing the data in addition to the non-volatile NAND flash memory 22.
The CU 10 determines whether the request from the client device 2 is the writing of the data or the reading of the data (step A1). When the request from the client device 2 is the writing of the data (YES of step A1), the CU 10 selects a NM 20 as a writing target from among the NMs 20 on the NM list 104 (step A2). In addition, the CU 10 executes writing processing of the data with respect to the selected NM 20 (step A3).
Meanwhile, when the request from the client device 2 is the reading of the data (NO of step A1), the CU 10 selects a NM 20 as a reading target from among all of the NMs 20 (step A4). In addition, the CU 10 executes reading processing of the data with respect to the selected NM 20 (step A5).
First, the CU 10 determines whether the corresponding writing is first writing (step B1). When the corresponding writing is the first writing (YES of step B1), the CU 10 acquires coordinates of the NM 20 at the head of the NM list 104 from the NM list 104 (step B2).
When the corresponding writing is not the first writing (NO of step B1), the CU 10 subsequently determines whether writing is completed up to a final NM 20 on the NM list 104 (step B3). When the writing is completed up to the final NM 20 on the NM list 104 (YES of step B3), the CU 10 acquires the coordinates of the NM 20 at the head of the NM list 104 from the NM list 104 (step B2). Meanwhile, when the writing is not completed up to the final NM 20 on the NM list 104 (NO of step B3), the CU 10 acquires the coordinates of the NM 20 next to the previously written NM 20 on the NM list 104 from the NM list 104 (step B4).
As such, the storage system 1 may enhance the access performance without the need for the exclusive lock.
However, in the above description, it is assumed that the CU 10 is directly connected with any one of the plurality of NMs 20. As described above, the CU 10 may, for example, communicate with all of the NMs 20 with respect to the reading of the data. Further, when the CU 10 communicates with the NMs 20 other than the directly connected NM 20, one or more other NMs 20 are interposed between the CU 10 and the NM 20. Therefore, in order to enhance communication performance between the CU 10 and the NM 20, in more detail, in order to decrease the number of other NMs 20 interposed during the communication between the CU 10 and the NM 20, for example, the CU 10 may be directly connected with, for example, two NMs so as to prevent duplication between the CUs 10, as illustrated in
As a result, the storage system 1 may further enhance the access performance.
Subsequently, a second embodiment will be described. Here, the same reference numerals are used to refer to the same components as the first embodiment and the description of the same components will be omitted.
The storage system 1 according to the second embodiment is also able to provide a high-capacity data storage area by connecting the plurality of NMs 20 in a matrix. Further, the input/output processing of the data into/from the data storage area 30, which is requested from the client device 2, is executed by the plurality of CUs 10. Further, in the storage system 1 according to the embodiment, it is assumed that a column type database is constructed.
Herein, first, an outline of the storage system 1 according to the embodiment will be described with reference to
As illustrated in part (A) in
Therefore, in the storage system 1 according to the second embodiment, each NM 20 first searches data, which meet the search condition, in parallel to return only the searched data to the CU 10. In more detail, the CU 10 sends the search request to each NM 20 (b1) and each NM 20 compares the data to be searched with the search condition in each NM 20 (b2). The NM 20 in which the data which meet the search condition is searched returns the data to the CU 10 (b3), and the CU 10 merges the data returned from the NM 20 (b4).
In the storage system 1 according to the second embodiment, the amount of data on the internal network is reduced, and as a result, congestion is alleviated. Further, the search is dispersedly performed in the plurality of NMs 20 to reduce the load of the CU 10. As a result, the access performance of the storage system 1 may be enhanced.
Subsequently, the description will focus on a data storage format in the column type database.
Now, as illustrated in
In this case, ideally, first, the data of column 2 of each record may be read (c1), and the data of the other columns in record 2, which meets the search condition (the data of column 2 is ‘bbb’), maybe read (c2). However, actually, data of a column which need not be originally read is also read.
Therefore, in the storage system 1 according to the second embodiment, secondly, the data storage format is devised to reduce the reading of data of the column which is not needed. As a result, the access performance of the storage system 1 maybe enhanced. Hereinafter, the first and second points will be described in detail.
As illustrated in
At the time of creating the table, the user of the client device 2 designates a table name, the number of columns, a column name, and a data type for each column, as illustrated in part (A) in
At the time of dropping the table, the user of the client device 2 designates the table name as illustrated in part (B) in
At the time of registering the record, the user of the client device 2 designates the table name and the data for each column as illustrated in part (C) in
At the time of searching the record, the user of the client device 2 designates the table name, identification information of a column to be compared, and the search condition as illustrated in part (D) in
Subsequently, the operation of the storage system 1 at the time of registering the record will be described with reference to
When the record registration command illustrated in part (C) in
The CU 10 first partitions the record for each column. Subsequently, the CU 10 stores the data of each column after the partitioning in different sectors (on the CU cache) so that only the data of the same column is inserted into the same sector, as illustrated in
For example, when the CU cache is full, the CU 10 creates the chunk and writes the created chunk in the NM 20. Referring to
As described above, when the client device 2 registers the record (
In more detail, the CU 10 sorts the sectors in the CU cache in a column order. After the sorting of the sectors, the CU 10 generates metadata regarding each sector in the chunk and stores the generated metadata in, for example, a sector at the head of the chunk. The metadata will be described below.
When the chunk is created, the CU 10 writes data for one chunk in any one of the plurality of NMs 20 (
As illustrated in
As illustrated in
The data type information is information on a data type of each column. In more detail, the data type information represents a fixed length or a variable length of the data type, and when the data type is the fixed length, the data type information represents the length.
In the case of the fixed length data type, since the size may be known with the data type information, the actual data sector need not include size information of each data. Meanwhile, in the case of the variable length data type, the size information of each data is stored in the actual data sector.
Further, the sector information table is a table that stores a column number, an order, and the number of elements for each sector. The column number represents information regarding which column of data each sector stores. The order represents the order of sectors storing the same column. The number of elements represents the number of data stored by each sector.
Referring to the sector information table, it may be known in which sector the data of each column of an n-th record in the chunk is stored. In the case of the fixed length data type, an address in the sector may also be known. For example, in the case of the sector information table illustrated in
Further, in the case of the variable length data type, the data may not be received in one sector. In this case, the plurality of sectors is used. In that case, for example, the number of elements of the sector at the head, in which the data is stored may be identified as −1, and the number of elements of the second sector may be identified as −2, and the like, by using a field of the number of elements of the sector information table.
In addition, the NM 20 manages chunk management information and a chunk registration order list on the memory (e.g., RAM 212) in order to manage the chunk.
The chunk management information represents information as to whether each chunk area is valid or invalid as illustrated in
The chunk registration order list stores the registration order of the chunk for each table as illustrated in
The NM 20 that manages the chunk management information and the chunk registration order list searches the invalid chunk area by the chunk management information at the time of writing the chunk. The NM 20 writes the chunk in the searched chunk area. In this case, the NM 20 updates the chunk management information in order to make the chunk area be valid and to register the table ID. Further, the NM 20 executes an update for registering a chunk number of the valid chunk area at the head with respect to the chunk registration order list of the table.
For example, when searching the record in a predetermined table is required, the NM 20 may recognize the chunk to be searched by referring to the chunk registration order list of the table. Further, for example, it is possible to search the chunk according to the order of new data or the order of old data by finding the chunk from the head or an end of the chunk registration order list.
Further, at the time of dropping the table, the NM 20 makes the chunk area, to which a table ID to be dropped is allocated in the chunk management information, be invalid, and empties the chunk registration order list of the table.
Herein, referring to
The NM 20 repeats the following operation with respect to each chunk while finding the chunk registration order list.
The NM 20 reads the metadata from the sector at the head of each chunk ((1) in
For example, as for the case of the chunk illustrated in
As such, the storage system 1 may only read the necessary minimum number of sectors by devising the data storage format to enhance the access performance of the storage system 1. Further, the NM 10 executes the search in parallel to further enhance the access performance of the storage system 1.
As illustrated in
The interface unit 501 of the client device 2 receives the requests for the registration, acquisition, search, and the like of the record from the user similarly to the first embodiment. Further, herein, since it is assumed that the column type database is constructed, the interface unit 501 additionally receives the requests for creating and dropping the table. Since the server communication unit 502 is the same as that of the first embodiment, the description thereof will be omitted.
Since the client communication unit 101 and the CU-side internal communication unit 103 of the CU 10 are the same as those of the first embodiment, the description thereof will be omitted. The table manager 105 manages information of the table created by the request from the client device 2, that is, the table list 109 to be described below. Further, the table manager 105 requests the NM 20 to perform processing of the chunk management information and the chunk registration order list stored in the table as necessary. The table list 109 includes a name of each table or information on the column. The CU cache manager 106 executes writing of data in the CU cache 110 and reading of the data from the CU cache 110. The CU cache manager 106 executes writing of data for one chunk in the NM 20, for example, in a case where a predetermined amount of data is accumulated in the CU cache 110, and the like.
The CU cache 110 is an area that temporarily stores the predetermined amount of data. The search processor 107 requests each NM 20 to perform the search. Further, the search processor 107 merges the search results from the respective NMs 20 to create a final result. The CU cache search executing unit 108 reads the record from the CU cache 110, compares the read record with the search condition, and acquires the record which meets the search condition.
Since the NM-side internal communication unit 201, the command executing unit 202, and the memory 203 of the NM 20 are the same as those of the first embodiment, the description thereof will be omitted. The chunk manager 204 manages the chunk management information and the chunk registration order list. The search executing unit 205 reads data of a column to be compared from the memory 203, compares the read data with the search condition, acquires the record which meets the search condition, and returns the acquired record to the CU 10.
When the table manager 105 receives a table creation request from the client communication unit 101 (step C1), the table manager 105 registers table information of the requested table in the table list 109 (step C2). Further, the table manager 105 requests the CU-side internal communication unit 103 to transmit a table information registration request to all of the CUs 10 except for its own CU 10 (step C3). In each CU 10, the table information is registered in the table list 109 by the table manager 105.
When the table manager 105 receives a table dropping request from the client communication unit 101 (step D1), the table manager 105 requests the CU-side internal communication unit 103 to transmit a table information dropping request from all of the CUs 10 except for its own CU 10 (step D2). In each CU 10, the table information is dropped from the table list 109 by the table manager 105.
Further, the table manager 105 requests the CU-side internal communication unit 103 to transmit the table information dropping request to all of the NMs 20 (step D3). In each NM 20, the chunk of the table becomes invalid by the chunk manager 204, and the chunk registration order list of the table is emptied by the chunk manager 204.
In addition, the table manager 105 drops the table information from the table list 109 (step D4).
The CU cache manager 106 determines whether allocating the area into the CU cache 110 is completed (step E1). When the allocation is not completed (NO of step E1), the CU cache manager 106 performs area allocation in the CU cache 110 (step E2).
The CU cache manager 106 determines whether the record to be registered has a size which is writable in the area (step E3). When the record to be registered does not have the writable size (NO of step E3), the CU cache manager 106 creates the chunk from registered data and requests the CU-side internal communication unit 103 to write the created chunk (step E4). When the writing is completed, the CU cache manager 106 releases the area. Subsequently, the CU cache manager 106 performs allocation of a new area in the CU cache 110 (step E5).
In addition, the CU cache manager 106 registers data in the area allocated to the CU cache 110 (step E6).
When the search processor 107 receives the record search request from the client communication unit 101 (step F1), the search processor 107 requests the CU-side internal communication unit 103 to transmit the search request to the plurality of NMs 20 (step F2). The search processor 107 receives the search result for one NM 20 from the CU-side internal communication unit 103 (step F3) until the search processor 107 receives the search results of all of the NMs 20 (YES of F4), the search processor 107 creates the search result returned to the client device 2 from the search results of all of the NMs 20 (step F5). The search processor 107 transmits the created search result to the client communication unit 101 (step F6). The search result is returned to the client device 2 by the client communication unit 101.
When the search executing unit 205 receives the search request from the NM-side internal communication unit 201 (step G1), the search executing unit 205 acquires information on the chunk at the head from the chunk registration order list (step G2). Subsequently, the search executing unit 205 acquires the metadata of the chunk from the memory 203 (step G3). The search executing unit 205 acquires sector data of the column to be compared from the memory 203 based on the metadata (step G4) to compare the respective data in the sector with the search condition sequentially (step G5).
When each data meets the search condition (YES of step G6), the search executing unit 205 acquires data of another column of the record, in which the data of the column to be compared meets the search condition, from the memory 203 based on the metadata (step G7). The search executing unit 205 stores the search result in the memory 203 (step G8).
The search executing unit 205 determines whether comparing all data in the sector is completed (step G9), and if comparing all of the data is not completed (NO of step G9), the search executing unit 205 returns to step G5 to process next data in the sector. Meanwhile, when comparing all of the data is completed (YES of step G9), the search executing unit 205 subsequently determines whether searching all columns to be compared in the chunk is completed (step G10). When searching all of the columns is not completed (NO of step G10), the search executing unit 205 returns to step G4 to process a next sector in the chunk.
When searching all of the columns is completed (YES of step G10), the search executing unit 205 acquires next chunk information from the chunk registration order list (step G11). When the next chunk information exists (YES of step G12), the search executing unit 205 returns to step G3 to process a next chunk. Meanwhile, when the next chunk information does not exist (NO of step G12), the search executing unit 205 reads all search results from the memory 203 (step G13), and then, requests the NM-side internal communication unit 201 to transmit the search result to the CU 10 as a request source (step G14).
When the chunk manager 204 receives the chunk writing request from the NM-side internal communication unit 201 (step H1), the chunk manager 204 searches an empty chunk (step H2). When the empty chunk does not exist (NO of step H3), the chunk manager 204 terminates processing of the requested chunk writing as an error.
When the empty chunk exists (YES of step H3), the chunk manager 204 executes writing in the chunk (step H4). The chunk manager 204 changes the chunk management information of the chunk to be valid to register the table ID and update the chunk registration order list of the corresponding table (step H5).
When the chunk manager 204 receives a table dropping notification from the NM-side internal communication unit 201 (step J1), the chunk manager 204 changes all of the chunks having the table ID of the dropped table to be invalid among the chunk management information and empties the chunk registration order list of the table ID of the dropped table (step J2).
As described above, in the storage system 1, each NM 20 first searches the data, which meets the search condition, in parallel, and second devises the data storage format to enhance the access performance.
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed, the novel embodiments described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the embodiments described herein maybe made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.
Number | Date | Country | Kind |
---|---|---|---|
2017-054955 | Mar 2017 | JP | national |