This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2021-87751, filed on May 25, 2021, the entire contents of which are incorporated herein by reference.
The embodiment discussed herein is related to an information processing apparatus, a computer-readable recording medium storing a program, and an information processing method.
In the cloud, serverless computing may be used. In serverless computing, beyond the frame of the usual cloud computing, such as hosting services or the like, processing units called functions freely operate regardless of hardware resources. The cloud thereby thoroughly uses the hardware. Users are charged pay-per-use, based on the number of requests for functions and are facilitated to start small.
Serverless computing may have difficulties in handling of data intended to be persistent. It is basically impossible to specify where functions are to operate, and usual serverless computing generally uses public cloud storage services accessible from anywhere in the world. Public cloud storages are excellent in durability and availability. However, some inexpensive storages have long response time (for example, latency), or the cloud databases (DBs) are expensive and do not allow for immediate scaling up.
As a storage of persistent data for serverless computing, a distributed data store has been introduced in recent years. The distributed data store implements DB functions (atomicity, consistency, isolation, and durability=ACID) in a computing environment spread in a wide area, such as multi-cloud or multi-cluster environments.
A basic distributed data store is composed of N (N>2) servers. Each server is individually placed in one of separate sites (data centers, for example). Even if some of the servers or networks have failed, services may be continued by the remaining servers. In a distributed data store for serverless computing, the place where a function is executed varies, and the configuration (server placement sites, for example) of the distributed data store is changed so as to shorten the response time from the place where the function is executed.
Examples of the related art include as follows: U.S. Patent Application Publication No. 2016/0098225, Japanese Laid-open Patent Publication No. 2009-151403, and International Publication Pamphlet No. WO 2014/188682.
According to an aspect of the embodiments, there is provided an information processing apparatus including: a memory; and a processor coupled to the memory, the processor being configured to: in a network coupling a plurality of storage nodes, at least one proxy, and at least one client; collect information of accesses executed most by the at least one client via the at least one proxy on a path of each access; based on the information of accesses, calculate network distances between the plurality of storage nodes and the at least one proxy; and based on the network distances, determine a leader to be one of the plurality of storage nodes that is close to one of the at least one proxy accessed most frequently.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention.
In distributed data stores, the configuration may be changed. However, during change of the configuration, the plurality of servers may assign themselves to serve as a leader, which could result in a situation called sprit brain, where DB consistency is broken. In order to avoid sprit brain, it is conceivable that the configuration is changed while services are suspended. However, suspension of services may cause an issue of availability. By using a consensus algorithm, it is possible to simultaneously avoid sprit brain and implement the availability. However, the leader is basically determined in elections among servers and is not guaranteed to be located near the place where the function is executed.
In one aspect, an object is to improve the performance of a distributed data store.
An embodiment will be described below with reference to the drawings. The embodiment described below is merely illustrative and is not intended to exclude employment of various modification examples or techniques that are not explicitly described in the embodiment. For example, the present embodiment may be implemented by variously modifying the embodiment without departing from the gist of the embodiment. Each of the drawings is not intended to indicate that only the elements illustrated in the drawing are included. Thus, other functions or the like may be included.
Hereafter, each of the same reference signs denotes substantially the same portion in the drawings, so that the description thereof is omitted.
The computer apparatus 1 includes an information processing apparatus 10, a display device 15, and a driving device 16.
The information processing apparatus 10 includes processor 11, a memory 12, storage device 13, and a network device 14.
The processor 11 is a processing device that exemplarily performs various types of control and various operations. The processor 11 realizes various functions when an operating system (OS) and programs stored in the memory 12 are executed.
The programs to realize the functions of the processor 11 may be provided in a form in which the programs are recorded in a computer-readable recording medium such as, for example, a flexible disk, a compact disc (CD) (such as a CD-read-only memory (CD-ROM), a CD-recordable (CD-R), or a CD-rewritable (CD-RW)), a Digital Versatile Disc (DVD) (such as a DVD-ROM, a DVD-random-access memory DVD-RAM), a DVD-R, a DVD+R, a DVD-RW, a DVD+RW, or a High Definition (HD) DVD), a Blu-ray disk, a magnetic disk, an optical disc, or a magneto-optical disk. The computer (the processor 11 according to the present embodiment) may read the programs from the above-described recording medium through a reading device (not illustrated) and transfer and store the read programs to an internal recording device or an external recording device. The program may be recorded in a storage device (recording medium) such as, for example, the magnetic disk, the optical disc, or the magneto-optical disk and provided from the storage device to the computer via a communication path.
When the functions as the processor 11 are realized, the program stored in the internal storage device (the memory 12 in the present embodiment) may be executed by the computer (the processor 11 in the present embodiment). The computer may read and execute the program recorded in the recording medium.
The processor 11 controls operation of the entire information processing apparatus 10. The processor 11 may be a multiprocessor. The processor 11 may be any one of, for example, a central processing unit (CPU), a microprocessor unit (MPU), a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a programmable logic device (PLD), and a field-programmable gate array (FPGA). The processor 11 may be a combination of two or more elements of the CPU, the MPU, the DSP, the ASIC, the PLD, and the FPGA.
The memory 12 is, for example, a storage device that includes a read-only memory (ROM) and a random-access memory (RAM). The RAM may be, for example, a dynamic RAM (DRAM). A program such as Basic Input/Output System (BIOS) may be written in the ROM of the memory 12. The software program in the memory 12 may be loaded and executed by the processor 11 as appropriate. The RAM of the memory 12 may be used as a primary recording memory or a working memory.
The storage device 13 is, for example, a device that stores data such that the data is able to be read from and written to the storage 13. The storage device 13 may be, for example, a solid-state drive (SSD) 131, a serial attached SCSI Hard disk drive (SAS-HDD) 132, or a storage class memory (SCM) (not illustrated).
The network device 14 is an interface device which couples the information processing apparatus 10 to the network switch 2 via an interconnect for communication with a network, such as the Internet 3 (described later with reference to
The display device 15 is a liquid crystal display, an organic light-emitting diode (OLED) display, a cathode ray tube (CRT), an electronic paper display, or the like and displays various types of information for, for example, an operator.
The driving device 16 is configured so that a recording medium is removably inserted thereto. The driving device 16 is configured to be able to read information recorded on a recording medium in a state in which the recording medium is inserted thereto. In this example, the recording medium is portable. For example, the recording medium is the flexible disk, the optical disc, the magnetic disk, the magneto-optical disk, a semiconductor memory, or the like.
The distributed data store system 100 includes a plurality of (in the example illustrated in
The computer apparatuses #1 and #2 may be respectively placed in different data centers and may each serve as a storage node 101 (described later with reference to
The computer apparatuses 1 as storage nodes are coupled to each other via dedicated lines, and the computer apparatuses 1 as the storage nodes and the computer apparatuses 1 as proxies are coupled via dedicated lines. The computer apparatuses 1 as the proxies and the computer apparatuses 1 as clients are coupled via the Internet 3. The Internet 3 may be replaced with a different type of network, such as wide area network (WAN).
In the example illustrated in
As illustrated in
The storage nodes 101, proxies 102, and clients 103 may be located at different sites. The storage nodes 101, proxies 102, and clients 103 may be operated at a same site. In such a case, the storage nodes 101, proxies 102, and clients 103 are located in consideration of a fault domain. The fault domain refers to a hardware set sharing a single point of failure.
The proxies 102 include the plurality of proxies 102 and couple the clients 103 to the storage nodes 101 to relay communications therebetween. The proxies 102 are distributed in a wide area, and the URLs thereof are open to the clients 103. Each URL corresponds to at least one node. For the purpose of load sharing, each proxy 102 may be composed of the plurality of nodes within a same site. In this case, the URL is resolved to any one of the plurality of IP addresses with a DNS. Latencies and bandwidths to the storage nodes 101 are different between the proxies 102. Even in the case where each proxy 102 is composed of the plurality of nodes, the tendency (average values and the like) thereof does not exhibit a great difference. For each proxy 102, an access counter or upstream and downstream transferred data size is monitored, and the values thereof in the last period A may be acquired by accessing the proxy 102.
The clients 103 are distributed in a wide area, and round-trip times or upstream and downstream bandwidths between each client 103 and the respective proxies 102 are measured in advance. Each client 103 is then selectively coupled to the proxy 102 (URL, for example) that is close to the client 103 itself.
The function as the management apparatus 104 may be included in any one of the proxies 102 or the storage nodes 101. The management apparatus 104 may access both the proxies 102 and storage nodes 101 and may use multiple nodes based on majority voting using a Raft algorithm or a Paxos algorithm.
For starting election of a storage node 101 as a leader, the management apparatus 104 executes a leader reassignment process based on monitoring according to a service level agreement (SLA) of the clients 103 or regular execution.
In the leader reassignment process, the management apparatus 104 acquires S parameters (later described using
The storage node 101 having received TriggerElection RPC replies approval=ACK (true) if being the leader. On the other hand, if not being the leader, the storage node 101 having received TriggerElection RPC replies denial=NACK (false) and leader information that the storage node 101 itself knows.
When receiving the denial replied from the storage node 101, the management apparatus 104 sets the leader that the management apparatus 104 knows to the leader included in the replied data and again sends TriggerElection RPC to the storage node 101 included in the replied data.
The storage node 101 as the leader having received TriggerElection RPC suspends transmission of heartbeats (AppendEntry RPC) to a follower as Leader_new during a timeout period for heartbeat reception plus a margin. If not receiving AppendEntry RPC during the timeout period, the storage node 101 autonomously runs for leader.
AppendEntry RPC is one of RPCs used in the Raft algorithm and is a heartbeat message from the leader to followers as well as a message for data replication.
The leader reassignment process may be also executed by the following method.
Each storage node 101 starts the leader reassignment process regularly using a timer.
Each storage node 101 terminates the process if the storage node 101 itself is the leader or the node state thereof is Candidate.
On the other hand, each storage node 101 acquires S parameters and acquires D parameters if the storage node 101 is not the leader and the node state thereof is not Candidate. Next, each storage node 101 calculates NW distances and calculates Leader_new. Each storage node 101 changes the node state to Candidate if the storage node 101 is not Leader_new. This is because the storage node 101 of interest is a follower and a candidate node. Hereinafter, the same process as that of the aforementioned leader election through the Raft algorithm is executed. For example, RequestVote RPC is sent to all the storage nodes 101. If a majority of the storage nodes 101 approves, a new leader is elected.
RequestVote RPC is one of RPCs used in the Raft algorithm and is a message through which RPC sender node s asks a receiver node r to vote for the sender node s in leader election.
The management apparatus 104 may function as a configuration management server that changes site assignment of each storage node 101. The function as the configuration management server may be assigned to any storage node 101. The configuration management server may access both the proxies 102 and the storage nodes 101. To enhance the reliability, the configuration management server may use multiple nodes based on majority voting using the Raft algorithm or Paxos algorithm.
For serving as the configuration management server, the management apparatus 104 executes the following change of site assignment based on monitoring according to the service level agreement (SLA) of the clients 103 or based on regular execution.
The management apparatus 104 acquires the S parameters and acquires the D parameters. Next, the management apparatus 104 calculates the NW distances and determines a set of storage nodes and Leader_new. If the set of storage nodes is the same as the current set of storage nodes, the change of site assignment is unnecessary and is terminated. If Leader_new is different from the current leader, the aforementioned leader reassignment process is executed.
The management apparatus 104 executes a joint consensus procedure of the Raft algorithm. For example, where the old set of storage nodes (before the change) is C1, the new set of storage nodes (after the change) is C2, and the union of the old and new sets of storage nodes is C1 U C2, the set of storage nodes transits from C1 to C2 configuration via the C1 U C2 configuration. After the change to the new configuration, the management apparatus 104 executes the leader reassignment process to choose Leader_new.
For example, the information processing apparatus 10 collects information of accesses executed most by at least one client 103 via at least one proxy 102 on a path of each access. Based on the information of accesses, the information processing apparatus 10 calculates the network distances between the plurality of storage nodes 101 and the at least one proxy 102. Based on the network distances, the information processing apparatus 10 determines the leader to be one of the plurality of storage nodes 101 that is close to the proxy 102 accessed most frequently.
The information of accesses may include static parameters and dynamic parameters between the plurality of storage nodes 101 and the at least one proxy 102 for calculating the network distances.
The information processing apparatus 10 may determine the leader when any client 103 determines that an access performance value does not meet a request. The information processing apparatus 10 may determine the leader when any client 103 determines that a request for site change of the plurality of storage nodes 101 is met.
The S parameters are substantially static parameters. The S parameters may be acquired through measurement in advance or may be acquired through measurement as appropriate. The S parameters are parameters which are not completely constant and are re-measured much less frequently than the D parameters described later.
In the round-trip times illustrated in
At SN #1, for example, the round-trip time to the proxy #1 is 10 ms; the round-trip time to the proxy #2 is 100 ms; and the round-trip time to the proxy #3 is 180 ms.
In the upload bandwidths illustrated in
At SN #1, for example, the upload bandwidth from the proxy #1 is 500 MB/s; the upload bandwidth from the proxy #2 is 600 MB/s; and the upload bandwidth from the proxy #3 is 900 MB/s.
In the download bandwidths illustrated in
At SN #1, for example, the download bandwidth to the proxy #1 is 550 MB/s; the download bandwidth to the proxy #2 is 650 MB/s; and the download bandwidth to the proxy #3 is 950 MB/s.
In the message rates illustrated in
At SN #1, for example, the message rate to SN #2 is m1,2, and the message rate to SN #n is m1,n.
In the upstream and downstream bandwidths illustrated in
At SN #1, for example, the upstream and downstream bandwidth to SN #2 is u1,2, and the upstream and downstream bandwidth to SN #n is u1,n.
The D parameters are dynamically changing parameters.
In the download bandwidths illustrated in
In the example illustrated in
In the transferred data volume illustrated in
In the example illustrated in
By calculating the network distances, the leader node may be determined based on the condition of accesses to the proxies 102 when the sites of the storage nodes 101 are fixed, and the configuration of the storage nodes 101 may be determined based on the current network congestion and the condition of accesses to the proxies 102.
The leader node may be calculated based on RW_ratio of the download bandwidths illustrated in
As illustrated in
c1=r11*u1+r12*u2+r13*u3
c2=r21*u1+r22*u2+r23*u3
c3=r31*u1+r32*u2+r33*u3
For reflecting the network congestion, the S parameters may be measured just before determination of the configuration. The assignment and distances may be determined each for three sites, four sites, . . . , and N sites.
Hereinbelow, the case of selecting three sites in total (N=4) will be described.
The storage nodes 101 include four SN #1 to #4, and the number of ways of selecting three sites therefrom is four, including (1, 2, 3), (1, 2, 4), (1, 3, 4), and (2, 3, 4).
For (1, 2, 3), when the leader is SN #1, the time taken to transmit T replica messages is (T/m12+T/m13). When T is 1, f1=(1/m12+1/m13) is calculated in the table illustrated in
When the leader is SN #2, f2=(1/m21+1/m23) is calculated in the table illustrated in
When the leader is SN #3, the network distance c3 is calculated in a similar manner.
In
The weighting in the function B may be implemented as B(x, y)=a0*x+a1*y+a2*x*x+a3*y*y+a4*x*y by using polynomial regression, for example. Average B indicates the average value of the function B for a same set of sites.
In the case of four or more sites, the average B is greater than that in the case of three sites, but the reliability is higher. Application of weighting including the reliability therefore provides a combination of sites with the distance (the pair of the average B and reliability) minimized. This combination is the answer for site assignment.
A first example of the performance monitoring process by the clients 103 side in the embodiment will be described according to the flowchart (steps S1 to S4) illustrated in
The performance monitoring process is performed by each client 103 itself or an agent. The agent is an independent process that exists and operates on the same server as the client 103.
The performance monitoring process may be started at regular intervals (once per 1 minute or so on) or may be triggered by degradation of any performance index (response time or the like) of the client 103. If the performance value does not meet the performance request, a leader reassignment request may be sent to the management apparatus 104.
The client 103 retrieves an IO performance value v (step S1).
The client 103 determines whether the IO performance value v meets the performance request (SLA) (step S2).
If the IO performance value v meets the performance request (see YES route in step S2), the process proceeds to step S4.
If the IO performance value v does not meet the performance request (see NO route in step S2), the client 103 sends a leader reassignment request to the management apparatus 104 (step S3).
The client 103 waits a certain period of time or waits for the next monitoring trigger (step S4), and the process returns to step S1.
Next, a first example of the leader reassignment process by the management apparatus 104 in the embodiment will be described according to the flowchart (steps S11 to S20) illustrated in
The management apparatus 104 receives a leader reassignment request from any client 103 (step S11).
The management apparatus 104 acquires the S parameters (step S12).
The management apparatus 104 acquires the D parameters (step S13).
The management apparatus 104 calculates the network (NW) distances based on the acquired S and D parameters (step S14).
The management apparatus 104 determines a new leader node, Leader_new, based on the calculated network distances (step S15).
The management apparatus 104 sets Leader_curr to a current leader node (step S16).
The management apparatus 104 sends TriggerElection RPC indicating Leader_new to the storage node 101 set as Leader_curr (step S17).
The management apparatus 104 waits to receive a response from the storage node 101 as Leader_curr (step S18).
The management apparatus 104 determines whether the response result=ACK (step S19).
If the response result is ACK (see YES route in step S19), the leader reassignment process is terminated.
If the response result is not ACK (see NO route in step S19), the management apparatus 104 sets Leader_curr to the current leader node included in the response result (step S20), and the process returns to step S17.
Next, a first example of the leader reassignment process by each storage node 101 in the embodiment will be described according to the flowchart (steps S21 to S26) illustrated in
The storage node 101 receives a TriggerElection RPC request indicating Leader_new from the management apparatus 104 (step S21).
The storage node 101 determines whether the storage node 101 itself is the current leader node (step S22).
If the storage node 101 is not the current leader node (see NO route in step S22), the storage node 101 responds NACK and information indicating the current leader to the management apparatus 104 (step S23). The leader reassignment process is terminated.
If the storage node 101 is the current leader node (see YES route in step S22), the storage node 101 responds ACK and information indicating the current leader to the management apparatus 104 (step S24).
The storage node 101 determines whether the storage node 101 itself is Leader_new (step S25).
If the storage node 101 itself is Leader_new (see YES route in step S25), the leader reassignment process is terminated.
If the storage node 101 itself is not Leader_new (see NO route in step S25), the storage node 101 makes a setting to suspend AppendEntry RPC to Leader_new (step S26). The leader reassignment process is terminated.
Next, a second example of the leader reassignment process by each storage node 101 in the embodiment will be described according to the flowchart (steps S31 to S39) illustrated in
The storage node 101 determines whether the storage node 101 itself is the leader node (step S31).
If the storage node 101 itself is the leader node (see YES route in step S31), the leader reassignment process is terminated.
If the storage node 101 itself is not the leader node (see NO route in step S31), the storage node 101 determines whether the state thereof is Candidate (step S32).
If the state is Candidate (see YES route in step S32), the leader reassignment process is terminated.
If the state is not Candidate (see NO route in step S32), the storage node 101 acquires the S parameters (step S33).
The storage node 101 acquires the D parameters (step S34).
The storage node 101 calculates the network (NW) distances based on the acquired S and D parameters (step S35).
The storage node 101 determines a new leader node, Leader_new, based on the calculated network distances (step S36).
The storage node 101 determines whether the storage node 101 itself is Leader_new (step S37).
If the storage node 101 is not Leader_new (see NO route in step S37), the leader reassignment process is terminated.
If the storage node 101 is Leader_new (see YES route in step S37), the storage node 101 changes its state to Candidate (step S38).
The storage node 101 starts leader node election by each storage nodes 101 (step S39). The leader reassignment process is terminated.
Next, the performance monitoring process for a site change request by the clients 103 side in the embodiment will be described according to the flowchart (steps S41 to S45) illustrated in
Each client 103 retrieves the IO performance value v (step S41).
The client 103 determines whether the IO performance value v meets the performance request (SLA) (step S42).
When the IO performance value v meets the performance request (see YES route in step S42), the process proceeds to step S45.
If the IO performance value v does not meet the performance request (see NO route in step S42), the client 103 determines whether the IO performance value v meets the condition for changing the configuration (step S43).
If the IO performance value v does not meet the condition for changing the configuration (see NO route in step S43), the process proceeds to step S45.
If the IO performance value v meets the condition for changing the configuration (see YES route in step S43), the client 103 sends a site change request to the management apparatus 104 (step S44).
The client 103 waits a certain period of time or waits for the next monitoring trigger (step S45), and the process returns to step S41.
Next, the leader reassignment process by the management apparatus 104 started due to the site change request in the embodiment will be described according to the flowchart (steps S51 to S63) illustrated in
The management apparatus 104 receives a site change request from any client 103 (step S51).
The management apparatus 104 acquires the S parameters (step S52).
The management apparatus 104 acquires the D parameters (step S53).
The management apparatus 104 calculates the network (NW) distances based on the acquired S and D parameters (step S54).
The management apparatus 104 determines a set SNS_new of storage nodes 101 and a new leader node, Leader_new, based on the calculated network distances (step S55).
The management apparatus 104 sets a current set SNS_Curr of storage nodes 101 (step S56).
The management apparatus 104 determines whether SNS_curr is the same as SNS_new (step S57).
If SNS_curr is the same as SNS_new (see YES route in step S57), the process proceeds to step S63.
If SNS_curr is not the same as SNS_new (see NO route in step S57), the management apparatus 104 sets values of SNS_new-SNS_curr in an addition set SNS_add (step S58).
The management apparatus 104 sets values of SNS_curr-SNS_new in a deletion set SNS_del (step S59).
The management apparatus 104 reserves new nodes based on the values of the addition set SNS_add (step S60).
The management apparatus 104 conducts joint consensus for SNS_curr and SNS_new (step S61).
The management apparatus 104 releases unnecessary nodes based on the values of the deletion set SNS_del (step S62).
The management apparatus 104 executes the leader reassignment process (step S63). The leader reassignment process started due to the site change request is terminated.
According to the information processing apparatus 10, the program, and the information processing method in one example of the embodiment described above, for example, the following operation effects may be provided.
The information processing apparatus 10 collects information of accesses executed most by the at least one client 103 via the at least one proxy 102 on a path of each access. Based on the information of accesses, the information processing apparatus 10 calculates the network distances between the plurality of storage nodes 101 and the at least one proxy 102. Based on the network distances, the information processing apparatus 10 determines the leader from the plurality of storage nodes 101, to be the storage node 101 that is close to the proxy 102 accessed most frequently.
This improves the performance of the distributed data store. For example, it is possible to improve the processing speed, throughputs, and latencies at reading and writing by the clients 103.
The information of accesses may include static parameters and dynamic parameters between the plurality of storage nodes 101 and the at least one proxy 102 for calculating the network distances. This allows for precise determination of the leader based on the network distances.
The information processing apparatus 10 may determine the leader when any client 103 determines that an access performance value does not meet a request. The information processing apparatus 10 may determine the leader when any client 103 determines that a request for site change of the plurality of storage nodes 101 is met. This allows determination of the leader to be carried out at appropriate timing.
The disclosed technique is not limited to the above-described embodiment. The disclosed technique may be carried out by variously modifying the technique without departing from the gist of the present embodiment. Each of the configurations and each of the processes of the present embodiment may be selectively employed or omitted as desired or may be combined with each other as appropriate.
All examples and conditional language provided herein are intended for the pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although one or more embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2021-087751 | May 2021 | JP | national |