Certain embodiments of the present invention relate generally to switch-assisted data storage network traffic management in a data storage center.
Data storage centers typically employ distributed storage systems to store large quantities of data. To enhance the reliability of such storage, various data redundancy techniques such as full data replication or erasure coded (EC) data are employed. Erasure coding based redundancy can provide improved storage capacity efficiency in large scale systems and thus is relied upon in many commercial distributed cloud storage systems.
Erasure coding can be described generally by the term EC(k,m), where a client's original input data for storage is split into k data chunks. In addition, m parity chunks are computed based upon a distribution matrix. Reliability from data redundancy may be achieved by separately placing each of the total k+m encoded chunks into different k+m storage nodes. As a result, should any m (or less than m) encoded chunks be lost due to failure of storage nodes or other causes such as erasure, the client's original data may be reconstructed from the surviving k encoded chunks of client storage data or parity data.
In a typical distributed data storage system, a client node generates separate data placement requests for each chunk of data, in which each placement request is a request to place a particular chunk of data in a particular storage node of the system. Thus, where redundancy is provided by erasure coding, the client node will typically generate k+m separate data placement requests, one data placement request for each of the k+m chunks of client data. The k+m data placement requests are transmitted through various switches of the distributed data storage system to the storage nodes for storage. Each data placement requests typically includes a destination address which is the address of a particular storage node which has been assigned to store the data chunk being carried by the data placement request as a payload of the data placement request. The switches through which the data placement requests pass, note the intended destination address of the data placement request and route the data placement request to the assigned storage node for storage of the data chunk carried as a payload of the request. For example, the destination address may be in the form of a TCP/IP (Transmission Control Protocol/Internet Protocol) address such that a TCP/IP connection is formed for each TCP/IP destination.
Embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings in which like reference numerals refer to similar elements.
In the description that follows, like components have been given the same reference numerals, regardless of whether they are shown in different embodiments. To illustrate an embodiment(s) of the present disclosure in a clear and concise manner, the drawings may not necessarily be to scale and certain features may be shown in somewhat schematic form. Features that are described and/or illustrated with respect to one embodiment may be used in the same way or in a similar way in one or more other embodiments and/or in combination with or instead of the features of the other embodiments.
As noted above, where redundancy is provided by erasure coding (EC), a prior client node will typically generate k+m separate data placement requests, one data placement request for each of the k+m chunks of client and parity data. Thus, for erasure coding, it is appreciated herein that there is generally a considerable amount of additional storage network traffic generated, and as a result, substantial network bandwidth is frequently required to place the redundant chunks of EC encoded data on to the storage nodes of the storage system. It is further appreciated that due to the nature of splitting the original client data into a number of encoded chunks including parity data, the latency of the various data placement requests may also be exacerbated since all data placement requests for the encoded chunks of data are typically concurrently handled in the same manner as they are routed from the client node to the storage nodes and the storage media of the storage nodes.
Moreover, the k+m separate data placement requests typically result in k+m separate acknowledgements, one acknowledgement for each of the data placement request as the data of a request is successfully stored in a storage node. Thus, the separate acknowledgements can also contribute to additional storage network traffic and as a result, can also increase storage network bandwidth requirements.
As explained in greater detail below, switch-assisted data storage network traffic management in accordance with one aspect of the present, may substantially reduce added storage network traffic generated as a result of EC or other redundancy techniques and as a result, substantially reduce both costs of network bandwidth and data placement latency. More specifically, in distributed data storage systems employing multiple racks of data storage nodes, both intra-rack and inter-rack network traffic carrying data to be stored may be reduced notwithstanding that EC encoded data chunks or other redundancy methods are employed for reliability purposes. Moreover, both intra-rack and inter-rack network traffic acknowledging placement of EC encoded data chunks in assigned storage nodes, may be reduced as well. However, it is appreciated that features and advantages of employing switch-assisted data storage network traffic management in accordance with the present description may vary, depending upon the particular application.
In one aspect of the present description, Software Defined Storage (SDS) level information is utilized to improve optimization of data flow within the storage network. In one embodiment, the SDS level information is a function of the hierarchical levels of hierarchical switches interconnecting storage nodes and racks of storage nodes in a distributed storage system. In addition, the SDS level information is employed by the various levels of hierarchical switches to improve optimization of data flow within the storage network.
For example, in one embodiment, a storage network of a distributed data storage system employs top of rack (ToR) switches at a first hierarchical level and end of row (EoR) switches at a second, higher hierarchical level than that of the ToR switches. SDS level information is employed by the various levels of hierarchical EoR and ToR switches to improve optimization of data flow within the storage network.
In another aspect, switch-assisted data storage network traffic management in accordance with one aspect of the present description can facilitate scaling of a distributed data storage system, in which the number of storage nodes, racks of storage nodes and hierarchical levels of switches increases in such scaling. As a result, reductions in both storage network bandwidth and latency achieved by switch-assisted data storage network traffic management in accordance with one aspect of the present description may be even more pronounced as the number of racks of servers or the number of levels of hierarchy deployed for the distributed storage systems increases.
In the example of
As previously mentioned, in a typical prior distributed data storage system, a client node such as the compute node of Rack3 (
As shown in
For example, the payload field 10 of the data placement RequestA, for example, contains the encoded chunk Chunk1, and the destination address field 14 of the data placement RequestA has the address of the assigned destination storage NodeA. The payload field 10 of the other data placement requests, RequestB-RequestF, each contain an encoded chunk, Chunk2-Chunk6, respectively and the destination address field 14 of each of the other data placement requests, RequestB-RequestF, has the address of the assigned destination storage node, NodeB-NodeF as shown in
The k+m data placement requests (RequestA-RequestF in
Thus, in the example of
It is noted that a prior placement of encoded data and parity chunks in the manner depicted in
Moreover, it is further noted that, in a typical prior distributed data storage system, placement of an encoded chunk in each storage node generates a separate acknowledgement which is routed through the storage network of the distributed data storage system. Thus, where the client node generates k+m separate data placement requests, one data placement request for each of the k+m encoded chunks of client data, the storage nodes typically generate k+m acknowledgements in return upon successfully placement of the k+m encoded chunks. Each of the k+m separate data placement requests may identify the source of the data placement request for purposes of addressing acknowledgments to the requesting node.
Accordingly, in the example of
The k+m data placement acknowledgements (AckA-AckF in
Thus, in the example of
It is noted that prior concurrent acknowledgements of placement of encoded data and parity chunks in the manner depicted in
As shown in
The data structure of the consolidated data placement Request0 further includes in association with the payload fields 100a-100f, a plurality of destination address fields such as the destination address fields 104a-104f to identify the destination address for the associated encoded chunks, Chunk1-Chunk6, respectively. Thus, each destination address field, 104a-104f, contains the address of a storage node which has been assigned to store the encoded chunk contained within the associated the payload field 100a-100f, respectively. For example, the payload field 100a of the consolidated data placement Request0 contains the encoded chunk, Chunk1, and the associated destination address field 104a of the consolidated data placement Request0 contains the address of the assigned destination storage node which is NodeA (
The destination addresses of the destination address fields 104a-104f may be in the form of TCP/IP network addresses, for example. However, unlike the prior system depicted in
The consolidated placement request logic 110 (
As described above, the consolidated placement Request0 has a payload (payload fields 100a-100f) containing erasure encoded chunks (Chunk1-Chunk6, respectively) for storage in sets of storage nodes (storage NodeA-NodeF, respectively), as identified by associated destination address fields (104a-104f, respectively). Thus, instead of six individual data placement requests for six separate encoded chunks transmitted in six individual TCP/IP connections as described in connection with the prior system of
As explained in greater detail below in connection with
In the illustrated embodiment, the storage network depicted in
In the illustrated embodiment, the client NodeC is configured to generate the initial consolidated data placement Request0. However, it is appreciated that such request generation may be performed by other nodes of the storage system. For example, one or more of the logic of the top of rack SwitchC, or the end of row SwitchE, for example, may be configured to generate an initial consolidated data placement, such as the initial consolidated data placement Request0.
It is appreciated herein that, from the perspective of the data center network topology, two factors in an EC(k, m) encoded redundancy scheme include an intra-rack EC chunk factor ri and an inter-rack EC chunk factor R. The factor ri describes how many EC encoded chunks are placed in the same rack i, whereas the factor R describes how many storage racks are holding these chunks. For a data storage system employing an EC(k, m) encoding scheme, the following holds true:
r=k+m=Σ r
i, where 1≤i≤R.
In the embodiment of
In one aspect of the present description, the inter-rack request generation logic 150 of the end of row SwitchE is configured to split encoded chunks from the consolidated multi-chunk placement Request0, and repackage them in a particular distributed multi-chunk data placement request such as the Request0A, as a function of the assigned storage nodes in which the encoded chunks of the consolidated multi-chunk placement Request0 are to be placed. In the example of
The inter-rack request generation logic 150 is further configured to determine (block 166,
In the example of
Thus, instead of three individual data placement requests for three separate encoded chunks transmitted in three individual TCP/IP connections between the end of row switch4 (
As explained in greater detail below in connection with
In the embodiment of
In one aspect of the present description, the intra-rack request generation logic 204 of the top of rack SwitchA is configured to split an encoded chunk from the distributed multi-chunk placement Request0A, and repackage it in a particular data chunk placement request such as the Request0A1, as a function of the assigned storage node in which the encoded chunk of the distributed multi-chunk placement Request0A is to be placed. In the example of
The intra-rack request generation logic 204 is further configured to determine (block 218,
In the embodiment of
Intra-rack acknowledgment consolidation logic 222 (
In one embodiment, the storage node which receives a data chunk placement request addressed to the storage node by a storage node address field such as the address field 104a of a data chunk placement Request0A1, for example, in a TCP/IP connection such as the connection 214a, for example, that receiving storage node may assume that it is the assigned storage node of the received data chunk placement request. In other embodiments, the receiving storage node may confirm that it is the assigned storage node of the received data chunk placement request by inspecting the storage node address field 104a of a received data chunk placement Request0A1, for example, and comparing the assigned address to the address of the receiving storage node.
Upon successfully storing the data Chunk1 contained in the payload field 100a of the received data chunk placement Request0A1, placement acknowledgement generation logic 262 (
Referring to
Thus, in the embodiment of
Similarly, in the embodiment of
Thus, instead of three individual chunk placement acknowledgements separately acknowledging three separate encoded chunk placements and transmitted in three individual TCP/IP connections between the top of rack Switch1 (
In another aspect, the end of row SwitchE has inter-rack acknowledgment consolidation logic 170 (
Thus, in the embodiment of
The inter-rack acknowledgment consolidation logic 170 (
Thus, instead of six individual chunk placement acknowledgements separately acknowledging six separate encoded chunk placements and transmitted in six individual TCP/IP connections between the end of row Switch4 (
As shown in
Such components in accordance with embodiments described herein can be used either in stand-alone memory components, or can be embedded in microprocessors and/or digital signal processors (DSPs). Additionally, it is noted that although systems and processes are described herein primarily with reference to microprocessor based systems in the illustrative examples, it will be appreciated that in view of the disclosure herein, certain aspects, architectures, and principles of the disclosure are equally applicable to other types of device memory and logic devices.
Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium. Thus, embodiments include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.
Operations described herein are performed by logic which is configured to perform the operations either automatically or substantially automatically with little or no system operator intervention, except where indicated as being performed manually such as user selection. Thus, as used herein, the term “automatic” includes both fully automatic, that is operations performed by one or more hardware or software controlled machines with no human intervention such as user inputs to a graphical user selection interface. As used herein, the term “automatic” further includes predominantly automatic, that is, most of the operations (such as greater than 50%, for example) are performed by one or more hardware or software controlled machines with no human intervention such as user inputs to a graphical user selection interface, and the remainder of the operations (less than 50%, for example) are performed manually, that is, the manual operations are performed by one or more hardware or software controlled machines with human intervention such as user inputs to a graphical user selection interface to direct the performance of the operations.
Many of the functional elements described in this specification have been labeled as “logic,” in order to more particularly emphasize their implementation independence. For example, a logic element may be implemented as a hardware circuit comprising custom Very Large Scale Integrated (VLSI) circuits or gate arrays, off-the-shelf semiconductors such as logic chips, transistors, or other discrete components. A logic element may also be implemented in firmware or programmable hardware devices such as field programmable gate arrays, programmable array logic, programmable logic devices or the like.
A logic element may also be implemented in software for execution by various types of processors. A logic element which includes executable code may, for instance, comprise one or more physical or logical blocks of computer instructions which may, for instance, be organized as an object, procedure, or function. Nevertheless, the executables of an identified logic element need not be physically located together, but may comprise disparate instructions stored in different locations which, when joined logically together, comprise the logic element and achieve the stated purpose for the logic element.
Indeed, executable code for a logic element may be a single instruction, or many instructions, and may even be distributed over several different code segments, among different programs, among different processors, and across several non-volatile memory devices. Similarly, operational data may be identified and illustrated herein within logic elements, and may be embodied in any suitable form and organized within any suitable type of data structure. The operational data may be collected as a single data set, or may be distributed over different locations including over different storage devices.
The microprocessor 320 includes a cache 325 that may be part of a memory hierarchy to store instructions and data, and the system memory may include both volatile memory as well as the memory 340 depicted which may include a non-volatile memory. The system memory may also be part of the memory hierarchy. Logic 327 of the microprocessor 320 may include one or more cores, for example. Communication between the microprocessor 320 and the memory 340 may be facilitated by the memory controller (or chipset) 330, which may also facilitate in communicating with the storage drive 344 and the peripheral components 350. The system may include an offload data transfer engine for direct memory data transfers.
Storage drive 344 includes non-volatile storage and may be implemented as, for example, solid-state drives, magnetic disk drives, optical disk drives, storage area network (SAN), network access server (NAS), a tape drive, flash memory, persistent memory domains and other storage devices employing a volatile buffer memory and a nonvolatile storage memory. The storage may comprise an internal storage device or an attached or network accessible storage. The microprocessor 320 is configured to write data in and read data from the memory 340. Programs in the storage are loaded into the memory 340 and executed by the microprocessor 320. A network controller or adapter enables communication with a network, such as an Ethernet, a Fiber Channel Arbitrated Loop, etc. Further, the architecture may, in certain embodiments, include a video controller configured to render information on a display monitor, where the video controller may be embodied on a video card or integrated on integrated circuit components mounted on a motherboard or other substrate. An input device is used to provide user input to the microprocessor 320, and may include a keyboard, mouse, pen-stylus, microphone, touch sensitive display screen, input pins, sockets, or any other activation or input mechanism known in the art. An output device is capable of rendering information transmitted from the microprocessor 320, or other component, such as a display monitor, printer, storage, output pins, sockets, etc. The network adapter may be embodied on a network card, such as a peripheral component interconnect (PCI) card, PCI-express, or some other input/output (I/O) card, or on integrated circuit components mounted on a motherboard or other substrate.
One or more of the components of the device 310 may be omitted, depending upon the particular application. For example, a network router may lack a video controller, for example. Any one or more of the devices of
Example 1 is an apparatus for use with a hierarchical communication network of a storage system having a plurality of storage nodes configured to store data, comprising:
a first hierarchical switch at a first hierarchical level in the hierarchical communication network of the storage system, the first hierarchical switch having intra-rack request generation logic configured to detect receipt of a first distributed multi-chunk data placement request to place first storage data in a first set of storage nodes of the storage system, and in response to the first distributed multi-chunk data placement request, generate and transmit a first set of data chunk placement requests to assigned storage nodes of the first set of storage nodes, wherein each data chunk placement request is a request to place an individual erasure encoded chunk of data of the first storage data, in an assigned storage node of the first set of storage nodes.
In Example 2, the subject matter of Examples 1-8 (excluding the present Example) can optionally include:
a second hierarchical switch at the first hierarchical level in the hierarchical communication network of the storage system, the second hierarchical switch having intra-rack request generation logic configured to detect receipt of a second distributed multi-chunk data placement request to place second storage data in a second set of storage nodes of the storage system, and in response to the second distributed multi-chunk data placement request, generate and transmit a second set of data chunk placement requests to assigned storage nodes of the second set of storage nodes, wherein each data chunk placement request of the second set is a request to place an individual erasure encoded chunk of data of the second storage data, in an assigned storage node of the second set of storage nodes.
In Example 3, the subject matter of Examples 1-8 (excluding the present Example) can optionally include wherein the first hierarchical level of the first and second hierarchical switches is at a lower hierarchical level as compared to a second hierarchical level of the hierarchical communication network of the storage system, the apparatus further comprising:
a third hierarchical switch at the second hierarchical level, the third hierarchical switch having inter-rack request generation logic configured to detect a consolidated multi-chunk placement request to place storage data including the first and second storage data, in storage in a set of storage nodes of the storage system including the first and second sets of storage nodes, and in response to the consolidated multi-chunk placement request, generate and transmit the first distributed multi-chunk data placement request to the first hierarchical switch to place the first storage data in the first set of storage nodes of the storage system, and the second distributed multi-chunk data placement request to the second hierarchical switch to place the second storage data in the second set of storage nodes of the storage system.
In Example 4, the subject matter of Examples 1-8 (excluding the present Example) can optionally include:
a client node coupled to the third hierarchical switch, and having consolidated placement request logic configured to receive storage data including the first and second storage data, erasure encode the received data into chunks of erasure encoded chunks of the first and second storage data, at least some of which include parity data, and generate and transmit to the third hierarchical switch, the consolidated multi-chunk placement request having a payload of erasure encoded chunks of the first and second storage data, for storage in the first and second sets of storage nodes, respectively.
In Example 5, the subject matter of Examples 1-8 (excluding the present Example) can optionally include:
a first storage rack having the first set of storage nodes, each storage node of the first set having chunk placement logic configured to, in response to a data chunk placement request of the first set of data chunk placement requests, received by the assigned storage node of the first set of storage nodes, store an individual erasure encoded chunk of data of the first set of erasure encoded chunks of data, each storage node of the first set further having placement acknowledgement generation logic configured to send to the first hierarchical switch a data chunk placement acknowledgment acknowledging storage of an individual erasure encoded chunk of data of the first set of erasure encoded chunks of data,
a second storage rack having the second set of storage nodes, each storage node of the second set having chunk placement logic configured to, in response to a data chunk placement request of the second set of data chunk placement requests, received by the assigned storage node of the second set of storage nodes, store an individual erasure encoded chunk of data of the second set of erasure encoded chunks of data, each storage node of the second set further having placement acknowledgement generation logic configured to send to the second hierarchical switch a data chunk placement acknowledgment acknowledging storage of an individual erasure encoded chunk of data of the second set of erasure encoded chunks of data,
wherein the second hierarchical switch has intra-rack acknowledgement consolidation logic configured to receive a second plurality of data chunk placement acknowledgments, each data chunk placement acknowledgment acknowledging storage of an individual erasure encoded chunk of data of the second set of erasure encoded chunks of data in an assigned storage node of the second set of storage nodes, and generate and transmit to the third switch in response to receipt of the second plurality of data chunk placement acknowledgments, a second multi-chunk data placement acknowledgement acknowledging storage of the second storage data in the second set of storage nodes of the storage system.
In Example 6, the subject matter of Examples 1-8 (excluding the present Example) can optionally include wherein the third hierarchical switch has inter-rack acknowledgment consolidation logic configured to receive the first and second multi-chunk data placement acknowledgements and generate and transmit to the client node of the storage system, a consolidated multi-chunk placement acknowledgment acknowledging storage of the first and second storage data in the first and second sets, respectively, of storage nodes of the storage system.
In Example 7, the subject matter of Examples 1-8 (excluding the present Example) can optionally include wherein the intra-rack request generation logic is further configured to erasure encode the first storage data of the first distributed multi-chunk data placement request received by the first hierarchical switch, to the erasure encoded chunks of data for the first set of data chunk placement requests to place the individual erasure encoded chunks of data of the first storage data, in assigned storage nodes of the first set of storage nodes.
In Example 8, the subject matter of Examples 1-8 (excluding the present Example) can optionally include said storage system having said hierarchical communication network.
Example 9 is a method, comprising:
detecting by a first hierarchical switch at a first hierarchical level in a hierarchical communication network of a storage system, a first distributed multi-chunk data placement request to place first storage data in a first set of storage nodes of the storage system, and
in response to the first distributed multi-chunk data placement request, transmitting by the first hierarchical switch, a first set of data chunk placement requests to assigned storage nodes of the first set of storage nodes, wherein each data chunk placement request is a request to place an individual erasure encoded chunk of data of the first storage data, in an assigned storage node of the first set of storage nodes.
In Example 10, the subject matter of Examples 9-15 (excluding the present Example) can optionally include:
detecting by a second hierarchical switch at the first hierarchical level in a hierarchical communication network of a storage system, a second distributed multi-chunk data placement request to place second storage data in a second set of storage nodes of the storage system, and
in response to the second distributed multi-chunk data placement request, transmitting by the second hierarchical switch, a second set of data chunk placement requests to assigned storage nodes of the second set of storage nodes, wherein each data chunk placement request of the second set of data chunk placement requests is a request to place an individual erasure encoded chunk of data of the second storage data, in an assigned storage node of the second set of storage nodes.
In Example 11, the subject matter of Examples 9-15 (excluding the present Example) can optionally include wherein the first hierarchical level of the first and second hierarchical switches is at a lower hierarchical level as compared to a second hierarchical level of a third hierarchical switch in the hierarchical communication network of the storage system, the method further comprising:
detecting by the third hierarchical switch, a consolidated multi-chunk placement request to place storage data including the first and second storage data, in storage in a set of storage nodes of the storage system including the first and second sets of storage nodes,
in response to the consolidated multi-chunk placement request, transmitting by the third hierarchical switch, the first distributed multi-chunk data placement request to the first hierarchical switch to place the first storage data in the first set of storage nodes of the storage system, and
in response to the third transmission of data, transmitting by the third hierarchical switch, the second distributed multi-chunk data placement request to the second hierarchical switch to place the second storage data in the second set of storage nodes of the storage system.
In Example 12, the subject matter of Examples 9-15 (excluding the present Example) can optionally include:
receiving by a client node of the storage system for storage in the storage system, storage data including the first and second storage data,
erasure encoding the received data into chunks of erasure encoded chunks of the first and second storage data, at least some of which include parity data, and
transmitting to the third hierarchical switch, the consolidated multi-chunk placement request having a payload of erasure encoded chunks of the first and second storage data, for storage in the first and second sets of storage nodes, respectively.
In Example 13, the subject matter of Examples 9-15 (excluding the present Example) can optionally include:
in response to each data chunk placement request of the first set of data chunk placement requests, an assigned storage node of the first set of storage nodes storing an individual erasure encoded chunk of data of the first set of erasure encoded chunks of data, and sending to the first hierarchical switch a data chunk placement acknowledgment acknowledging storage of an individual erasure encoded chunk of data of the first set of erasure encoded chunks of data,
receiving by first hierarchical switch, a first plurality of data chunk placement acknowledgments, each data chunk placement acknowledgment acknowledging storage of an individual erasure encoded chunk of data of the first set of erasure encoded chunks of data in an assigned storage node of the first set of storage nodes, transmitting by the first hierarchical switch to the third switch in response to receipt of the first plurality of data chunk placement acknowledgments, a first multi-chunk data placement acknowledgement acknowledging storage of the first storage data in the first set of storage nodes of the storage system,
in response to each data chunk placement request of the second set of data chunk placement requests, an assigned storage node of the second set of storage nodes storing an individual erasure encoded chunk of data of the second set of erasure encoded chunks of data, and sending to the second hierarchical switch a data chunk placement acknowledgment acknowledging storage of an individual erasure encoded chunk of data of the second set of erasure encoded chunks of data,
receiving by second hierarchical switch, a second plurality of data chunk placement acknowledgments, each data chunk placement acknowledgment of the second set acknowledging storage of an individual erasure encoded chunk of data of the second set of erasure encoded chunks of data in an assigned storage node of the second set of storage nodes, and
transmitting by the second hierarchical switch to the third switch in response to receipt of the second plurality of data chunk placement acknowledgments, a second multi-chunk data placement acknowledgement acknowledging storage of the second storage data in the second set of storage nodes of the storage system.
In Example 14, the subject matter of Examples 9-15 (excluding the present Example) can optionally include receiving by third hierarchical switch, the first and second multi-chunk data placement acknowledgements and transmitting to a client node of the storage system, a consolidated multi-chunk placement acknowledgment acknowledging storage of the first and second storage data in the first and second sets, respectively, of storage nodes of the storage system.
In Example 15, the subject matter of Examples 9-15 (excluding the present Example) can optionally include erasure encoding by the first hierarchical switch, the first storage data of the first distributed multi-chunk data placement request received by the first hierarchical switch, to the erasure encoded chunks of data for the first set of data chunk placement requests to place the individual erasure encoded chunks of data of the first storage data, in assigned storage nodes of the first set of storage nodes.
Example 16 is an apparatus comprising means to perform a method as claimed in any preceding Example.
Example 17 is a storage system, comprising:
a hierarchical communication network having a plurality of storage nodes configured to store data, the network comprising:
a first hierarchical switch at a first hierarchical level in the hierarchical communication network of the storage system, the first hierarchical switch having intra-rack request generation logic configured to detect receipt of a first distributed multi-chunk data placement request to place first storage data in a first set of storage nodes of the storage system, and in response to the first distributed multi-chunk data placement request, generate and transmit a first set of data chunk placement requests to assigned storage nodes of the first set of storage nodes, wherein each data chunk placement request is a request to place an individual erasure encoded chunk of data of the first storage data, in an assigned storage node of the first set of storage nodes.
In Example 18, the subject matter of Examples 17-24 (excluding the present Example) can optionally include:
a second hierarchical switch at the first hierarchical level in the hierarchical communication network of the storage system, the second hierarchical switch having intra-rack request generation logic configured to detect receipt of a second distributed multi-chunk data placement request to place second storage data in a second set of storage nodes of the storage system, and in response to the second distributed multi-chunk data placement request, generate and transmit a second set of data chunk placement requests to assigned storage nodes of the second set of storage nodes, wherein each data chunk placement request of the second set is a request to place an individual erasure encoded chunk of data of the second storage data, in an assigned storage node of the second set of storage nodes.
In Example 19, the subject matter of Examples 17-24 (excluding the present Example) can optionally include wherein the first hierarchical level of the first and second hierarchical switches is at a lower hierarchical level as compared to a second hierarchical level of the hierarchical communication network of the storage system, the system further comprising:
a third hierarchical switch at the second hierarchical level, the third hierarchical switch having inter-rack request generation logic configured to detect a consolidated multi-chunk placement request to place storage data including the first and second storage data, in storage in a set of storage nodes of the storage system including the first and second sets of storage nodes, and in response to the consolidated multi-chunk placement request, generate and transmit the first distributed multi-chunk data placement request to the first hierarchical switch to place the first storage data in the first set of storage nodes of the storage system, and the second distributed multi-chunk data placement request to the second hierarchical switch to place the second storage data in the second set of storage nodes of the storage system.
In Example 20, the subject matter of Examples 17-24 (excluding the present Example) can optionally include:
a client node coupled to the third hierarchical switch, and having consolidated placement request logic configured to receive storage data including the first and second storage data, erasure encode the received data into chunks of erasure encoded chunks of the first and second storage data, at least some of which include parity data, and generate and transmit to the third hierarchical switch, the consolidated multi-chunk placement request having a payload of erasure encoded chunks of the first and second storage data, for storage in the first and second sets of storage nodes, respectively.
In Example 21, the subject matter of Examples 17-24 (excluding the present Example) can optionally include:
a first storage rack having the first set of storage nodes, each storage node of the first set having chunk placement logic configured to, in response to a data chunk placement request of the first set of data chunk placement requests, received by the assigned storage node of the first set of storage nodes, store an individual erasure encoded chunk of data of the first set of erasure encoded chunks of data, each storage node of the first set further having placement acknowledgement generation logic configured to send to the first hierarchical switch a data chunk placement acknowledgment acknowledging storage of an individual erasure encoded chunk of data of the first set of erasure encoded chunks of data,
wherein the first hierarchical switch has intra-rack acknowledgement consolidation logic configured to receive a first plurality of data chunk placement acknowledgments, each data chunk placement acknowledgment acknowledging storage of an individual erasure encoded chunk of data of the first set of erasure encoded chunks of data in an assigned storage node of the first set of storage nodes, and generate and transmit to the third switch in response to receipt of the first plurality of data chunk placement acknowledgments, a first multi-chunk data placement acknowledgement acknowledging storage of the first storage data in the first set of storage nodes of the storage system,
a second storage rack having the second set of storage nodes, each storage node of the second set having chunk placement logic configured to, in response to a data chunk placement request of the second set of data chunk placement requests, received by the assigned storage node of the second set of storage nodes, store an individual erasure encoded chunk of data of the second set of erasure encoded chunks of data, each storage node of the second set further having placement acknowledgement generation logic configured to send to the second hierarchical switch a data chunk placement acknowledgment acknowledging storage of an individual erasure encoded chunk of data of the second set of erasure encoded chunks of data,
wherein the second hierarchical switch has intra-rack acknowledgement consolidation logic configured to receive a second plurality of data chunk placement acknowledgments, each data chunk placement acknowledgment acknowledging storage of an individual erasure encoded chunk of data of the second set of erasure encoded chunks of data in an assigned storage node of the second set of storage nodes, and generate and transmit to the third switch in response to receipt of the second plurality of data chunk placement acknowledgments, a second multi-chunk data placement acknowledgement acknowledging storage of the second storage data in the second set of storage nodes of the storage system.
In Example 22, the subject matter of Examples 17-24 (excluding the present Example) can optionally include wherein the third hierarchical switch has inter-rack acknowledgment consolidation logic configured to receive the first and second multi-chunk data placement acknowledgements and generate and transmit to the client node of the storage system, a consolidated multi-chunk placement acknowledgment acknowledging storage of the first and second storage data in the first and second sets, respectively, of storage nodes of the storage system.
In Example 23, the subject matter of Examples 17-24 (excluding the present Example) can optionally include wherein the intra-rack request generation logic is further configured to erasure encode the first storage data of the first distributed multi-chunk data placement request received by the first hierarchical switch, to the erasure encoded chunks of data for the first set of data chunk placement requests to place the individual erasure encoded chunks of data of the first storage data, in assigned storage nodes of the first set of storage nodes.
In Example 24, the subject matter of Examples 17-24 (excluding the present Example) can optionally include a display communicatively coupled to the switch.
Example 25 is an apparatus for use with a hierarchical communication network of a storage system having a plurality of storage nodes configured to store data, comprising:
a first hierarchical switch at a first hierarchical level in the hierarchical communication network of the storage system, the first hierarchical switch having intra-rack request generation logic means configured for detecting receipt of a first distributed multi-chunk data placement request to place first storage data in a first set of storage nodes of the storage system, and in response to the first distributed multi-chunk data placement request, generating and transmitting a first set of data chunk placement requests to assigned storage nodes of the first set of storage nodes, wherein each data chunk placement request is a request to place an individual erasure encoded chunk of data of the first storage data, in an assigned storage node of the first set of storage nodes.
In Example 26, the subject matter of Examples 25-31 (excluding the present Example) can optionally include:
a second hierarchical switch at the first hierarchical level in the hierarchical communication network of the storage system, the second hierarchical switch having intra-rack request generation logic means configured for detecting receipt of a second distributed multi-chunk data placement request to place second storage data in a second set of storage nodes of the storage system, and in response to the second distributed multi-chunk data placement request, generating and transmitting a second set of data chunk placement requests to assigned storage nodes of the second set of storage nodes, wherein each data chunk placement request of the second set is a request to place an individual erasure encoded chunk of data of the second storage data, in an assigned storage node of the second set of storage nodes.
In Example 27, the subject matter of Examples 25-31 (excluding the present Example) can optionally include wherein the first hierarchical level of the first and second hierarchical switches is at a lower hierarchical level as compared to a second hierarchical level of the hierarchical communication network of the storage system, the apparatus further comprising:
a third hierarchical switch at the second hierarchical level, the third hierarchical switch having inter-rack request generation logic means configured for detecting a consolidated multi-chunk placement request to place storage data including the first and second storage data, in storage in a set of storage nodes of the storage system including the first and second sets of storage nodes, and in response to the consolidated multi-chunk placement request, generating and transmitting the first distributed multi-chunk data placement request to the first hierarchical switch to place the first storage data in the first set of storage nodes of the storage system, and the second distributed multi-chunk data placement request to the second hierarchical switch to place the second storage data in the second set of storage nodes of the storage system.
In Example 28, the subject matter of Examples 25-31 (excluding the present Example) can optionally include:
a client node coupled to the third hierarchical switch, and having consolidated placement request logic means configured for receiving storage data including the first and second storage data, erasure encoding the received data into chunks of erasure encoded chunks of the first and second storage data, at least some of which include parity data, and generating and transmitting to the third hierarchical switch, the consolidated multi-chunk placement request having a payload of erasure encoded chunks of the first and second storage data, for storage in the first and second sets of storage nodes, respectively.
In Example 29, the subject matter of Examples 25-31 (excluding the present Example) can optionally include:
a first storage rack having the first set of storage nodes, each storage node of the first set having chunk placement logic means configured for, in response to a data chunk placement request of the first set of data chunk placement requests, received by the assigned storage node of the first set of storage nodes, storing an individual erasure encoded chunk of data of the first set of erasure encoded chunks of data, each storage node of the first set further having placement acknowledgement generation logic means configured for sending to the first hierarchical switch a data chunk placement acknowledgment acknowledging storage of an individual erasure encoded chunk of data of the first set of erasure encoded chunks of data,
wherein the first hierarchical switch has intra-rack acknowledgement consolidation logic means configured for receiving a first plurality of data chunk placement acknowledgments, each data chunk placement acknowledgment acknowledging storage of an individual erasure encoded chunk of data of the first set of erasure encoded chunks of data in an assigned storage node of the first set of storage nodes, and generating and transmitting to the third switch in response to receipt of the first plurality of data chunk placement acknowledgments, a first multi-chunk data placement acknowledgement acknowledging storage of the first storage data in the first set of storage nodes of the storage system,
a second storage rack having the second set of storage nodes, each storage node of the second set having chunk placement logic means configured for, in response to a data chunk placement request of the second set of data chunk placement requests, received by the assigned storage node of the second set of storage nodes, storing an individual erasure encoded chunk of data of the second set of erasure encoded chunks of data, each storage node of the second set further having placement acknowledgement generation logic means configured for sending to the second hierarchical switch a data chunk placement acknowledgment acknowledging storage of an individual erasure encoded chunk of data of the second set of erasure encoded chunks of data,
wherein the second hierarchical switch has intra-rack acknowledgement consolidation logic means configured for receiving a second plurality of data chunk placement acknowledgments, each data chunk placement acknowledgment acknowledging storage of an individual erasure encoded chunk of data of the second set of erasure encoded chunks of data in an assigned storage node of the second set of storage nodes, and generating and transmitting to the third switch in response to receipt of the second plurality of data chunk placement acknowledgments, a second multi-chunk data placement acknowledgement acknowledging storage of the second storage data in the second set of storage nodes of the storage system.
In Example 30, the subject matter of Examples 25-31 (excluding the present Example) can optionally include wherein the third hierarchical switch has inter-rack acknowledgment consolidation logic means configured for receiving the first and second multi-chunk data placement acknowledgements and generating and transmitting to the client node of the storage system, a consolidated multi-chunk placement acknowledgment acknowledging storage of the first and second storage data in the first and second sets, respectively, of storage nodes of the storage system.
In Example 31, the subject matter of Examples 25-31 (excluding the present Example) can optionally include wherein the intra-rack request generation logic means is further configured for erasure encoding the first storage data of the first distributed multi-chunk data placement request received by the first hierarchical switch, to the erasure encoded chunks of data for the first set of data chunk placement requests to place the individual erasure encoded chunks of data of the first storage data, in assigned storage nodes of the first set of storage nodes.
Example 32 is a machine-readable storage including machine-readable instructions, when executed, to implement a method or realize an apparatus as claimed in preceding Examples 1-31.
The described operations may be implemented as a method, apparatus or computer program product using standard programming and/or engineering techniques to produce software, firmware, hardware, or any combination thereof. The described operations may be implemented as computer program code maintained in a “computer readable storage medium”, where a processor may read and execute the code from the computer storage readable medium. The computer readable storage medium includes at least one of electronic circuitry, storage materials, inorganic materials, organic materials, biological materials, a casing, a housing, a coating, and hardware. A computer readable storage medium may comprise, but is not limited to, a magnetic storage medium (e.g., hard disk drives, floppy disks, tape, etc.), optical storage (CD-ROMs, DVDs, optical disks, etc.), volatile and non-volatile memory devices (e.g., EEPROMs, ROMs, PROMs, RAMs, DRAMs, SRAMs, Flash Memory, firmware, programmable logic, etc.), Solid State Devices (SSD), etc. The code implementing the described operations may further be implemented in hardware logic implemented in a hardware device (e.g., an integrated circuit chip, Programmable Gate Array (PGA), Application Specific Integrated Circuit (ASIC), etc.). Still further, the code implementing the described operations may be implemented in “transmission signals”, where transmission signals may propagate through space or through a transmission media, such as an optical fiber, copper wire, etc. The transmission signals in which the code or logic is encoded may further comprise a wireless signal, satellite transmission, radio waves, infrared signals, Bluetooth, etc. The program code embedded on a computer readable storage medium may be transmitted as transmission signals from a transmitting station or computer to a receiving station or computer. A computer readable storage medium is not comprised solely of transmissions signals. Those skilled in the art will recognize that many modifications may be made to this configuration without departing from the scope of the present description, and that the article of manufacture may comprise suitable information bearing medium known in the art. Of course, those skilled in the art will recognize that many modifications may be made to this configuration without departing from the scope of the present description, and that the article of manufacture may comprise any tangible information bearing medium known in the art.
In certain applications, a device in accordance with the present description, may be embodied in a computer system including a video controller to render information to display on a monitor or other display coupled to the computer system, a device driver and a network controller, such as a computer system comprising a desktop, workstation, server, mainframe, laptop, handheld computer, etc. Alternatively, the device embodiments may be embodied in a computing device that does not include, for example, a video controller, such as a switch, router, etc., or does not include a network controller, for example.
The illustrated logic of figures may show certain events occurring in a certain order. In alternative embodiments, certain operations may be performed in a different order, modified or removed. Moreover, operations may be added to the above described logic and still conform to the described embodiments. Further, operations described herein may occur sequentially or certain operations may be processed in parallel. Yet further, operations may be performed by a single processing unit or by distributed processing units.
The foregoing description of various embodiments has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit to the precise form disclosed. Many modifications and variations are possible in light of the above teaching.