Embodiments of the invention relate to the fields of computing, storage and networking, and in particular the interrelationship between each. Additionally the invention relates to data centers.
Present systems in data centers consist of discrete computing, storage and networking functions. Each of these functions is seen separately in both the hardware that is used to build each function, and in the software that runs on each function. Status quo system circa 2013 have limited purpose due to the way that the functions themselves have been crafted.
One major draw back of present systems is that they consume enormously more power than the embodiments of the invention contained herein. This power is a direct result of the separation of each of the elements of computing, storage and networking. In particular, since each of these elements is essentially contained within its own “box”, one can easily see the multiplicative consequences of added power supplies and components. Power is a large motivator in modern data centers because the data centers often consume AC wall power on the order of tens of megawatts. Thus, a ten percent reduction in power is considered of enormous value to the industry.
Power is not the only factor that data centers care about. Because the functions of computing, storage and networking in prior art are all rendered as effectively independent entities, a vast amount of three dimensional space is required to house these systems. This three dimensional size results in large purpose built buildings being crafted at substantially larger cost per square foot than conventional commercial construction.
Present data center systems also suffer enormous performance loss because of the aforementioned separation. To illuminate this point more fully, consider the fact that a typical hard disk in a typical data center storage server has a SAS interface, which in turn must be converted to Fiber Channel, and then from Fiber Channel to PCIe and then from PCIe to Ethernet and then from Ethernet to the CPU that operates the storage server. This are a large number of inefficiencies in the system. These represent a simple example of the limitations that are inherent in the way that present data centers operate.
Another distinguishing factor of prior art data center systems over the present invention is that they offer little to no purpose built intelligence. This means that any particular operation such as mining the storage system for some particular piece of information is effectively produced by offloading the intelligence to another agent. e.g. a CPU server must perform all of the gathering of data from the storage system, and then calculate the intended result before it is sent across the network system. This represents a huge misuse of the abilities of each system, as will be seen in the context of the present invention.
When one looks at the status quo, one can not help but realize there is enormous need within the industry for serious change in the way computing, storage and networking systems are built to craft data centers. The impressive reduction in system cost of the embodiments contained herein is itself sufficient motivation for change. However, the performance capabilities that the embodiments allow for provide substantial motivation. In the end, the present invention's new capabilities are what will provide a justification to move away from status quo systems to those of the present invention instead.
Given the deficiencies in current data center systems, what is needed are both apparatus, methods, and systems to address these deficiencies. The following description and claims will demonstrate vastly superior methods and apparatus to directly address these needs while simultaneously preparing for future changes within the data center.
The present disclosure demonstrates to those skilled in the art how to make a much better data center. In particular, an apparatus is disclosed which directly couples computing, storage and networking elements together wherein each of those elements has its own local and highly efficient intelligence.
One of the major deficiencies in present systems is the sheer number of times that the physical interface changes to get or put computed information. For example, when one reads a granted patent from the United States Patent and Trademark Office, information must be fetched across a network from a request on the viewers' computer. If one breaks down all that goes on in that seemingly simple process, one will find literally tens of interfaces that have to be crossed over. Each of these interfaces typically has a set of physical attributes associated with it, along with a set of software routines which must be run to effect the transfer. When considering the ‘cost’ of this example transaction, one has to look at how much power, time and physical volume that that transaction cost. Since almost no one pays for the entire part of those costs it is not at all obvious that these costs are indeed high. When one couples the fact that systems have pretty much all been made the same way since the dawn of computing, a form of confirmation bias is introduced into the thinking of those who design and manufacture such systems. The result is that the present system are designed in substantially the same way as systems from the 1960s.
The present invention described in detail below begins by re-imagining how a data center can be built using modern technology. This includes using the most modern possible CPUs, memory, storage, connectors, and other physical attributes of hardware. However, the embodiments also seek to alter how each of these elements are connected to each other in order to provide the most direct connection possible. The embodiments then further augment the resultant system with intelligence which is both localized to the element, and yet scalable and accessible as a vastly larger system. Lastly, the present invention intentionally re-imagines how software can be altered to take advantage of the direct connections and substantially altered data flow.
In general, the present invention can be thought of as a unification of three functions of the modern data center: Computing, Storage and Networking. The embodiments directly couple each of these three functions together with the minimum possible number of interfaces. This is then augmented by adding intelligence to each function to permit that function to operate with an expanded role. For example, adding local intelligence to the storage system permits data mining directly on the storage instead of across a network on some distant computing element. Because of the structure of the network function, and the fact that it is directly coupled, the resultant final system is highly scalable. In fact, it was a design intent to ensure that the system was scalable from single instances to many hundreds of thousands of instances.
A hugely distinguishing factor in the present invention is that the augmented intelligence contained within each function permits new classes of behavior to exist. Perhaps the best example of this in the present embodiment is the ability of the network to act in an intelligent way and become the “server” portion of the data center. To be clear, this ability does not exist in any present systems except through emulation. The reason is that the network does not have its own intelligence, it ‘borrows’ the intelligence from the computing function. One way to think about this is that all present operating systems and file systems use the compute element to actually operate. Yes, they speak to the storage and across the network, but in the end, it is the compute function that is doing the work. In the preferred embodiments, this workload is actually shifted to the network itself. This new capability offers functionality that has never existed before. In particular, the ability to craft a ‘self-aware’ network that knows how to ‘serve’ its other functions based upon the context of the requests it has received. Examples of this new ability include the ability to have the network itself schedule the amount of time a set of CPUs will operate on a particular problem, as well as contextually making switching the datasets needed in storage over to that appropriate CPU. Further, that the network itself is actually managing each individual storage element (e.g. NAND page) directly itself, and is aware of and manages the idiosyncrasies of those storage elements. These kinds of capabilities will alter the definition of both operating systems and file systems as they are presently known, and will offer new synthetic capabilities that make data mining, sensor fusion and other similar kinds of problems vastly simpler to solve.
In brief summary, the present invention can be thought of as directly connecting newly intelligent agents of augmented computing, augmented storage and augmented networking together in ways that allow for direct scalability of the resulting platform while simultaneously vastly reducing power, cost, three dimensional volume, and vastly increasing the effective performance of the system. A more detailed understanding of the specific elements of the invention can be found below, and more specifically alliterated by the attached claims.
The accompanying drawings are included to provide further understanding of the invention, and are incorporated in and constitute a part of this specification. The drawings illustrate embodiments of the invention, and together with the description, serve to explain the principles of the invention. In the drawings:
Reference will now be made in detail to the preferred embodiments of the invention, examples of which are illustrated in the accompanying drawings. While the invention will be described in conjunction with the preferred embodiments, it will be understood that they are not intended to limit the invention to these embodiments. On the contrary, the invention is intended to cover alternatives, modifications and equivalents, which may be included within the spirit and scope of the invention as defined by the appended claims. Furthermore, in the following detailed description of the present invention, numerous specific details are set forth in order to provide a thorough understanding of the present invention. However, it will be obvious to one of ordinary skill in the art that the present invention may be practiced without these specific details. In other instances, well-known methods, procedures, components, and circuits have not been described in detail so as not to unnecessarily obscure aspects of the present invention.
Prior art data centers are built from elements that collectively can be thought of as operating as a computer. Typical elements, as exampled in
Replication of same elements over and over again does not force homogeneity of the system. Even if there are expert programmers available to program the supposedly homogenous system, the work they do is towards an application specific purpose, which relegates the system to a now special purpose. Though the hardware might be more general purpose, the end effect is to craft an application specific purpose. In a data center such as one that exclusively runs say Netflix, Inc. movie streaming, this is probably a good thing. But in a data center that wishes to be many different kinds of applications and process many different kinds of data, then a purpose built system is not very effective. Note that the homogeneity is not the cause of this lack of effectiveness—the difficult to use programming model is.
Prior art data centers, such as those illustrated by
Consider for example that the data center of
The above illustrative example is actually greatly simplified over real life in a data center, but demonstrates a particular degree of complexity. What is not obvious to many, including those skilled in the art, is the number of times that the requests and data itself in that very simple transaction are reformatted, or changed in some substantial way. For example, each time an interface is crossed, the information is temporarily stored. This means that information is also stalled, and while waiting is burning DC power. When an interface is changed, as when we transition from NIC 145 to CPU Server 155, the information is changed from Ethernet to PCIe, for example. This transition requires a reformatting of the information so that it fits the new standard. Typically this is done via encapsulation. Encapsulation can be thought of as the Russian dolls that fit one doll inside another. Each layer of encapsulation adds a ‘wrapper’ doll to the information. This means a number of things. First, the data must be stored while it is being encapsulated; second, a processor of some form (hardware or hardware and software) must be adding the new information which forms that particular layer of the encapsulation; third an agent elsewhere will now be responsible for unwrapping, or unencapsulating the information at each cross of boundary. When one counts up the number of times that this process happens on just the simple transactions, it is truly staggering. More importantly, it is incredibly wasteful of processing, bandwidth, hardware costs, latency, and DC power.
The modern data center is not a creation. It is an evolution. In so doing, the modern data center has adopted numerous well known standards to protect information security, authentication, and information integrity. The above typical example is so complex expressly because these systems have evolved from standards which themselves have evolved over a number of years. In many ways, this is much like a duct tape boat that had to keep adding new capabilities as new challenges are found. More and more duct tape is applied to keep the boat functional for its intended purpose. Rarely does one re-evaluate the design of the boat to see if it could be done more efficiently.
In point of fact, the supposedly modern data center where the Power Use Efficiency (PUE) factor is “best”, such as the efforts of the Open Compute consortium, is a very poor measure of efficiency that actually matters because these systems fail to regard the underlying complexity as simplify-able. This is actually an exceptionally poor ‘redesign’ because it did not deal with why the large amounts of power were necessary in the first place. Instead, the PUE effort focuses on a financially tangible consideration of wasted power in the cooling systems. Such systems maintain the inherent software, hardware and system complexity of multiple different interfaces all while changing the topology of the system.
Alteration of a system's topology is often very useful for specific sets of problems. However, if one does not also consider the inherent complexity within the rest of the system, vast swaths of inefficiency are left behind, and often forgotten because it is felt it takes too much work to address those inherent complexities. This is particularly the case when those skilled in the art design modern data center components. They focus instead on reuse of existing standards, rarely asking if they had the opportunity to design from scratch if they would build the same standards and interfaces. The system demonstrated herein addresses both the homogeneity, topology, and complexity questions simultaneously to create a new apparatus, methods and systems that are substantially more cost effective, substantially faster, substantially more capable and substantially smaller in physical volume and AC wall power. The present invention also offers new capabilities that present systems lack, and which will provide benefit in ways that are unforeseeable.
Typical HPCs are built differently from data centers. Yet they have similar kinds of functional behavior. They have a more refined and purpose built network to aid in the cooperative sharing of data between computing elements. In general, HPCs are large computers built from similar elements of computing, storage and networking. The major difference here is the way that they intend to process information. In an HPC, as much information as possible is kept in the main memory of the HPC, and the storage system is generally only used to house copies of the main memory at specific points in time, known as check points.
In its heart, an HPC is a purpose built computer that optimizes for the specific and very unique characteristics that work well for a specific class of problem. While this architecture is very homogenous, and highly scalable, it does not posses augmented intelligences. Indeed, typical HPCs are bare bones CPU cores with main memory, and a fabric switch to interprocess communicate between the nodes. But HPCs would make terrible data center computers because their purposes are so vastly different. This prior art is important to note because it demonstrates the principles of simplification of the system in direct contrast to that of the data center.
Beyond that, one skilled in the art will understand that what truly makes the HPC function well is the way the network is deeply populated with very high bandwidth connections. This is in direct contrast to data centers which use very sparse interconnection. Many errantly believe that these interconnections are not needed. This belief is another form of confirmation bias spoken of earlier. Distilled, when the performance of the system is below some particular threshold, the reason to have high bandwidth deeply populated connections is not obvious. Once past that threshold, it becomes far more understandable why high bandwidth deeply populated connections are important. However, in present data centers, the massive complexity of the system inherently results in a lack of need for these kinds of interconnections expressly because the complexity limits the performance and utility of these connections. A slightly exaggerated analogy: perhaps one could think of this like the time before we thought it possible to break the sound barrier. Because we believed it was impossible to break, for the longest time, no one really tried to actually break it. Only when one started to think beyond that barrier was the technology developed to show the usefulness of breaking the sound barrier. In the case of the high bandwidth deeply populated network, so far only HPCs have really had the need, and thus develop and deploy the technology. The prior art data center actively dismisses this as a necessity for its own use.
Turning away from prior art descriptions toward the present invention, the remaining figures will help to illustrate various aspects of the present invention.
The present invention is broadly illustrated in
Beginning at SFP+ ports 303, a 10 GE ingress/egress means directly from the internet is made available. Similarly, QSFP ports 307 offer a 40 Gigabit Ethernet (40 GE) set of ports that provide 40 GE ingress/egress directly from the internet. The 10 GE or 40 GE ports are fed directly to the intelligent network switches 313, 317 which handle all of the work of the entire network shown in
Requests for serving of content from the internet thus enter the preferred embodiment of the invention at one or more SFP+ ports 303 and/or QSFP ports 307. These requests then transit the switches 313, 317 and are sent to CPUs 342-347 to be processed. In one embodiment a single CPU 342-347 receives the request. In another embodiment two or more CPU 342-347 receive the request. In still another embodiment all CPU 342-347 receive the request. The latter two embodiments can be used by the system to provide for load sharing. Importantly, CPUs 342-347 are logically equivalent to CPU servers 152, 155, 157 insofar as information processing is concerned.
CPU 323 coupled to FPGA 327 and CPU 333 coupled to FPGA 337 along with FPGA 327 coupled to NAND 329 and FPGA 337 coupled to NAND 339 together represent an intelligent permanent storage system for the present invention. Specifically, in one embodiment CPUs 323/333 obtain requests from other agents, either directly from the 10 GE or 40 GE sources, e.g. ports 303 or 307, or from the CPUs 342-347 as appropriate via either of PCIe switch 350 and/or intelligent network switches 313/317. These requests result in data being read or written, or control information presented to effect the same logical equivalent to SAN storage processor 180 and disk chassis 182-188 shown in
In the preferred embodiment data plane information is sent through the internal 10 GE network shown in
Prior art systems are built with a plethora of interfaces. These interfaces are most often not directly compatible with one another and thus encapsulation techniques are used to allow the information to transit each different kind of interface. The present invention is vastly simpler than prior art expressly because its conceptualization was from the basis of minimizing interfaces. Moreover, prior art nearly always uses international standards such as Ethernet, PCIe, SAS, SATA, Infiniband, and Fiber Channel. In fact, the prior art systems use these differing standards as a pre-requisite because various manufacturers need to allow their small piece of equipment such as a CPU server, a network switch or a SAN server to be able to interoperate with other equipment. One of the main advantages of the preferred embodiment is the elimination of as many types of interfaces, and as many transitions across the remaining interfaces as humanly possible. The end result is a substantially power, volume and cost efficient system that replaces large swaths of data center equipment in a single small box. This is not about miniaturization. This is directly and intentionally about elimination of complexity.
The origination of the invention was as a result of a thought experiment about how to get rid of the sources of inefficiency within the data center. By analyzing the way data centers are built from the ground up, and from the top down, it was possible to discern a means to collapse large parts of the data center through the elimination of effectively redundant elements. However, the actual means for elimination, and the actual elements that were removed are not at all obvious. What became clear was that if it were possible to build a fully populated network, and it were possible to contain a large amount of storage, and if it were possible to embed power efficient CPUs all into the same platform while simultaneously allowing for sufficient ingress/egress of information and while allowing for expanding between boxes using a means that could scale to large data center sizes, and if it were possible to add intelligence into each of the three basic elements of computing, storage and network, that then it would be possible to collapse all of the present data center infrastructure into a single, scalable, replicable platform. It also became clear that it was highly useful to add IPC means between elements to allow for future kinds of software which radically alter how data centers are built today. In some embodiments, the addition of the PCIe switch 350 are not required, but those embodiments will suffer performance and power loss against the preferred embodiments that have such once these future kinds of software are operating upon the invention.
Most engineers define systems with a set of criteria based upon a customer's need. This results in many systems being highly specialized. It is often true that such systems have limited use beyond their originally intended purpose. One artifact of this is that such systems do not conceive of a future beyond which present knowledge might substantially change. The present invention has a number of intentional design elements both topologically as well as functionally that are designed to be ready for an unknown future.
Like many present day servers, the present invention is upgradeable. It is upgradeable in a few different ways from prior art. First, it utilizes a mother board and card set structure rather than a back plane, which allows cards to be altered for differing types of functionality over time. Second, it has a set of cards that are storage elements, a set of cards which are compute elements and a set of cards which are network elements with a relatively passive mother board. Third, the present invention specifically adds valuable changeable intelligences to the cards. Most data center systems involve software intelligence, and a fixed hardware intelligence. The present invention allows for both the software changes, but more importantly dynamically changeable hardware in the form of FPGAs. The FPGAs are capable of doing many different kinds of tasks because the hardware is based on bits stored in a changeable RAM. In fact, the present invention is capable of “sharing” it's FPGA logic over time and enables the switching out of hardware algorithms while the system is running. One example use of this is video filters being applied to incoming data wherein the FPGA houses one kind of filter for one moment, and then shifts to a different kind of filter the next. Part of the reason for the connection between CPU 323 to FPGA 327 and CPU 333 to FPGA 337 is expressly to allow a kind of ‘co processing’ augmentation between the CPUs and the FPGAs.
Another vital aspect of the preferred embodiment is the intelligences referred to above. Most data center systems tend to consider the CPU server as the “intelligence” of the system. In collective, the entire communications network of
The synthesis of the computing, storage and networking into a single cohesive, intelligent and scalable platform can be better understood through a deeper dive into a preferred embodiment. This synthesis is what permits the at scale advantage of the invention over prior art systems.
It should be noted that
Functionally, CPU 450 receives requests for information from the network via connector 460 and XFI ports 496/497. CPU 450 can be best thought of as the SAN Storage Processor 180 of
As mentioned in
If this were a typical hard disk based prior art data center, the seek times of a hard drive alone contained within disk chassis 182-188 would far outweigh the partial reconfiguration time of FPGA 420, let alone the time it would take to gather or process the data through SAN storage processor 180, SAN switches 172, 174, SAN interface 162-167 and finally at one of CPU servers 152-157. It can thus be seen that there are very large performance advantages of having intelligence inside the storage node rather than the way prior art systems are crafted.
Further, because the vast majority of prior art systems utilize hard disk technology instead of solid state storage technology, there are even larger gains of performance in the preferred embodiment of the present invention. There are some prior art systems that are starting to replace the hard disk drives with solid state drives (SSD). These systems rely upon standardized interfaces that the storage industry have established such as SAS, SATA and occasionally PCIe. The difficulty with all of these prior art systems is that the software stack which goes with these interfaces requires a relatively large overheard to obtain information. None of the interfaces were designed with the characteristics of solid state storage in mind; rather, they were all designed for hard drives and their unique and peculiar characteristics. There is an enormous amount of waste in software, required processing and latency across these systems that is not present in the preferred embodiment. The reason this is true is because of an intentional movement away from the standard interface and standard protocols to a minimalist protocol across the interface. Thus the preferred embodiment of the present invention has manifold improvement over prior art storage systems because it directly attaches the solid state elements to processing elements; provides a vastly simpler interface to accomplish work; has very large amounts of local DRAM storage for cacheing information; can store the entire ‘table of contents’ of information inside its local DRAM storage; because it can know where the information is located within the storage it is capable of processing data without intervention from external sources; provides a dynamically reconfigurable FPGA that can process data at hardware speeds; allows for autonomous storage operations without the intervention of external CPUs (e.g. CPU server 152-157.
There are four separate CPUs each of which is connected to DRAM and NAND storage elements. For example, CPU 510 connects to three separate 72-bit DRAM SORDIMMs 511-513 via a DDR3 DRAM interface, and to NAND 515 via an ONFI interface. Similarly CPU 520 connects to three separate 72-bit DRAM SORDIMMs 521-523 via a DDR3 DRAM interface, and to NAND 525 via an ONFI interface; CPU 530 connects to three separate 72-bit DRAM SORDIMMs 531-533 via a DDR3 DRAM interface, and to NAND 535 via an ONFI interface; and CPU 540 connects to three separate 72-bit DRAM SORDIMMs 541-543 via a DDR3 DRAM interface, and to NAND 545 via an ONFI interface. CPUs 510, 520, 530 and 540 each connect to connector 550 with a set of four XFI interfaces to provide 10 GE connectivity and one PCIe interface to provide PCIe connectivity. CPUs 510 also directly connects with a single interface, which could be either PCIe or XFI, or even a private interface, to each of the other CPUs 520, 530, and 540. In this way there is a non-switched high speed information conduction path between the CPUs of CPU complex 500. This is very useful for sharing of load or for coordinating messaging, or for coordinating information processing in HPC use. The NAND 515, 525, 535, 545 is used to provide boot code, security codes and store any required local static data that the CPUs 510, 520, 530, 540 might need. In the preferred embodiment CPUs 510, 520, 530 and 540 are large multi-core complexes that also provide the offloading of work to other internal hardware such a frame managers, TCP/IP Offload Engines (TOEs), search engines, security processors, and authentication units. The added intelligence of CPUs 510, 520, 530 and 540 provides enormous effective performance over prior art data center CPU servers 152-157. CPU complex 500 in the preferred embodiment also contains a local power supply 560 which takes in a single DC power supply and crafts the required power to the components of CPU complex 500.
The reason that NAND complex 400 and CPU complex 500 exist is so that a modular system can be created which allows for these elements to change over time. For example, as larger NAND devices become available, NAND complex 400 can change to accommodate those. If a better long term storage technology comes along, the NAND can be replaced with something else, for example Resistive RAM (RRAM). Thus NAND is used herein to mean an appropriate memory storage technology for non-volatile storage of data. Similarly, if better CPUs become available that are more power efficient, or have greater processing power, then CPU complex 500 may be changed to accommodate those. In point of fact, as
In the preferred embodiment according to the present invention, a multiplicity of NAND complex 400 and CPU complex 500 can be installed into a chassis. The number of these is limited by the physical volume that the chassis allows for. For example, an industry standard 2U chassis might house 16 NAND complex 400 cards and 20 CPU complex 500 cards along with other required elements to craft a final system. What relatively quickly becomes obvious is that each of these cards needs a means to be powered, and a means to communicate with one another.
An important reason for separating NAND Complex 400 and CPU Complex 500, as well as Intelligent ethernet switch fabric 610 and PCIe switch fabric 710 is so that they can be changed over time within a chassis. For example, while
The best mode of
Let us say that a request comes in from QSFP ports 307 via 40 GE into intelligent ethernet switch fabric 610 from the internet. Let us further define that the request is to send a photo stored somewhere in a NAND complex back across the internet to the original requestor. Note this is the same example used in reference to
The second method of access is when intelligent ethernet switch fabric 610 does not have enough information to properly forward on the request directly to NAND complex 400 and instead must ask for help from one or more CPU Complex 500. In this case, Intelligent Ethernet Switch Fabric 610 determines which CPU in a particular CPU Complex 500 is least busy, along with which route between Intelligent Ethernet Switch Fabric 610 and CPU Complex 500 is least busy and sends the request on to one or more of CPUs 510, 520, 530 and/or 540 via one or more of XFI 632-639. For the sake of this example, assume that CPU 510 has been chosen to receive the request, and that XFI 633 is used to communicate the request. CPU 510 will do appropriate operations on the requested information and determine how to best complete the request. In the simplest of cases, CPU 510 knows that the data it needs is located in NAND Complex 400, and can use either or both of XFI 672, 673 to gather the data and craft a response back to Intelligent Ethernet Switch Fabric 610. It should be appreciated by those skilled in the art that the system is capable of far more complex operations as well. In another simple example, CPU 510 could determine that part or all of the information resides in another NAND Complex 400 that is interconnected to the present system via Intelligent Ethernet Switch Fabric 610, typically via QSFP Ports 307, and that CPU 510 can use one or more of XFI 632, 633 to communicate a request to another Intelligent Ethernet Switch Fabric 610, or another CPU Complex 500 to get that portion of the data it may house. One skilled in the art will also understand that it is possible that the present NAND Complex 400 has none of the data required and that CPU 510 will make as many requests through Intelligent Ethernet Fabric Switch 610 as needed to complete and aggregate the data back to the original requestor. That might be as the data comes in from the various agents, or a complete aggregation of all the data before sending it back to the original requestor on the internet.
It should become clear that exceptionally sophisticated algorithms are possible with structures such as these. Computing can be borrowed between CPU elements in CPU Complex 500, or from CPU 450 and/or FPGA 420 in NAND Complex 400, or even from other CPUs in CPU Complex 500s which are attached via interconnection between two different Intelligent Ethernet Switch Fabrics. This method of rich interconnection makes it possible to do truly cooperative computing on information at many different levels in the whole of a data center. In contrast, modern data centers are exceptionally connection poor with typically only a single or dual 10 GE link between them and a network switch. While data centers are upgrading to 40 GE between servers, it should be understood that this is still a “starved” connection because communication uses switch bandwidth in ways that the switches are not designed to handle. The result is a model of computing which tends to keep the elements more tightly coupled within a small set of information in the data center. This in turn both limits performance, and reliability. The present invention is strongly superior expressly because it permits a full and direct inter communication between many different kinds of elements. This eliminates a large number of latencies and power inefficiencies in the system.
The present invention is designed to use Ethernet as a standard mode of communication between elements. Ethernet is intended to be used as the best mode means of communicating large amounts of information to any element in the system with very fast latencies and high data rate. Note that the system in this context is a group of elements such as shown on
The main advantage of having a separate network to the Ethernet network in the present invention is that there is that it can opportunistically use side band traffic between CPUs or other elements to update the contents of master tables, or to effect a particular quality of the larger information that is being sent down the Ethernet network. Simple examples of this include standard IPC traffic that multiple CPUs use to cooperate in an operating system. These messages are used to synchronize content, direct traffic, and report status and statistics. While this traffic can travel on the ethernet network, and in fact, will if there is a failure of communication on the PCIe network, the PCIe network is designed to provide parallel access over the Ethernet network. Importantly, removing this class of traffic from the Ethernet network results in much better utilization of the network because of how Ethernet works, as those skilled in the art will appreciate.
In one embodiment of the present invention, the PCIe network will be used to provide contextual update frames between CPUs and between the CPU Complex 500 and NAND Complex 400. For example, it was noted before that CPU 510 might be chosen by the Intelligent Network Switch Fabric 610 to receive a request from the internet to provide a specific photo back to the requestor. CPU 510 can use the PCIe network to help offload the searching for where the photo is by searching the the information context which houses location data. That context could be housed in multiple CPUs in CPU complex 500, or may be be across several CPU complex 500s. Having multiple CPUs cooperate to find the information helps to reduce the response time back to the original requestor. CPU 510 can also use the PCIe network to make inquiries to other CPUs in either the CPU Complex 500 or the NAND Complex 400 to accomplish a particular task such as providing status on a prior request. Those skilled in the art will recognize that having the low latency PCIe network in parallel to the Ethernet network provides a wide variety of different uses for the PCIe network that will depend on the specific task that the over all system is executing. For example, when the system is using CPU Complex 500 effectively as CPU servers and NAND Complex 400 effectively as storage servers, the use of the PCIe network is vastly different than if CPU Complex 500 and NAND Complex 400 are operating as a data mining engine. The context of the work being done by the overall system can thus be altered, even over time as the context changes over time, to make best use of each network for their best performance.
It should be noted that both
Switch 810-832 also have external connections, akin to SFP+ Ports 303 and QSFP Ports 307 of
Switch 810-832 also have Switch CPUs 880-891 attached to them. Switch CPUs 880-891 are designed to provide additional intelligence over and above that which is native in Switch 810-832. For example, if Switch 810-832 were only capable of dealing with layer 3 switching commands, Switch CPUs 880-891 could be used to provide layer 4 functionality. Note this example is purely illustrative, and not intended to limit the present invention. Having CPUs dedicated to the task of administering the switch is highly valuable because it allows CPUs in other areas such as CPU Complex 500 or NAND Complex 400 to not have their work flow interrupted. Note that Switch CPUs 880-891 have their own memory subsystems attached, not shown, as one skilled in the art will recognize.
Switch CPUs 880-891 in the preferred embodiment are used to help setup, manage and tear down the routing tables contained in Switch 810-832, and to provide data plane processing of transactions that are too complex for Switch 810-832 to handle by themselves. This interaction might be one where Switch 810-832 offloads the transaction entirely to Switch CPUs 880-891, or where only part of the transaction is offloaded. Which method is used depends entirely upon the traffic flow into and out of Switch 810-832, and thus may be assumed to vary over time.
Those skilled in the art will recognize that many different interconnection topologies are possible. That is, the specific connections between NAND Complexes, CPU Complexes, SFP+ Ports, QSFP Ports, and Switch CPUs can be done in a variety of ways to optimize for a particular system characteristic. The present invention is not intended to be limited in any way through the examples shows in the above figures.
Those skilled in the art will recognize that many different interconnection topologies are possible for PCIe Switch Fabric 710. The preferred embodiment of the present invention uses the fat-tree switch approach to minimize latency and maximize performance. Fat-tree switches are known to be non-blocking so long as other ports do not wish to speak to a port already communicating with another port. So if two ports wish to have access to two different other ports, the fat-tree switch allows them to communicate completely in parallel with one another. Another embodiment uses two sets of fat-tree switches in order to handle the case of failing of a single leaf or spine switch. Alternate implementations are possible that require less switches to populate the final switch but also reduce the bandwidth and capability of the final switch. The fat-tree switch topology is intentionally chosen in the preferred embodiment to ensure as much non-blocking communication as possible between all PCIe ports.
CPU Complex 1243 is included in
Motherboard 1200 is shown with these seventeen NAND Complex 1201-1217 and twenty CPU Complex 1231-1250. In the preferred embodiment each NAND Complex and each CPU Complex would be on a separate, but replaceable card on Motherboard 1200. More specifically, in the preferred embodiment, the NAND Complex 1201-1217 are cards which can be removed from the front panel of the 2U Chassis that Motherboard 1200 is mounted in. The twenty CPU Complexes 1231-1250 can similarly be removed from Motherboard 1200, albeit through the top of the chassis, in the preferred embodiment. Each CPU Complex is another card that is mounted to Motherboard 1200. The PCIe Switch Fabric 710 would also be mounted on one or more cards that allow the entire PCIe Switch Fabric 710 to be removed from Motherboard 1200. Similarly, Intelligent Ethernet Switch Fabric 610 would be mounted on one or more cards that allow the entire Intelligent Ethernet Switch Fabric 610 to be removed from Motherboard 1200.
In the preferred embodiment, the reason that the NAND Complex, CPU Complex, PCIe Switch Fabric, and Intelligent Ethernet Switch Fabric are all on separate cards is because they are intended to be changed over time. There are a few important reasons these elements are replaceable. First, if a particular NAND Complex or CPU Complex card were to cease functioning, it could be replaced without the cost of replacing the entire structure. For systems which need to run in data centers, that is a very important aspect of the design. In fact, the design should allow for the practice known as “hot pluggable” which means that a card can be replaced while the power is on and it will not harm the system. Another reason that cards are designed to be replaceable is to allow technology to change over time but not obsoleting the product. For example, in 2013 technology, NAND is the preferred choice for solid state storage. In the future a technology like Phase Change Memory (PCM) might replace NAND as the preferred choice for storage. In such a case, the NAND Complex 400 shown in
Those skilled in the art may recognize that the card based structure of the present invention is similar to the way that a modern data center is created. Take the network for example. The network of a data center, as illustrated in
It should be understood that the means of connection, the number of connections, the topology of the connections, and even what is being connected can be altered in the present invention. While this is somewhat analogous to prior art systems, no one system has all of the features described above contained in a single, scalable unit. The present invention was expressly designed to combine each of the three major functional elements of a data center (computing, storage and networking) into a tightly coupled, deeply vertically integrated structure that provides multiplicative substantial value over any known prior art system, as cited above.
Chassis 1300 itself consists of the physical structure of the chassis of a suitable material such as steel, aluminum or plastic, along with Motherboard 1200 which in turn houses NAND Complex 1201 to 1217, CPU Complex 1231 to 1250, FANs 1391-1397, AC to DC Power Supply 1325, Uninterruptible Power Supply 1327, Power Supply Connector 1323, Connectors 1380 which contain SFP+ Connectors 303 and QSFP Connectors 307, PCIe Switch Fabric cards 1361 to 1365, and Intelligent Ethernet Switch Fabric cards 1371 to 1375. In some embodiments, AC to DC Power Supply 1215 is more than one supply, where each of the supplies may be hot-pluggably removed from Chassis 1300 while Chassis 1300 remains powered.
In one embodiment, one or more rows of FAN 1391 to 1397 are used to remove waste heat from the system, typically through vent holes in either Rear Panel 1340 or on the side panels, vent holes not shown on
NAND Complex 1201 to 1217 collectively make up the storage subsystem of Chassis 1300. Similarly, CPU Complex 1231 to 1250 make up the computing subsystem of Chassis 1300. PCIe Switch Fabric cards 1361 to 1365 and Intelligent Ethernet Switch Fabric cards 1371 to 1375 collectively make up the networking subsystem of Chassis 1300.
Chassis 1300 would be typically operating in a data center environment. These environments are uniquely different from the typical home computer because Chassis 1300 can not be down at any time it is in operation in the data center environment. Typically, a data center will employ one or more massive uninterruptible power supplies (UPS) which will provide AC power to the CPU and storage servers as well as network to protect against power failure by providing enough power to start massive on-site power generation equipment. This equipment is enormously expensive, and must be maintained so that it can be relied upon if a power failure to the building occurs.
The present invention offers an alternative method of approach which in some data center circumstances can offer far less cost yet perform similarly to the massive generator and UPS systems found at most data centers. In particular, inside Chassis 1300, in one embodiment of the present invention, Uninterruptible Power Supply 1327 is added to provide sufficient power to allow for Chassis 1300 to shut down in an orderly manner, or otherwise bridge power until either main power is restored, or locally generated power is connected to Chassis 1300. Having Uninterruptible Power Supply 1327 is a significant departure over prior art systems because of the effect it has on the software required to run Chassis 1300s storage subsystem.
Present NAND storage control software is required to be complex because if power is lost while the NAND is being used, information can be lost. Companies spend enormous resources on the software used to control the storage subsystem, and in turn large amounts of software run on CPUs to control the subsystem. Worse still, the prior art practices require very complex structures be written to the NAND in case the power is lost so that information can be restored once power is restored. The present invention looks at the world radically differently. Instead, it asks the question what happens if I can always guarantee power to the storage subsystem? May the software, and software structures be simplified? The answer is an unqualified yes. By changing the operating rules of the storage subsystem, vastly simpler software and vastly less processing is required which in turn boost performance, decreases latency, and even results in longer life for NAND memory technology because it results in less writes and rewrites to the NAND. The present invention solves the power problem by providing sufficient power once AC power has failed to at least orderly shut down the storage subsystem. It does this by adding Uninterruptible Power Supply 1327. Those skilled in the art will understand that UPS 1327 could in fact be external to Chassis 1300. However, it is advantageous to be local to Chassis 1300 because of finer grain control over the use of the power, and the fact that even if someone were to trip over the power cord to Chassis 1300, power would still be available to accomplish and orderly shutdown.
Earlier, it was indicated that the network subsystem consists of PCIe Switch Fabric 710 and Intelligent Ethernet Switch Fabric 610. Chassis 1300 shows a more refined example of how those might be physically implemented. For example,
Connectors 1380 in some embodiment is actually one or more cards that together house SFP+ Connectors 303 and/or QSFP Connectors 307. Being able to change the actual connectors that go out the back is another advantage of the present invention. Since the network switch technology can and will change over time, recognizing that connector technology will also change and allowing for it is an important aspect of the present invention. While not all embodiments require Connectors 1380 to be removable from Chassis 1300, it is a strong advantage over prior art systems to allow for doing so.
Rack 1400 must communicate with the outside world, generally the internet. To do so, Chassis 1300 offers a number of 10 GE ports via SFP+ Connectors 303. In
Internal 40 GE Backbone Connection 1460 is illustrated very simplistically in
Data centers have two kinds of network bandwidth to consider: First, the bandwidth ingressing/egressing to/from the data center and the outside world; second, the bandwidth that is used inside the data center to move information about and complete whatever tasks the data center is supposed to accomplish. Typical data centers have a ratio of roughly 80% of the network traffic staying inside the data center versus roughly 20% of the network traffic going to the outside world. What quickly becomes clear is that the Internal 40 GE Backbone Connection 1460 is very important to how the overall system is created from many Rack 1400s consisting of many Chassis 1300 per each Rack 1400. The preferred embodiment of the present invention was expressly designed so that external network switches were not required. Rather, all the required switching would be done using the 40 GE network. To that end, the preferred embodiment has sufficient numbers of 40 GE connections, via QSFP Ports 307 to allow the aforementioned each Chassis 1300 connecting with every other Chassis 1300 within the same Rack 1400, but also to allow for every Chassis 1300 in Rack 1400 to communicate with other Chassis 1300s in other Rack 1400s. In the simplest example, one can imagine where there are three Rack 1400s, and in the center Rack 1400, each Chassis 1300 connects to other Chassis 1300s in that Rack 1400, but also allows each Chassis 1300 to communicate to the Chassis 1300 at the same height in the leftmost and rightmost Rack 1400s such that a 2D array is formed. Those skilled in the art will understand this simplistic example is just that—simplistic and an example. The preferred embodiment of Chassis 1300 was designed to allow every Chassis 1300 to communicate with every other Chassis 1300 in the same Rack 1400, and to allow each Chassis 1300 sufficient additional 40 GE ports to connect in more traditional network topologies such as 3D torus, spanning tree, fat tree or many other appropriate structures. The best mode of Chassis 1300 is when it uses a fat-tree inter Rack 1400 and a 3D torus between Rack 1400s.
Power is supplied to each Chassis 1300 via Power Connector 1323 on each Chassis 1300. There are many well known means of connecting power to each Chassis 1300 and are not covered here.
When Rack 1400s are interconnected with one another, typically by interconnecting Internal 40 GE Backbone Connection 1460s between the Rack 1400s, large scalable data centers can now be built. The preferred embodiment was designed to allow for adding as many Rack 1400s with as many Chassis 1300s as are needed to solve the particular problem the data center is seeking to solve. Importantly, an aspect of this preferred embodiment is that the data center can be expanded or contracted post facto without requiring additional resources other than the extra Rack 1400s and Chassis 1300s being added (or taken away).
Because the network is designed to be predictable and homogenous, it can actually be relied upon by software. This is vastly superior to existing prior art because there is no requirement that the network be predictable or homogenous whereas the present invention enforces either a limited or complete degree of structure, according to how the data center desires to be set up.
Increasingly, large parts of data centers are being used to accomplish data mining. For example, when checking out at grocery store, data mining is used to identify the items purchased in this transaction in order to provide a coupon for future use based upon the present purchase and what is known about past purchases. These can be very sophisticated algorithms that identify uniquely marketable traits. Using data mining, people will get more accurately reflective information that relates to the specific data being mined. Some of these algorithms are enormously complex and requires extraordinarily large amounts of computing, storage and networking to be effective.
One of the major issues with prior art data centers is that they are almost never homogenous. The effect of this simple reality is that little to no effort is made in writing more efficient software that could make excellent use of the locality of data via a homogenous network. It is demonstrable in bioinformatics, for example, that the lack of homogeneity deeply affects the execution time, response time, and in some cases, the quality of the results. The preferred embodiment of the present invention actively and intentionally seeks to set, maintain and keep homogeneity in all aspects of the computing, storage and networking expressly so that a new class of software can be created which relies deeply upon that homogeneity to solve more complex problems in less time, with less power, and for less money than prior art systems do.
Although the invention has been described with reference to particular embodiments thereof, it will be apparent to one of ordinary skill in the art that modifications to the described embodiment may be made without departing from the spirit of the invention. Accordingly, the scope of the invention will be defined by the attached claims not by the above detailed description.