This application is related to co-pending application entitled “Method and Apparatus for Connecting Direct Access From Non-volatile Memory to Local Memory, U.S. patent application Ser. No. 15/389,596, filed on Dec. 23, 2016, and to co-pending application entitled “Method and Apparatus for Accessing Non-volatile Memory As Byte Addressable Memory”, U.S. patent application Ser. No. 15/389,811, on Dec. 23, 2016, and to co-pending application entitled “Method and Apparatus for Integration of Non-volatile Memory”, U.S. patent application Ser. No. 15/389,908, filed on Dec. 23, 2016, which are incorporated by reference as if fully set forth. This application and the related co-pending applications listed above were all filed on the same date.
Graphics cards require interaction with a root complex of a host computing system to execute certain types of functions. For example, the transfer of data from non-volatile memory (NVM) to a graphics processing unit (GPU) local memory requires that the data is transferred from the NVM to a host memory, and then from the host memory to the local memory. This involves at least using a root complex of the host computing system. This taxes the root complex and increases traffic and congestion.
A more detailed understanding may be had from the following description, given by way of example in conjunction with the accompanying drawings wherein:
Described herein are apparatus for connecting a first memory architecture to a graphics processing unit (GPU) through a local switch, where the first memory architecture can be a non-volatile memory (NVM) or other similarly used memories, for example, along with associated controllers. The apparatus includes the GPU(s) or discrete GPU(s) (dGPU(s)) (collectively GPU(s)), second memory architectures associated with the GPU(s), the local switch, first memory architecture(s), first memory architecture controllers or first memory architecture connector(s). In an implementation, the local switch is part of the GPU. The apparatus can also include a controller for distributing a large transaction among multiple first memory architectures. In an implementation, the first memory architectures can be directly connected to the GPU. In an implementation, the apparatus is user configurable. In an implementation, the apparatus is a solid state graphics (SSG) card. The second memory architecture can be a local memory, a high bandwidth memory (HBM), a double data rate fourth-generation synchronous dynamic random-access memory (DDR4), a double data rate type five synchronous graphics random access memory (GDDR5), a hybrid memory cube or other similarly used memories, for example, along with associated controllers. For purposes of illustration and discussion, the terms NVM and local memory will be used in the description without limiting the scope of the specification and claims.
In general, SSG card 110 includes a PCIe switch 134 for interfacing with PCIe switch 124. PCIe switch 134 can be connected to one or more non-volatile memory (NVM) controllers 136, such as for example, a NVM Express (NVMe) or Non-Volatile Memory Host Controller Interface Specification (NVMHCI) device, for accessing associated NVMs 138 and one or more dGPUs 130. Each dGPU 130 is further connected to an associated local memory 132. Each NVM controller 136 can manage and access an associated NVM 138 and in particular, can decode incoming commands from host computing system 105 or dGPU 130. In an implementation, SSG card 110 is user reconfigurable. Illustrative configurations for SSG card 110 are described in
Inclusion of PCIe switch 134 in SSG card 110 enables peer-to-peer connectivity that bypasses PCIe switch 124. For example, when dGPU 130 executes commands that require data transfer between local memory and one or more NVMs 1351 to 135k, then the processor 120 can instruct or enable direct data transfer from the associated local memory to one or more NVMs 1351 to 135k (arrow 140) or from one or more NVMs 1351 to 135k to the associated local memory (arrow 142). The direct data transfer can be initiated by an appropriate NVM controller 1341 to 134k via a local PCIe switch, such as for example, PCIe switch 1361 to 136n. In an implementation, the dGPU can have a hardware agent that can instruct the direct data transfer. This peer-to-peer data transfer or access can alleviate the disadvantages discussed herein. As shown in
Data striping segments logically sequential data, such as a file, so that consecutive segments are stored on different physical storage devices, such as NVMs. These are referred to as stripes. Flow diagram 1100 shows the sequence of steps needed to process a stripe. Operationally, host processor 1105 writes a command, (for example a data transfer command), to a submission queue 1112 in system memory 1110 (step 1) and writes to a doorbell register in, for example, NVM controller 11341 to signal that the data transfer command is available in submission queue 1112 (step 2) to process stripe 0. NVM controller 11341 fetches the data transfer command from submission queue 1112 (step 3) and executes the data transfer command between local memory 1120 and NVM 11321 (step 4). Upon execution of the data transfer command, NVM controller 11341 writes a completion entry in completion queue 1114 in system memory 1110 (step 5) and generates an interrupt for NVM controller 11341 (step 6). Host processor 1105 processes the completion entry (step 7) and writes to the doorbell register in NVM controller 11341 to signal completion (step 8). Steps 1-8 are then repeated for the remaining stripes. As shown, there are multiple interactions between host processor 1105, system memory 1110, host switch 1115, local memory 1120, NVM 11321-n and NVM controller 11341-n for each stripe. This involves at least using a root complex of the host computing system. This taxes the root complex and increases traffic and congestion.
As noted above, flow diagram 1200 shows the sequence of steps needed to process a stripe. Operationally, host processor 1205 writes a command, (for example a data transfer command), to a submission queue 1252 in RAID assist 1250 (step 1) and writes to a doorbell register in a striper 1256 in RAID assist 1250 to signal that the data transfer command is available in submission queue 1252 (step 2) to process stripe 0. Striper 1256 creates a set of parallel NVM transactions for data transfers between local memory 1220 and NVM 12321-NVM 1232n, respectively (step 3). Striper 1256 and NVM controllers 12341-n executes the data transfers without having to interact with system memory 1210 and host memory 1205 (step 4). Upon execution of the data transfer command for all stripes, striper 1256 writes a completion entry in completion queue 1254 in RAID assist 1250 (step 5) and generates an interrupt for host controller 1205 (step 6). Host processor 1205 processes the completion entry (step 7) and writes to the doorbell register in RAID assist 1250 to signal completion (step 8). As shown, RAID assist 1250 minimizes interaction with host processor 1205 and system memory 1210 until data transfer completion. Each of the stripe transactions is essentially transparent to host processor 1205 and system memory 1210
The processor 1302 may include a central processing unit (CPU), a graphics processing unit (GPU), a CPU and GPU located on the same die, or one or more processor cores, wherein each processor core may be a CPU or a GPU. The memory 1304 may be located on the same die as the processor 1302, or may be located separately from the processor 1302. The memory 1304 may include a volatile or non-volatile memory, for example, random access memory (RAM), dynamic RAM, or a cache.
The storage 1306 may include a fixed or removable storage, for example, a hard disk drive, a solid state drive, an optical disk, or a flash drive. The input devices 1308 may include a keyboard, a keypad, a touch screen, a touch pad, a detector, a microphone, an accelerometer, a gyroscope, a biometric scanner, or a network connection (e.g., a wireless local area network card for transmission and/or reception of wireless IEEE 802 signals). The output devices 310 may include a display, a speaker, a printer, a haptic feedback device, one or more lights, an antenna, or a network connection (e.g., a wireless local area network card for transmission and/or reception of wireless IEEE 802 signals).
The input driver 1312 communicates with the processor 1302 and the input devices 1308, and permits the processor 1302 to receive input from the input devices 1308. The output driver 1314 communicates with the processor 1302 and the output devices 1310, and permits the processor 1302 to send output to the output devices 1310. It is noted that the input driver 1312 and the output driver 1314 are optional components, and that the device 1300 will operate in the same manner if the input driver 1312 and the output driver 1314 are not present.
In general, in an implementation, solid state graphics (SSG) card includes at least one graphics processing unit (GPU), a second memory architecture associated with each GPU, at least one first memory architecture unit and a local switch coupled to each first memory architecture unit and the at least one GPU. A first memory architecture unit and the local switch directly process data transactions between a second memory architecture and the first memory architecture unit in response to a data transfer command. In an implementation, each first memory architecture unit is powered by an in-card power supply. In an implementation, each first memory architecture unit includes a first memory architecture and an associated first memory architecture controller. In an implementation, each first memory architecture unit includes a first memory architecture connector on the SSG card and a first memory architecture drive for connecting to the first memory architecture connector, where each first memory architecture drive includes a first memory architecture and an associated first memory architecture controller. In an implementation, each first memory architecture drive is powered by an off-card power supply. In an implementation, each first memory architecture unit includes a ball grid array (BGA) mount on the SSG card and a first memory architecture drive for connecting to the BGA mount, where each first memory architecture drive includes a first memory architecture and an associated first memory architecture controller. In an implementation, each BGA mount is powered by an in-card power supply. In an implementation, the local switch is integrated with one of the at least one GPU. In an implementation, the at least one first memory architecture unit is a plurality of first memory architecture units and further includes a redundant array of independent drives (RAID) assist unit connected to each of the plurality of first memory architecture units and the local switch, the RAID assist unit segmenting and distributing a data transaction amongst the plurality of first memory architecture units. In an implementation, the at least one first memory architecture unit is a plurality of first memory architecture units and where one set of first memory architecture units have first memory architectures and associated first memory architecture controllers on the SSG card and another set of first memory architecture units have at least one of first memory architectures and associated first memory architecture controllers external to the SSG card.
In an implementation, a solid state graphics (SSG) card includes at least one graphics processing unit (GPU) including a first memory architecture controller, a second memory architecture associated with each GPU and at least one first memory architecture unit connected to the first memory architecture controller. The GPU directly processes data transactions between a second memory architecture and the at least one first memory architecture in response to a data transfer command. In an implementation, each first memory architecture unit is powered by an in-card power supply. In an implementation, each first memory architecture unit includes a first memory architecture connector on the SSG card and a first memory architecture drive for connecting to the first memory architecture connector, where each first memory architecture drive includes a first memory architecture. In an implementation, each first memory architecture drive is powered by an off-card power supply. In an implementation, each first memory architecture unit includes a ball grid array (BGA) mount on the SSG card and a first memory architecture drive for connecting to the BGA mount, where each first memory architecture drive includes a first memory architecture. In an implementation, each BGA mount is powered by an in-card power supply.
In an implementation, a method for transferring data includes a data transfer command received at a redundant array of independent drives (RAID) assist unit from a host processor via a local switch. A set of parallel memory transactions created between a local memory and a plurality of first memory architectures. The set of parallel memory transactions being executed via the local switch and absent interaction with the host processor. The host processor being notified upon completion of data transfer. In an implementation, the data transfer command is written into a submission queue in the RAID assist unit. In an implementation, an entry is written to a striper in the RAID assist unit to initiate creation of the set of parallel memory transactions. In an implementation, a completion queue is written into from the striper upon completion of the data transfer.
In general and without limiting implementations described herein, a computer readable non-transitory medium including instructions which when executed in a processing system cause the processing system to execute a method for transferring data directly from a second memory architecture associated with a GPU to a first memory architecture.
In general and without limiting implementations described herein, a computer readable non-transitory medium including instructions which when executed in a processing system cause the processing system to execute a method for distributively transferring data directly from a second memory architecture associated with a GPU to first memory architecture(s).
It should be understood that many variations are possible based on the disclosure herein. Although features and elements are described above in particular combinations, each feature or element may be used alone without the other features and elements or in various combinations with or without other features and elements.
The methods provided may be implemented in a general purpose computer, a processor, or a processor core. Suitable processors include, by way of example, a general purpose processor, a special purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of microprocessors, one or more microprocessors in association with a DSP core, a controller, a microcontroller, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) circuits, any other type of integrated circuit (IC), and/or a state machine. Such processors may be manufactured by configuring a manufacturing process using the results of processed hardware description language (HDL) instructions and other intermediary data including netlists (such instructions capable of being stored on a computer readable media). The results of such processing may be maskworks that are then used in a semiconductor manufacturing process to manufacture a processor which implements aspects of the embodiments.
The methods or flow charts provided herein may be implemented in a computer program, software, or firmware incorporated in a non-transitory computer-readable storage medium for execution by a general purpose computer or a processor. Examples of non-transitory computer-readable storage mediums include a read only memory (ROM), a random access memory (RAM), a register, cache memory, semiconductor memory devices, magnetic media such as internal hard disks and removable disks, magneto-optical media, and optical media such as CD-ROM disks, and digital versatile disks (DVDs).
Number | Name | Date | Kind |
---|---|---|---|
5968143 | Chisholm et al. | Oct 1999 | A |
8996781 | Schuette et al. | Mar 2015 | B2 |
20140129753 | Schuette | May 2014 | A1 |
20160364829 | Apodaca et al. | Dec 2016 | A1 |
Number | Date | Country | |
---|---|---|---|
20180181518 A1 | Jun 2018 | US |