A computing system has a physical hardware topology that includes at least multiple endpoint devices and one or more general-purpose central processing units (CPUs). In some designs, each of the endpoint devices is a graphics processing unit (GPU) that uses a parallel data processor, and the endpoint devices are used in non-uniform memory access (NUMA) nodes that utilize the endpoint devices to process tasks. A virtualization layer is added between the hardware of the computing system and an operating system that creates a guest virtual machine (VM) with multiple endpoint devices. The guest VM utilizes a guest VM topology that is different from the physical hardware topology. For example, the guest VM topology uses a single emulated root complex, which lacks the connectivity that is actually used in the physical hardware topology. Therefore, paths between endpoint devices are misrepresented in the guest VM topology.
The hardware of a processor of an endpoint device executes instructions of a device driver in the guest VM. When scheduling tasks, the device driver being executed by this processor of the endpoint device uses latency information between endpoint devices provided by the guest VM. For example, the guest VM being executed by the processor of the endpoint device generates an operating system (OS) call to determine the latencies. These latencies rely on the latency information based on the guest VM topology, rather than the physical hardware topology. Therefore, when executing the device driver, the processor schedules tasks with mispredicted latencies between nodes of the computing system such as between two processors located in the computing system. These mispredicted latencies between nodes result in an erroneous detection of a hung system, or result in scheduling that provides lower system performance.
In view of the above, efficient methods and systems for efficiently scheduling tasks to multiple endpoint devices are desired.
While the invention is susceptible to various modifications and alternative forms, specific implementations are shown by way of example in the drawings and are herein described in detail. It should be understood, however, that drawings and detailed description thereto are not intended to limit the invention to the particular form disclosed, but on the contrary, the invention is to cover all modifications, equivalents and alternatives falling within the scope of the present invention as defined by the appended claims.
In the following description, numerous specific details are set forth to provide a thorough understanding of the present invention. However, one having ordinary skill in the art should recognize that the invention might be practiced without these specific details. In some instances, well-known circuits, structures, and techniques have not been shown in detail to avoid obscuring the present invention. Further, it will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements are exaggerated relative to other elements.
Systems and methods for efficiently scheduling tasks to multiple endpoint devices are contemplated. In various implementations, multiple endpoint devices are placed in a computing system. The endpoint devices include one or more of a general-purpose microprocessor, a parallel data processor or processing unit, local memory, and one or more link or other interconnect interfaces for transferring data with other endpoint devices. In an implementation, each of the endpoint devices is a GPU that uses a parallel data processor, and the endpoint devices are used in non-uniform memory access (NUMA) nodes that utilize the endpoint devices to process tasks. Therefore, the computing system has a physical hardware topology that includes the multiple endpoint devices and at least one or more general-purpose CPUs and system memory. A software layer, such as a virtualization layer, is added between the hardware of the computing system and an operating system of one of the processors of the computing system such as a particular CPU. In various implementations, this software layer creates and runs at least one guest virtual machine (VM) in the computing system with the multiple endpoint devices.
A particular endpoint device runs a guest device driver of the guest VM. When executing this guest device driver of the guest VM, a processor (e.g., a microprocessor, a data parallel processor, other) of this particular endpoint device performs multiple steps. For example, the processor determines a task is ready for data transfer between two endpoint devices of the guest VM. The guest VM utilizes a guest VM topology that is different from the physical hardware topology. The processor accesses a distance table storing indications of distance or latency information corresponding to one or more pairs of endpoint devices of the guest VM based on physical hardware topology, rather than based on the guest VM topology. In various implementations, the table was built earlier by a topology manager and sent to the processor of the endpoint device for storage. In an implementation, the processor selects a pair of endpoint devices listed in the table that provide a smallest latency or smallest distance for data transfer based on the physical hardware topology. Following, the processor schedules the task on the selected pair of endpoint devices.
In the below description,
Turning now to
In an implementation, the physical hardware topology 110 includes hardware circuitry such as general-purpose central processing units (CPUs) 120 and 130, root complexes 122 and 132, and endpoint devices 124 and 134. Additionally, the physical hardware topology 110 includes the topology manager 140. The endpoint devices 124 and 134 include one or more of a general-purpose microprocessor, a parallel data processor or processing unit, local memory, and one or more link or other interconnect interfaces for transferring data with one another and with the CPUs 120 and 130 via the root complexes 122 and 132. In an implementation, each of the endpoint devices 124 and 134 is a graphics processing unit (GPU) that uses a parallel data processor. In another implementation, one or more of the endpoint devices is another type of parallel data processor such as a digital signal processor (DSP), a custom application specific integrated circuit (ASIC), or other. In various implementations, the endpoint devices 124 and 134 are used in non-uniform memory access (NUMA) nodes that utilize the endpoint devices 124 and 134 to process tasks.
The topology manager 140 generates the distance table 180 that stores indications of distances or latencies between pairs of endpoint devices. The indications of distances or latencies are based on the physical hardware topology 110, rather than the guest VM topology 160. In some implementations, the indication of distance or latency is a non-uniform memory access (NUMA) distance between two nodes such as between two different processors, between a particular processor and a particular memory, or other. The NUMA distance can be indicated by a PCIe locality weight, an input/output (I/O) link weight, or other. Typically, as the weight lowers in value, the shorter is the distance between two nodes and the smaller is the latency between the two nodes. Other indications of distance and latency are possible and contemplated. As used herein, a “distance table” can be used interchangeably with a “latency table.”
In some implementations, the topology manager 140 determines a value for a particular endpoint device, using a physical identifier (ID), that determines a location of the endpoint device in the physical hardware topology 110 of the computing system 100. In an implementation, the topology manager 140 determines a BDF (or B/D/F) value based on the PCI standard that locates the particular endpoint device in the physical hardware topology 110. The BDF value stands for Bus, Device, Function, and in the PCI standard specification, it is a 16-bit value. Based on the PCI standard, the 16-bit value includes 8 bits for identifying one of 256 buses, 5 bits for identifying one of 32 devices on a particular bus, and 3 bits for identifying a particular function of 8 functions on a particular device. Other values for identifying a physical location of the endpoint device in the physical hardware topology are also possible and contemplated. Following, the topology manger 140 then determines an indication of latency or distance between pairs of endpoint devices using the identified physical locations. For example, the topology manager 140 determines NUMA distances that the topology manager 140 places in a copy of the distance table 180.
Each of the CPUs 120 and 130 processes instructions of a predetermined algorithm. The processing includes fetching instructions and data, decoding instructions, executing instructions and storing results. In an implementation, the CPUs 120 and 130 use one or more processor cores with circuitry for executing instructions according to a predefined general-purpose instruction set architecture (ISA). Each of the root complexes 122 and 132 provides connectivity between a respective one of the CPUs 120 and 130 and one or more endpoint devices. As used herein, an “endpoint device” can also be referred to as an “endpoint.” For example, endpoint devices 124 and 134 can also be referred to as endpoints 124 and 134. In the illustrated implementation, each of the root complexes 122 and 132 is connected to a single endpoint, but in other implementations, one or more of the root complexes 122 and 132 is connected to multiple endpoints.
As used herein, a “root complex” refers to a communication switch fabric that is a root near a corresponding CPU of an inverted tree hierarchy that is capable of communicating with multiple endpoints. For example, the root complex is connected to the corresponding CPU through a local bus, and the root complex generates transaction requests on behalf of the corresponding CPU to send to one or more multiple endpoint devices that are connected via ports to the root complex. The root complex includes one or more queues for storing requests and responses corresponding to various types of transactions such as messages, commands, payload data, and so forth. The root complex also includes circuitry for implementing switches for routing transactions and for supporting a particular communication protocol. One example of a communication protocol is the Peripheral Component Interconnect Express (PCIe) communication protocol.
In various implementations, each of the endpoints 124 and 134 includes a parallel data processing unit, which utilizes a single instruction multiple word (SIMD) micro-architecture. As described earlier, in some implementations, the parallel data processing unit is a graphics processing unit (GPU). The SIMD micro-architecture uses multiple compute resources with each of the compute resources having a pipelined lane for executing a work item of many work items. Each work unit is a combination of a command and respective data. One or more other pipelines uses the same instructions for the command, but operate on different data. Each pipelined lane is also referred to as a compute unit.
The parallel data processing unit of the endpoint devices 124 and 134 uses various types of memories such as a local data store shared by two or more compute units within a group as well as a command cache and a data cache shared by each of the compute units. Local registers in register files within each of the compute units are also used. The parallel data processing unit additionally uses secure memory for storing secure programs and secure data accessible by only a controller within the parallel data processing unit. The controller is also referred to as a command processor within the parallel data processing unit. In various implementations, the command processor decodes requests to access information in the secure memory and prevents requestors other than itself from accessing content stored in the secure memory. For example, a range of addresses in on-chip memory within the parallel data processing unit is allocated for providing the secure memory. If an address within the range is received, the command processor decodes other attributes of the transaction, such as a source identifier (ID), to determine whether or not the request is sourced by the command processor.
The memory 150 is any suitable memory device. Examples of the memory devices are dynamic random access memories (DRAMs), synchronous DRAMs (SDRAMs), static RAM, three-dimensional (3D) integrated DRAM, and so forth. It is also possible and contemplated that the physical hardware topology 110 includes one or more of a variety of other processing units. The multiple processing units can be individual blocks or individual dies on an integrated circuit (IC), such as a system-on-a-chip (SOC). Alternatively, the multiple processing units can be individual blocks or individual dies within a package, such as a multi-chip module (MCM).
A software layer, or virtualization layer, is added between the hardware of the physical hardware topology 110 and an operating system of one of the CPUs 120 and 130. In one instance, this software layer runs on top of a host operating system and spawns higher level guest virtual machines (VMs). This software layer monitors corresponding VMs and redirects requests for resources to appropriate application program interfaces (APIs) in the hosting environment. This type of software layer is referred to as a virtual machine manager (VMM) such as VMM 152 stored in memory 150. A virtual machine manager is also referred to as a virtual machine monitor or a hypervisor. The virtualization provided by the VMM 152 allows one or more guest VMs, such as guest VM 154, to use the hardware resources of the parallel data processors of the endpoint devices 124 and 134. Each guest VM executes as a separate process that uses the hardware resources of the parallel data processor.
In an implementation, the VMM 152 is used to generate the guest VM 154 that uses the guest VM topology 160. A guest device driver that runs (or executes) as a process on one of the endpoint devices 124 and 134 along with a guest operating system to implement the guest VM 154 uses the hardware of the CPUs 120 and 130. In addition, the guest VM 154 uses the hardware of the endpoint devices 124 and 134. However, rather than use the hardware of the root complexes 122 and 132, the guest VM 154 uses an emulated root complex 170. Therefore, without help from the topology manager 140, the guest device driver of the guest VM 154 is unaware of the true connectivity between the endpoints 124 and 134. For example, the connectivity in the guest VM topology 160 uses the single emulated root complex 170 between them. However, in the physical hardware topology 110, the true, physical connectivity between the endpoints 124 and 134 connects to each of the root complexes 122 and 132 and connects to each of the CPUs 120 and 130 via the root complexes 122 and 132.
As described earlier, the topology manager 140 generates the indications of distances or latencies stored in the distance table 180 based on the physical hardware topology 110, rather than the guest VM topology 160. When executed by one of the endpoint devices 124 and 134, the guest VM 154 uses a copy of the distance table 180 when scheduling tasks. As described earlier, a copy of the distance table 180 is stored in one or more of the CPUs 120 and 130 and the endpoint devices 124 and 134, or the copy is stored in a memory accessible by one or more of the CPUs 120 and 130 and the endpoint devices 124 and 134.
In one implementation, the topology manager 140 is implemented by a dedicated processor. An example of the dedicated processor is a security processor. In some implementations, the security processor is a dedicated microcontroller within an endpoint device that includes one or more of a microprocessor, a variety of types of data storage, a memory management unit, a dedicated cryptographic processor, a direct memory access (DMA) engine, and so forth. The interface to the security processor is carefully controlled, and in some implementations, direct access to the security processor by external devices is avoided. Rather, in an implementation, communication with the security processor uses a secure mailbox mechanism where external devices send messages and requests to an inbox. The security processor determines whether to read and process the messages and request, and sends generated responses to an outbox. Other communication mechanisms with the security processor are also possible and contemplated.
In other implementations, the functionality of the topology manager 140 is implemented across multiple security processors such as a security processor of the endpoint device 124 and another security processor of the endpoint device 134 where the endpoint devices 124 and 134 are used in the guest VM topology 160. For example, the endpoints 124 and 134 include the security processors (SPs) 125 and 135, respectively. In another implementation, the functionality of the topology manager 140 is implemented by one or more of the CPUs 120 and 130. In yet other implementations, the functionality of the topology manager 140 is implemented by a security processor of one of the CPUs 120 and 130 that runs the VMM 152. For example, the CPU 120 and 130 include the security processors (SPs) 121 and 131, respectively. In further implementations, the functionality of the topology manager 140 is implemented by a combination of one or more of these security processors 121, 131, 125 and 135.
Regardless of the particular combination of hardware selected to perform the functionality of the topology manager 140, it is noted that the functionality of the topology manager 140 is also implemented by the selected combination of hardware executing instructions of one or more of a variety of types of software. The variety of types of software include a host device driver running on one of the CPUs 120 and 130, a particular application running on one of the CPUs 120 and 130, a device driver within the guest VM 154, the guest VM 154, a variety of types of firmware, and so on.
In an implementation, the distance table 180 includes indications of distances or latencies between pairs of endpoint devices. A single pair of endpoint devices 124 and 134 is shown as an example, but in other implementations, each of the physical hardware topology 110 and the guest VM topology 160 uses multiple pairs of endpoint devices. As shown, the distance table 180 includes physical identifiers (IDs) of the endpoint devices 124 and 134 as well as corresponding indications of latencies. In the illustrated implementation, the endpoint 124 has the physical device identifier (PID) 83, which is a hexadecimal value, and the virtual device identifier (VID) 0. The endpoint 134 has a PID value of A3, which is also a hexadecimal value, and a VID value of 1. The shaded entries of the distance table 180 indicate the distances or latencies that are set based on the use of the topology manager 140. The shaded entries illustrate the distances or latencies that would differ if the distance table 180 was generated based on the guest VM topology 160, rather than the physical hardware topology 110. The differing values of these entries are described below in the upcoming description of the tables 200 (of
Referring to
The range of latencies in the mappings 210 is shown as a smallest value of 10 and a largest value of 255. The smallest indication of latency of 10 corresponds to a connection that includes an endpoint device sending a transaction to itself. The largest indication of latency of 255 corresponds to a connection that does not exist. In other words, there is no path between a particular pair of endpoint devices. A connection, or path, for data transfer between a pair of CPUs connected to one another is shown to have an indication of latency of 12. A path for data transfer between a pair of endpoint devices with a single root complex between them is shown to have an indication of latency of 15. A path for data transfer between a pair of endpoint devices with two root complexes and two CPUs between them is shown to have an indication of latency of 30. An example of this path is provided earlier regarding the path between the endpoint devices 124 and 134 (of
Rather than show each type of path as a physical hardware topology grows and becomes more complex, an entry of the mappings 210 shows a formula that can be potentially used. For example, as the number of root complexes and corresponding endpoint devices grows, in some cases, the indication of latency grows based on the formula 30+(N−2)×12. In other words, when a first endpoint sends a transaction to a second endpoint across 4 CPUs and 2 root complexes, the indication of latency is 30+(4-2)×12, or 54. The distance tables 220 and 230 correspond to the physical hardware topology 110 and the guest VM topology 160 (of
Without the use of the topology manager, the endpoint devices, such as endpoint devices 124 and 134 of the computing system 100 (of
Turning now to
A guest device driver that runs as a process on one of the endpoint devices 324-356 along with a guest operating system to implement the guest VM uses the hardware of the CPUs 320 and 330. In addition, the guest VM uses the hardware of the endpoint devices 324-356. However, rather than use the hardware of the root complexes 322-352, the guest VM uses an emulated root complex 380. The virtual device identifiers (VIDs) 0-7 are assigned to the endpoint devices 324-356. The corresponding physical device IDs (PIDs) are shown in the physical hardware topology 310. In various implementations, the topology manager 360 includes the functionality of the topology manager 140, and additionally, the topology manager 360 is implemented by one of a variety of implementations described earlier for the topology manager 140. The topology manager 360 performs steps to generate a distance table based on the physical hardware topology 310, rather than the guest VM topology 370. The details of this distance table are provided in the below description.
Referring to
Each entry of the ID mapping table 410 stores a mapping between a physical device ID (PID) of an endpoint device and a corresponding virtual device identifier (VID). The values of these IDs are shown in the computing system 300 (of
Turning now to
Referring to
Turning now to
The virtual device identifiers (VIDs) 8-11 are assigned to the endpoint devices 324, 334, 346 and 356. The corresponding physical device IDs (PIDs) are shown in the physical hardware topology 310. The topology manager 360 performs steps to generate a distance table based on the physical hardware topology 310, rather than the guest VM topology 770. The details of this distance table are provided in the below description.
Referring to
The distance table 820 uses the PIDs of endpoint devices to provide the indications of latencies between pairs of endpoint devices used in a guest VM. The indications of latencies in the distance table 820 are based on a physical hardware topology, rather than a guest VM topology. In contrast, the indications of latencies in the distance table 830 are based on a guest VM topology, rather than a physical hardware topology. The shaded entries of the distance tables 820 and 830 indicate the latencies that are adjusted based on the use of the topology manager (such as topology manager 360 of
Turning now to
Multiple endpoint devices are placed in a computing system. The endpoint devices include one or more processors, local memory, and one or more link or other interconnect interfaces for transferring data with other endpoint devices. In an implementation, each of the endpoint devices is a GPU that uses a parallel data processor. In some implementations, the GPUs are used in non-uniform memory access (NUMA) nodes that utilize the GPUs to process tasks. The computing system also includes one or more general-purpose CPUs, system memory, and one or more of a variety of peripheral devices besides the endpoint devices. It is also possible and contemplated that the computing system includes one or more of a variety of other processing units.
A software layer is added between the hardware of the computing system and an operating system of one of the processors of the computing system such as a particular CPU. In various implementations, this software layer creates and runs at least one guest virtual machine (VM) in the computing system with the multiple endpoint devices. A particular endpoint device runs a guest device driver of the guest VM. When executing this guest device driver, a processor of this particular endpoint device determines a task is ready for data transfer between two endpoint devices of the guest VM that utilizes a first hardware topology (block 902). The processor accesses a distance table of latency information of one or more pairs of endpoints of the guest VM based on a second hardware topology different from the first hardware topology (block 904). In an implementation, the first hardware topology uses an emulated root complex, whereas, the second hardware topology includes the actual physical root complexes and corresponding connections. In various implementations, the distance table was built earlier by a topology manager (such as topology manager 140 of
The processor compares a latency of the selected pair to latencies of other pairs of endpoints provided in the distance table (block 908). If the latency of the selected pair is not the smallest latency (“no” branch of the conditional block 910), then the control flow of method 900 returns to block 906 where the processor selects a next pair of endpoints. If the latency of the selected pair is the smallest latency (“yes” branch of the conditional block 910), then the processor schedules the task on the selected pair of endpoints (block 912). Therefore, in an implementation, the processor selects the pair of endpoints based on determining a particular latency of the latency information corresponding to the pair of endpoints is less than any latency of the latency information corresponding to each other pair of endpoints of the second hardware topology.
For each of the methods 1000 and 1100 (of
The endpoint device that runs a particular guest VM retrieves a list of physical device identifiers (IDs) of multiple endpoint devices of a virtual hardware topology of the guest VM (block 1004). Within this endpoint device, in an implementation, one or more of a security processor and a device driver or an application running on a separate processor accesses a mapping table that stores mappings between virtual IDs of endpoint devices used in the guest VM and the corresponding physical IDs. In another implementation, the security processor of this endpoint device retrieves the physical IDs from a CPU that runs a host driver or an application that accesses mappings between the virtual IDs and the physical IDs. One of the various implementations of the topology manager finds a physical location in the physical hardware topology for endpoint devices corresponding to the list of physical device IDs (block 1006). Further details of an indication of this physical location are provided in the below description. The topology manager determines latencies between each pair of endpoint devices corresponding to the list of physical device IDs (block 1008). As described earlier, an example of an indication of latency is a NUMA distance. The topology manager inserts the indications of latencies and the physical device IDs in a table (block 1010). Since the physical IDs of only the endpoint devices used by the guest VM are used, this table is a trimmed distance table that includes latency information only for the endpoint devices used by the guest VM. The above steps performed in blocks 1004-1010 can be repeated for each guest VM used in the computing system.
In some implementations, the topology manager determines a value for a particular endpoint device, using the physical ID, that determines a location of the endpoint device in the physical hardware topology of the computing system. For example, the topology manager determines a BDF (or B/D/F) value based on the PCI standard that locates the particular endpoint device in the physical hardware topology. The BDF value stands for Bus, Device, Function, and in the PCI standard specification, it is a 16-bit value. Based on the PCI standard, the 16-bit value includes 8 bits for identifying one of 256 buses, 5 bits for identifying one of 32 devices on a particular bus, and 3 bits for identifying a particular function of 8 functions on a particular device. Other values for identifying a physical location of the endpoint device in the physical hardware topology are also possible and contemplated.
Turning now
The topology manager accesses, using the physical IDs, a table of latencies between pairs of endpoint devices based on the physical hardware topology (block 1106). The topology manager creates a trimmed table using latency information corresponding to the physical IDs retrieved from the table (block 1108). The topology manager sends the trimmed table to the guest driver of the guest VM running on the given endpoint device (block 1110).
It is noted that one or more of the above-described implementations include software. In such implementations, the program instructions that implement the methods and/or mechanisms are conveyed or stored on a computer readable medium. Numerous types of media which are configured to store program instructions are available and include hard disks, floppy disks, CD-ROM, DVD, flash memory, Programmable ROMs (PROM), random access memory (RAM), and various other forms of volatile or non-volatile storage. Generally speaking, a computer accessible storage medium includes any storage media accessible by a computer during use to provide instructions and/or data to the computer. For example, a computer accessible storage medium includes storage media such as magnetic or optical media, e.g., disk (fixed or removable), tape, CD-ROM, or DVD-ROM, CD-R, CD-RW, DVD-R, DVD-RW, or Blu-Ray. Storage media further includes volatile or non-volatile memory media such as RAM (e.g. synchronous dynamic RAM (SDRAM), double data rate (DDR, DDR2, DDR3, etc.) SDRAM, low-power DDR (LPDDR2, etc.) SDRAM, Rambus DRAM (RDRAM), static RAM (SRAM), etc.), ROM, Flash memory, non-volatile memory (e.g. Flash memory) accessible via a peripheral interface such as the Universal Serial Bus (USB) interface, etc. Storage media includes microelectromechanical systems (MEMS), as well as storage media accessible via a communication medium such as a network and/or a wireless link.
Additionally, in various implementations, program instructions include behavioral-level descriptions or register-transfer level (RTL) descriptions of the hardware functionality in a high level programming language such as C, or a design language (HDL) such as Verilog, VHDL, or database format such as GDS II stream format (GDSII). In some cases the description is read by a synthesis tool, which synthesizes the description to produce a netlist including a list of gates from a synthesis library. The netlist includes a set of gates, which also represent the functionality of the hardware including the system. The netlist is then placed and routed to produce a data set describing geometric shapes to be applied to masks. The masks are then used in various semiconductor fabrication steps to produce a semiconductor circuit or circuits corresponding to the system. Alternatively, the instructions on the computer accessible storage medium are the netlist (with or without the synthesis library) or the data set, as desired. Additionally, the instructions are utilized for purposes of emulation by a hardware based type emulator from such vendors as Cadence®, EVE®, and Mentor Graphics®.
Although the implementations above have been described in considerable detail, numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications.