A processor, or set of processors, executes instructions from an instruction set, e.g., the instruction set architecture (ISA). The instruction set is the part of the computer architecture related to programming, and generally includes the native data types, instructions, register architecture, addressing modes, memory architecture, and exception handling, and external input and output (IO). It should be noted that the term instruction herein may refer to a macro-instruction, e.g., an instruction that is provided to the processor for execution, or to a micro-instruction, e.g., an instruction that results from a processor's decoder decoding macro-instructions.
Various examples in accordance with the present disclosure will be described with reference to the drawings, in which:
The present disclosure relates to methods, apparatus, systems, and non-transitory computer-readable storage media for implementing address translation extensions for confidential computing hosts.
In the following description, numerous specific details are set forth. However, it is understood that examples of the disclosure may be practiced without these specific details. In other instances, well-known circuits, structures, and techniques have not been shown in detail in order not to obscure the understanding of this description.
References in the specification to “one example,” “an example,” “examples,” etc., indicate that the example described may include a particular feature, structure, or characteristic, but every example may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same example. Further, when a particular feature, structure, or characteristic is described in connection with an example, it is submitted that it is within the knowledge of one skilled in the art to affect such feature, structure, or characteristic in connection with other examples whether or not explicitly described.
A (e.g., hardware) processor (e.g., having one or more cores) may execute instructions (e.g., a thread of instructions) to operate on data, for example, to perform arithmetic, logic, or other functions. For example, software may request an operation and a hardware processor (e.g., a core or cores thereof) may perform the operation in response to the request. Certain operations include accessing one or more memory locations, e.g., to store and/or read (e.g., load) data. A system may include a plurality of cores, e.g., with a proper subset of cores in each socket of a plurality of sockets, e.g., of a system-on-a-chip (SoC). Each core (e.g., each processor or each socket) may access data storage (e.g., a memory). Memory may include volatile memory (e.g., dynamic random-access memory (DRAM)) or (e.g., byte-addressable) persistent (e.g., non-volatile) memory (e.g., non-volatile RAM) (e.g., separate from any system storage, such as, but not limited, separate from a hard disk drive). One example of persistent memory is a dual in-line memory module (DIMM) (e.g., a non-volatile DIMM), for example, accessible according to a Peripheral Component Interconnect Express (PCIe) standard.
In certain examples of computing, a virtual machine (VM) (e.g., guest) is an emulation of a computer system. In certain examples, VMs are based on a specific computer architecture and provide the functionality of an underlying physical computer system. Their implementations may involve specialized hardware, firmware, software, or a combination. In certain examples, a virtual machine monitor (VMM) (also known as a hypervisor) is a software program that, when executed, enables the creation, management, and governance of VM instances and manages the operation of a virtualized environment on top of a physical host machine. A VMM is the primary software behind virtualization environments and implementations in certain examples. When installed over a host machine (e.g., processor) in certain examples, a VMM facilitates the creation of VMs, e.g., each with separate operating systems (OS) and applications. The VMM may manage the backend operation of these VMs by allocating the necessary computing, memory, storage, and other input/output (IO) resources, such as, but not limited to, an input/output memory management unit (IOMMU) (e.g., an IOMMU circuit). The VMM may provide a centralized interface for managing the entire operation, status, and availability of VMs that are installed over a single host machine or spread across different and interconnected hosts.
However, it may be desirable to maintain the security (e.g., confidentiality) of information for a virtual machine from the VMM and/or other virtual machine(s). Certain processors (e.g., a system-on-a-chip (SoC) including a processor) utilize their hardware to isolate virtual machines, for example, with each referred to as a “trust domain”. Certain processors support an instruction set architecture (ISA) (e.g., ISA extension) to implement trust domains. For example, Intel® trust domain extensions (Intel® TDX) that utilize architectural elements to deploy hardware-isolated virtual machines (VMs) referred to as trust domains (TDs).
In certain examples, a hardware processor and its ISA (e.g., a trust domain manager thereof) isolates TD VMs from the VMM (e.g., hypervisor) and/or other non-TD software (e.g., on the host platform). In certain examples, a hardware processor and its ISA (e.g., a trust domain manager thereof) implement trust domains to enhance confidential computing by helping protect the trust domains from a broad range of software attacks and reducing the trust domain's trusted computing base (TCB). In certain examples, a hardware processor and its ISA (e.g., a trust domain manager thereof) enhance a cloud tenant's control of data security and protection. In certain examples, a hardware processor and its ISA (e.g., a trust domain manager thereof) implement trust domains (e.g., trusted virtual machines) to enhance a cloud-service provider's (CSP) ability to provide managed cloud services without exposing tenant data to adversaries.
In certain examples, a hardware processor and its ISA (e.g., a trust domain manager thereof) also support device input/output (IO). For example, with an ISA (e.g., Intel® TDX 2.0) supporting trust domain extension (TDX) with device input/output (IO) (e.g., TDX-IO). In certain examples, a hardware processor and its ISA (e.g., a trust domain manager thereof) that support device input/output (IO) (e.g., TDX-IO) enables the use (e.g., assignment) of a physical function (PF) and/or a virtual function (VF) of a device to (e.g., only) a specific TD.
Certain trust domains (TDs) are used to host confidential computing workloads isolated from hosting environments. Certain trust domain technology (e.g., TDX 1.0) architecture enables isolation of the TD (e.g., central processing unit (CPU)) context and memory from the hosting environment, but does not support trusted IO (e.g., direct memory access (DMA) or memory-mapped I/O (MMIO)) to TD private memory, e.g., leading to higher overheads as trust domains are to use a software mechanism for protecting data sent to IO devices (e.g., storage, network, etc.), for example, where all IO data is sent through bounce buffers in TD shared memory using para-virtualized interfaces. However, in certain examples, this precludes the use of some IO models, such as, but not limited to, shared virtual memory, direct IO assignments, and compute offload to an accelerator, field-programmable gate array (FPGA), and/or graphics processing unit (GPU). Thus, from an IO perspective, certain trust domain technology (e.g., TDX 1.0) suffers from the limitations of 1) functionality (e.g., security) because protection can only be extended for devices having the capabilities of end to end encryption (e.g., hardware (H/W) or software (S/W) stack based), as well as no support for state of the art IO virtualization/programming models, and 2) performance because copying for bounce buffers (and software based encryption) incurs significant performance overheads, especially with increased speed/bandwidth of IO devices (e.g., accelerators). Certain trust domain technology (e.g., TDX 1.0) suffers from the limitations that DMAs from devices is done to unprotected memory (e.g., shared memory, which may or may not be encrypted and integrity-protected memory pages) and the TD (e.g., software running in the TD) hence is to copy the data between the unprotected (e.g., shared) memory and TD Private Memory. In certain examples, these additional copies introduce significant overhead on software and/or are used when the data that needs to be sent or received from the device is previously encrypted or integrity-protected (e.g., using software-managed keys), for example, in the case of network traffic or storage. In certain examples, this scheme of “bouncing” the data through shared buffers does not support shared virtual memory (SVM) usages since devices cannot access the shared (e.g., IA) page tables in TD private memory and does not support accelerator offload models where clear-text data from the TD private memory needs to be operated upon by the accelerator.
Certain trust domain technology (for example, trust domain extensions (TDX) with device input/output (IO) (e.g., TDX-IO)) defines the hardware, firmware, and/or software extensions to enable direct and trusted (e.g., confidential) IO between TDs and corresponding IO (e.g., TDX-IO) enlightened devices, and thus overcomes the above limitations. In certain examples, an IOMMU (e.g., a VT-d engine thereof) on a system-on-a-chip (SoC) is the critical hardware enabling trusted direct memory access (trusted DMA) between these device(s) (e.g., in TD's trusted computing base (TCB)) and one or more TD's private memory, and overcomes the above limitations.
Certain examples herein are directed to VT-d/IOMMU extensions for enabling TDX-IO. Certain examples herein are directed to TDX-IO IOMMU (e.g., virtualization technology for directed I/O (VT-d)) extensions to a processor and/or its ISA. Certain examples herein extend an IOMMU (e.g., circuitry) to enable direct device assignment to one or more TDs and/or enable IO (e.g., PCIe) devices to access a TD's confidential memory. Certain examples herein extend an IOMMU (e.g., circuitry) with (i) new security attributes of initiator (SAI) protected (e.g., access controlled to only trusted firmware or TDX Module and/or SEAM) architectural register set, (ii) trusted root table pointer for enabling trusted DMA walks to TD private memory from device(s) in TD's TCB (e.g., a TD assigned device), a trusted invalidation queue (e.g., and register(s) for its base address, head, tail, and events) for enabling trusted invalidations, a trusted page request queue (e.g. and registers(s) for its base address, head, tail, and events) for enabling trusted page-requests, and thereby secure page and/or IO resource reassignment, and/or (iii) a control (e.g., TDX_MODE) register for securely transitioning IOMMU in and out of trust domain (e.g., tdx_mode) operation. In certain examples, a T attribute (e.g., a trusted execution environment (TEE) bit or an “ide_t” bit) or an XT attribute (e.g., an eXtended trusted execution environment (XT) bits or XT0/XT1 bits) (e.g., in an incoming Peripheral Component Interconnect Express (PCIe) standard's integrity and data encryption (IDE) transaction layer packet (TLP) prefix) in a memory access request (e.g., a request by an IO device to a private/shared memory of a trust domain) (i) signifies whether a DMA request (e.g., transaction) originates from a trusted IO context, and/or (ii) is used to select between walking the untrusted (e.g., VMM) maintained (e.g., VT-d) translation tables (e.g., from root pointer) or the trusted (e.g., TDM) (e.g., TDX Module) maintained (e.g., VT-d) translation tables (e.g., from trusted root pointer). In certain examples, a translation table includes a mapping of a virtual address to a physical address. Certain examples herein are directed to IOMMU/host extensions to support trusted DMAs to TD's private memory, e.g., including the definition of new architectural states to manage trusted DMA-translation table entries, support trusted Address Translation Services (trusted ATS), support trusted Page Request Services (trusted PRS), support TEE-Polarity of Completer (TPC) in ATS transactions, and/or support eXtended TEE (XT) mode.
In certain examples, a VMM is not trusted to access “trusted” translation table(s) for a trust domain or a plurality of trust domains (e.g., not trusted with the mappings of a (e.g., guest) trust domain (e.g., physical) address to a host (e.g., physical) address), and a trust domain manager is to instead manage the translation tables for the trust domain or the plurality of trust domains. In certain examples, an IOMMU is to restrict access to the “trusted” translation tables, for example, to ensure that only trusted access(es) by an IO device is allowed, e.g., to ensure that the IO device is in the trusted computing base of the trust domain (or the plurality of trust domains).
In certain examples, an IOMMU includes an IO cache (e.g., IO translation lookaside buffer (IOTLB), context-cache, PASID-cache, first-stage and second-stage paging structure caches) to perform a translation, walk, etc. In certain examples, respective IOMMU caches are tagged to separate between trusted and untrusted (e.g., VT-d) mappings. In certain examples, for different transactions to memory (e.g., originating from the I/O device or the IOMMU itself), the IOMMU generates a command which is used to selectively allow addresses to TD private memory, e.g., where this catches various security threats from untrusted VMM/operating system(OS) VT-d tables/IOMMU programming and/or malicious devices.
In certain examples, the IOMMU enhancements enable TDX-IO, and thus are improvements to the functioning of a SoC (e.g., processor) (e.g., of a computer) itself as they allow for confidential computing in the cloud space (e.g., with (e.g., all) direct, performant IO models supported as well), particular with the rise of heterogeneous computing with accelerators and IO devices in the cloud.
In certain examples, IOMMU enhancements include one or more of: an access controlled register set in corresponding IOMMU, two (e.g., “trusted” and “untrusted”) root pointers, two (e.g., “trusted” and “untrusted”) invalidation queues, two (e.g. “trusted” and “untrusted”) page request queues, “trusted’ tags in the IOMMU caches (e.g., translation table cache(s)), and/or new faults for trusted/untrusted DMA walks. In certain examples, these are architectural changes and are also documented in a corresponding IOMMU specification. In certain examples, these architectural changes can be seen by monitoring a DMA path of trusted transactions to and/or from system memory. In certain examples, IOMMU enhancements enable accelerator offload models for Trust Domains and allows these accelerators to access TD's private memory. In certain examples, ATS support enables high-performance I/O (e.g., for next-gen datacenters) and makes various customer scenarios viable (e.g., direct peer-to-peer between TDX-IO devices, compute express link (CXL) cache, etc.). In certain examples, PRS support enables simplified programming model for data-accelerators and enables efficient memory management with the use of shared virtual memory. In certain examples, TEE-Polarity of Completer (TPC) support enables efficient device caching/sharing and direct peer-to-peer scenarios. In certain examples, eXtended TEE (XT) mode support enables (i) a mechanism to convey TEE or non-TEE intent on the memory requests, and (ii) appropriate access checks based on the conveyed intent.
It should be understood that the functionality herein may be added to other confidential computing technology as a computing solution for IO devices, for example, to AMD® Secure Encrypted Virtualization (e.g., SEV) (e.g., Secure Encrypted Virtualization-Encrypted State (SEV-ES) and/or SEV-Secure Nested Paging (SEV-SNP)) or ARM® Realm Management Extension (RME). In certain examples, the confidential computing technology (e.g., AMD® SEV) uses one key per virtual machine to isolate guests and the hypervisor from one another, for example, where the keys are managed by a trust domain manager (e.g., AMD Secure Processor). In certain examples, the confidential computing (e.g., SEV) requires enablement in the guest operating system and hypervisor. In certain examples, the guest changes allow the virtual machine to indicate which pages in memory should be encrypted. In certain examples, the hypervisor changes use hardware virtualization instructions and communication with the trust domain manager (e.g., AMD Secure processor) to manage the appropriate keys in the memory controller. In certain examples, the confidential computing technology (e.g., ARM® Confidential Compute Architecture (ARM® CCA)) enables the construction of protected execution environments called realms, for example, where realms allow lower-privileged software, such as an application or a virtual machine, to protect its content and execution from attacks by higher-privileged software, such as an OS or a hypervisor.
Turning now to
In certain examples, each core includes (e.g., or logically includes) a set of registers, e.g., registers 103-0 for core 102-0, registers 103-N for core 102-N, etc. Registers 103 may be data registers and/or control registers, e.g., for each core (e.g., or each logical core of a plurality of logical cores of a physical core). In certain examples, each core includes its own cache and/or coupling to a next level(s) cache, for example, the cache hierarchy shown in
In certain examples, IO device 106 includes one or more accelerators (e.g., accelerator circuits 106-0 to 106-N(e.g., where N is any positive integer greater than one, although single accelerator circuit examples may also be utilized)).
Although the example shown in
Memory 108 may include operating system (OS) and/or virtual machine monitor code 110, user (e.g., program) code 112, non-trust domain memory 114 (e.g., pages), trust domain memory 116 (e.g., pages), uncompressed data (e.g., pages), compressed data (e.g., pages), or any combination thereof. In certain examples of computing, a virtual machine (VM) is an emulation of a computer system. In certain examples, VMs are based on a specific computer architecture and provide the functionality of an underlying physical computer system. Their implementations may involve specialized hardware, firmware, software, or a combination. In certain examples, the virtual machine monitor (VMM) (also known as a hypervisor) is a software program that, when executed, enables the creation, management, and governance of VM instances and manages the operation of a virtualized environment on top of a physical host machine. A VMM is the primary software behind virtualization environments and implementations in certain examples. When installed over a host machine (e.g., processor) in certain examples, a VMM facilitates the creation of VMs, e.g., each with separate operating systems (OS) and applications. The VMM may manage the backend operation of these VMs by allocating the necessary computing, memory, storage, and other input/output (IO) resources, such as, but not limited to, an input/output memory management unit (IOMMU). The VMM may provide a centralized interface for managing the entire operation, status, and availability of VMs that are installed over a single host machine or spread across different and interconnected hosts.
Memory 108 may be memory separate from a core and/or device 106. Memory 108 may be DRAM. Compressed data may be stored in a first memory device (e.g., far memory) and/or uncompressed data may be stored in a separate, second memory device (e.g., as near memory). A coupling (e.g., input/output (IO) fabric interface 104) may be included to allow communication between device 106, core(s) 102-0 to 102-N, memory 108, etc.
In certain examples, the hardware initialization manager (non-transitory) storage 118 stores hardware initialization manager firmware (e.g., or software). In one example, the hardware initialization manager (non-transitory) storage 118 stores Basic Input/Output System (BIOS) firmware. In another example, the hardware initialization manager (non-transitory) storage 118 stores Unified Extensible Firmware Interface (UEFI) firmware. In certain examples (e.g., triggered by the power-on or reboot of a processor), computer system 100 (e.g., core 102-0) executes the hardware initialization manager firmware (e.g., or software) stored in hardware initialization manager (non-transitory) storage 118 to initialize the system 100 for operation, for example, to begin executing an operating system (OS) and/or initialize and test the (e.g., hardware) components of system 100.
In certain examples, computer system 100 includes an input/output memory management unit (IOMMU) 120 (e.g., circuitry), e.g., coupled between one or more cores 102-0 to 102-N and IO fabric interface 104. In certain examples, IO fabric interface is a Peripheral Component Interface Express (PCIe) interface or a Compute Express Link (CXL) interface. In certain examples, IOMMU 120 provides address translation, for example, from a virtual address to a physical address. In certain examples, IOMMU 120 includes one or more registers 121, for example, data registers and/or control registers (e.g., the registers discussed in reference to
A device 106 may include any of the depicted components. For example, with one or more instances of an accelerator circuit 106-0 to 106-N. In certain examples, a job (e.g., corresponding descriptor for that job) is submitted to the device 106 and the device to performs one or more (e.g., decompression or compression) operations. In certain examples, device 106 includes a local memory 134. In certain examples, device 106 is a TEE IO capable device, for example, with the host (e.g., processor including one of more of cores 102-0 to 102-N) being a TEE capable host. In certain examples, a TEE capable host implements a TEE security manager.
In certain examples, a trusted execution environment (TEE) security manager (e.g., implemented by a trust domain manager 101) is to: provide interfaces to the VMM to assign memory, processor, and other resources to trust domains (e.g., trusted virtual machines), (ii) implements the security mechanisms and access controls (e.g., IOMMU translation tables, etc.) to protect confidentiality and integrity of the trust domains (e.g., trusted virtual machines) data and execution state in the host from entities not in the trusted computing base of the trust domains (e.g., trusted virtual machines), (iii) uses a protocol to manage the security state of the trusted device interface (TDI) to be used by the trust domains (e.g., trusted virtual machines), (iv) establishing/managing IDE encryption keys for the host, and, if needed, scheduling key refreshes. TSM programs the IDE encryption keys into the host root ports and communicates with the DSM to configure integrity and data encryption (IDE) encryption keys in the device, (v) or any single or combination thereof.
In certain examples, a device security manager (DSM) 136 is to (i) support authentication of device identities and measurement reporting, (ii) configuring the IDE encryption keys in the device (e.g., where the TSM provide the keys for the initial configuration and subsequent key refreshes to the DSM), (iii) provide device interface management for locking TDI configuration, reporting TDI configurations, attaching, and detaching TDIs to trust domains (e.g., trusted virtual machines), (iv) implements access control and security mechanisms to isolate trust domain (e.g., trusted virtual machine) provided data from entities not in the TCB of a trust domain (e.g., a trusted virtual machine), (v) or any single or combination thereof.
In certain examples, a standard defines a virtual machine monitor (VMM) (e.g., or VM thereof), TSM (e.g., trust domain manager 101), and device security manager (DSM) 136 interaction flow.
In certain examples, IOMMU 120 and trust domain manager(s) 101 cooperate to allow for direct memory access (e.g., directly) between (e.g., to and/or from) IO device(s) 106 and trust domain memory 116 (e.g., a region for only a single trust domain and/or another region shared by a plurality of trust domains).
In order to establish the trust relationship between a device and a TD, certain TDX-IO architectures require the TD and/or a trust domain manager (e.g., circuit and/or code) (e.g., Trusted Execution Environment (TEE) security manager (TSM)) to create a secure communication session between the device and the trust domain manager (e.g., for the trust domain manager to allow a particular trust domain to use the device or a subset of function(s) of the device). In order to establish the trust relationship between a device and a TD, certain TDX-IO architectures require the TD and/or a trust domain manager (e.g., circuit and/or code) (e.g., Trusted Execution Environment (TEE) security manager (TSM)) use (i) a Distributed Management Task Force (DMTF) Secure Protocol and Data Model (SPDM) standard to authenticate the device (e.g., and collect device measurement), and (ii) use a Peripheral Component Interconnect Special Interest Group (PCI-SIG) TEE Device Interface Security Protocol (TDISP) standard (e.g., to communicate with a device security manager (DSM) to manage the device's function(s)).
In certain examples, a SPDM messaging protocol defines a request-response messaging model between two endpoints to perform the message exchanges outlined in SPDM message exchanges, for example, where each SPDM request message shall be responded to with an SPDM response message as defined in the SPDM specification. In certain examples, an endpoint's (e.g., device's) “measurement” describes the process of calculating the cryptographic hash value of a piece of firmware/software or configuration data and tying the cryptographic hash value with the endpoint identity through the use of digital signatures. This allows an authentication initiator to establish that the identity and measurement of the firmware/software or configuration running on the endpoint.
In certain examples, to help enforce the security policies for the TDs, a new mode of a processor called Secure-Arbitration Mode (SEAM) is introduced to host an (e.g., manufacturer provided) digitally signed, but not encrypted, security-services module. In certain examples, a trust domain manager (TDM) 101 is hosted in a reserved, memory space identified by a SEAM-range register (SEAMRR). In certain examples, the processor only allows access to SEAM-memory range to software executing inside the SEAM-memory range, and all other software accesses and direct-memory access (DMA) from devices to this memory range are aborted. In certain examples, a SEAM module does not have any memory-access privileges to other protected, memory regions in the platform, including the System-Management Mode (SMM) memory or (e.g., Intel® Software Guard Extensions (SGX)) protected memory.
In certain examples, the host 202 is coupled to device 216 via a coupling 104, e.g., via a secured link 104A (e.g., a link according to a PCIe/Compute Express Link (CXL) standard).
In certain examples, the host 202 is coupled to device 216 according to a transport level (e.g., SPDM) specification and/or an application level (e.g., TDISP) specification. In certain examples, device 106 includes a device security manager (DSM) 136 with a device secret(s), e.g., device certificate 212, session key, device “measurement” values, etc. In certain examples, device 106 implements one or more physical function(s).
In certain examples, device 106 includes a first device interface (I/F) 214 on the device side, and one or more second device interface(s) 216. In certain examples, the device 106 supports intra context isolation between these interfaces.
In certain examples, device 106 (e.g., according to a single-root input/output virtualization (SR-IOV) standard) is shared by a plurality of virtual machines (e.g., trust domains). In certain examples, a physical function has the ability to move data in and out of the device while virtual functions (for example, first virtual function and second virtual function, e.g., where the virtual functions are lightweight (e.g., PCI express (PCIe)) functions that support data flowing but also have a restricted set of configuration resources.
In certain examples, IO device 106 is to perform a direct memory access request to a private memory of a trust domain (e.g., trust domain 206-1 or trust domain 206-2) under the control of the IOMMU 120.
In certain examples, a trust domain has both a private memory (e.g., in trust domain memory 116 in
Example extensions and changes to the IOMMU 120 with respect to different architectural components are discussed below.
In certain examples, IOMMU 120 (e.g., circuitry) reports Trusted Nested DMA Translation support (TNEST) through a trusted extended capability register (e.g., register 314D).
In certain examples, IOMMU 120 (e.g., circuitry) supports two parallel DMA-translation tables, e.g., table 322 representing the device interfaces assigned to certain VMs (e.g., referred to as untrusted DMA-translation tables) and table 324 representing the device interfaces assigned to TDs (e.g., referred to as trusted DMA-translation tables). In certain examples, DMA-translation tables consist of multi-level tables including scalable-mode root-table, scalable-mode context-table, scalable-mode PASID directory, and scalable-mode PASID table. In a first example, DMA-translation tables are indexed by the PCIe Requester-ID and in a second example, DMA-translation tables are indexed by the PCIe Requester-ID and PASID.
In certain examples, registers (e.g., T_RTADDR_REG) associated with the trusted DMA-translation tables are protected and can only be written with the SEAM Security Attribute of Initiator (SAI). In certain examples, such a protection scheme ensures that only the trust domain manager (e.g., TDX-module) can program these trusted registers and no other untrusted software entities on the platform.
In certain examples, untrusted DMA-translation tables are stored in a regular (e.g., not trust domain) memory and are managed/programmed by the VMM, and trusted DMA-translation tables are stored in the protected memory and are managed/programmed by the trust domain manager (e.g., TDX-module). In certain examples, IOMMU (e.g., circuitry) uses Translation Agent (TA)-polarity of 0b when accessing untrusted DMA-translation tables and TA-polarity of 1b when accessing trusted DMA-translation tables.
In certain examples, DMA-translation tables are programmed with the first-stage page-table pointer, second-stage page-table pointer, or both. In certain examples, first-stage/second-stage page-tables connected via untrusted DMA-translation tables are stored in a regular (e.g., not trust domain) memory and first-stage/second-stage page-tables connected via trusted DMA-translation tables are stored in a protected memory.
In certain examples, IOMMU (e.g., circuitry) uses T attribute of an untranslated request to select between untrusted or trusted DMA-translation tables. For example, where if the T attribute is 0b, IOMMU circuitry translates the untranslated address using the untrusted DMA-translation tables (and/or IO cache 302 that has a copy of the untrusted translation), and if the T attribute is 1b, IOMMU circuitry translates the untranslated address using the trusted DMA-translation tables (and/or IO cache 302 that has a copy of the trusted translation). In certain examples, on the successful translation, the IOMMU circuitry uses the T attribute of request to tag the IOMMU caches and generate final TA-polarity of the DMA read/write request. In certain examples, the final TA-polarity of DMA read/write request is generated as (T attribute of untranslated request & !GPA.SHARED).
In certain examples, IOMMU (e.g., circuitry) uses an eXtended TEE (XT) attribute (e.g., as shown in
In certain examples, the untranslated address is a guest physical address (GPA) which gets translated to a host physical address (HPA) using the trusted DMA-translation tables (e.g., and second-stage page tables).
In certain examples, the untranslated address is a guest virtual address (GVA) which gets translated to a GPA using the (e.g., first-stage) page tables and then gets translated again to a HPA using the (e.g., second-stage) page tables (GVA→GPA→HPA).
In certain examples, the untranslated address is a guest IO virtual address (GIOVA) which gets translated to GPA using (e.g., first-stage) page tables and then gets translated again to HPA using the (e.g., second-stage) page tables (GIOVA→GPA→HPA).
In certain examples, root-complex circuitry uses the same T or XT attribute as the untranslated request to generate the read completions.
Depicted IO cache 302 includes an input and/or output for a memory access (e.g., read and/or write) request (e.g., from an IO device 106), for example, from root complex 306 of computer system 100. In certain examples, IOMMU utilizes a PCIe root port 208 in
In certain examples, IO cache 302 is to, for a hit in the IO cache 302 (e.g., its cache of one or more mappings from untrusted DMA translation table 322 and/or from trusted DMA translation tables 324) for an input of an (e.g., virtual) address from the device (e.g., endpoint), output the corresponding host (e.g., physical) address, and/or for a miss in the IO cache 302 (e.g., its cache of mappings) for an input of a (e.g., virtual) address, perform a (e.g., page) walk in memory to determine the corresponding host (e.g., physical) address for that input of address from the device.
However, it may be desirable to not allow an IO device 106 to access protected private memory (e.g., trust domain memory 116 in
In certain examples, IOMMU 120 maintains a cache 302 of one or more (e.g., a proper subset of) translations from trusted translation tables 324 (e.g., with these cached “trusted” translations also protected by T or XT attribute) and/or one or more (e.g., a proper subset of) translations from untrusted translation tables 322.
In certain examples, a request from TEE-JO device (e.g., marked with T attribute or “ide_t” (e.g., =1) or XT attribute (e.g., !=00b) as discussed herein) (e.g., as checked by check 304) is to be sent to an IO cache 302 of “trusted” translations and/or (e.g., and for a miss in that cache) to a set of trusted translation tables 324 (e.g., also stored within protected memory 116 or within IOMMU 120) (e.g., managed by the trust domain manager 101 (e.g., TDX-module)) that are separate from a set of untrusted translation tables 322 (e.g., in non-trust domain memory 114 or within IOMMU 120) (e.g., managed by the VMM 110B). In certain examples, IOMMU 120 maintains a (e.g., trusted) translation table for each device.
In certain examples, use of separate untrusted translation tables 322 and trusted translation tables 324 means that a separate set of one or more registers is to be utilized for each, for example, with “non-trusted” root table address register 312 storing the pointer for the base address of the non-trusted root table in untrusted translation tables 322 and trusted root table address register (T_RTADDR_REG) 316 storing the pointer for the base address of the trusted root table in trusted translation tables 324 (e.g., where a root table stores a plurality of root entries and each root entry contains a context table pointer to reference the context table for the IO device).
In certain examples, a request for non-TEE-JO device (e.g., marked with T attribute or “ide_t” (e.g., =0) or XT attribute (e.g., =00b) as discussed herein) (e.g., as checked by check 304) is to be sent to an IO cache 302 of “untrusted” translations and/or (e.g., and for a miss in that cache) to be sent to a set of untrusted translation tables 322 (e.g., stored in non-trust domain memory 114).
Certain I/O memory controllers (e.g., IOMMU 120) (e.g., in Scalable Mode as discussed below in reference) allow IO devices to access memory using the virtual address (VA) in the DMA requests (e.g., with or without a process address space identifier (PASID) prefix). In certain examples, I/O memory controller (e.g., IOMMU) translates a VA to a corresponding physical address (PA) using a PASID configured in the translation tables or using a PASID received in the DMA request.
In certain examples, I/O memory controller (e.g., IOMMU 120) pushes a translation into built-in IO cache (e.g., the data storage therein that stores the virtual address to physical address mappings) after a successful page table walk.
In certain examples, translation tables 322 (e.g., a copy thereof stored in IOMMU 120 and/or IO cache 302) includes a DMA remapping structure (e.g., that starts with a root table) according to examples of the disclosure. Depicted (scalable) root table includes a bus entry (e.g., 0 to 255) that points to an entry for a device (e.g., function) in (upper or lower scalable) context table that points to a PASID directory whose entry then points to a PASID table whose entry contains a value that includes a first-stage page table (FSPT) pointer and/or a second-stage page table (SSPT) pointer.
In certain examples, trusted translation tables 324 (e.g., a copy thereof stored in IOMMU 120 and/or IO cache 302) includes a DMA remapping structure (e.g., that starts with a root table) according to examples of the disclosure. Depicted (scalable) root table includes a bus entry (e.g., 0 to 255) that points to an entry for a device (e.g., function) in (e.g., lower or upper scalable) context table that points to PASID directory whose entry then points to a PASID table whose entry contains a value that includes a pointer to a secure extended page table (secEPT) 326 (for example, that maps memory protected using a TD key (e.g., TD KeyID)) or a combination of secEPT and a shared extended page table (sharedEPT) 328 (e.g., that maps TD's private and shared memory).
In certain examples, each inbound request appearing at the address-translation hardware (e.g., IOMMU 120) is required to identify the device originating the request. The (e.g., 16 bit) attribute identifying the originator of an I/O transaction may be referred to as the source ID. In certain examples, for PCI Express (PCIe) devices, the source ID is the requester identifier in the PCI Express transaction layer header in certain examples, e.g., where the requester identifier of a device, which is composed of its PCI Bus number/Device number/Function number, is assigned by configuration software, and uniquely identifies the hardware function that initiated the request.
In certain examples, TDX-IO framework (e.g., as shown in the figures) enables heterogenous confidential computing with secure, efficient, and low-overhead data movement to/from IO-agents. In certain examples, IOMMU enables direct device assignment of PCIe TDIs (Trusted Execution Environment Device Interfaces) to the TDs. In certain examples, the IOMMU supports trusted DMAs to TD's private memory using nested page-tables, supports new architectural states for IOMMU's trusted DMA-translation table entries, supports PCIe Address Translation Services (ATS), support PCIe Page Request Services (PRS), supports TEE-Polarity of Completer (TPC), and/or supports eXtended TEE (XT) mode.
In certain examples, a system (e.g., IOMMU) uses a state machine, e.g., to manage one or more states for an entry in the trusted translation tables of the IOMMU.
However, in certain examples, it may be desirable to have different and/or additional states than those states (or their equivalents) shown in
In certain examples, once TEE-IO device has been accepted in TD's TCB and performs DMAs, various IOMMU caches 302 will get populated (e.g., an IO TLB, first-stage/second-stage paging structure caches, etc., see, e.g., cache 302 storing a copy of certain (e.g., most recently used) data of the tables 324 in
In certain examples, a DMAR.ADD request adds an entry to the trusted translation tables of the IOMMU, a DMAR.ACCEPT request transitions a configured entry to active use where the entry can be used to generate a translation, a DMAR.BLOCK request is to block certain uses of a translation, and a DMAR.REMOVE request removes (e.g., invalidates) an entry from the trusted translation tables of the IOMMU. In certain examples, the requests are from a virtual machine monitor (VMM) (e.g., DMAR.ADD request, DMAR,BLOCK request, DMAR.REMOVE request) or from a trust domain (e.g., DMAR.ACCEPT request).
In certain examples, use of a CONFIGURED state enables (i) trust domain manager (e.g., TDX-module) to create and/or configure the corresponding DMA-translation table entry without making it active and/or operational, and/or (ii) a TD to authenticate an IO (e.g., TEE-IO) device and verify configuration of DMA-translation table (e.g., working with trust domain manager) and request the entry to transition to a PRESENT state.
In certain examples, use of one or more BLOCKED states enables trust domain manager (e.g., TDX-module) to (i) block an entry (e.g., block it from being used to provide a translation), (ii) queue invalidations, and/or (iii) process queued invalidations before removing and/or re-purposing an entry.
In certain examples, the entry is cached in the IO cache 302 during the address translation (e.g., page walk) only when in the entry is in the PRESENT state.
In certain examples, a blocked (inv_pending) state 508A is where the entry is blocked, but the invalidations associated with blocking the entry are pending. In certain examples, these invalidations are to invalidate the IO cache and/or to invalidate the cached entries of the trusted translation tables. In certain examples, a blocked (inv_queued) state 508B is where the entry is blocked and invalidations are queued, but the invalidations may not have been processed yet. In certain examples, blocked (inv_completed) state 508C is where the entry is blocked, and invalidations associated with blocking the entry are also completed.
In certain examples, a DMAR.ADD request adds an entry to the trusted translation tables of the IOMMU, a DMAR.ACCEPT request transitions a configured entry to active use where the entry can be used to generate a translation, a DMAR.BLOCK request is to block certain uses of (e.g., access to) a translation (e.g., depending on the status of the invalidation of the entry) (e.g., so that invalidation can begin), DMAR.INVALIDATE request is to queue the invalidation request (e.g., in trusted invalidation queue 1010) and block certain uses of (e.g., access to) a translation (e.g., depending on the status of the invalidation of the entry), DMAR.PROCESSINV request is to process a queued invalidation request (e.g., from trusted invalidation queue 1010) and block any use of (e.g., access to) a translation, and a DMAR.REMOVE request removes (e.g., invalidates) an entry from the trusted translation tables of the IOMMU. In certain examples, the requests are from a virtual machine monitor (e.g., DMAR.ADD request, DMAR.BLOCK request, DMAR.INVALIDATE request, DMAR.PROCESSINV request, DMAR.REMOVE request) or from a trust domain (e.g., DMAR.ACCEPT request).
In certain examples, these states are implemented for a scalable-mode PASID table entry. The following tables 1A and 1B map examples of these states and specifies example IOMMU behavior on an incoming DMA transaction.
The following discussion of
In certain examples, the trust domain (TD) is required to communicate with the VMM 110B (e.g., the TD is not allowed direct communication with the IOMMU), so the VMM 110B is to send requests to the TDM, and the TDM is then to communicate with the IOMMU on behalf of the VMM and/or TD.
In certain examples, a state machine does not include the configured state. In certain examples, IOMMU's (e.g., IO cache's) non-leaf entries (e.g., scalable-mode root-table entry, scalable-mode context-table entry, and scalable-mode PASID directory entry, e.g., as shown in trusted DMA translation tables 324 in
In certain examples, a blocked (inv_pending) state 808A is where the entry is blocked, but the invalidations associated with blocking the entry are pending. In certain examples, these invalidations are to invalidate the IO cache and/or to invalidate the cached entries of the trusted translation tables. In certain examples, a blocked (inv_queued) state 808B is where the entry is blocked and invalidations are queued, but the invalidations may not have been processed yet. In certain examples, blocked (inv_completed) state 808C is where the entry is blocked, and invalidations associated with blocking the entry are also completed.
In certain examples, a DMAR.ADD request adds an entry to the trusted translation tables, a DMAR.BLOCK request is to block certain uses of (e.g., access to) a translation (e.g., depending on the status of the invalidation of the entry) (e.g., so that invalidation can begin), DMAR.INVALIDATE request is to queue the invalidation request (e.g., in trusted invalidation queue 1010) and block certain uses of (e.g., access to) a translation (e.g., depending on the status of the invalidation of the entry), DMAR.PROCESSINV request is to process a queued invalidation request (e.g., from trusted invalidation queue 1010) and block any use of (e.g., access to) a translation, and a DMAR.REMOVE request removes (e.g., invalidates) an entry from the trusted translation tables. In certain examples, the requests are from a virtual machine monitor (VMM).
In certain examples, IOMMU 120 (e.g., circuitry) reports Trusted ATS Translation support (TDT) through a trusted extended capability register (e.g., register 314D).
In certain examples, IOMMU 120 (e.g., circuitry) supports two parallel DMA-translation tables, e.g., table 322 representing the device interfaces assigned to certain VMs (e.g., referred to as untrusted DMA-translation tables) and table 324 representing the device interfaces assigned to TDs (e.g., referred to as trusted DMA-translation tables).
In certain examples, DMA-translation tables are programmed with the first-stage page-table pointer, second-stage page-table pointer, or both. In certain examples, first-stage/second-stage page-tables connected via untrusted DMA-translation tables are stored in a regular (e.g., not trust domain) memory and first-stage/second-stage page-tables connected via trusted DMA-translation tables are stored in a protected memory.
In certain examples, IOMMU (e.g., circuitry) uses the T attribute of an ATS translation request to select between untrusted or trusted DMA-translation tables, e.g., table 322 representing the device interfaces assigned to certain VMs (e.g., referred to as untrusted DMA-translation tables) and table 324 representing the device interfaces assigned to TDs (e.g., referred to as trusted DMA-translation tables). In certain examples, if the T attribute is 0b, IOMMU (e.g., circuitry) translates the untranslated address using the untrusted DMA-translation tables 322 (and/or IO cache 302 that has a copy of the untrusted translation). In certain examples, if T attribute is 1b, IOMMU (e.g., circuitry) translates the untranslated address using the trusted DMA-translation tables 324 (and/or IO cache 302 that has a copy of the trusted translation).
In certain examples, IOMMU (e.g., circuitry) uses the XT attribute of an ATS translation request to select between untrusted or trusted DMA-translation tables, e.g., table 322 representing the device interfaces assigned to certain VMs (e.g., referred to as untrusted DMA-translation tables) and table 324 representing the device interfaces assigned to TDs (e.g., referred to as trusted DMA-translation tables). In certain examples, if the XT attribute is 00b, IOMMU (e.g., circuitry) translates the untranslated address using the untrusted DMA-translation tables 322 (and/or IO cache 302 that has a copy of the untrusted translation). In certain examples, if the XT attribute is not 00b, IOMMU (e.g., circuitry) translates the untranslated address using the trusted DMA-translation tables 324 (and/or IO cache 302 that has a copy of the trusted translation).
In certain examples, IOMMU (e.g., circuitry) uses the same T or XT attribute as an ATS translation request to generate the ATS translation completion.
In certain examples, IOMMU (e.g., circuitry) uses T attribute of ATS translated request to select between the untrusted or trusted DMA-translation tables, e.g., table 322 representing the device interfaces assigned to certain VMs (e.g., referred to as untrusted DMA-translation tables) and table 324 representing the device interfaces assigned to TDs (e.g., referred to as trusted DMA-translation tables). In certain examples, on the successful translation enable check, IOMMU (e.g., circuitry) to generate the final TA-polarity of the DMA read/write request. In certain examples, the final TA-polarity of DMA read/write request is generated as (T attribute of translated request & IS_TEE_PAGE(HPA)).
In certain examples, IOMMU (e.g., circuitry) uses untrusted host physical address (HPA) permission table (HPT) to validate ATS translated request with the T attribute of 0b and trusted HPA permission table (HPT) to validate ATS translated request with the T attribute of 1b.
In certain examples, IOMMU (e.g., circuitry) uses XT attribute of ATS translated request to select between the untrusted or trusted DMA-translation tables, e.g., table 322 representing the device interfaces assigned to certain VMs (e.g., referred to as untrusted DMA-translation tables) and table 324 representing the device interfaces assigned to TDs (e.g., referred to as trusted DMA-translation tables). In certain examples, on the successful translation enable check, IOMMU (e.g., circuitry) to generate the final TA-polarity of the DMA read/write request. In certain examples, the final TA-polarity of DMA read/write request is generated as (Bit-0 of XT attribute of untranslated request & IS_TEE_PAGE(HPA)).
In certain examples, IOMMU (e.g., circuitry) uses untrusted host physical address (HPA) permission table (HPT) to validate ATS translated request with the XT attribute as 00b and trusted HPA permission table (HPT) to validate ATS translated request with the XT attribute as not 00b.
In certain examples, register's access policy groups are changed for security, e.g., when in the TDX_MODE of operation. In certain examples, an IOMMU includes a trusted root table address register (T_RTADDR_REG) 316, a register (TDX_MODE_REG) 314A to set the IOMMU 120 into (or out of) TDM (e.g., TDX) mode, an enhanced command register (ECMD_REG) 314B as an interface to submit an enhanced command (e.g., to place it into or out of TDX mode) to the IOMMU, and/or a global command register (GCMD_REG) 314C to submit global commands for IOMMU memory.
In certain examples, registers include a control register (TDX_MODE) 314A (e.g., within IOMMU 120) to set the IOMMU 120 within TDM (e.g., TDX) mode, e.g., to use register 316, registers in
In certain examples, a “standard” command, register, etc. refers to a command, register, etc. that is not used for a trust domain, e.g., not used to implement input/output extensions for trust domains.
In certain examples, IOMMU 120 (e.g., circuitry) supports two parallel invalidation queues, e.g., untrusted invalidation queue 1006 (e.g., stored in memory 114) for queuing invalidations associated with the device interfaces assigned to certain (e.g., legacy) VMs and trusted invalidation queue 1010 (e.g., stored in memory 116) for queuing invalidations associated with device interfaces assigned to trust domains (e.g., trusted VMs).
In certain examples, the untrusted invalidation queue 1006 is stored in a regular (e.g., not trust domain) memory and is managed by the VMM 110B, and the trusted invalidation queue 1010 is stored in a protected (e.g., trust domain) memory and is managed by the trust domain manager 101 (e.g., TDX-module).
In certain examples, the IOMMU 120 (e.g., circuitry) supports two separate sets of registers that are associated with each invalidation queue. In certain examples, registers associated with the trusted invalidation queue are protected and can only be written by the trust domain manager (e.g., with the SEAM SAI).
In certain examples, the IOMMU 120 (e.g., circuitry) uses the T attribute of 0b (or XT attribute of 00b) on ATS invalidate request, when processing the DevTLB invalidation descriptor queued in the untrusted invalidation queue 1006 and uses the T attribute of 1b (or XT attribute of 01b) on ATS invalidate request, when processing the DevTLB invalidation descriptor queued in the trusted invalidation queue 1010.
In certain examples, the IOMMU 120 (e.g., circuitry) compares the T (or XT attribute) of ATS invalidate completion against the original T (or XT attribute) of ATS invalidate request, and only treats it as a valid completion when the attributes are matching. In certain examples, this is achieved by maintaining/storing a T (or XT attribute) associated with each of the Invalidation Tag (ITag) (e.g., stored in the ITAG tracker 1014) specified in an ATS invalidate request and comparing this T (or XT attribute) along with comparing ITag on ATS invalidate completion.
In certain examples, the IOMMU 120 (e.g., circuitry) uses trusted registers associated with the trusted invalidation queue 1010 to log the DevTLB invalidation timeouts or other errors.
In certain examples, IOMMU 120 includes a set of registers for an invalidation queue. In certain examples, it is desirable to keep a VMM 110B (or OS or other component that is not part of a trust domain) from invalidating private memory as well as reading any data structure, register, etc. that has corresponding data for invalidating that private memory (e.g., in trust domain memory 116 in
In certain examples, different trust domains are mapped through one or more corresponding trusted translation tables 324 and/or corresponding IOMMU registers 1012A-1012I.
In certain examples, a request (e.g., command) for an invalidation of (e.g., a page of) protected private memory 116 as discussed herein) is to be sent (e.g., by the trust domain manager 101 (e.g., TDX-module)) to trusted invalidation queue 1010. In certain examples, trusted invalidation queue tail register (T_IQT_REG) 1012B (e.g., for TDX-IO) is to store an indication of the tail (e.g., last valid) entry in trusted invalidation queue 1010, trusted invalidation queue head register (T_IQH_REG) 1012A (e.g., for TDX-IO) is to store an indication of the head (e.g., first valid) entry in trusted invalidation queue 1010, and trusted invalidation queue address register (T_IQA_REG) 1012C (e.g., for TDX-IO) is to store an indication of the base address (e.g., and size) of the trusted invalidation queue 1010, e.g., with these registers accessible (e.g., only) by the trust domain manager 101 and/or these registers within the IOMMU 120.
In certain examples, a request (e.g., command) for an invalidation of (e.g., a page of) non-private memory 114 as discussed herein) is to be sent (e.g., by the virtual machine monitor 110B) to untrusted invalidation queue 1006. In certain examples, “non-trusted” invalidation queue tail register (IQT_REG) 1008B (e.g., not for TDX-IO) is to store an indication of the tail (e.g., last valid) entry in untrusted invalidation queue 1006, “non-trusted” invalidation queue head register (IQH_REG) 1008A (e.g., not for TDX-IO) is to store an indication of the head (e.g., first valid) entry in untrusted invalidation queue 1006, and “non-trusted” invalidation queue address register (IQA_REG) 1008C (e.g., not for TDX-IO) is to store an indication of the base address (e.g., and size) of the untrusted invalidation queue 1006, e.g., with these registers accessible (e.g., only) by the VMM 110B and/or these registers within the IOMMU 120.
In certain examples, the invalidation requests are serviced, e.g., and the corresponding register(s) are updated, for example, updating the head and tail pointers accordingly. In certain examples, an invalidation request is (i) to take memory (e.g., a page) from a first virtual machine (e.g., or trust domain) and give it to another virtual machine (e.g., or trust domain) (e.g., after clearing the data of the first virtual machine from that memory), (ii) to delete a virtual machine (e.g., or trust domain), and/or (iii) in response to a global reset request.
In certain examples (e.g., as shown in
In certain examples, trust domain manager 101 (e.g., TDX-module) manages trusted IOMMU registers 1012A-1012I, register 316, and trusted translations tables 324.
In certain examples, VMM 110B 101 manages other IOMMU registers 1008A-1008I, register 312, and other translations tables 322.
In certain examples, computer system 100 (e.g., IOMMU 120 thereof) sends an ATS invalidate request 1002 to IO device 106, and, on completion of the invalidation, the IO device 106 sends an ATS invalidate completion 1004 indication to computer system 100 (e.g., IOMMU 120).
In certain examples, IOMMU 120 (e.g., circuitry) reports Trusted PRS support (TPRS) through a trusted extended capability register (e.g., register 314D).
In certain examples, IOMMU 120 (e.g., circuitry) supports two parallel page-request queues, e.g., untrusted page-request queue 1106 (e.g., in memory 114) utilized for storing page-requests associated with the device interfaces assigned to certain (e.g., legacy) VMs and trusted page-request queue 1110 (e.g., in memory 116) utilized for storing page-requests associated with device interfaces assigned to trust domains (e.g., trusted VMs).
In certain examples, the untrusted page-request queue 1106 is stored in a regular (e.g., not trust domain) memory and is managed by the VMM 110B, and the trusted page-request queue 1110 is stored in a protected (e.g., trust domain) memory and is managed by the TDM 101 (e.g., TDX-module).
In certain examples, IOMMU (e.g., circuitry) support two separate set of registers that are associated with each page-request queue, e.g., registers 1108A-1108H for untrusted page-request queue 1106 and registers 1112A-H for trusted page-request queue 1110.
In certain examples, the registers associated with the trusted page-request queue 1110 are protected and can only be written by the trust domain manager (e.g., with the SEAM SAI).
In certain examples, IOMMU (e.g., circuitry) populates the untrusted page-request queue 1106, when the page-request message is received with T attribute of 0b (or XT attribute of 00b) and populates trusted page-request queue 1110 when the page-request message is received with T attribute of 1b (or XT attribute of 01b).
In certain examples, software services the page requests, e.g., by handling the page-fault and generating a page-response with a success or a failure code.
In certain examples, IOMMU (e.g., circuitry) to generates page-request group response message with T attribute of 0b (or XT attribute of 00b), when software queues page-response to the untrusted invalidation queue (e.g., untrusted page-response queue), and to generate page-request group response with T attribute of 1b (or XT attribute of 01b), when software queues page-response to the trusted invalidation queue (e.g., trusted page-response queue).
In certain examples, trusted execution environments (TEEs) have access to TEE resources (e.g., protected memory and/or memory mapped IO (MMIO) of TEE-TO device) and non-TEE resources (e.g., shared memory and/or MMIO of legacy device).
In certain examples, an IO device may be interested to figure-out if the translated address is associated with the TEE memory or the non-TEE memory (e.g., sharing a cache line between TEE and non-TEE domains and/or support direct peer-to-peer between IO devices and/or enable efficient data-sharing across the trusted and untrusted device contexts). Certain examples herein are directed to an extension to IOMMU circuitry to return the TEE-polarity of DMA-Target/Completer as part of ATS Translation Completion.
In certain examples, IOMMU (e.g., circuitry) reports TEE-Polarity of Completer support (TPCS) through the trusted extended capability register (e.g., register 314D).
In certain examples, a new bit (e.g., TPCE-bit) is utilized in an IOMMU's scalable-mode context table entry that enables generation of TPC-bit as “TEE Exclusive” attribute in ATS Translation Completion.
In certain examples, when TPCE-bit in IOMMU's scalable-mode context table entry is 0b, the TPC-bit is always generated as 0b. In certain examples, when TPCE-bit in IOMMU's scalable-mode context table entry is 1b (e.g., enabled), on a successful processing of ATS Translation Request, TPC-bit is generated as (T (or Bit-0 of XT attribute) of ATS Translation Request & (!GPA.SHARED)). In certain examples, a TPC-bit is generated as (T (or Bit-0 of XT attribute) of ATS Translation Request & IS_TEE_PAGE (HPA)), e.g., where this results in TPC-bit being generated as 1b for TEE resources (e.g., protected memory and/or TEE Device Interface) and 0b for non-TEE resources (e.g., shared memory and/or legacy device interface) assigned to TEE.
In certain examples, IOMMU caches are also tagged with TPC-bit along with TEE-bit.
In certain examples, a standard (e.g., PCI-SIG) defines mechanisms to convey TPC-bit as part of ATS Translation Completion. In example ATS packet 1202, the TPC-bit is conveyed as part of (e.g., payload for) ATS Translation Completion through the “TEE Exclusive” attribute, e.g., TEE Exclusive attribute to replace the global field (when enabled via ATS registers on the IO device).
eXtended TEE (XT) Mode Support in the IOMMU
In certain examples, trusted execution environments (TEEs) have access to TEE memory (e.g., protected memory and/or memory mapped IO (MMIO) of TEE-JO device) and non-TEE memory (e.g., shared memory and/or MMIO of legacy device).
In certain examples, an IO device may be interested in explicitly targeting TEE memory or non-TEE memory (e.g., conveying an intent to store digital-rights management (DRM) content to only TEE memory). In certain examples, this intent is conveyed through an eXtended TEE (XT) attribute on the memory request (e.g., untranslated request, ATS translation request, ATS translated request). For example, if the XT attribute is 00b, the request originated from non-TEE (e.g., not trust domain or not TEE-JO device) and must target the non-TEE memory. If the XT attribute is 01b, the request originated from TEE (e.g., trust domain or TEE-JO device) and can target TEE or non-TEE memory based on the address translation performed by the IOMMU. If the XT attribute is 10b, the request originated from TEE (e.g., trust domain or TEE-JO device) and must target non-TEE memory. If the XT attribute is 11b, the request originated from TEE (e.g., trust domain or TEE-JO device) and must target TEE memory. Certain examples herein are directed to an extension to IOMMU circuitry to process the memory requests received with the XT attribute.
In certain examples, the host (e.g., processor) may be interested in learning the requested TEE-polarity of the Completer (e.g., keyID look-up for Scalable Multi-Key TME and/or direct peer-to-peer between IO devices). Certain examples herein allow an IO device to fill the XT attribute on the ATS translated request based on the TEE-polarity of Completer received in the ATS Translation Completion. If the TEE Exclusive attribute is 0b (e.g., TPC=0b), the IO device generates ATS Translated Request with the XT attribute of 10b. If the TEE Exclusive attribute is 1b (e.g., TPC=1b), the IO device generates ATS Translated Request with the XT attribute of 11b.
In certain examples, IOMMU (e.g., circuitry) reports support for XT mode (XTS) through the trusted extended capability register (e.g., register 314D).
In certain examples, a new bit (e.g., XTE-bit) is utilized in an IOMMU's scalable-mode context table entry that enables processing of XT attribute from the memory requests.
In certain examples, when XTE-bit in IOMMU's scalable-mode context table entry is 0b, only XT0 bit is used for address translation and XT1 bit is treated as Reserved (and must be 0b). In certain examples, when XTE-bit in IOMMU's scalable-mode context table entry is 1b (e.g., enabled), on a successful address translation, TEE-polarity of Target/Completer is checked against the incoming XT attribute. For example, if the XT attribute is 00b or 10b, the target must be non-TEE memory. If the XT attribute is 1b, the target must be TEE memory. If the XT attribute is 01b, the target can be TEE or non-TEE memory. If the request is ATS translated request, the XT attribute must not be 01b. The memory request failing any of these checks is blocked by the IOMMU 120 (e.g., circuitry).
In certain examples, IOMMU caches are also tagged with the XT attribute.
In certain examples, Table 2A describes the meaning of XT attribute and Table 2B, 2C and 2D describe the IOMMU (e.g., circuitry) processing for the untranslated request, the ATS translation request, and the ATS translated request respectively.
In example PCIe packet 1204, the XT attribute is conveyed as part of the Integrity and Data Encryption (IDE) TLP prefix.
In example PCIe packet 1206, the XT attribute is conveyed as part of the OHC-C (Orthogonal Header Content—C) field.
In certain examples, the IOMMU 120 gets a new input (e.g., T attribute or “ide_t” as the state of the T bit in the IDE prefix of TLP (e.g., not a control packet) received, e.g., where the T attribute, when set, indicates the TLP originated from within a trust domain) from devices. In certain examples, for a TLP received without the IDE prefix, this input is 0b.
In certain examples, the IOMMU 120 gets a new input (e.g., XT attribute (XT0/XT1 bits) in the IDE TLP prefix or OHC-C field received, e.g., where the XT attribute, when not 00b, indicates the TLP originated from within a trust domain) from devices. In certain examples, for a TLP received without the IDE TLP prefix or OHC-C field, this input is 00b.
In certain examples, the IOMMU 120 generates an output (“TA-Polarity”) which indicates if the physical address at the final applicable output can have a trust domain (e.g., TDX) KeyID (kid).
In certain examples, to signal the setting of the T (or XT) attribute to be sent in the PCIe TLP, the IOMMU 120 outputs a signal T (or XT) attribute which is forwarded by the HIOP (e.g., OTC thereof) to the on-chip system fabric (OSF) agent. In certain examples, the IOMMU 120 sets T attribute to 1b (or XT attribute to 01b) when the message was generated in response to descriptors from the trusted invalidation queue (e.g., trusted invalidation queue 1010 in
In certain examples, the secondary interface is also used to generate Message Signaled Interrupts (MSI) writes, e.g., writes to special memory ranges and the TA-Polarity for these writes is assumed to be 0.
In certain examples, the secondary interface is also used to generate writes to store the value obtained “Status Data” field of invalidation wait descriptor to address specified by the “Status Address” field of an invalidation wait descriptor. In certain examples, the TA-Polarity for these writes is always 0 irrespective of which invalidation queue (normal or trusted) the invalidation wait descriptor was processed from.
In certain examples, a new signal (value) called TA-Polarity is added to this interface to indicate if the physical address of the access to the memory subsystem can have a TDM (e.g., TDX) KeyID.
In certain examples, the memory interface is used by the IOMMU 120: (i) for fetches to translation table entries as part of page walk originating from the untrusted as well as trusted translation tables, (ii) to perform address/data (A/D) bit updates atomically in first and second stage paging structures, (iii) to perform atomic updates to the posted interrupt descriptor (PID), (iv) for fetches to invalidation descriptor from the untrusted as well as trusted invalidation queue, and/or (v) writes to the untrusted as well as trusted page request queue.
In certain examples, one or more registers are used to implement the disclosure herein. For example, by decoding and executing an instruction that stores a (e.g., control) value into one or more registers.
In certain examples, if an implementation cannot ensure that the registers (e.g., trusted IOMMU registers 1012A-1012I and 316) are reserved and store zero values (RsvdZ) when ECAP_REG.TDXIO 1100 is 0, it should be guaranteed the writing of these registers (where applicable) are effectively no-operations (No-Ops) from the IOMMU operation point of view.
In certain examples, the ECAP_REG.TDXIO is 1 only when all the following qualifications/dependencies are satisfied: (i) default hardware reset of ECAP_REG.TDXIO is 1, (ii) ECAP_REG.SMTS=1 (scalable mode support present), (iii) Effective Host Address Width (e.g., after hardware autonomous width (HAW) defeature inclusion with the maximum physical platform address (MAX_PA)) is 52 bit, and (iv) TDX-IO Defeature (see below) is OFF. In certain examples, the TDX-IO feature can be fully defeatured using a bit (e.g., bit 3 for TDX-IO) of a Capability Defeature Register (e.g., as one of the registers in a processor and/or IOMMU).
In certain examples, a set of registers is used for command submission (e.g., called “Enhanced Command”) to an IOMMU with appropriate success/failure and thereby fault reporting, for examples, with these extended as below to support the SET_TDX_MODE command in TDX-IO
In certain examples, separate Trusted Enhanced Command Register (T_ECMD_REG), Trusted Enhanced Command Extended Operand Register (T_ECEO_REG), Trusted Enhanced Command Status 0-1 Register (T_ECSTS0_REG, T_ECSTS1_REG), Trusted Enhanced Command Capability Register 0-3 (T_ECCAP0_REG, T_ECCAP1_REG, T_ECCAP2_REG, T_ECCAP3_REG), and Trusted Enhanced Command Response Register (T_ECRESP_REG) are used to send/receive Trusted Commands (e.g., TDX-IO Commands) to the IOMMU.
In certain examples, the registers include Protected Memory Enable Register (PMEN), Protected Low-Memory Base Register (PLMBASE), Protected Low-Memory Limit Register (PLMLIMIT), Protected High-Memory Base Register (PHMBASE), and Protected High-Memory Limit Register (PHMLIMIT). In certain examples, the PMEN, when set, is to enable DMA-protected memory regions setup through the PLMBASE, PLMLIMT, PHMBASE, PHMLIMIT registers.
In certain examples, PMEN, PLMBASE, PLMLIMIT, PHMBASE, and PHMLIMIT registers are shadowed in the HIOP, for example, where the HIOP also shadows the IOMMU SAI policy group registers of the IOMMU. In certain examples, the IOMMU SAI policy group registers are located at offset 0xF10 in the IOMMU VTBAR.
In certain examples, TDX-IO makes these registers into protected registers (e.g., covered by the SEAM_OS_W policy group). In certain examples, to avoid having to add new policy groups to the HIOP shadow logic and to avoid the HIOP shadow logic from having to use a different offset (e.g., than 0xF10), the IOMMU locate the SEAM_OS_W policy group registers of read access control (RAC), write access control (WAC), and control policy (CP) at certain offsets (e.g., offsets 0xF10, 0xF18, and 0xF20, respectively).
In certain examples, setting Set Root Table Pointer (SRTP) bit via global command register (GCMD_REG) 314C is unchanged from a non-JO VT-d specification definition, for example, it latches the legacy root pointer to an internal copy (e.g., along with the internal/external drain, global invalidation, etc.) with no other side effects from unexpected register values etc.
In certain examples, when in TDX mode, the trust domain manager (e.g., TDX-module) takes ownership of the RTADDR_REG as well as the GCMD_REG (write access controlled to SEAM), e.g., such that the trust domain manager (e.g., TDX-module) ensures that the RTADDR_REG programmed by the VMM has translation mode set to either scalable mode or abort.
In certain examples, an Enhanced Command (ECMD) register (e.g., enhanced command register (ECMD_REG) 314B) is a new VT-d command submission interface to the IOMMU 120 with corresponding response (e.g., success/failure) feedback to S/W based on the applicable error/compatibility checks. This is a cleaner contract between H/W and S/W as compared to other register-based commands (e.g., SRTP via GCMD) where the commands always execute irrespective of error checks and involved side effects on other IOMMU states that would ultimately invoke failure/fault detection in the data path operations. In certain examples, software is updated about the erroneous/incompatible command processing by the IOMMU.
In certain examples (e.g., along with architectural support for various performance monitoring (Perfmon) commands for IOMMU), ECMD supports new command “Set TDX Mode” (e.g., architectural) for enabling/disabling TDX Mode on an IOMMU. In certain examples, flows (e.g., SRTP, Set Interrupt Remap Table Pointer (SIRTP), etc.) transfer over to the ECMD. In certain examples, the ECMD register (used for submitting commands) is placed in the SEAM_OS_W policy group. In certain examples, in addition to the ECMD, GCMD, Protected Memory Range (PMR) related registers, and RTADDR are in SEAM_OS_W policy group.
In certain examples, the ECMD_REG.CMD=SET_TDX_MODE command processing in the IOMMU (e.g., along with all associated operations) is as in the following pseudocode (where // is before comments/notes):
In certain examples, ECCAP0.STDXS is dependent/qualified on ECAP_REG.TDXIO being 1, e.g., without TDX-IO capability, there is no Set TDX Mode command support. In certain examples, for TDX-IO, the trust domain manager (e.g., TDX Module) is to also reset the performance counter configurations as part of IOMMU initialization steps for transitioning to TDX_MODE, e.g., through the ECMD command ‘RESET_PERFORMANCE_COUNTER_CONFIGURATION” which results in all counters being disabled and all configuration, filter, freeze, and overflow status registers set to their default value (e.g., to prevent any telemetry based attacks on trusted DMA request translations).
In certain examples, for supporting TDX-IO capability, an IOMMU has two sets of invalidation queues (IQ), for example, a non-trust domain (e.g., “normal”) IQ maintained by the VMM (e.g., untrusted invalidation queue 1006 in
In certain examples, when ECAP_REG.TDXIO is 1, the IOMMU round robins between the trusted and the untrusted invalidation queues independent of the INT_TDX_MODE_REG.TM value, e.g., if ECAP_REG.TDXIO is 0, then the IOMMU defaults to fetching only from the existing untrusted IQ.
In certain examples with TDX-IO capability, if there is one active IQ (untrusted or trusted) being fetched and processed at a time, and there is an associated fault, it would be recorded, and actions taken as per the IQ fault related registers. In certain examples, no security is associated with fault reporting as MSIs are handled by VMM/host OS. In certain examples, a pending fault will stop all IQ/TIQ related processing until it is dealt with by software.
In certain examples, the IOMMU operations when ECAP_REG.TDXIO=1 can be summarized as follows:
In certain examples, the round robin behavior is kept irrespective of TDX Mode to simplify the hardware. In certain examples, when ECAP_REG.TDXIO=1, if TDX Mode=0, trusted IQ is always empty as per TDX-module expected behavior/requirements and hence only the first IF condition will be satisfied if applicable.
The following discuses architecture level changes in certain IOMMUs to support trusted translations/walks for requests coming in with T attribute=1b or XT attribute !=00b.
In certain examples, an IO cache (e.g., IO TLB) is extended with a new tag bit “trusted”. In certain examples, when the IO cache is filled, this tag bit is set to (ECAP_REG.TDXIO & INT_TDX_MODE_REG.TM & T or XT0). In certain examples, when IO cache is looked up, the (ECAP_REG.TDXIO & INT_TDX_MODE_REG.TM & T or XT0) of the transaction is compared to the trusted bit to detect a match. In certain examples, the parity generation and/or verification on IO cache tags includes the Trusted bit. In certain examples, the same behavior also applies to translation type cache (TTC) (e.g., at the micro-architectural level) read and/or match as well in the IO cache pipeline.
In certain examples, a PASID table entry cache (PTC) is extended with a new tag bit—Trusted. In certain examples, when PTC is filled, this tag bit is set to (ECAP_REG.TDXIO & INT_TDX_MODE_REG.TM & T or XT0). In certain examples, when PTC is looked up, the (ECAP_REG.TDXIO & INT_TDX_MODE_REG.TM & T or XT0) of the transaction is compared to the Trusted bit to detect a match. In certain examples, the parity generation and verification on PTC tags should include the Trusted bit.
In certain examples, context entry cache (CTC) is extended with a new tag bit —Trusted. In certain examples, when CTC is filled, this tag bit is set to (ECAP_REG.TDXIO & INT_TDX_MODE_REG.TM & T or XT0). In certain examples, when CTC is looked up, the (ECAP_REG.TDXIO & INT_TDX_MODE_REG.TM & T or XT0) of the transaction is compared to the TDX bit to detect a match. In certain examples, the parity generation and verification on CTC tags should include the Trusted bit. In certain examples, this logically extends to TTC as well when the tag/lookup array is shared with the CTC.
In certain examples, on an IO cache miss (e.g., the mapping is not in the IO cache, so a walk is to be performed from the translation tables), when IOMMU is to access the root table to perform an operation, the IOMMU selects between the HARDWARE_RTADDR_REG and the HARDWARE_T_RTADDR_REG based on (ECAP_REG.TDXIO & INT_TDX_MODE_REG.TM & T or (XT0|XT1)) of the associated incoming request. In certain examples, when in TDX mode, if the request received for translation was with T attribute of 1b or XT attribute of not 00b, then the HARDWARE_T_RTADDR_REG is selected else the HARDWARE_RTADDR_REG is selected in all other cases.
In certain examples, UR is an unsupported request, CA is completer abort, IR is interrupt remapping, and NA is not applicable.
In certain examples, if the remapping hardware is not able to successfully process the translation-request (e.g., with or without PASID), a translation-completion without data is returned, for example, with a status code of UR (Unsupported Request) returned in the completion if the remapping hardware is configured to not support translation requests from this endpoint, and/or a status code of CA (Completer Abort) is returned if the remapping hardware encountered errors when processing the translation-request.
In certain examples, in TDX_MODE, the domain ID is partitioned between TD VMs and non-TD VMs. In certain examples, non-TD VMs use domain IDs with bit L of domain ID set to 0 and TD VMs use domain IDs with bit L of domain ID set to 1. In certain examples, L is the most significant bit (MSB) of the effective domain ID width as enumerated by ECAP.ND field. In certain examples, the ECAP.ND enumerates a (e.g., 16-bit wide) domain ID (e.g., not accounting for de-feature) and hence L bit will be that MSB (e.g., bit 15 of bits 15-0). In certain examples, in TDX mode, when a page walk is being performed for untrusted requests (e.g., request with T attribute of 0b or XT attribute of 00b), if a PASID table entry is found with domain ID bit L set to 1 then it is treated as a terminal fault and such PASID table entries are not cached. In certain examples, this prevents a VMM from maliciously re-using a domain ID allocated to TDs and PASID allocated to TDs with an untrusted device to trigger a first/second stage paging structure entry cache hit which is looked up by domain-ID, PASID (e.g., for first-stage caches), and address. In certain examples, as Domain ID partitioning is done, no separate “Trusted” bit tags are required for the set of FS and SS caches. In certain examples, the following fault check is used for TDX-IO security:
In certain examples, the error reporting for this terminal fault is like error reporting for reserved bits.
In certain examples, the following fault check is used for TDX-IO security: when ECAP_REG.TDXIO is 1, if TDX mode is enabled and the walk is for T or (XT0|XT1)=1, then the PASID Granular Translation Type (PGTT) is (e.g., must be) a certain value or values, e.g., 010b (e.g., 2nd level only) or 011b (e.g., nested), and if not one of those values (e.g., those two values), then cause a terminal fault.
In certain examples, remapping hardware includes an indication of a field that indicates the maximum DMA virtual addressability supported by the remapping hardware. In certain examples, the Maximum Guest Address Width (MGAW) is computed as (N+1), where N is the value reported in this field. For example, a hardware implementation supporting 48-bit MGAW reports a value of 47 (101111b) in this field. In certain examples, if the value in this field is X, untranslated and translated DMA requests to addresses above 2{circumflex over ( )}(x+1)−1 are always blocked by hardware and translations requests to address above 2{circumflex over ( )}(x+1)−1 from allowed devices return a null Translation Completion Data Entry with R=W=0.
In certain examples, guest addressability for a given DMA request is limited to the minimum of the value reported through this field and the adjusted guest address width of the corresponding page-table structure, e.g., and adjusted guest address widths supported by hardware are reported through the SAGAW field.
In certain examples, implementations support a MGAW at least equal to the physical addressability (e.g., host address width) of the platform.
In certain examples, remapping hardware includes an indication of a (e.g., 5-bit field) the supported adjusted guest address widths (SAWAG), e.g., which represents the levels of page-table walks for the (e.g., 4 KB) base page size supported by the hardware implementation. In certain examples, a value of 1 in any of these bits indicates the corresponding adjusted guest address width is supported, e.g., where the adjusted guest address widths corresponding to various bit positions within this field are:
In certain examples, software is to ensure that the adjusted guest address width used to setup the page tables is one of the supported guest address widths reported in this field.
In certain examples, for TDs, guest physical addresses (GPA) with most significant bit set to 1 are called shared GPA and with most significant bit set to 0 are private GPA. In certain examples, the SHARED bit is evaluated as follows:
In certain examples, the S_BIT calculation does not need to include SAGAW and MGAW as these are separate VT-d checks and would raise fault if AW and SAGAW did not comply with each other and/or input GPA width is greater than what is allowed by MGAW and AW. In certain examples, the expected S/W behavior is that TDX-module would verify SAGAW and MGAW from a capabilities (CAP) register to support multiple (e.g., 4 and/or 5) level EPT before setting TDX Mode=1.
In certain examples, the SHARED bit being 1 in first-stage paging entry (e.g., FS-PML5E, FS-PML4E, FS-PDPE with PS bit 0, FS-PDE with PS bit 0) with Present (P) field set are treated as terminal fault. In certain examples, for data read and write, FS-PDPE can have SHARED bit 1 if PS is set to 1 i.e., maps a 1 GB page and FS-PDE can have SHARED bit 1 if PS is set to 1 i.e., maps a 2 MB page and FS-PTE can have SHARED bit 1. In certain examples, for instruction fetches, if SHARED bit is set to 1 in FS-PDPE with page size (PS) set to 1, maps a 1 GB page, or FS-PDE with PS set to 1, maps a 2 MB page, or in FS-PTE, then cause a terminal fault. In certain examples, this fault check enforces that a TD can locate FSPT paging structures only in private GPA and data read/write can be done to shared memory but not instruction fetches. In certain examples, the fault is a terminal fault and signaled as set fault-log (SFS) SFS.11 (e.g., for both leaf and non-leaf paging structures). In certain examples, SHARED will always evaluate to 0 if TDX mode is not enabled or if the walk is for a transaction with T or (XT0|XT1)=0.
In certain examples, SSPT walks require that all second-stage (SS) paging structure entries (e.g., except the root SS paging structure entry and the final address of the translation) do not (e.g., must not) have TD private KeyID if the walk was started with a GPA with SHARED set to 1. In certain examples, this fault check prevents a VMM from locating SS paging structure entries or final translation from SS paging to be mapped to TD private memory. In certain examples, the TDX_MODE_REG.L indicates the number of physical address bits starting at HAW-1 that are reserved for encoding TDX Key IDs. If, for example, HAW is 46 and L is 6, the bits 45:40 if set in a physical address indicate that the physical address has a private Key ID.
In certain examples, this is evaluated as follows:
In certain examples, the IOMMU relies on the KeyID filter to abort a memory request from the device or an access from the IOMMU itself to access its translation structures with TDX KeyID unless the IOMMU allows memory request to have a TDX KeyID. In certain examples, this is accomplished by a logical signal from the IOMMU called TA-Polarity.
In certain examples, the TA-Polarity value is driven by the IOMMU as follows to indicate whether the access can have a TDX KeyID as follows:
In certain examples, a VMM hands control of the IOMMU to the trust domain manager (e.g., TDX-module) if it discovers TDX-IO capable device(s) in the scope of the IOMMU, e.g., by invoking a function in the TDX-module. The following sections specify an example programming sequence and restrictions the TDX-module is to (e.g., must) observe for:
In certain examples, when TDX mode is enabled, the SEAM_OS_W registers are not writeable by the VMM, e.g., the VMM is provided an application programming interface (API) function to program the following registers if needed:
In certain examples, the VMM may request that an IOMMU TDX mode be cleared, e.g., where the TDX module follows the following sequence.
In certain examples, an IOMMU includes a set of registers for untrusted components and a separate set of (e.g., protected) registers for trusted components. The following discussed certain examples registers, but it should be understood that an untrusted and trusted (“T”) instance of each can be utilized in the same IOMMU.
The operations 4700 include, at block 4702, managing one or more hardware isolated virtual machines as a respective trust domain with a region of protected memory by a trust domain manager of a hardware processor core. The operations 4700 further include, at block 4704, sending a request for a direct memory access of a protected memory of a trust domain from an input/output device to input/output memory management unit (IOMMU) circuitry comprising trusted direct memory access translation data and coupled between the hardware processor core and the input/output device. The operations 4700 include, at block 4706, in response to a field in the request being set to indicate the input/output device is in a trusted computing base of the trust domain and an entry in the trusted direct memory access translation data being set into an active state by the trust domain manager, allowing the direct memory access by the input/output device. The operations 4700 (optionally) include, at block 4708, in response to the entry in the trusted direct memory access translation data being set into a not active state by the trust domain manager, blocking, by the IOMMU circuitry, the direct memory access by the input/output device.
In certain examples, a (e.g., TDX-IO) register (e.g., in an IOMMU) is read and/or written to by an instruction, for example, according to a method for processing a register instruction according to examples of the disclosure. A processor (e.g., or processor core) may perform operations of a method, e.g., in response to receiving a request to execute an instruction from software. Operations may include processing a “TDX-IO” instruction by performing a: fetch of an instruction (e.g., having an instruction opcode corresponding to the command mnemonic), decode of the instruction into a decoded instruction, retrieve data associated with the instruction, (optionally) schedule the decoded instruction for execution, execute the decoded instruction to set the register, and thus control the functionality of the TDX-IO commands, and commit a result of the executed instruction.
Exemplary architectures, systems, etc. that the above may be used in are detailed below. Exemplary instruction formats that may cause any of the operations herein are detailed below.
At least some examples of the disclosed technologies can be described in view of the following examples:
Example 2. The apparatus of example 1, wherein the IOMMU circuitry is to, in response to the entry in the cache of trusted direct memory access translation data for the guest address being set into a blocked state for a pending, but not yet queued, invalidation of the entry by the trust domain manager, allow the determination of the host physical address corresponding to the guest address from the entry in the cache of trusted direct memory access translation data, and the direct memory access to the host physical address by the input/output device.
Example 3. The apparatus of any one of examples 1-2, wherein the IOMMU circuitry is to, in response to the entry in the cache of trusted direct memory access translation data for the guest address being set into a blocked state for a queued, but not yet completed, invalidation of the entry by the trust domain manager, allow the determination of the host physical address corresponding to the guest address from the entry in the cache of trusted direct memory access translation data, and the direct memory access to the host physical address by the input/output device.
Detailed below are descriptions of example computer architectures. Other system designs and configurations known in the arts for laptop, desktop, and handheld personal computers (PC)s, personal digital assistants, engineering workstations, servers, disaggregated servers, network devices, network hubs, switches, routers, embedded processors, digital signal processors (DSPs), graphics devices, video game devices, set-top boxes, micro controllers, cell phones, portable media players, hand-held devices, and various other electronic devices, are also suitable. In general, a variety of systems or electronic devices capable of incorporating a processor and/or other execution logic as disclosed herein are generally suitable.
Processors 4870 and 4880 are shown including integrated memory controller (IMC) circuitry 4872 and 4882, respectively. Processor 4870 also includes interface circuits 4876 and 4878; similarly, second processor 4880 includes interface circuits 4886 and 4888. Processors 4870, 4880 may exchange information via the interface 4850 using interface circuits 4878, 4888. IMCs 4872 and 4882 couple the processors 4870, 4880 to respective memories, namely a memory 4832 and a memory 4834, which may be portions of main memory locally attached to the respective processors.
Processors 4870, 4880 may each exchange information with a network interface (NW I/F) 4890 via individual interfaces 4852, 4854 using interface circuits 4876, 4894, 4886, 4898. The network interface 4890 (e.g., one or more of an interconnect, bus, and/or fabric, and in some examples is a chipset) may optionally exchange information with a coprocessor 4838 via an interface circuit 4892. In some examples, the coprocessor 4838 is a special-purpose processor, such as, for example, a high-throughput processor, a network or communication processor, compression engine, graphics processor, general purpose graphics processing unit (GPGPU), neural-network processing unit (NPU), embedded processor, or the like.
A shared cache (not shown) may be included in either processor 4870, 4880 or outside of both processors, yet connected with the processors via an interface such as P-P interconnect, such that either or both processors' local cache information may be stored in the shared cache if a processor is placed into a low power mode.
Network interface 4890 may be coupled to a first interface 4816 via interface circuit 4896. In some examples, first interface 4816 may be an interface such as a Peripheral Component Interconnect (PCI) interconnect, a PCI Express interconnect or another I/O interconnect. In some examples, first interface 4816 is coupled to a power control unit (PCU) 4817, which may include circuitry, software, and/or firmware to perform power management operations with regard to the processors 4870, 4880 and/or co-processor 4838. PCU 4817 provides control information to a voltage regulator (not shown) to cause the voltage regulator to generate the appropriate regulated voltage. PCU 4817 also provides control information to control the operating voltage generated. In various examples, PCU 4817 may include a variety of power management logic units (circuitry) to perform hardware-based power management. Such power management may be wholly processor controlled (e.g., by various processor hardware, and which may be triggered by workload and/or power, thermal or other processor constraints) and/or the power management may be performed responsive to external sources (such as a platform or power management source or system software).
PCU 4817 is illustrated as being present as logic separate from the processor 4870 and/or processor 4880. In other cases, PCU 4817 may execute on a given one or more of cores (not shown) of processor 4870 or 4880. In some cases, PCU 4817 may be implemented as a microcontroller (dedicated or general-purpose) or other control logic configured to execute its own dedicated power management code, sometimes referred to as P-code. In yet other examples, power management operations to be performed by PCU 4817 may be implemented externally to a processor, such as by way of a separate power management integrated circuit (PMIC) or another component external to the processor. In yet other examples, power management operations to be performed by PCU 4817 may be implemented within BIOS or other system software.
Various I/O devices 4814 may be coupled to first interface 4816, along with a bus bridge 4818 which couples first interface 4816 to a second interface 4820. In some examples, one or more additional processor(s) 4815, such as coprocessors, high throughput many integrated core (MIC) processors, GPGPUs, accelerators (such as graphics accelerators or digital signal processing (DSP) units), field programmable gate arrays (FPGAs), or any other processor, are coupled to first interface 4816. In some examples, second interface 4820 may be a low pin count (LPC) interface. Various devices may be coupled to second interface 4820 including, for example, a keyboard and/or mouse 4822, communication devices 4827 and storage circuitry 4828. Storage circuitry 4828 may be one or more non-transitory machine-readable storage media as described below, such as a disk drive or other mass storage device which may include instructions/code and data 4830 and may implement the storage 4828 in some examples. Further, an audio I/O 4824 may be coupled to second interface 4820. Note that other architectures than the point-to-point architecture described above are possible. For example, instead of the point-to-point architecture, a system such as multiprocessor system 4800 may implement a multi-drop interface or other such architecture.
Processor cores may be implemented in different ways, for different purposes, and in different processors. For instance, implementations of such cores may include: 1) a general purpose in-order core intended for general-purpose computing; 2) a high-performance general purpose out-of-order core intended for general-purpose computing; 3) a special purpose core intended primarily for graphics and/or scientific (throughput) computing. Implementations of different processors may include: 1) a CPU including one or more general purpose in-order cores intended for general-purpose computing and/or one or more general purpose out-of-order cores intended for general-purpose computing; and 2) a coprocessor including one or more special purpose cores intended primarily for graphics and/or scientific (throughput) computing. Such different processors lead to different computer system architectures, which may include: 1) the coprocessor on a separate chip from the CPU; 2) the coprocessor on a separate die in the same package as a CPU; 3) the coprocessor on the same die as a CPU (in which case, such a coprocessor is sometimes referred to as special purpose logic, such as integrated graphics and/or scientific (throughput) logic, or as special purpose cores); and 4) a system on a chip (SoC) that may be included on the same die as the described CPU (sometimes referred to as the application core(s) or application processor(s)), the above described coprocessor, and additional functionality. Example core architectures are described next, followed by descriptions of example processors and computer architectures.
Thus, different implementations of the processor 4900 may include: 1) a CPU with the special purpose logic 4908 being integrated graphics and/or scientific (throughput) logic (which may include one or more cores, not shown), and the cores 4902(A)-(N) being one or more general purpose cores (e.g., general purpose in-order cores, general purpose out-of-order cores, or a combination of the two); 2) a coprocessor with the cores 4902(A)-(N) being a large number of special purpose cores intended primarily for graphics and/or scientific (throughput); and 3) a coprocessor with the cores 4902(A)-(N) being a large number of general purpose in-order cores. Thus, the processor 4900 may be a general-purpose processor, coprocessor, or special-purpose processor, such as, for example, a network or communication processor, compression engine, graphics processor, GPGPU (general purpose graphics processing unit), a high throughput many integrated core (MIC) coprocessor (including 30 or more cores), embedded processor, or the like. The processor may be implemented on one or more chips. The processor 4900 may be a part of and/or may be implemented on one or more substrates using any of a number of process technologies, such as, for example, complementary metal oxide semiconductor (CMOS), bipolar CMOS (BiCMOS), P-type metal oxide semiconductor (PMOS), or N-type metal oxide semiconductor (NMOS).
A memory hierarchy includes one or more levels of cache unit(s) circuitry 4904(A)-(N) within the cores 4902(A)-(N), a set of one or more shared cache unit(s) circuitry 4906, and external memory (not shown) coupled to the set of integrated memory controller unit(s) circuitry 4914. The set of one or more shared cache unit(s) circuitry 4906 may include one or more mid-level caches, such as level 2 (L2), level 3 (L3), level 4 (L4), or other levels of cache, such as a last level cache (LLC), and/or combinations thereof. While in some examples interface network circuitry 4912 (e.g., a ring interconnect) interfaces the special purpose logic 4908 (e.g., integrated graphics logic), the set of shared cache unit(s) circuitry 4906, and the system agent unit circuitry 4910, alternative examples use any number of well-known techniques for interfacing such units. In some examples, coherency is maintained between one or more of the shared cache unit(s) circuitry 4906 and cores 4902(A)-(N). In some examples, interface controller units circuitry 4916 couple the cores 4902 to one or more other devices 4918 such as one or more I/O devices, storage, one or more communication devices (e.g., wireless networking, wired networking, etc.), etc.
In some examples, one or more of the cores 4902(A)-(N) are capable of multi-threading. The system agent unit circuitry 4910 includes those components coordinating and operating cores 4902(A)-(N). The system agent unit circuitry 4910 may include, for example, power control unit (PCU) circuitry and/or display unit circuitry (not shown). The PCU may be or may include logic and components needed for regulating the power state of the cores 4902(A)-(N) and/or the special purpose logic 4908 (e.g., integrated graphics logic). The display unit circuitry is for driving one or more externally connected displays.
The cores 4902(A)-(N) may be homogenous in terms of instruction set architecture (ISA). Alternatively, the cores 4902(A)-(N) may be heterogeneous in terms of ISA; that is, a subset of the cores 4902(A)-(N) may be capable of executing an ISA, while other cores may be capable of executing only a subset of that ISA or another ISA.
In
By way of example, the example register renaming, out-of-order issue/execution architecture core of
The front-end unit circuitry 5030 may include branch prediction circuitry 5032 coupled to instruction cache circuitry 5034, which is coupled to an instruction translation lookaside buffer (TLB) 5036, which is coupled to instruction fetch circuitry 5038, which is coupled to decode circuitry 5040. In one example, the instruction cache circuitry 5034 is included in the memory unit circuitry 5070 rather than the front-end circuitry 5030. The decode circuitry 5040 (or decoder) may decode instructions, and generate as an output one or more micro-operations, micro-code entry points, microinstructions, other instructions, or other control signals, which are decoded from, or which otherwise reflect, or are derived from, the original instructions. The decode circuitry 5040 may further include address generation unit (AGU, not shown) circuitry. In one example, the AGU generates an LSU address using forwarded register ports, and may further perform branch forwarding (e.g., immediate offset branch forwarding, LR register branch forwarding, etc.). The decode circuitry 5040 may be implemented using various different mechanisms. Examples of suitable mechanisms include, but are not limited to, look-up tables, hardware implementations, programmable logic arrays (PLAs), microcode read only memories (ROMs), etc. In one example, the core 5090 includes a microcode ROM (not shown) or other medium that stores microcode for certain macroinstructions (e.g., in decode circuitry 5040 or otherwise within the front-end circuitry 5030). In one example, the decode circuitry 5040 includes a micro-operation (micro-op) or operation cache (not shown) to hold/cache decoded operations, micro-tags, or micro-operations generated during the decode or other stages of the processor pipeline 5000. The decode circuitry 5040 may be coupled to rename/allocator unit circuitry 5052 in the execution engine circuitry 5050.
The execution engine circuitry 5050 includes the rename/allocator unit circuitry 5052 coupled to retirement unit circuitry 5054 and a set of one or more scheduler(s) circuitry 5056. The scheduler(s) circuitry 5056 represents any number of different schedulers, including reservations stations, central instruction window, etc. In some examples, the scheduler(s) circuitry 5056 can include arithmetic logic unit (ALU) scheduler/scheduling circuitry, ALU queues, address generation unit (AGU) scheduler/scheduling circuitry, AGU queues, etc. The scheduler(s) circuitry 5056 is coupled to the physical register file(s) circuitry 5058. Each of the physical register file(s) circuitry 5058 represents one or more physical register files, different ones of which store one or more different data types, such as scalar integer, scalar floating-point, packed integer, packed floating-point, vector integer, vector floating-point, status (e.g., an instruction pointer that is the address of the next instruction to be executed), etc. In one example, the physical register file(s) circuitry 5058 includes vector registers unit circuitry, writemask registers unit circuitry, and scalar register unit circuitry. These register units may provide architectural vector registers, vector mask registers, general-purpose registers, etc. The physical register file(s) circuitry 5058 is coupled to the retirement unit circuitry 5054 (also known as a retire queue or a retirement queue) to illustrate various ways in which register renaming and out-of-order execution may be implemented (e.g., using a reorder buffer(s) (ROB(s)) and a retirement register file(s); using a future file(s), a history buffer(s), and a retirement register file(s); using a register maps and a pool of registers; etc.). The retirement unit circuitry 5054 and the physical register file(s) circuitry 5058 are coupled to the execution cluster(s) 5060. The execution cluster(s) 5060 includes a set of one or more execution unit(s) circuitry 5062 and a set of one or more memory access circuitry 5064. The execution unit(s) circuitry 5062 may perform various arithmetic, logic, floating-point or other types of operations (e.g., shifts, addition, subtraction, multiplication) and on various types of data (e.g., scalar integer, scalar floating-point, packed integer, packed floating-point, vector integer, vector floating-point). While some examples may include a number of execution units or execution unit circuitry dedicated to specific functions or sets of functions, other examples may include only one execution unit circuitry or multiple execution units/execution unit circuitry that all perform all functions. The scheduler(s) circuitry 5056, physical register file(s) circuitry 5058, and execution cluster(s) 5060 are shown as being possibly plural because certain examples create separate pipelines for certain types of data/operations (e.g., a scalar integer pipeline, a scalar floating-point/packed integer/packed floating-point/vector integer/vector floating-point pipeline, and/or a memory access pipeline that each have their own scheduler circuitry, physical register file(s) circuitry, and/or execution cluster—and in the case of a separate memory access pipeline, certain examples are implemented in which only the execution cluster of this pipeline has the memory access unit(s) circuitry 5064). It should also be understood that where separate pipelines are used, one or more of these pipelines may be out-of-order issue/execution and the rest in-order.
In some examples, the execution engine unit circuitry 5050 may perform load store unit (LSU) address/data pipelining to an Advanced Microcontroller Bus (AMB) interface (not shown), and address phase and writeback, data phase load, store, and branches.
The set of memory access circuitry 5064 is coupled to the memory unit circuitry 5070, which includes data TLB circuitry 5072 coupled to data cache circuitry 5074 coupled to level 2 (L2) cache circuitry 5076. In one example, the memory access circuitry 5064 may include load unit circuitry, store address unit circuitry, and store data unit circuitry, each of which is coupled to the data TLB circuitry 5072 in the memory unit circuitry 5070. The instruction cache circuitry 5034 is further coupled to the level 2 (L2) cache circuitry 5076 in the memory unit circuitry 5070. In one example, the instruction cache 5034 and the data cache 5074 are combined into a single instruction and data cache (not shown) in L2 cache circuitry 5076, level 3 (L3) cache circuitry (not shown), and/or main memory. The L2 cache circuitry 5076 is coupled to one or more other levels of cache and eventually to a main memory.
The core 5090 may support one or more instructions sets (e.g., the x86 instruction set architecture (optionally with some extensions that have been added with newer versions); the MIPS instruction set architecture; the ARM instruction set architecture (optionally with optional additional extensions such as NEON)), including the instruction(s) described herein. In one example, the core 5090 includes logic to support a packed data instruction set architecture extension (e.g., AVX1, AVX2), thereby allowing the operations used by many multimedia applications to be performed using packed data.
In some examples, the register architecture 5200 includes writemask/predicate registers 5215. For example, in some examples, there are 8 writemask/predicate registers (sometimes called k0 through k7) that are each 16-bit, 32-bit, 64-bit, or 128-bit in size. Writemask/predicate registers 5215 may allow for merging (e.g., allowing any set of elements in the destination to be protected from updates during the execution of any operation) and/or zeroing (e.g., zeroing vector masks allow any set of elements in the destination to be zeroed during the execution of any operation). In some examples, each data element position in a given writemask/predicate register 5215 corresponds to a data element position of the destination. In other examples, the writemask/predicate registers 5215 are scalable and consists of a set number of enable bits for a given vector element (e.g., 8 enable bits per 64-bit vector element).
The register architecture 5200 includes a plurality of general-purpose registers 5225. These registers may be 16-bit, 32-bit, 64-bit, etc. and can be used for scalar operations. In some examples, these registers are referenced by the names RAX, RBX, RCX, RDX, RBP, RSI, RDI, RSP, and R8 through R15.
In some examples, the register architecture 5200 includes scalar floating-point (FP) register file 5245 which is used for scalar floating-point operations on 32/64/80-bit floating-point data using the x87 instruction set architecture extension or as MMX registers to perform operations on 64-bit packed integer data, as well as to hold operands for some operations performed between the MMX and XMM registers.
One or more flag registers 5240 (e.g., EFLAGS, RFLAGS, etc.) store status and control information for arithmetic, compare, and system operations. For example, the one or more flag registers 5240 may store condition code information such as carry, parity, auxiliary carry, zero, sign, and overflow. In some examples, the one or more flag registers 5240 are called program status and control registers.
Segment registers 5220 contain segment points for use in accessing memory. In some examples, these registers are referenced by the names CS, DS, SS, ES, FS, and GS.
Machine specific registers (MSRs) 5235 control and report on processor performance. Most MSRs 5235 handle system-related functions and are not accessible to an application program. Machine check registers 5260 consist of control, status, and error reporting MSRs that are used to detect and report on hardware errors.
One or more instruction pointer register(s) 5230 store an instruction pointer value. Control register(s) 5255 (e.g., CR0-CR4) determine the operating mode of a processor (e.g., processor 4870, 4880, 4838, 4815, and/or 4900) and the characteristics of a currently executing task. Debug registers 5250 control and allow for the monitoring of a processor or core's debugging operations.
Memory (mem) management registers 5265 specify the locations of data structures used in protected mode memory management. These registers may include a global descriptor table register (GDTR), interrupt descriptor table register (IDTR), task register, and a local descriptor table register (LDTR) register.
Alternative examples may use wider or narrower registers. Additionally, alternative examples may use more, less, or different register files and registers. The register architecture 5200 may, for example, be used in registers 103, 121, or physical register file(s) circuitry 5058.
An instruction set architecture (ISA) may include one or more instruction formats. A given instruction format may define various fields (e.g., number of bits, location of bits) to specify, among other things, the operation to be performed (e.g., opcode) and the operand(s) on which that operation is to be performed and/or other data field(s) (e.g., mask). Some instruction formats are further broken down through the definition of instruction templates (or sub-formats). For example, the instruction templates of a given instruction format may be defined to have different subsets of the instruction format's fields (the included fields are typically in the same order, but at least some have different bit positions because there are less fields included) and/or defined to have a given field interpreted differently. Thus, each instruction of an ISA is expressed using a given instruction format (and, if defined, in a given one of the instruction templates of that instruction format) and includes fields for specifying the operation and the operands. For example, an example ADD instruction has a specific opcode and an instruction format that includes an opcode field to specify that opcode and operand fields to select operands (source1/destination and source2); and an occurrence of this ADD instruction in an instruction stream will have specific contents in the operand fields that select specific operands. In addition, though the description below is made in the context of x86 ISA, it is within the knowledge of one skilled in the art to apply the teachings of the present disclosure in another ISA.
Examples of the instruction(s) described herein may be embodied in different formats. Additionally, example systems, architectures, and pipelines are detailed below. Examples of the instruction(s) may be executed on such systems, architectures, and pipelines, but are not limited to those detailed.
The prefix(es) field(s) 5301, when used, modifies an instruction. In some examples, one or more prefixes are used to repeat string instructions (e.g., 0xF0, 0xF2, 0xF3, etc.), to provide section overrides (e.g., 0x2E, 0x36, 0x3E, 0x26, 0x64, 0x65, 0x2E, 0x3E, etc.), to perform bus lock operations, and/or to change operand (e.g., 0x66) and address sizes (e.g., 0x67). Certain instructions require a mandatory prefix (e.g., 0x66, 0xF2, 0xF3, etc.). Certain of these prefixes may be considered “legacy” prefixes. Other prefixes, one or more examples of which are detailed herein, indicate, and/or provide further capability, such as specifying particular registers, etc. The other prefixes typically follow the “legacy” prefixes.
The opcode field 5303 is used to at least partially define the operation to be performed upon a decoding of the instruction. In some examples, a primary opcode encoded in the opcode field 5303 is one, two, or three bytes in length. In other examples, a primary opcode can be a different length. An additional 3-bit opcode field is sometimes encoded in another field.
The addressing information field 5305 is used to address one or more operands of the instruction, such as a location in memory or one or more registers.
The content of the MOD field 5442 distinguishes between memory access and non-memory access modes. In some examples, when the MOD field 5442 has a binary value of 11 (11b), a register-direct addressing mode is utilized, and otherwise a register-indirect addressing mode is used.
The register field 5444 may encode either the destination register operand or a source register operand or may encode an opcode extension and not be used to encode any instruction operand. The content of register field 5444, directly or through address generation, specifies the locations of a source or destination operand (either in a register or in memory). In some examples, the register field 5444 is supplemented with an additional bit from a prefix (e.g., prefix 5301) to allow for greater addressing.
The R/M field 5446 may be used to encode an instruction operand that references a memory address or may be used to encode either the destination register operand or a source register operand. Note the R/M field 5446 may be combined with the MOD field 5442 to dictate an addressing mode in some examples.
The SIB byte 5404 includes a scale field 5452, an index field 5454, and a base field 5456 to be used in the generation of an address. The scale field 5452 indicates a scaling factor. The index field 5454 specifies an index register to use. In some examples, the index field 5454 is supplemented with an additional bit from a prefix (e.g., prefix 5301) to allow for greater addressing. The base field 5456 specifies a base register to use. In some examples, the base field 5456 is supplemented with an additional bit from a prefix (e.g., prefix 5301) to allow for greater addressing. In practice, the content of the scale field 5452 allows for the scaling of the content of the index field 5454 for memory address generation (e.g., for address generation that uses 2scale*index+base).
Some addressing forms utilize a displacement value to generate a memory address. For example, a memory address may be generated according to 2scale*index+base+displacement, index*scale+displacement, r/m+displacement, instruction pointer (RIP/EIP)+displacement, register+displacement, etc. The displacement may be a 1-byte, 2-byte, 4-byte, etc. value. In some examples, the displacement field 5307 provides this value. Additionally, in some examples, a displacement factor usage is encoded in the MOD field of the addressing information field 5305 that indicates a compressed displacement scheme for which a displacement value is calculated and stored in the displacement field 5307.
In some examples, the immediate value field 5309 specifies an immediate value for the instruction. An immediate value may be encoded as a 1-byte value, a 2-byte value, a 4-byte value, etc.
Instructions using the first prefix 5301(A) may specify up to three registers using 3-bit fields depending on the format: 1) using the reg field 5444 and the R/M field 5446 of the MOD R/M byte 5402; 2) using the MOD R/M byte 5402 with the SIB byte 5404 including using the reg field 5444 and the base field 5456 and index field 5454; or 3) using the register field of an opcode.
In the first prefix 5301(A), bit positions 7:4 are set as 0100. Bit position 3 (W) can be used to determine the operand size but may not solely determine operand width. As such, when W=0, the operand size is determined by a code segment descriptor (CS.D) and when W=1, the operand size is 64-bit.
Note that the addition of another bit allows for 16 (24) registers to be addressed, whereas the MOD R/M reg field 5444 and MOD R/M R/M field 5446 alone can each only address 8 registers.
In the first prefix 5301(A), bit position 2 (R) may be an extension of the MOD R/M reg field 5444 and may be used to modify the MOD R/M reg field 5444 when that field encodes a general-purpose register, a 64-bit packed data register (e.g., an SSE register), or a control or debug register. R is ignored when MOD R/M byte 5402 specifies other registers or defines an extended opcode.
Bit position 1 (X) may modify the SIB byte index field 5454.
Bit position 0 (B) may modify the base in the MOD R/M R/M field 5446 or the SIB byte base field 5456; or it may modify the opcode register field used for accessing general purpose registers (e.g., general purpose registers 5225).
In some examples, the second prefix 5301(B) comes in two forms—a two-byte form and a three-byte form. The two-byte second prefix 5301(B) is used mainly for 128-bit, scalar, and some 256-bit instructions; while the three-byte second prefix 5301(B) provides a compact replacement of the first prefix 5301(A) and 3-byte opcode instructions.
Instructions that use this prefix may use the MOD R/M R/M field 5446 to encode the instruction operand that references a memory address or encode either the destination register operand or a source register operand.
Instructions that use this prefix may use the MOD R/M reg field 5444 to encode either the destination register operand or a source register operand, or to be treated as an opcode extension and not used to encode any instruction operand.
For instruction syntax that support four operands, vvvv, the MOD R/M R/M field 5446 and the MOD R/M reg field 5444 encode three of the four operands. Bits[7:4] of the immediate value field 5309 are then used to encode the third source register operand.
Bit[7] of byte 2 5717 is used similar to W of the first prefix 5301(A) including helping to determine promotable operand sizes. Bit[2] is used to dictate the length (L) of the vector (where a value of 0 is a scalar or 128-bit vector and a value of 1 is a 256-bit vector). Bits[1:0] provide opcode extensionality equivalent to some legacy prefixes (e.g., 00=no prefix, 01=66H, 10=F3H, and 11=F2H). Bits[6:3], shown as vvvv, may be used to: 1) encode the first source register operand, specified in inverted (1s complement) form and valid for instructions with 2 or more source operands; 2) encode the destination register operand, specified in is complement form for certain vector shifts; or 3) not encode any operand, the field is reserved and should contain a certain value, such as 1111b.
Instructions that use this prefix may use the MOD R/M R/M field 5446 to encode the instruction operand that references a memory address or encode either the destination register operand or a source register operand.
Instructions that use this prefix may use the MOD R/M reg field 5444 to encode either the destination register operand or a source register operand, or to be treated as an opcode extension and not used to encode any instruction operand.
For instruction syntax that support four operands, vvvv, the MOD R/M R/M field 5446, and the MOD R/M reg field 5444 encode three of the four operands. Bits[7:4] of the immediate value field 5309 are then used to encode the third source register operand.
The third prefix 5301(C) can encode 32 vector registers (e.g., 128-bit, 256-bit, and 512-bit registers) in 64-bit mode. In some examples, instructions that utilize a writemask/opmask (see discussion of registers in a previous figure, such as
The third prefix 5301(C) may encode functionality that is specific to instruction classes (e.g., a packed instruction with “load+op” semantic can support embedded broadcast functionality, a floating-point instruction with rounding semantic can support static rounding functionality, a floating-point instruction with non-rounding arithmetic semantic can support “suppress all exceptions” functionality, etc.).
The first byte of the third prefix 5301(C) is a format field 5811 that has a value, in one example, of 62H. Subsequent bytes are referred to as payload bytes 5815-5819 and collectively form a 24-bit value of P[23:0] providing specific capability in the form of one or more fields (detailed herein).
In some examples, P[1:0] of payload byte 5819 are identical to the low two mm bits. P[3:2] are reserved in some examples. Bit P[4] (R′) allows access to the high 16 vector register set when combined with P[7] and the MOD R/M reg field 5444. P[6] can also provide access to a high 16 vector register when SIB-type addressing is not needed. P[7:5] consist of R, X, and B which are operand specifier modifier bits for vector register, general purpose register, memory addressing and allow access to the next set of 8 registers beyond the low 8 registers when combined with the MOD R/M register field 5444 and MOD R/M R/M field 5446. P[9:8] provide opcode extensionality equivalent to some legacy prefixes (e.g., 00=no prefix, 01=66H, 10=F3H, and 11=F2H). P[10] in some examples is a fixed value of 1. P[14:11], shown as vvvv, may be used to: 1) encode the first source register operand, specified in inverted (Is complement) form and valid for instructions with 2 or more source operands; 2) encode the destination register operand, specified in is complement form for certain vector shifts; or 3) not encode any operand, the field is reserved and should contain a certain value, such as 1111b.
P[15] is similar to W of the first prefix 5301(A) and second prefix 5311(B) and may serve as an opcode extension bit or operand size promotion.
P[18:16] specify the index of a register in the opmask (writemask) registers (e.g., writemask/predicate registers 5215). In one example, the specific value aaa=000 has a special behavior implying no opmask is used for the particular instruction (this may be implemented in a variety of ways including the use of an opmask hardwired to all ones or hardware that bypasses the masking hardware). When merging, vector masks allow any set of elements in the destination to be protected from updates during the execution of any operation (specified by the base operation and the augmentation operation); in other one example, preserving the old value of each element of the destination where the corresponding mask bit has a 0. In contrast, when zeroing vector masks allow any set of elements in the destination to be zeroed during the execution of any operation (specified by the base operation and the augmentation operation); in one example, an element of the destination is set to 0 when the corresponding mask bit has a 0 value. A subset of this functionality is the ability to control the vector length of the operation being performed (that is, the span of elements being modified, from the first to the last one); however, it is not necessary that the elements that are modified be consecutive. Thus, the opmask field allows for partial vector operations, including loads, stores, arithmetic, logical, etc. While examples are described in which the opmask field's content selects one of a number of opmask registers that contains the opmask to be used (and thus the opmask field's content indirectly identifies that masking to be performed), alternative examples instead or additional allow the mask write field's content to directly specify the masking to be performed.
PV[19] can be combined with PV[14:11] to encode a second source vector register in a non-destructive source syntax which can access an upper 16 vector registers using P[19]. P[20] encodes multiple functionalities, which differs across different classes of instructions and can affect the meaning of the vector length/rounding control specifier field (P[22:21]). P[23] indicates support for merging-writemasking (e.g., when set to 0) or support for zeroing and merging-writemasking (e.g., when set to 1).
Example examples of encoding of registers in instructions using the third prefix 5301(C) are detailed in the following tables.
Program code may be applied to input information to perform the functions described herein and generate output information. The output information may be applied to one or more output devices, in known fashion. For purposes of this application, a processing system includes any system that has a processor, such as, for example, a digital signal processor (DSP), a microcontroller, an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a microprocessor, or any combination thereof.
The program code may be implemented in a high-level procedural or object-oriented programming language to communicate with a processing system. The program code may also be implemented in assembly or machine language, if desired. In fact, the mechanisms described herein are not limited in scope to any particular programming language. In any case, the language may be a compiled or interpreted language.
Examples of the mechanisms disclosed herein may be implemented in hardware, software, firmware, or a combination of such implementation approaches. Examples may be implemented as computer programs or program code executing on programmable systems comprising at least one processor, a storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.
One or more aspects of at least one example may be implemented by representative instructions stored on a machine-readable medium which represents various logic within the processor, which when read by a machine causes the machine to fabricate logic to perform the techniques described herein. Such representations, known as “intellectual property (IP) cores” may be stored on a tangible, machine readable medium and supplied to various customers or manufacturing facilities to load into the fabrication machines that make the logic or processor.
Such machine-readable storage media may include, without limitation, non-transitory, tangible arrangements of articles manufactured or formed by a machine or device, including storage media such as hard disks, any other type of disk including floppy disks, optical disks, compact disk read-only memories (CD-ROMs), compact disk rewritables (CD-RWs), and magneto-optical disks, semiconductor devices such as read-only memories (ROMs), random access memories (RAMs) such as dynamic random access memories (DRAMs), static random access memories (SRAMs), erasable programmable read-only memories (EPROMs), flash memories, electrically erasable programmable read-only memories (EEPROMs), phase change memory (PCM), magnetic or optical cards, or any other type of media suitable for storing electronic instructions.
Accordingly, examples also include non-transitory, tangible machine-readable media containing instructions or containing design data, such as Hardware Description Language (HDL), which defines structures, circuits, apparatuses, processors and/or system features described herein. Such examples may also be referred to as program products.
Emulation (including binary translation, code morphing, etc.).
In some cases, an instruction converter may be used to convert an instruction from a source instruction set architecture to a target instruction set architecture. For example, the instruction converter may translate (e.g., using static binary translation, dynamic binary translation including dynamic compilation), morph, emulate, or otherwise convert an instruction to one or more other instructions to be processed by the core. The instruction converter may be implemented in software, hardware, firmware, or a combination thereof. The instruction converter may be on processor, off processor, or part on and part off processor.
References to “one example,” “an example,” etc., indicate that the example described may include a particular feature, structure, or characteristic, but every example may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same example. Further, when a particular feature, structure, or characteristic is described in connection with an example, it is submitted that it is within the knowledge of one skilled in the art to affect such feature, structure, or characteristic in connection with other examples whether or not explicitly described.
Moreover, in the various examples described above, unless specifically noted otherwise, disjunctive language such as the phrase “at least one of A, B, or C” or “A, B, and/or C” is intended to be understood to mean either A, B, or C, or any combination thereof (i.e., A and B, A and C, B and C, and A, B and C).
The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. It will, however, be evident that various modifications and changes may be made thereunto without departing from the broader spirit and scope of the disclosure as set forth in the claims.