Example embodiments of the present disclosure relate generally to event analysis and, more particularly, to a scalable and secure method to retrieve data from an integrated circuit (IC) in response to a trigger event.
Data centers, integral to modern computing infrastructure, rely heavily on integrated circuits (ICs) within their server farms. These ICs, composed of multiple cores running diverse loads, can lead to unpredictable crashes that are difficult to diagnose. Traditional methods, such as using Reliability, Availability, and Serviceability (RAS) records, often prove inadequate for helping users understand crash details. In portable computing devices, the constraints of cost and complexity limit the options for troubleshooting. Meanwhile, in server farms, solutions like Baseboard Management Controllers (BMCs) present their own challenges, including potential security vulnerabilities. The existing solutions are thus found to be either inadequate, slow, unsecure, or expensive, leading to a pressing need for more effective approaches to manage and diagnose system failures in various computing environments.
Applicant has identified a number of deficiencies and problems associated with crash analysis. Many of these identified problems have been solved by developing solutions that are included in embodiments of the present disclosure, many examples of which are described in detail herein.
Systems, methods, and computer program products are therefore provided for automated data retrieval from an integrated circuit (IC).
In one aspect, a scan island is presented. The scan island comprising: a data extraction module, configured to: extract data from a plurality of scan chains and a plurality of random-access memories (RAMs) associated with an integrated circuit (IC) in response to a trigger event; and store the data in an external non-volatile storage media; and a clock and reset module configured to coordinate operations of the scan island, wherein the clock and reset module comprises: a free-running independent clock configured to facilitate continuous operation of the scan island upon occurrence of the trigger event; and a local reset module configured to re-initialize the scan island in a known state upon occurrence of the trigger event without external intervention, wherein the scan island is a partition of the IC that is isolated for data retrieval.
In some embodiments, the trigger event comprises at least a malfunction of the IC.
In some embodiments, the data comprises information associated with the trigger event, wherein the information comprises at least one of information associated with the IC, a debug configuration state of the IC, a configuration state of one or more components of the IC, firmware and/or software measurements of the one or more components of the IC, an error state, or configuration information associated with the scan island.
In some embodiments, the scan island is further configured to: reboot the IC upon storing the data in the external non-volatile storage media.
In some embodiments, the scan island is further configured to: determine whether the reboot is in response to the trigger event; and transmit the data from the external non-volatile storage media to an original equipment manufacturer (OEM) server for post-event analysis in an instance in which the reboot is in response to the trigger event.
In some embodiments, the scan island further comprises a data security module, wherein the data security module is configured to: determine that a first portion of the data is associated with a first user; mask, using the scan island, the first portion of the data upon determining that the first portion of the data is associated with the first user, wherein the first portion of the data is masked using a dynamic mask pattern generator, wherein the first portion of the data is dynamically masked during extraction of the data; and store the data in an external non-volatile storage media, wherein the first portion of the data is masked.
In some embodiments, the data security module is further configured to: determine a first subset of scan chains and a first subset of RAMs associated with the first portion of the data; bypass the first subset of scan chains and the first subset of RAMs when extracting the data upon determining that the first portion of the data is associated with the first user; and store the data in an external non-volatile storage media, wherein the first portion of the data is excluded.
In some embodiments, the data security module is further configured to: determine whether the first user is authorized to access the first portion of the data; and transmit the first portion of the data to the first user in an instance in which the first user is authorized to access the first portion of the data.
In some embodiments, the data security module is further configured to: establish a network session with the first user upon determining that the first portion of the data is associated with the first user; transmit, via the network session, a request to the first user to submit an authentication token to access the first portion of the data, wherein the request comprises a unique identifier associated with the IC and a session identifier associated with the network session; receive, via the network session, the authentication token from the first user in response to the request; validate the authentication token following receipt of the authentication token; and determine that the first user is authorized to access the first portion of the data based on at least validating the authentication token.
In some embodiments, the authentication token comprises at least one of the session identifier, the unique identifier, a first set of parameters for extraction of the data, or a digital signature of the first user, wherein the digital signature of the first user is generated using a first private key associated with the first user.
In some embodiments, the data extraction module is further configured to: extract, using a data extraction module, the first portion of the data based on at least the first set of parameters in response to the trigger event.
In some embodiments, the data security module is further configured to: generate an encryption key; encrypt the first portion of the data using the encryption key; encrypt the encryption key using a first public key associated with the first user; and transmit the encrypted first portion of the data and the encrypted encryption key to the first user.
In some embodiments, the data security module is further configured to: attest the first portion of the data prior to encrypting the first portion of the data using the encryption key, thereby ensuring integrity of the data.
In some embodiments, the data security module is further configured to: determine that one or more portions of the data is associated with one or more users; determine whether the one or more users is authorized to access the one or more portions of the data; and transmit the one or more portions of the data to the one or more users in an instance in which the one or more users is authorized to access the one or more portions of the data.
In some embodiments, the scan island further comprises a data processing module, wherein the data processing module is configured to: filter the data based on at least security and isolation policies associated with the scan island; and reformat the data from an initial format to a standardized format upon filtering the data, wherein the data processing module is associated with the scan island.
In another aspect, a method for automated data retrieval from an integrated circuit (IC) using a scan island is presented. The method comprising: extracting, using a data extraction module within a scan island, data from a plurality of scan chains and a plurality of random-access memories (RAMs) associated with an IC in response to a trigger event; storing, using the data extraction module, the data in an external non-volatile storage media; continuously operating, using a free-running independent clock within the scan island, the scan island upon occurrence of the trigger event; and re-initializing, using a local reset module within the scan island, the scan island in a known state following the trigger event without external intervention, wherein the scan island is a partition of the IC that is isolated for data retrieval.
In yet another aspect, a computer program product for automated data retrieval from an integrated circuit (IC) using a scan island is presented. The computer program product comprising a non-transitory computer-readable medium comprising code configured to cause an apparatus to: extract, using a data extraction module within a scan island, from a plurality of scan chains and a plurality of random-access memories (RAMs) associated with an IC in response to a trigger event; store, using the data extraction module, the data in an external non-volatile storage media; continuously operate, using a free-running independent clock within the scan island, the scan island upon occurrence of the trigger event; and re-initialize, using a local reset module within the scan island, the scan island in a known state following the trigger event without external intervention, wherein the scan island is a partition of the IC that is isolated for data retrieval.
The above summary is provided merely for purposes of summarizing some example embodiments to provide a basic understanding of some aspects of the present disclosure. Accordingly, it will be appreciated that the above-described embodiments are merely examples and should not be construed to narrow the scope or spirit of the disclosure in any way. It will be appreciated that the scope of the present disclosure encompasses many potential embodiments in addition to those here summarized, some of which will be further described below.
Having described certain example embodiments of the present disclosure in general terms above, reference will now be made to the accompanying drawings. The components illustrated in the figures may or may not be present in certain embodiments described herein. Some embodiments may include fewer (or more) components than those shown in the figures.
Data centers often utilize a significant number of integrated circuits (ICs) within their server farms. These ICs are made up of multiple cores running diverse loads simultaneously, which can lead to unpredictable crashes that are challenging to replicate or diagnose. The standard approach of using Reliability, Availability, and Serviceability (RAS) records, which contain minimal register and error code details, usually falls short when providing a comprehensive understanding of crash details and depends heavily on the replication of the issue. As a result, resources may be unavailable for extended periods while investigators attempt to identify the underlying cause of the problem. Expanding the number of registers and records is not a practical solution, as it presumes a precognitive understanding of potential failures.
In computing environments such as laptops or portable devices, product cost significantly influences the platform design. Incorporating complicated external microcontrollers is not a viable option due to their expense. These systems are often used in real-world settings where the typical user either lacks the ability or the financial means to employ sophisticated tools to troubleshoot a system failure. Requiring the system to be taken to a service station for comprehensive analysis can be a drawn-out process, leading to a poor user experience due to the time and inconvenience factors involved.
In server farms, Baseboard Management Controllers (BMCs) can be used on platforms to execute basic telemetry and recovery operations. Nonetheless, BMCs are generally simpler and potentially less secure as compared to the IC. Differences in security policies, trust levels, and ownership among the BMC, the primary processor vendor, and the data center owner add complexities to the management data extracted from the IC. Additionally, these scenarios are vulnerable to man-in-the-middle attacks if the data is extracted in a remote environment, posing a security risk to all parties involved, including the original equipment manufacturer (OEM). Therefore, the existing solutions, being either inadequate, slow, unsecure, or expensive, are not practical for large-scale deployment in remote systems.
Embodiments of the invention present a scalable and secure method to retrieve data from an IC in response to a trigger event (e.g., an IC malfunction), as well as techniques for the processing and transmission of the data while ensuring confidentiality requirements are met. To this end, embodiments of the present invention use a scan island—a specific area within the IC that is isolated to ensure uninterrupted functionality—to perform specific operations in response to a trigger event. The scan island may include a data extraction module that is configured to extract data associated with the IC when a trigger event occurs. In addition, the scan island may include a data security module that is used to enable a fine-tuned exclusion and/or masking of data depending on the dynamically changing boundaries and asset distribution within the IC at any given time. The data security module may be used to ensure permissioned access based on data ownership among multiple users. The data security module may also be used to attest the extracted data to prevent spoofing attacks on the extracted data. In example embodiments, the scan island may include a data filtering module and a data compression module to further facilitate the processing of data. Once processed, the data is then stored in a non-volatile storage media for distribution. Upon storing the data in the non-volatile storage media, the IC is rebooted. When rebooting, embodiments of the invention determine whether the IC reboot is in response to the trigger event. If the reboot is in response to the trigger event, the data from the non-volatile storage media is transmitted to the authorized parties, such as an OEM server, for post-event analysis.
Embodiments of the present disclosure will now be described more fully hereinafter with reference to the accompanying drawings, in which some, but not all, embodiments of the present disclosure are shown. Indeed, the present disclosure may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will satisfy applicable legal requirements. Where possible, any terms expressed in the singular form herein are meant to also include the plural form and vice versa, unless explicitly stated otherwise. Also, as used herein, the term “a” and/or “an” shall mean “one or more,” even though the phrase “one or more” is also used herein. Furthermore, when it is said herein that something is “based on” something else, it may be based on one or more other things as well. In other words, unless expressly indicated otherwise, as used herein “based on” means “based at least in part on” or “based at least partially on.” Like numbers refer to like elements throughout.
As used herein, “operatively coupled” may mean that the components are electronically coupled and/or are in electrical communication with one another, or optically coupled and/or are in optical communication with one another. Furthermore, “operatively coupled” may mean that the components may be formed integrally with each other or may be formed separately and coupled together. Furthermore, “operatively coupled” may mean that the components may be directly connected to each other or may be connected to each other with one or more components (e.g., connectors) located between the components that are operatively coupled together. Furthermore, “operatively coupled” may mean that the components are detachable from each other or that they are permanently coupled together.
As used herein, “determining” may encompass a variety of actions. For example, “determining” may include calculating, computing, processing, deriving, investigating, ascertaining, and/or the like. Furthermore, “determining” may also include receiving (e.g., receiving information), accessing (e.g., accessing data in a memory), and/or the like. Also, “determining” may include resolving, selecting, choosing, calculating, establishing, and/or the like. Determining may also include ascertaining that a parameter matches a predetermined criterion, including that a threshold has been met, passed, exceeded, satisfied, etc.
As used herein, the term “first” is not intended to impart a numerical or serial limitation. Instead, the term “first” is used solely to distinguish or identify a particular element or set of elements from other similar elements within the context of the invention. For example, the term ‘first user’ may refer to one particular user, who may be the sole user or one user among a plurality of users, and does not signify any particular sequence, order, or hierarchy among the users.
It should be understood that the word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any implementation described herein as “exemplary” is not necessarily to be construed as advantageous over other implementations.
Furthermore, as would be evident to one of ordinary skill in the art in light of the present disclosure, the terms “substantially” and “approximately” indicate that the referenced element or associated description is accurate to within applicable engineering tolerances.
The server systems 102 may house multiple servers, each containing various computing resources, otherwise referred to as compute resources or compute nodes. These resources may include central processing units (CPUs), such as NVIDIA Grace™ CPUs, and graphics processing units (GPUs), such as NVIDIA® H100 Tensor Core GPUs. The servers may also include memory, such as high-bandwidth memory (HBM) for GPUs, and storage devices, such as NVMe (Non-Volatile Memory Express) SSDs for fast data access. Each server within the server systems 102 may be configured to handle specific types of workloads, such as general-purpose computing, data processing, specialized tasks like artificial intelligence (AI) and machine learning (ML) applications, and/or the like. For example, NVIDIA® Tensor Core GPUs may be used to accelerate AI and ML workloads by performing parallel processing of large datasets. The server systems 102 may be connected to one or more datacenter switches 106, allowing the servers systems 102 to communicate with other systems within the datacenter or external networks 108. The configuration of the server systems 102 may be scalable, allowing for additional servers, such as those with NVIDIA® GPUs and CPUs, to be added or removed as needed based on computing requirements.
In some embodiments, the server systems 102 may include top-of-rack (ToR) switches 102A. The ToR switches 102A may connect each server system 102 to the broader datacenter network, typically using high-speed networking protocols such as Ethernet or InfiniBand® protocols. The ToR switches 102A may reduce cable complexity by aggregating server connections within the rack and then linking to higher-layer switches, such as datacenter switches 106, within the datacenter. Each ToR switch may be connected to every server within its rack through short cables, and the switch may then uplink to the datacenter switches 106. ToR switches 102A in the server systems 102 may also support various network features such as VLAN segmentation, load balancing, and quality of service (QOS) management, ensuring optimized traffic flow within the rack and the datacenter as a whole. In some configurations, ToR switches 102A may offer redundancy by employing multiple uplinks to datacenter switches 106, providing fault tolerance in case of a switch or connection failure. Additionally or alternatively, the ToR switches 102A may be operatively coupled to the advanced datacenter processing units (NIC/DPUs) 104, enabling efficient offloading of data processing and security tasks, further reducing the computational burden on the server CPUs and improving overall data flow within the rack.
The NIC/DPU 104, may integrate network interface controller (NIC) and data processing unit (DPU) functionalities to enhance the efficiency of data center operations. The NIC/DPU 104 may be configured to offload various network, storage, and security tasks from the server systems 102, in particular, CPUs in the server systems 102, allowing the CPUs to focus on compute-intensive workloads. The NIC/DPU 104 may facilitate high-speed data transmission, optimize data flow, and enable advanced network services with minimal impact on server performance. The NIC component within the NIC/DPU 104 may handle standard network functions, such as packet transmission and reception, supporting high-speed Ethernet or InfiniBand® protocols. By facilitating fast data transfers between the server systems 102 and external networks 108, the NIC enables efficient communication across the datacenter environment. The NIC may also support offloading network protocol processing, reducing the overhead on server systems 102, in particular, CPUs in the server systems 102, and improving overall data throughput. The DPU component of the NIC/DPU 104 may extend these capabilities by offloading more advanced processing tasks, such as data encryption and decryption, packet inspection and filtering, virtualization support, and/or the like. In example embodiments, the NIC/DPU 104 may be NVIDIA BlueField®-2 DPUs, which provide a high-performance platform for data center acceleration. The BlueField-2 architecture may include up to 8 Arm cores, enabling the NIC/DPU 104 to execute network, storage, and security tasks independently of the server systems 102, in particular, CPUs in the server systems 102. By performing these tasks closer to the data source, the NIC/DPU 104 may reduce data movement across the network, lower latency, and enhance overall system efficiency.
The NIC/DPU 104 may also include a dedicated memory subsystem, such as dynamic random-access memory (DRAM), to support local processing and ensure high-speed data access. Additionally, the NIC/DPU 104 may be configured to manage NVMe over Fabrics (NVMe-oF) storage protocols, allowing for efficient remote storage access and fast data retrieval. The combined NIC and DPU functionalities within the NIC/DPU 104 may support various advanced networking features, including traffic shaping and load balancing, remote direct memory access (RDMA), virtual machine and container isolation, and/or the like.
The datacenter switches 106 may manage the data flow between the server systems 102 and the external networks 108. The datacenter switches 106 may be responsible for routing and distributing data between servers within the datacenter and facilitating communication with external networks. Datacenter switches 106 may be configured to support various high-speed network protocols, such as Ethernet or InfiniBand® protocols, depending on the performance and bandwidth requirements of the datacenter. The datacenter switches 106 may include optical switches, which use light signals for data transmission, offering high bandwidth and low latency for long-distance communication. Alternatively, the datacenter switches 106 may include electrical switches, which rely on electronic signals and may be used for shorter distances or when lower latency is a priority. In some configurations, hybrid switches may be used, combining both optical and electrical components to balance performance and flexibility. The datacenter switches 106 may be advanced networking switches, such as Nvidia Quantum-2 switches, configured to provide high throughput capabilities. The datacenter switches 106 may operate at different layers of the network stack, including Layer 2 (data link layer) and Layer 3 (network layer), to perform switching and routing functions. Multiple datacenter switches 106 may be interconnected to provide redundancy and load balancing for reliable data transfer even if one switch fails. The datacenter switches 106 may support scalable configurations, allowing the network architecture to expand as additional server systems 102 or external networks 108 are introduced.
In certain embodiments, the number and arrangement of datacenter switches 106 within the datacenter network architecture 100 may be based on the overall network topology deployed in the datacenter environment. The choice of network topology may influence the scalability, performance, fault tolerance, and bandwidth distribution of the network, thus affecting how many switches are required and how they are interconnected. Examples of network topology may include fat-tree topology, SlimFly topology, dragonfly topology, HyperX topology, torus topology, Clos (folded-Clos) topology, and/or the like. For instance, in a fat-tree topology, the network is structured as a multi-tiered hierarchy with equal-cost paths between any two endpoints. The fat-tree topology may be built using three layers of switches: leaf switches at the bottom layer, directly connected to the server systems 102, spine switches in the middle layer, which interconnect the leaf switches, and core switches at the top, which interconnect multiple sets of spine switches. In a SlimFly topology, the datacenter switches 106 may be arranged to minimize the average path length between servers, reducing communication latency. The total number of datacenter switches 106 may be fewer than in fat-tree topology, but their arrangement may be more complex to optimize the number of direct and indirect connections between nodes. Dragonfly topology may organize switches into groups (or “pods”), with high-bandwidth connections within each group and lower-bandwidth connections between groups. The datacenter switches 106 may be arranged into several pods, with each pod containing a set of leaf switches connected to server systems 102 and local spine switches. In addition, there may be fewer inter-pod connections than intra-pod connections. In hyperX topology, switches may be arranged in a multi-dimensional grid, with each switch connected to multiple neighboring switches in different dimensions. The total number of switches may scale with the number of dimensions and network size. In a torus topology, the datacenter switches 106 may be connected in a loop or ring structure. Torus topology may offer reduced wiring complexity and built-in redundancy, as each switch is connected to multiple adjacent switches. In larger datacenters, a higher-dimensional torus (e.g., 3D or 4D torus) may be implemented, where switches are arranged in a multi-layered grid. In a Clos topology, also known as a folded-Clos or CLOS architecture, the datacenter switches 106 may be arranged in multiple layers of switching stages, with each stage containing multiple switches. In this configuration, each server system 102 may connect to a set of leaf switches, which in turn connect to multiple spine switches. Additional spine and leaf switches may be added as the network grows, with the number of datacenter switches 106 increasing in proportion to the number of server systems and external networks connected.
The external networks 108 represent a range of connectivity options that facilitate communication between the datacenter and various external systems, such as other datacenters, cloud service providers, and/or the like. These external networks 108 may include local area networks (LANs), which connect devices within a limited geographical area, as well as WANs that span larger distances and connect multiple LANs. Additionally, external networks 108 may include cloud networks, which provide scalable resources and services hosted remotely, and private networks, which offer secure communication channels for sensitive data transfer. Other types of external networks may include virtual private networks (VPNs) that enable secure access over the internet and Content Delivery Networks (CDNs) that optimize the delivery of content to end-users. Each of these external networks may utilize various communication protocols, such as Ethernet, InfiniBand®, or MPLS (Multiprotocol Label Switching) protocols, to ensure reliable and efficient data transfer.
It should be noted that the description provided herein is merely one embodiment of the datacenter network architecture and the associated components, including the datacenter switches 106 and the NIC/DPU 104. Various modifications, alterations, and adaptations may be made without departing from the scope of the disclosure. The specific configurations, components, and functionalities described are illustrative and may be replaced or modified in other embodiments depending on the particular requirements of the datacenter environment. For example, different network topologies, alternative processing units, or variations in server configurations may be used to achieve similar objectives. As such, the scope of the invention should not be limited by the described embodiment.
The CPU 202 may manage overall operations within a server system (e.g., server system 102). The CPU 202 may execute instructions, process data, and control communication between the other components, including the memory module 204, switches 206, and GPUs 208. The CPU 202 may be connected to the memory module 204, providing fast access to data required for computational tasks. The CPU 202 may communicate with the GPUs 208 through the switches 206, enabling the CPU 202 to offload specialized computing tasks such as graphics rendering, AI, and ML workloads, and/or the like. Additionally, the CPU 202 may manage external communication via external connections 212, facilitating data exchange between the server system 102 and external networks 108 or other systems.
In some embodiments, the server system architecture 200 may be scalable to include multiple CPUs that are the same or similar to CPU 202, each managing its own set of resources such as memory, GPUs, and network connections. In such configurations, each CPU may communicate with other CPUs within the system via high-speed interconnects, such as NVLink® interconnects, to coordinate processing tasks and balance workloads. Such a distributed architecture may improve performance by allowing parallel processing across multiple CPUs, which may be particularly useful for data-intensive applications such as AI, ML, and high-performance computing. The server system architecture 200 may allow for the addition of more CPUs as needed, depending on the computing requirements of the workload.
The memory module 204 may provide fast data access for the CPU 202, allowing the CPU to efficiently execute instructions and process data. The memory module 204 may include various types of memory, such as DRAM or high-bandwidth memory (HBM), depending on the specific performance requirements. The memory module 204 may be directly connected to the CPU 202 to minimize latency and enable high-speed data transfers between the memory and the CPU. The size and type of the memory module 204 may be scalable, allowing for adjustments based on the workload and data processing needs of the server system. Multiple memory modules that are the same or similar to the memory module 204 may be included in the architecture to support additional CPUs or to increase memory capacity as required by the computing tasks.
The switches 206 may facilitate communication between the CPU 202, GPUs 208, and other components within the server system 102. These switches 206 may be responsible for routing data between these components, ensuring efficient data flow and coordination during processing tasks. The switches 206 may include various types of technologies, such as Peripheral Component Interconnect Express (PCIe) switches, which connect the CPU to multiple GPUs, enabling high-speed data transfers, Ethernet switches for managing communication with external networks or InfiniBand® switches designed for low-latency, high-throughput data transfers between servers in a high-performance computing environment, and/or the like. The architecture of the switches 206 may be scalable, accommodating additional components as needed to meet increasing performance demands. Furthermore, the switches 206 may provide features such as load balancing and fault tolerance, which improve the reliability and efficiency of data transmission within the server system.
The GPUs 208 may provide specialized processing capabilities for parallel computation tasks, such as those involved in AI, ML, and data-intensive computing workloads. Each GPU 208 may be connected to the CPU 202 via the switches 206, allowing the CPU 202 to offload certain tasks to the GPUs 208 for faster processing. The GPUs 208 may be configured to communicate with one another, either directly or through interconnect switches 210, to enable coordinated parallel processing and data sharing. The GPUs 208 may include HBM for faster access to data during computation. The number and type of GPUs 208 in the system may be scalable, allowing the architecture to accommodate varying performance needs depending on the specific workload. For example, the GPUs 208 may include NVIDIA® H100 Tensor Core GPUs optimized for deep learning and AI inference, or NVIDIA® A100 GPUs designed for high-performance computing and data analytics. The GPUs 208 may be used individually or in combination to meet the demands of various computational tasks.
In specific embodiments, the CPU 202 and/or the GPUs 208, or portions or components thereof, may be embodied as or include a chip or chipset. In other words, the CPU 202 and/or the GPUs 208 may include physical packages (e.g., chips) including materials, components, and/or wires on a structural assembly (e.g., a baseboard). The structural assembly may provide physical strength, conservation of size, and/or limitation of electrical interaction for component circuitry included thereon. The CPU 202 and/or the GPUs 208, may therefore, in some cases, be configured to implement an embodiment of the disclosure on a single chip or as a single “system on a chip (SoC).” As such, in some cases, a chip or chipset may constitute means for performing one or more operations for providing the functionalities described herein. In this configuration, the CPU may be coupled to a GPU via die-to-die (D2D) interconnects, chip-to-chip (C2C) interconnects, such as a Ground-Referenced Signaling (GRS) interconnect, and/or the like, allowing for low-latency communication and high bandwidth between the CPU and GPU. Additionally, the CPU can connect to multiple GPUs using both D2D/C2C interconnects and high-speed interconnects, such as PCIe interconnects.
The interconnect switches 210 may facilitate communication between the GPUs 208, enabling high-speed data transfer and coordination for parallel processing tasks. These switches may include various types of interconnect technologies, such as NVIDIA® NVSwitches or other high-performance fabric switches, depending on the configuration. The high-speed interconnect switches 210 may allow multiple GPUs 208 to be interconnected in a fully integrated fabric, providing low-latency, high-bandwidth communication between the GPUs for efficient execution of AI, ML, and high-performance computing tasks. The interconnect switches 210 may support scalability, allowing additional GPUs 208 to be added as needed. These interconnect switches 210 may also manage data flow between GPUs 208 and the CPU 202 via the switches 206, optimizing data throughput for complex computational workloads. In some configurations, the interconnect switches 210 may support hybrid or optical interconnect technologies to enhance performance based on system requirements.
The external connections 212 may provide interfaces between the server system 102 and external networks (e.g., external networks 108 shown in
It should be understood that the server system architecture 200 described herein is merely one embodiment, and various modifications, substitutions, and alternatives may be made without departing from the scope of the disclosure. The specific components, configurations, and functionalities described are illustrative examples and may vary depending on the specific requirements of the server system or datacenter environment. For example, different types of CPUs, GPUs, memory modules, interconnect switches, and external connections may be used, and the architecture may be adapted to support alternative technologies or configurations. The server system architecture 200 may also be implemented in other forms or combined with additional hardware or software components to meet particular performance, scalability, or workload needs. Accordingly, the invention is not limited to the described embodiment.
In this embodiment, CPU 302 within server system 102 may be connected to GPU 304 and GPU 306 via an interconnect 310. Interconnect 310 may utilize corresponding ports 302a on CPU 302, 304a on GPU 304, and 306a on GPU 306 to facilitate efficient data flow between the CPU and GPUs. The ports used for interconnect 310 may vary depending on the type of connection. For example, if interconnect 310 is a D2D) or C2C interconnect, such as a GRS interconnect, ports 302a, 304a, and 306a may be configured as GRS-compatible ports, which provide a high-bandwidth, low-latency pathway for data exchange directly on the chip or package. These GRS ports enable low-power, high-speed connections that minimize latency and optimize data exchange efficiency, allowing CPU 302 to handle workload distribution and data synchronization across GPUs 304 and 306 effectively.
Alternatively or additionally, interconnect 310 may be a high-speed PCIe interconnect, operatively coupling CPU 302 with GPUs 304 and 306 through PCIe-compatible ports 302a, 304a, and 306a. In this embodiment, the ports may be configured as multi-lane PCIe ports, such as PCIe x16, which provides a high-bandwidth, scalable data transfer channel. The PCIe interconnect 310 may support dynamic link width adjustments, enabling bandwidth to scale according to the intensity of the processing tasks, thereby optimizing resource allocation and supporting efficient data transfer across components. Such a configuration may be advantageous for workloads with variable data transfer needs, as it allows efficient sharing of PCIe bandwidth across other components within the server system, enhancing overall system performance.
As shown in
In this embodiment, CPU 302 is connected to GPUs 304 and 306 via interconnects 310, configured in a manner similar to the coupling described in
As illustrated in
It should be noted that
Although the term “circuitry” as used herein with respect to components 412-422 is described in some cases using functional language, it should be understood that the particular implementations necessarily include the use of particular hardware configured to perform the functions associated with the respective circuitry as described herein. It should also be understood that certain of these components 412-422 may include similar or common hardware. For example, two sets of circuitries may both leverage use of the same processor, network interface, storage medium, or the like to perform their associated functions, such that duplicate hardware is not required for each set of circuitries. It will be understood in this regard that some of the components described in connection with the system 402 may be housed together, while other components are housed separately (e.g., a controller in communication with the system 402).
While the term “circuitry” should be understood broadly to include hardware, in some embodiments, the term “circuitry” may also include software for configuring the hardware. For example, in some embodiments, “circuitry” may include processing circuitry, storage media, network interfaces, input/output devices, and the like. In some embodiments, other elements of the system 402 may provide or supplement the functionality of particular circuitry. For example, the processor 412 may provide processing functionality, the memory 414 may provide storage functionality, the communications circuitry 418 may provide network interface functionality, and the like.
In some embodiments, the processor 412 (and/or co-processor or any other processing circuitry assisting or otherwise associated with the processor) may be in communication with the memory 414 via a bus for passing information among components of, for example, the system 402. The memory 414 may be non-transitory and may include, for example, one or more volatile and/or non-volatile memories, or some combination thereof. In other words, for example, the memory 414 may be an electronic storage device (e.g., a non-transitory computer readable storage medium). The memory 414 may be configured to store information, data, content, applications, instructions, or the like, for enabling an apparatus, e.g., system 402, to carry out various functions in accordance with example embodiments of the present disclosure.
Although illustrated in
The processor 412 may be embodied in a number of different ways and may, for example, include one or more processing devices configured to perform independently. Additionally, or alternatively, the processor 412 may include one or more processors configured in tandem via a bus to enable independent execution of instructions, pipelining, and/or multithreading. The processor 412 may, for example, be embodied as various means including one or more microprocessors with accompanying digital signal processor(s), one or more processor(s) without an accompanying digital signal processor, one or more coprocessors, one or more multi-core processors, one or more controllers, processing circuitry, one or more computers, various other processing elements including integrated circuits such as, for example, an ASIC (application specific integrated circuit) or FPGA (field programmable gate array), or some combination thereof. The use of the term “processing circuitry” may be understood to include a single core processor, a multi-core processor, multiple processors internal to the apparatus, and/or remote or “cloud” processors. Accordingly, although illustrated in
In an example embodiment, the processor 412 may be configured to execute instructions stored in the memory 414 or otherwise accessible to the processor 412. Alternatively, or additionally, the processor 412 may be configured to execute hard-coded functionality. As such, whether configured by hardware or software methods, or by a combination thereof, the processor 412 may represent an entity (e.g., physically embodied in circuitry) capable of performing operations according to an embodiment of the present disclosure while configured accordingly. Alternatively, as another example, when the processor 412 is embodied as an executor of software instructions, the instructions may specifically configure the processor 412 to perform one or more algorithms and/or operations described herein when the instructions are executed. For example, these instructions, when executed by processor 412, may cause the system 402 to perform one or more of the functionalities thereof as described herein.
In some embodiments, the system 402 further includes input/output circuitry 416 that may, in turn, be in communication with the processor 412 to provide an audible, visual, mechanical, or other output and/or, in some embodiments, to receive an indication of an input from a user or another source. In that sense, the input/output circuitry 416 may include means for performing analog-to-digital and/or digital-to-analog data conversions. The input/output circuitry 416 may include support, for example, for a display, touchscreen, keyboard, mouse, image capturing device (e.g., a camera), microphone, and/or other input/output mechanisms. The input/output circuitry 416 may include a user interface and may include a web user interface, a mobile application, a kiosk, or the like. The input/output circuitry 416 may be used by a user to view and/or adjust likelihood of malfunction indications (e.g., a user may indicate whether a malfunction has been resolved via the input/output circuitry).
The processor 412 and/or user interface circuitry comprising the processor 412 may be configured to control one or more functions of a display or one or more user interface elements through computer-program instructions (e.g., software and/or firmware) stored on a memory accessible to the processor 412 (e.g., the memory 414, and/or the like). In some embodiments, aspects of input/output circuitry 416 may be reduced as compared to embodiments where the system 402 may be implemented as an end-user machine or other type of device designed for complex user interactions. In some embodiments (like other components discussed herein), the input/output circuitry 416 may be eliminated from the system 402. The input/output circuitry 416 may be in communication with memory 414, communications circuitry 418, and/or any other component(s), such as via a bus. Although more than one input/output circuitry and/or other component can be included in the system 402, only one is shown in
The communications circuitry 418, in some embodiments, includes any means, such as a device or circuitry embodied in either hardware, software, firmware or a combination of hardware, software, and/or firmware, that is configured to receive and/or transmit data from/to a network and/or any other device, circuitry, or module in communication with the IC 420. In this regard, the communications circuitry 418 may include, for example, a network interface for enabling communications with a wired or wireless communication network. For example, in some embodiments, communications circuitry 418 may be configured to receive and/or transmit any data that may be stored by the memory 414 using any protocol that may be used for communications between computing devices. For example, the communications circuitry 418 may include one or more network interface cards, antennae, transmitters, receivers, buses, switches, routers, modems, and supporting hardware and/or software, and/or firmware/software, or any other device suitable for enabling communications via a network. Additionally, or alternatively, in some embodiments, the communications circuitry 418 may include circuitry for interacting with the antenna(s) to cause transmission of signals via the antenna (e) or to handle receipt of signals received via the antenna (e). These signals may be transmitted by the system 402 using any of a number of wireless personal area network (PAN) technologies, such as Bluetooth® v1.0 through v5.0, Bluetooth Low Energy (BLE), infrared wireless (e.g., IrDA), ultra-wideband (UWB), induction wireless transmission, or the like. In addition, it should be understood that these signals may be transmitted using Wi-Fi, Near Field Communications (NFC), Worldwide Interoperability for Microwave Access (WiMAX) or other proximity-based communications protocols. The communications circuitry 418 may additionally or alternatively be in communication with the memory 414, the input/output circuitry 416, and/or any other component of system 402, such as via a bus. The communication circuitry 418 of the system 402 may also be configured to receive and transmit information with the IC 420 and components associated therewith.
In some embodiments, the IC 420 may refer to a semiconductor device that integrates multiple functional components onto a single substrate to execute a broad array of electronic tasks. The IC 420 may include, but is not limited to, processing units, memory storage, input/output interfaces, and specialized controllers, which collectively support the efficient operation of various functions within the system 402. The IC 420 may contain both digital and analog circuits, potentially organized as a single chip or as a network of interconnected modules distributed across multiple chips, depending on system requirements and design constraints. In specific embodiments, the architecture of the IC 420 may include multiple functional modules configured to work in coordination, allowing the IC 420 to execute complex tasks within constrained power and space environments. By consolidating key processing, graphics, and memory management functions, the IC 420 may be configured to provide a high-performance solution suited for applications that require both computational efficiency and compact design.
In an example embodiment, the IC 420 may function as an integral component within the datacenter network architecture 100. For instance, in the datacenter network architecture 100, IC 420 may support efficient data processing and high-speed communication as part of the server systems 102, facilitating the transfer and management of data across the datacenter network through various interconnected components, such as datacenter switches 106 and ToR switches 102A.
In another example embodiment, the IC 420 may function as an integral component within the NIC/DPU 104. For instance, as part of the NIC/DPU 104, the IC 420 may assist in performing advanced data processing tasks by supporting NIC functionalities, such as high-speed packet transmission, protocol offloading, and low-latency communication. Additionally, the IC 420 may augment the DPU capabilities of the NIC/DPU 104 by executing computationally intensive operations, including encryption, packet inspection, and virtualization.
In yet another example embodiment, the IC 420 may function as an integral component within the server system architecture 200. For instance, within the server system architecture 200, IC 420 may improve computational efficiency by working alongside CPUs 202, memory modules 204, and GPUs 208, as well as interconnect switches 210 and external connections 212. The IC 420 may enable efficient data handling and processing by facilitating communication between CPUs and GPUs through high-speed interconnects, such as PCIe or NVLink®, to offload tasks and balance workloads. This coordination may allow the server system architecture 200 to execute complex, data-intensive computations, such as those required for AI, ML, or large-scale data analytics applications.
Additionally, the IC 420 may be configured to support high-performance clusters within the server system architecture 200, incorporating multiple GPUs or CPUs in an SoC manner to provide integrated parallel processing capabilities. For example, within high-performance clusters like those illustrated in
In an example embodiment, the IC 420 may be a Tegra® IC developed to support multimedia processing and advanced computational requirements in mobile and embedded systems. The Tegra® integrates an ARM-based CPU, a high-performance GPU, and additional components such as memory controllers and power management units. These elements enable the Tegra® to manage demanding applications, including high-definition graphics processing and data-intensive computations, while maintaining energy efficiency. For instance, in specific implementations, the Tegra X1®, may incorporate a multi-core CPU alongside a GPU with hundreds of cores, facilitating tasks in gaming, machine learning, and autonomous control systems.
In some embodiments, the IC 420 may include a scan island 422. The scan island 422, as part of the IC 420, may represent a specialized section or module designed to facilitate the testing and analysis of the IC. Unlike other components within the IC 420, the scan island 422 may be specifically configured to enable access to internal states of the IC via associated scan chains and RAMs, providing a means to monitor, diagnose, or evaluate the behavior of the circuit. In specific embodiments, the scan island 422 may be configured to access a series of interconnected scan cells or registers (referred to herein as scan chains) associated with the IC 420. Such an arrangement allows for the sequential scanning of data through the cells, enabling the observation and control of individual elements within the IC 420. The scan island 422 may be used to perform various testing procedures, such as fault detection, performance analysis, or validation of the IC's functionality.
In various components of the datacenter environment, ICs (e.g., IC 420) may incorporate scan islands (e.g., scan island 422) to facilitate testing and analysis tailored to their specific operational roles. For instance, ICs within NIC/DPUs 104 may use scan islands to validate high-speed packet processing, encryption, and protocol offloading mechanisms, ensuring reliable performance of advanced networking features. Similarly, ICs integrated into server systems 102, working alongside CPUs, GPUs, and memory modules, may utilize scan islands to diagnose and monitor high-speed interconnects like PCIe or NVLink®, enabling efficient data transfer for compute-intensive tasks such as AI and machine learning. In datacenter switches 106, scan islands may support testing of packet switching and routing logic, identifying potential bottlenecks or faults that could impact overall network performance. By providing access to internal states and enabling detailed diagnostics, scan islands across these diverse ICs promote operational reliability and maintain the efficiency of the interconnected datacenter environment.
The scan island 422 may interface with the system 402 through specific communication pathways, allowing for the extraction of data or the injection of test patterns. This interaction may be facilitated by at least the processor 412, communications circuitry 418 and/or the input/output circuitry 416 within the system 402, ensuring a secure and efficient connection with the IC 420. Although depicted as a singular entity within the IC 420, the scan island 422 may include multiple interconnected sections or modules, each designed to target specific aspects of the IC 420. The design and configuration of the scan island 422 may vary based on the specific requirements of the IC 420, reflecting the complexity of the tasks it is intended to perform.
In some embodiments, the system 402 may include hardware, software, firmware, and/or a combination of such components, configured to execute various processes and methods described herein. It should be appreciated that in some embodiments, the IC 420 (including the scan island 422) may perform one or more of such example actions in combination with another circuitry of the system 402, such as the memory 414, processor 412, input/output circuitry 416, and communications circuitry 418. For example, in some embodiments, the IC 420 may utilize processing circuitry, such as the processor 412 and/or the like, to form a self-contained subsystem to perform one or more of its corresponding operations. In this regard, some or all of the example processes and algorithms discussed herein can be performed by at least one processor 412 and/or the IC 420.
In some embodiments, components including the processor 412, memory 414, input/output circuitry 416, communication circuitry 418, and other applicable components may be embedded with the IC 420 on the same board or substrate, allowing these elements to operate closely and efficiently. In alternative embodiments, the processor 412, memory 414, and other associated components may be embedded separately from the IC 420, connected through high-speed interconnects or buses to enable efficient data transfer between distinct modules. In yet other embodiments, some of the components, such as the processor 412 and communication circuitry 418, may be integrated into the IC 420, while others, like memory 414 and input/output circuitry 416, may remain external.
In various embodiments described herein, non-transitory computer readable storage media can be configured to store firmware, one or more application programs, and/or other software, which include instructions and/or other computer-readable program code portions that can be executed to direct operation of the system 402 to implement various operations, including the examples shown herein. As such, a series of computer-readable program code portions may be embodied in one or more computer-program products and can be used, with a device, system 402, database, and/or other programmable apparatus, to produce the machine-implemented processes discussed herein. It is also noted that all or some of the information discussed herein can be based on data that is received, generated and/or maintained by one or more components of the system 402. In some embodiments, one or more external systems (such as a remote cloud computing and/or data storage system) may also be leveraged to provide at least some of the functionality discussed herein.
While the term “module” should be understood broadly to encompass a distinct functional unit within a system, in some embodiments, the term “module” may include both hardware and software components that collectively perform a specific function or set of functions. For example, in some embodiments, a “module” may include processing circuitry, algorithms, routines, storage media, network interfaces, input/output mechanisms, and the like. In some embodiments, other elements of the system may interact with or supplement the functionality of a particular module. For example, the clock and reset module 502 may synchronize the operations of other modules, the data security module 506 may ensure the integrity of data processed by other modules, and the like. The term “module” may thus represent a cohesive and purpose-driven element within the system, capable of operating independently or in conjunction with other modules to achieve the desired outcomes.
As shown in
As shown in
As shown in
The scan chains 522 may be a series of interconnected registers REG_1, REG_2, . . . , REG_n that enable the controlled shifting of data within the IC 420. The scan chains 522 may be used during the data extraction process, allowing for the precise retrieval of information from various parts of the IC 420. The scan chains 522 may be configured to capture specific data patterns, states, or conditions within the IC, providing valuable insights into its operation, performance, or potential issues. As such, the scan chains 522 play an important role in the diagnostic and monitoring capabilities of the scan island 422, facilitating the targeted extraction of data in response to trigger events or other operational needs. In some embodiments, the scan chains 522 may be routed through the rest of the IC, allowing registers with similar exposure permission to be grouped in common chains, while those with dissimilar exposure are grouped in different scan chains. In this regard, in example embodiments, fuses can be used to prevent data from certain chains from appearing on scan chains depending on IC manufacturer, OEM, or end user requirements.
The RAMs RAM_1, RAM_2, . . . , RAM_n 524 may serve as volatile storage for the IC 420 during operation. When a trigger event occurs, the data within the RAMs 524 is extracted along with the data from the scan chain 522. The data extraction provides a snapshot of the state of the IC at the time of the malfunction, capturing valuable information about the conditions, variables, and processes that were active when the issue occurred. In some embodiments, the data extracted from the RAMs may include variables, buffers, temporary calculations, and other transient information that reflects the dynamic operation of the IC 420. By analyzing this data, analysts can gain insights into the underlying cause of the malfunction, identify patterns or anomalies, and develop targeted solutions or preventative measures. Similar to implementing exposure-based chain segregation, differing policies may be applied for on-chip RAM segregation for data management to allow for data isolation and concealment as needed during RAM dump.
Once the data is extracted, it may be further processed by other modules within the scan island 422, such as the data processing module 514 and the data compression module 516, before being stored in non-volatile storage media for distribution. The data extraction module 512 thus plays a central role in the scalable and secure method of retrieving data from the IC 420, acting as the starting point for the entire process.
As shown in
As shown in
As shown in
As shown in
As shown in
As shown in
As described herein, the data processing module 514 may be configured to filter and reformat the data extracted from the IC 420. At step 610, the data processing module 514 may read data from the memory bank (e.g., memory bank 220), accessing the information previously extracted and stored by the data extraction module 512. Next, at step 612, the data processing module 514 may filter the data, applying specific criteria or rules to isolate relevant information or exclude unnecessary details. In some embodiments, this filtering may be based on security and isolation policies associated with the scan island. Following the filtering, at step 614, the data processing module 514 may re-format the data, converting it from its initial format to a standardized format that facilitates subsequent analysis or transmission. Then, at step 616, the data processing module 514 may store the processed data in the memory bank, preserving the refined and restructured information for further use. Finally, at step 618, the data processing module 514 may complete its operation by informing the next stage of data availability, signaling to other modules or components within the scan island that the processed data is ready for further actions such as compression, encryption, or transmission.
As described herein, the data compression module 516 may be configured to compress the data extracted from the IC 420. At step 620, the data compression module 516 may read data from the memory bank (e.g., memory bank 220), accessing the information that has been processed and stored by the preceding stages (e.g., data processing module 514). Next, at step 622, the data compression module 516 may compress the data, applying specific algorithms or techniques to reduce its size without losing essential information. Such compression may be useful for efficient storage and transmission, particularly when dealing with large volumes of data. Following the compression, at step 624, the data compression module 516 may store the compressed data in the memory bank, ensuring that the more compact form of the data is preserved for subsequent actions such as encryption or transmission. Finally, at step 626, the data compression module 516 may complete its operation by informing the next stage of data availability, signaling to other modules or components within the scan island that the compressed data is ready for further processing.
As described herein, the data security module 506 may be configured to secure data extracted from the IC 420. At step 628, the data security module 506 may read data from the memory bank (e.g., memory bank 520), accessing the information that has been compressed and stored by the preceding stages. Next, at step 630, the data security module 506 may encrypt the data, applying specific cryptographic algorithms to ensure the confidentiality and integrity of the information. Such encryption may be useful for protecting the data against unauthorized access or alteration. Following the encryption, at step 632, the data security module 506 may attest the data, such as by using a verification mechanism that ensures the authenticity and integrity of the data, thereby preventing spoofing attacks on the extracted data. In addition, the data security module 506 may provide permissioned access to the data based on data ownership among multiple users, thereby providing a robust layer of security. Then, at step 634, the data security module 506 may store the encrypted and attested data in the memory bank, preserving the secured form of the data for subsequent transmission or analysis. Finally, at step 638, the data security module 506 may complete its operation by informing the next stage of data availability, signaling to other modules or components within the scan island that the encrypted and attested data is ready for further actions such as transmission to the OEM server for post-event analysis.
As described herein, the input/output circuitry 508 may be configured to manage data transmission associated with the IC 420. At step 640, the input/output circuitry 508 may read data from the memory bank (e.g., memory bank 520), accessing the information that has been encrypted, attested, and stored by the preceding stages. This data may represent the final processed form of the information extracted from the IC, which should be ready for external transmission. Next, at step 642, the input/output circuitry 508 may transmit the data for storage to an external non-volatile storage, ensuring that the data is preserved in a stable and secure form outside of the scan island. The external storage may provide a robust repository for the data, facilitating subsequent retrieval and analysis by authorized parties such as the OEM server. In some embodiments, the transmission may involve specific protocols or interfaces to maintain the security and integrity of the data during the transfer.
Example Method for Automated Data Retrieval from an IC
As shown in block 704, the method may include extracting, using a scan island, data from a plurality of scan chains and a plurality of RAMs associated with the IC in response to receiving the alert. As described herein, the scan island, a partition of the IC isolated for data retrieval, may utilize the data extraction module to retrieve specific information from interconnected registers (scan chains) and RAMs, providing insights into various operational states and conditions within the IC. In some embodiments, the data extracted from the scan chains and the RAMs may include information associated with the trigger event. For example, the information associated with the trigger event may include information associated with the IC, such as general data related to the integrated circuit, such as its model, manufacturer, serial number, or other identifying characteristics used for tracking, auditing, or diagnostic purposes. In another example, the information associated with the trigger event may include a debug configuration state of the IC, such as specific settings, parameters, or states related to debugging functions within the IC. The debug configuration state may include details about active debug modes, breakpoints, watchpoints, or other diagnostic tools that were engaged at the time of the trigger event. In yet another example, the information associated with the trigger event may include a configuration state of one or more components of the IC, including settings and states of individual components within the IC, such as processors, memory units, or peripheral devices that may reveal how each component was set up and operating at the time of the trigger event. In still other examples, the information associated with the trigger event may include firmware and/or software measurements of the one or more components of the IC, such as performance metrics, version information, or other data related to the firmware or software running on the IC's components. In still other examples, the information associated with the trigger event may include an error state that captures specific details about any errors, faults, or exceptions that occurred within the IC at the time of the trigger event. Here, the error state may include error codes, descriptions, timestamps, or other diagnostic information that can pinpoint the nature and origin of the problem. In yet another example, the information associated with the trigger event may include configuration information associated with the scan island.
In some embodiments, the data extraction process may depend on various factors, including data ownership and specific parameters defined by the users who own the data, as described in further detail in
As shown in block 706, the method may include storing the data in an external non-volatile storage media. Upon processing, the input/output circuitry may transmit the data to the external non-volatile storage media, ensuring stable and secure preservation.
As shown in block 708, the method may include rebooting the IC upon storing the data in the external non-volatile storage media. In some embodiments, this reboot may be part of a recovery process and may serve to reset the IC to a functional state.
As shown in block 710, the method may include determining whether the reboot is in response to the trigger event. This determination ensures that the data transmission to the OEM server is specifically linked to the identified trigger event, maintaining the relevance and accuracy of the post-event analysis.
As shown in block 712, the method may include transmitting the data from the external non-volatile storage media to an original equipment manufacturer (OEM) server for post-event analysis in an instance in which the reboot is in response to the trigger event. In some embodiments, if the reboot is indeed in response to the trigger event, then the data is transmitted to the OEM server for post-event analysis, facilitating diagnostics, troubleshooting, and potential improvements to the IC's design or operation. On the other hand, in embodiments where the reboot is not in response to the trigger event (e.g., the reboot is performed as part of the IC's regular operation), the data is not transmitted from the external non-volatile storage media to the OEM server.
Example Method for Managing Data Extraction from an IC
As shown in block 804, the method may include masking, using the scan island, the first portion of the data upon determining that the first portion of the data is associated with the first user, wherein the first portion of the data is masked. In some embodiments, to mask the first portion of the data, the method may employ a dynamic mask pattern generator (DMPG) to generate a specific pattern, such as a sequence of zeros or a random sequence, that effectively masks or conceals a first portion of the data extracted from the IC. The DPMG may be configured and optimized according to the specific characteristics of the scan chain, such as the length of the region that needs to be concealed. This allows for precise control over the masking process, enabling real-time adjustments or “ad-hoc” changes to the concealment configuration. The arming or disarming of the DPMG can be decided at run time, providing flexibility and responsiveness to the varying requirements of data security and concealment within the system. By employing the DPMG, the method may ensure that sensitive or client-specific data is protected from unauthorized access or exposure.
As shown in block 806, the method may include storing the data in an external non-volatile storage media. The external non-volatile storage provides a secure and reliable means of preserving the data, facilitating its transmission to other components such as an original equipment manufacturer (OEM) server for post-event analysis, or enabling further processing within the scan island or other parts of the system. The storage process may also include specific measures to ensure data security, such as through implementing encryption or access controls, further enhancing the confidentiality and integrity of the stored data.
As shown in block 854, the method may include bypassing the first subset of scan chains and the first subset of RAMs when extracting the data upon determining, using the data security module, that the first portion of the data is associated with the first user, wherein the first portion of the data is excluded. In some embodiments, as an alternative to masking data using DMPG, Dynamic Sub-chain Bypass Control (DSBC) may be used to bypass or skip specific registers (e.g., first subset of scan chains and first subset of RAMs) in the scan chain during the scanning and data extraction process. By bypassing these registers during the scanning and data extraction process, the first portion of the data (e.g., client data) is effectively skipped over, and thus concealed from the scanning process.
As shown in block 856, the method may include storing the data in an external non-volatile storage media.
The choice between using masking techniques, such as DMPG, register bypass techniques, such as DSBC, may depend on various factors such as the specific requirements for data concealment, the design of the IC, area optimization considerations, the desired level of granularity in control, and/or the like. Both masking and bypassing provide mechanisms to prevent unauthorized access to client data within the IC. By either masking the data with DMPG or bypassing it with DSBC, sensitive information can be protected from potential exposure or misuse. For example, in situations that require a more space-efficient option, DSBC is a preferred choice as it offers the advantage of a minimal spatial and computational requirements.
As shown in block 902, the method may include establishing, using the data security module, a network session with the first user upon determining that the first portion of the data is associated with the first user. Establishing the network session may ensure a secure connection between the system and/or the IC, and the authorized user (e.g., the first user), facilitating a controlled environment for data access. In specific embodiments, the network session may be encrypted or otherwise secured to prevent unauthorized access.
As shown in block 904, the method may include transmitting, via the network session, a request to the first user to submit an authentication token to access the first portion of the data, wherein the request comprises a unique identifier associated with the IC and a session identifier associated with the network session.
As shown in block 906, the method may include receiving, via the network session, the authentication token from the first user in response to the request. In some embodiments, the authentication token may include the session identifier, the unique identifier, and/or a first set of parameters for extraction of the data. Here, the session identifier may refer to a unique value associate with the network session established with the first user, which may serve as a reference to identify and manage the specific communication session, thereby ensuring that the data exchange is consistent and secure. The unique identifier associated with the IC may refer to a serial number, a hardware ID, or another unique attribute that identifies the specific IC. By including the unique identifier in the authentication token, the system can verify that the request pertains to the correct IC and prevent unauthorized access to other devices. The first set of parameters for extraction of the data may refer to specific rules, conditions, or methods for extracting the data from the IC. These parameters may include specifications about specific data to extract, data formatting requirements, data transmission parameters, or other operational details.
In some embodiments, the authentication token may be signed by the first user using a first private key. The resulting digital signature generated by the first private key may serve as a cryptographic seal, confirming the integrity of the authentication token and establishing the identity of the first user. The first private key may be a critical component of a cryptographic key pair, securely stored and exclusively controlled by the first user. The corresponding public key (e.g., first public key), stored in a secure area of the IC, may be used to verify the digital signature on the authentication token.
As shown in block 908, the method may include validating, using the data security module, the authentication token following receipt of the authentication token. The validation process may involve several steps to ensure the authenticity and integrity of the token. In specific embodiments, the validation process may include verifying the digital signature of the first user, checking the token's expiration date, and ensuring that the token has not been revoked. In specific embodiments, the digital signature may be verified using the corresponding public key (e.g., the first public key) associated with the first user. Successful verification may authenticate the first user and confirm the legitimacy of the authentication token. Additionally or alternatively, the validation may include comparing the unique identifier associated with the IC and the session identifier associated with the network session, both of which are contained within the token, against known and expected values. These checks help confirm that the token is legitimate and has been issued by a trusted entity, and that it pertains to the correct IC and network session. If any of these validation checks fail, the token may be rejected, and access to the first portion of the data may be denied. Successful validation of the authentication token confirms that the first user is authorized to access the first portion of the data.
As shown in block 910, the method may include determining that the first user is authorized to access the first portion of the data based on at least validating the authentication token. Such a determination is made based on successful validation confirming that the user has the necessary permissions to access the specified data.
Upon successfully authenticating the first user, in some embodiments, the method may include extracting, using the data extraction module, the first portion of the data from the IC. Here, the extraction of the first portion of the data is conducted in accordance with the first set of parameters, delineating the specific data to be extracted, the method of extraction, and the desired format. Upon extracting the first portion of the data, the method may include attesting, using the data security module, the first portion of the data. Attestation may refer to the process of verifying the integrity and authenticity of the data, ensuring that the data has not been altered or tampered with since its extraction from the IC. In example embodiments, attestation may be achieved by generating a cryptographic hash or checksum of the first portion of the data and then digitally signing the hash. The resulting digital signature may serve as a secure and verifiable proof of the data's integrity at the time of attestation. The attestation process provides an additional layer of security and trust, reinforcing the confidence in the accuracy and reliability of the data being accessed.
In some embodiments, once attested, the first portion of the data may then be encrypted using the data security module. In this regard, the method may include generating, using the data security module, an encryption key that may be used to encrypt the first portion of the data. The encryption key may be produced to ensure uniqueness and security, in accordance with established cryptographic standards. Once produced, the method may include encrypting the first portion of the data utilizing a specified encryption algorithm, thereby transforming the data into a secure format. To further enhance security, the method may include encrypting the encryption key itself using the first public key associated with the first user, thereby ensuring that the encryption key is accessible only to the holder of the corresponding private key (e.g., the first user).
As shown in block 912, the method may include transmitting the first portion of the data to the first user in an instance in which the first user is authorized to access the first portion of the data. Here, both the encrypted first portion of the data and the encrypted encryption key may be transmitted to the first user. Upon receipt, the first user may decrypt the encryption key using a private key that corresponds to the public key that was previously used to encrypt the encryption key. Once the encryption key is obtained, it can be used to decrypt the encrypted first portion of the data, thereby granting the first user access to the first portion of the data in its original form.
In instances where the data extracted from the IC is associated with multiple users, the method may include determining that one or more portions of the data are associated with one or more users. Following this determination, the method may include utilizing the data security module to determine whether the one or more users are authorized to access the one or more portions of the data. The authorization process may be distinct for each user and the specific portion of the data they are claiming access to. In some embodiments, the authorization process may involve validating the authentication token, verifying user credentials, checking access permissions, ensuring compliance with any applicable legal or contractual obligations, and/or the like. In example embodiments, validating the authentication token may include verifying the digital signature of each user, checking the token's expiration date, and ensuring that the token has not been revoked. As described herein, the authentication token may be signed by each user using their respective private keys. The digital signature of each user may be independently verified using corresponding public keys associated with each user. Successful verification may authenticate each user and confirm the legitimacy of the authentication token. In some embodiments, the authorization process may also consider the context of the request, such as the current state of the IC, the nature of the trigger event, or other situational factors that may affect access rights. Additional security measures, such as multi-factor authentication or encryption, may be employed to further safeguard the data.
If the one or more users are determined to be authorized to access the one or more portions of the data, the method may include transmitting the corresponding portions of the data to the authorized users. This transmission may be conducted securely, using encryption or other security protocols, and may be tailored to each user's specific access rights and needs. By implementing this multi-user authorization process, the method ensures that each party is authenticated for the specific portion of the data they are claiming access to, maintaining the confidentiality, integrity, and appropriate distribution of the information. This approach allows for fine-grained control over data access, accommodating complex scenarios where multiple authorized parties are involved, each with potentially different rights and permissions concerning the data.
Many modifications and other embodiments of the present disclosure set forth herein will come to mind to one skilled in the art to which these embodiments pertain having the benefit of the teachings presented in the foregoing descriptions and the associated drawings. Although the figures only show certain components of the methods and systems described herein, it is understood that various other components may also be part of the disclosures herein. In addition, the method described above may include fewer steps in some cases, while in other cases may include additional steps. Modifications to the steps of the method described above, in some cases, may be performed in any order and in any combination, and some steps may be performed in parallel.
Therefore, it is to be understood that the present disclosure is not to be limited to the specific embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of the appended claims. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.
This application is a continuation-in-part of U.S. patent application Ser. No. 18/732,773, filed on Sep. 26, 2023, entitled “SYSTEM FOR AUTOMATED DATA RETRIEVAL FROM AN INTEGRATED CIRCUIT FOR EVENT ANALYSIS,” which is hereby incorporated by reference in its entirety for all purposes.
Number | Date | Country | |
---|---|---|---|
Parent | 18372773 | Sep 2023 | US |
Child | 19003046 | US |