The present disclosure is directed in general to automotive safety systems. In one aspect, the present disclosure relates to methods and systems for fault management in system-on-chips (SoCs).
Increasing levels of system integration have resulted in more and more processor cores and resources being bundled on a single chip. These processor cores have multiple applications being executed at the same time. With such system designs where multiple applications are integrated on the same chip and working concurrently, there is an increase in the number of faults in the chip. To achieve fault-free operation, it is mandatory to detect and recover from a fault within fault handling time interval. Typically, a fault collection and reaction system is included on the chip to categorize faults and generate appropriate reactions. Existing fault collection and reaction system architectures use a centralized fault collection application to collect and react to faults, which results in a high severity response for each fault irrespective of the application that resulted in the fault. Moreover, the existing fault collection and reaction systems rely on a central core and software to handle faults from all applications. This increases fault handling time and instability in the chip and reduces system availability for other operations. Such centralized fault handling architectures have several disadvantages for the new SoC trends, including loss of fault granularity when complex fault routing is solved by fault aggregation, limited scalability to system-of-systems solutions, and robustness with respect to faults which requires thorough testing.
As seen from the foregoing, the existing fault collection systems are extremely difficult at a practical level by virtue of the challenges with providing fault handling on complex SoC and system of system designs which meet the applicable performance, design, complexity and cost constraints.
The present invention may be understood, and its numerous objects, features and advantages obtained, when the following detailed description of a preferred embodiment is considered in conjunction with the following drawings.
A fault handling apparatus, system, method, and program code are described for System(s) on Chip (SoC) systems where each SoC subsystem includes one or more fault collection hardware units which are deployed and connected to one another to form fault handling schemas that provide a fault-tolerant architecture for an arbitrary depth of hierarchy and an arbitrary number of faults originating in the SoC system. In selected embodiments, the fault collection hardware units are connected with a plurality daisy-chain configurations arranged in an escalation tree for comprehensive fault collection and escalation. By using daisy-chained configurations of a single fault collection hardware unit in arbitrary topologies of SoC subsystems, fault collection and monitoring is enabled from distributed components. And by connecting the daisy-chained fault collection hardware unit in an escalation tree configuration, fault collection, monitoring, and escalation is enabled in a hierarchical organization of SoC system hardware resources. The arbitrary hierarchy depth enables efficient safe-stating while keeping the unaffected applications running on the SoC system. The disclosed deployment topologies scale well to the future complex SoC architectures supporting both distributed and hierarchical organization of SoC system hardware resources.
In selected embodiments, the disclosed fault handling system provides fail-safe fault handling in SoCs implementing system of systems while keeping available as much of the SoC as possible, thereby ensuring functional safety for the SoC while safe-stating or shutting down only the affected subset resources. In addition, the disclosed fault handling system guarantees reaching a safe state in a fault tolerant manner.
At the core of the disclosed fault handling system is a fault collection unit (FCU) which is a dedicated IP design which collects, monitors, and escalates faults as needed. Each FCU is connected to receive fault inputs (for collecting faults) and an overflow input (for further connectivity). The FCU generates escalation fault outputs, including an overflow output and local control outputs that can be configured as needed. Both fault inputs and the escalation fault output have associated signals for Execution Environment (EENV) IDs to instruct escalation reactions toward the resource-owning EENVs. EIDs are associated with fault inputs, either dynamically by the EID inputs or statically, where the EID is configured within FCU. The overflow output signals that the FCU is registering too many faults than it can handle.
As disclosed herein, FCUs can be connected in different topologies to implement fault collection and fault escalation. Fault collection in a subsystem can be achieved by a single FCU or by multiple FCUs connected in a daisy chain. Fault escalation can be achieved by forming an escalation tree from FCUs. The escalation ensures that a safe state is ultimately reached. Moreover, the escalation process is fault tolerant and thus guarantees that fault handling reaches a safe state event under presence of a single fault during the fault handling process.
Finally, the proposed FCU implementation and its deployment supports SoC architectures in the form of system of systems. The fault collection and escalation supports SoCs with blocks of IPs called subsystems and hosting several virtual ECUs called EENVs. The fault escalation enables grouping of virtual ECUs (vECUs) at multiple levels and therefore scales up well to future SoC/ECU architectures hosting multiple instances of nowadays SOCs and vECU logical grouping into a hierarchical structure.
In summary, the scalability, fault tolerance, and versatility of the disclosed fault handling solution is achieved by FCU features and by the topologies for connecting the FCUs together. Each FCU may be fully configurable to serve either in an escalation tree or in a daisy chain. The topologies can be combined to achieve the desired behavior demonstrated in this report.
To provide a contextual understanding of the disclosed SoC system fault handling apparatus, system, method, and program code, it is helpful to understand the entire effect chain of faults, including errors, failures, and reactions. To this end, the following definitions of fault-related blocks and entities is provided:
A “HW resource” refers to the atomic hardware component in SoC which performs a certain function. A HW resource cannot be further divided into HW resources.
A “HW Subsystem (SSys)” is a group of HW resources that are logically or physically bound together. The logical bond can be the same functionality (e.g., timers) and the physical bond can be the physical connections (e.g., location in SoC, common power supply line). A HW resource can belong to only one subsystem.
A “Managing Subsystem (MSSys)” is a subsystem that can control resources in other subsystems.
An “Execution Environment (EENV)” is a collection of HW resources that together have the capability to perform certain functionality, including software execution. An EENV must be capable of executing an application, and therefore it must contain resources that execute SW, such as cores, memory ranges etc.
A “Group of EENVs (GEENV)” is a group of EENVs. Since a GEENV satisfies the definition of EENV, a GEENV is an EENV. This allows nesting. A GEENV can be a group of GEENVs.
A “Pure EENV” is an EENV that is not GEENV.
A “Managing EENV(MEENV)” is an EENV that is dedicated to managing EENVs within a GEENV. Unlike EENV resource allocation, MEENV resources are predefined; MEENV uses resources in MSSys. All MSSys resources are used by the MEENV and no resource from MSSys can be allocated to other EENV within the group of EENVs managed by MEENV.
To provide additional details for an improved understanding of selected embodiments of the present disclosure, reference is now made to
In the automotive sector, SoC designs increasingly host multiple control applications that were originally developed as independent Electronic Control Units (ECUs). Automotive vendors tend to implement zonal architecture in cars because it saves costs and weight. A zonal architecture requires control functions to be implemented close to the physical function locations. A zonal controller therefore hosts several ECU functions that were previously controlled from independent ECUs.
In order to reduce the likelihood of a harm to humans in the case of a failure, automotive ECUs are subjected to functional safety standards, such as ISO 26262 [ISO11], [ISO18] which is the international functional safety standard for the development of electrical and electronic systems in road vehicles. Under such standards, an ECU that experiences a fault will be brought gracefully to a safe state. Ideally, the safe state should, as much as possible, keep the ECU functionality alive and degrade only the affected function in order for the ECU to be still available. This goal applies also when a zonal controller hosts several ECUs. Both individual ECU availability and also ECU safe-stating independence shall be achieved in order to achieve high ECU availability. Both fault detection and fault reaction shall be designed to achieve this goal.
Hosting several virtual ECUs (vECUs) on a single SoC (and thus on a single physical ECU) poses additional challenges for fault handling. For example, the ISO 26262 standard requires that every ECU reduces the occurrence of dangerous failures to an acceptable level. Faults can be detected by both hardware and software detection mechanisms. The scope of fault effects can be different and thus different need to be the scopes of the respective reactions.
In addition, the design trends for future SoCs is to host complex system of virtual ECUs—a system of systems. As the number of control functions grows and the dependency on those functions increases with new vehicle designs, the demand for higher availability of those functions increases, even in the presence of faults. This will require higher independence of functions and cascading hierarchy of function groups that ensure only a local or a partial loss of functionality in the case of a failure.
To provide additional details for an improved understanding of selected embodiments of the present disclosure, reference is now made to
To provide additional details for an improved understanding of selected embodiments of the present disclosure, reference is now made to
When an MEENV (e.g., 36) processes events from EEVNs (e.g., EENV1, EENV2, EENV3), it must recognize the origin of any event. Therefore, each EENV is assigned an ID, EENV ID (EID). The EID is interpreted by an MEENV (e.g., 36) of GEENV (e.g., 33) where the corresponding EENV is present at the same abstraction level as the MEENV. That means that the semantics of EID changes at each level. In the example in
It is desirable to have faults in the SoC to be responded in the hierarchy of the SoC to contain the fault effect within the smallest group of HW resources to ensure high availability of systems. Since each HW resource can experience a fault, each HW Subsystem (SSys) may contain a fault detection mechanism. Since the EENV is owner of the HW resource, the EENV should respond to such a detected fault, but if the fault cannot be responded to by the EENV, then the fault should be then managed by some higher authority, such as the MEENV of the respective GEENV. Since a GEENV is an EENV, the fault escalation can continue in the hierarchy from EENV to MEENV in the respective GEENV.
HW resources that can be shared among EENVs is designed and integrated for sharing. Otherwise, the HW resource cannot be shared. For example, a HW resource that is spatially shared among EENVs can provide multiple interfaces that are used from different EENVs as the EENV would be the owner of that resource. In this arrangement, each EENV is the owner of the respective interface only, but the EENV perceives the ownership.
In addition or in the alternative, a HW resource may be shared by temporal partitioning which is managed by EENVs or by a shared HW resource that is connected to the temporally shared resource. For instance, a spatially shared resource accesses a storage controller which is used by multiple EENVs via the shared resource and the access is managed by the shared resource that applies time switching to access the controller.
To provide additional details for an improved understanding of selected embodiments of the present disclosure, reference is now made to
The SoC 41 may also include a single instance of an interconnect subsystem which connects bus masters in different subsystems to slaves in subsystems.
The SoC 41 may also include one or more memory subsystems (Memory SS) instances (e.g., Memory SS0, Memory SS1). Each memory subsystem consists of a memory and a memory controller. A memory subsystem is shared among EENVs via the Interconnect subsystem so that every bus master, from its respective EENV, gets its time window to access the memory subsystem.
The SoC 41 may also include one or more communication subsystems (Comm SS) instances (e.g., Comm SS0, Comm SS1). Each communications subsystem contains HW resources for performing CAN, LIN, and FlexRay communication. In the communication subsystem(s), each HW resource is non-shareable.
The SoC 41 may also include one or more Machine Learning Accelerator (ML ACC) instances (e.g., ML ACC1, ML ACC2). Each accelerator subsystem provides two independent interfaces, where each interface can be owned by a different EENV.
The SoC 41 may also include one or more Ethernet instances (Ethernet0, Ethernet1). Each Ethernet subsystem provides two independent interfaces.
The SoC 41 may also include one or more Analog-to-Digital Converter (ADC) instances (e.g., ADC0 and ADC1). Each ADC subsystem can be used by a single EENV only.
The SoC 41 may also include one or more I/O subsystems, such as a GPIO(s) (GPIO0, GPIO1), UART(s), SPI(s), I2C(s), I3C(s), and Timer(s). HW Resources in the I/O subsystems are non-shareable.
The SoC 41 may also include a Security subsystem instance that provides four independent interfaces.
The SoC 41 may also include a single Management Subsystem (MGT Subsystem) instance that contains a core, memory, peripherals, and the necessary infrastructure to run MEENV.
As will be appreciated, the SoC subsystems may use any suitable partitioning to group subsystems with different EENVs (e.g., EENV1, EENV2, EENV3, and MEENV). For example and as indicated with the labeling below each SoC subsystem, a first EENV (EENV1) may include CPU0, CPU1, Interconnect interfaces, Memory SS0, Security interface, ML Acceerator0, Ethernet 0, CAN and LIN from Comm SS0, and some I/O subsystems (e.g., timers, SPI, GPIO0). In addition, a second EENV (EENV2) may include CPU2, Interconnect interfaces, Memory SS1, Security interface, ML Accelerator1, Ethernet1, CAN and LIN from Comm SS1, and some I/O subsystems (e.g., timers, I2C, GPIO1). In addition, a third EENV (EENV3) may include CPU3, Interconnect interface, Memory SS1, ADC0, I3C, and SPI. Finally, a fourth EENV (MEENV) may include the Management Subsystem.
The employment of the architecture in
Furthermore, two or more instances of this SoC architecture can be integrated into a single SoC forming a system of SoCs. If each SoC has its own linear physical memory space, EENVs that are not GEENVs can own resources from a single SoC only.
Another SoC architecture supporting hierarchical GEENV structure is a tile-based architecture. In such cases, the SoC consists of tiles connected into 2D grid network, where each tile is a full-featured computing system. Resources in tiles can be used to form EENVs. Moreover, a group of tiles can be used to form a GEENV. Furthermore, the GEENV can belong to a supergroup of tiles forming another GEENV—a group of GEENVs, and so on.
Based on the example architecture of an SoC and EENV partitioning in
Given the variety and complexity of SoC subsystem partitioning possibilities, newer SoC architectures pose challenges for fault handling while simultaneously supporting the demand for higher availability of the vECU control functions under presence of faults. To address these challenges, selected embodiments of the present disclosure introduce a single hardware fault collection unit (FCU) block that is deployed in individual SoC subsystems so that multiple FCUs are connected to one another to form fault handling schemas that provide a fault-tolerant architecture for an arbitrary depth of hierarchy and an arbitrary number of faults originating in the SoC system. In
At each SoC subsystem, an instance of the first FCU instance (FCU1) reacts to faults and monitors the reactions for their successful completion. For that, detected faults from different HW resources in a subsystem shall be both indicated to the owning EENV and at the same time collected for monitoring. If the pure EENV that owns the failing HW resource cannot handle the fault, the fault must be further escalated to the MEENV (e.g., FCU2) in order to either shut down or safe-state the owning EENV, including all its HW resources, or the containing subsystem must be safe-stated and all affected EENVs informed or even shut down. This escalation can continue through the hierarchy of GEENVs.
For an improved understanding of selected embodiments of the present disclosure, reference is now made to
At the FCU 51, each fault input (FAULT) is associated with a fault EENV ID (FAULT EID) which may be a non-negative integer identifying the EENV owning the failing resource. In addition, there may be a special ID associated with each MEENV. For example, the MEENV ID may be zero. The MEENV is also called the Global ID because it means that the fault has a global effect and it does not affect only one EENV. Each EID may be statically configured within the FCU for each fault input, or it is dynamically generated by the SoC subsystem. This behavior is also configurable for each fault input. EID is also generated for the fault escalation output. Fault inputs are either active (to signal a fault) or inactive. Faults can also be signaled by edges, and such inputs need to be configured as edge-based. An activated fault signal becomes inactive if the fault is resolved in the subsystem—the fault signal is either deactivated or cleared for edge-based faults.
In operation, the FCU 51 may monitor active faults to determine if they are deactivated in a timely fashion. To this end, each FCU 51 may contain a timer that measures how long a fault is active. For example, if a count-down timer reaches zero, the fault along with its EID is escalated further via the FCU outputs 55 (Fault, Fault EID). In effect, a fault that is monitored by the FCU timer is configured for delayed escalation. Alternatively, a fault input can be configured for immediate escalation, in which case such a fault is escalated immediately upon fault activation. Finally, the FCU 51 may be configured to generate the overflow signal 54 in response to two or more faults with different EIDs being collected and active in the FCU 51. Hence, the FCU 51 can handle only faults of a single EENV at a time.
In selected embodiments, the FCU 51 can be configured to generate additional generic control signals 56 that may be used for programmable local reactions. While the FCU 51 monitors if a fault is resolved by EENV owning the faulty resource, the FCU 51 does not react to the fault itself, it only escalates. However, the control signals 56 can be used by the FCU 51 when immediate local safe-stating is needed to prevent dangerous events. In such a case, the FCU 51 can serve as the initiator of the local safe-stating, such as by using the control signals 56 to pull on the reset signal of the failing resource. Still, the owning EENV is due to finalize the fault reaction including clearing the fault.
Without belaboring the specific circuit implementations that would be understood by those skilled in the art, the high-level requirements for each FCU instance 51 may include generating an overflow output signal 54 which indicates that two faults with different EIDs are present at the FCU inputs 53 or that an FCU input signal was asserted. In addition, each FCU instance 51 may be connected to receive an overflow input 52 which, when asserted, puts the FCU 51 in an overflow state which results in the assertion of overflow output 54. The overflow status may be cleared by an FCU reset operation or by writing to a register in FCU.
In selected embodiments, the FCU 51 may have a configurable number of fault inputs 53, where each FCU fault input shall have corresponding configurable Fault EID input which is associated only with the FCU fault input. The configurable fault EID may be configured within the FCU 51 or the EID input for the respective fault input is used. Each FCU fault input 53 may be configurable as edge-based or level-based. Level-based inputs have the fault being asserted when fault input is active. Edge-based fault inputs latch the edge and keep the fault asserted until cleared in an FCU register.
In selected embodiments, the FCU 51 may have an FCU output 55 with a fault escalation output that is asserted in the case of escalation. In addition, each FCU escalation output shall have its associated EID output.
Every FCU fault input 53 shall be configurable for an immediate escalation or delayed escalation. For immediate escalation, the fault is immediately escalated to the FCU escalation output, but for delayed escalation, the fault is escalated after a preconfigured time. To this end, the FCU 51 includes a countdown timer that monitors the duration of assertion of faults with the same EID. When the time elapses, FCU 51 triggers escalation with the respective EID, resulting in delayed escalation. However, concurrent faults with different EIDs lead to overflow. The countdown timer is stopped when all FCU fault inputs are de-asserted (deactivated or cleared within FCU 51). All FCU faults have the same EID. It is not possible that EIDs would be different. If they were different, the FCU 51 would have already moved to overflow state.
In addition, the FCU 51 may have logical outputs that can be configured to be asserted to an arbitrary combination of asserted fault inputs and the escalation output.
As disclosed herein, multiple instances of the FCU 51 may be deployed across a plurality of SoC subsystems in different SoC architectures to enable fault-tolerant scalable fault collection, fault monitoring, fault escalation, and escalation reaction processing. For example, reference is now made to
The daisy-chain connected FCUs 61-63 can be used to extend the number of faults monitored within a subsystem or to collect faults in a subsystem with sparse HW resources. In the case a subsystem has many different faults detected and each FCU is implemented to provide 64 inputs, the FCU can be daisy-chained to monitor 128, 192, . . . faults.
Furthermore, daisy-chain connected FCUs 61-63 can be used to enable monitoring of multiple faults at the same time. FCU transitions and signals to the overflow state if two inputs with different EIDs are being asserted at the same time. When multiple FCUs are used in a subsystem, concurrent faults with different EIDs asserted at different FCUs do not result in overflow because no two fault indications with different EIDs are present at a single FCU. Hence, the daisy chain extends the number of faults affecting different EENVs that can be concurrently present within the subsystem without an escalation.
In the example in
In addition to using daisy-chained FCUs, multiple instances of FCUs may be deployed to collect faults, monitor their resolution, and escalate the fault reaction when required. For example, reference is now made to
As illustrated with reference to the example GEENV 31 shown in
Next, there can be multiple SoCs in a single package along with the management subsystem MSSys managing those GEENVs. If the GEENV does not manage its fault in time, the fault is escalated to the FCU 77 of the package. As disclosed herein, the Fault EID can change its meaning from being an ID of a pure EENV (e.g., EENV1) to an ID of some subgroups of pure EENVs. This depends on the implemented bit-width of Fault EID and the needs of the architecture. The Fault EID might just simply identify the GEENV from which the escalation is coming from. This can be achieved either by using Global EID or EID of the GEENV. The selected way affects the overflow mechanism. In the case of using Global EID, multiple escalations from different GEENVs will be perceived as an event of a single EID and overflow will not happen. Further escalation will happen only if the respective MEENV does not handle all escalation in time. In the latter case of using different EID for each GEENV, the overflow will happen if two or more GEENVs escalate their faults. Overall, the selected hierarchy implementation is use-case dependent.
With reference to the escalation tree configuration of FCUs 71-77 shown in
With the disclosed escalation tree configuration of FCUs, a fault monitoring and escalation schema has a fault-tolerant capability because when any fault reaction fails, the escalation ensures reaching a safe state. Stated more generally, consider an escalation tree that includes a first level Fault Handling 1 (FH1) through an nth level Fault Handling n (FHn), where n is as integer and 1≤n<∞. For n=1, a fault of a HW resource is reported to the EENV that owns it and to the FCU. In the case of escalation, the FCU escalation output shall reset or safe-state the entire SoC package. All SoC resources belong to a single EENV and there is no MEENV, so the FCU resets the whole SoC.
In comparison, for n=2, the FCU instance at the second level Fault Handling 2 (FH2) invokes a reaction at the respective MEENV, and if the reaction does not work, then it escalates to a reaction that resets the SoC. The FCUs at FH1 monitor the reactions within pure EENVs and escalate to level-two FCU if needed.
In order to analyze the scenarios of fault handling failure, the reaction failure at FHx is denoted as FHx
For n=2 the fault handling fails if the following condition is satisfied:
i.FH1
In turn, this condition can be transformed to:
i.FH1
Hence, the fault handling fails if either both the FH1 reaction and the FH1 escalation fail, the conjunction FH1
The fault handling failure can be generalized in two forms. The first form is recursive. If we let FHx mean the failure of fault handling from FHx up to FHn, this means that a combination of faults in fault handling made the fault reaction and escalation schema fail. We can then write:
The other form unrolls the recursiveness.
This formula represents the fault handling failure in the canonical disjunctive normal form. When a conjunction formula evaluates to true, it defines a scenario for fault handling failure. Individual conjunction formulas have the form:
The left part (FHx
For the sake of completeness, the right part (FHx
Referring back to
To enable fault escalation, each FCU1 instance may be connected directly or indirectly to the FCU2 instance in the Management Subsystem (MGT Subsystem), thereby forming an escalation tree. For example, each FCU1 instance from the CPU subsystems may be directly connected (not shown) to provide a Fault, Fault EID, and overflow signal to the FCU2 instance.
By contrast, the I/O subsystem includes sparse IP instances distributed within the SoC 41, such as a GPIOs, UARTs, SPI, I2C, I3C, and Timer. The corresponding FCU1 instances can be connected in a daisy chain which terminates in a terminating FCU1 instance that is directly connected to the FCU2 instance in the Management Subsystem to propagate any escalation from the I/O subsystem. For example, a first daisy chain may be formed with connections (not shown) between FCU1 instances in the UARTs, GPIO1, I3C, SPIs, and Timers subsystems to terminate in the FCU1 instance of the Timers block. In another example, a second daisy chain may be formed with connections (not shown) between FCU1 instances in the GPIO0 and I2C subsystems to terminate in the FCU1 instance of the I2C block.
In operation, the FCU1 instance in the Management Subsystem monitors the resolution of faults that the MEENV responds to. If the MEENV cannot handle the fault, the fault cannot be escalated to the MEENV. Therefore, escalations from the FCU1 instance of Management Subsystem are set at the FCU2 instance for immediate escalation. Since there is no upper level Fault Handling 3 that would resolve the escalation from the GEENV managed by the MEENV, any escalation from the FCU2 instance leads to the device reset.
As disclosed herein, each of the FCU1s and FCU2 instances may use the same FCU design. The FCU versatility allows the collection of faults in a very flexible manner. In addition, FCU instances are connected either in a daisy chain or in a tree topology where the former enables efficient fault collection and the latter escalation.
By now it should be appreciated that there has been provided a method, architecture, circuit, and system-on-chip for monitoring, handling, and escalating faults from a plurality of SoC subsystems, including a first management SoC subsystem and a first plurality of SoC subsystems connected together on a shared semiconductor substrate. In the disclosed system, a plurality of fault collection unit (FCU) instances is deployed with at least one FCU instance at each SoC subsystem, where each FCU instance is configured to monitor one or more fault input signals at one or more fault inputs, to generate a local control signal for one or more hardware resources controlled by the FCU instance, and to escalate any unresolved fault on a fault output. In this system, a first plurality of FCU instances deployed at the first plurality of SoC subsystems are each connected in a fault escalation tree with an escalation FCU instance deployed at the first management SoC subsystem by connecting the fault outputs from the first plurality of FCU instances to the one or more escalation fault inputs of the escalation FCU instance deployed at the first management SoC subsystem. In addition, a second plurality of FCU instances deployed at the first plurality of SoC subsystems are connected in a fault collection daisy-chain by connecting the fault output from each FCU instance in the second plurality of FCU instances to a fault input of succeeding FCU instance in the fault collection daisy-chain except for a terminating FCU instance in the fault collection daisy-chain which has a fault output connected to a fault input of the escalation FCU instance deployed at the first management SoC subsystem. In selected embodiments, each FCU instance is further configured to generate an overflow signal at an overflow output of said FCU instance in response to receiving an overflow signal at an overflow input of said FCU instance. In other embodiments, each fault input signal received at a fault input of an FCU instance includes a fault signal and an associated fault execution environment ID (EID) value which identifies a collection of hardware resources that perform a specified software function and that generated the fault signal. In such embodiments, each FCU instance may be configured to generate an overflow signal at an overflow output of said FCU instance if two or more fault input signals having different fault EID values are receiving at the one or more fault inputs of said FCU instance. In selected embodiments, a first fault input signal at a first fault input of a first FCU instance at a first SoC subsystem signals that a first fault is detected at one or more failing hardware resources of the first subsystem. In such embodiments, the first FCU instance may be further configured to monitor the first fault input signal to detect if the first fault was handled by a first execution environment (EENV) which owns the one or more failing hardware resources within a specified time window. In such embodiments, the escalation FCU instance deployed at the first management SoC subsystem may be configured to monitor the one or more escalation fault inputs and to generate a fault escalation signal to a managing execution environment (MEENV) which manages one or more EENVs when the first EENV has not handled the first fault.
In another form, there is provided a fault collection and reaction method, architecture, circuit, and system-on-chip which includes a plurality of SoC subsystems with a plurality of fault collection unit (FCU) instances deployed with at least one FCU instance at each SoC subsystem. In the disclosed method, a first fault signal is generated by one or more hardware resources at a first SoC subsystem in response to a first fault. In addition, the disclosed method provides the first fault signal to a first FCU instance at the first SoC subsystem and to a first execution environment (EENV) which owns the one or more failing hardware resources at the first SoC subsystem. In addition, the disclosed method monitors, at a first fault input of the first FCU instance, the first fault signal to detect if the first EENV handles the first fault within a specified fault handling time interval. In addition, the disclosed method generates, by the first FCU instance, a first fault output signal if the first FCU instance detects that the first EENV does not handle the first fault within the specified fault handling time interval. In addition, the disclosed method provides, over a first fault output of the first FCU instance, the first fault output signal to a second FCU instance at one of the plurality of SoC subsystems. In selected embodiments, the second FCU instance is an escalation FCU instance that is connected in a fault escalation tree with the first FCU instance, and where first fault output signal generated by the first FCU instance is provided to the escalation FCU instance that is deployed at a first management SoC subsystem for the plurality of SoC subsystems. In other selected embodiments, the first fault output signal generated by the first FCU instance is provided to a second FCU instance at a second SoC subsystem connected in a daisy-chain with the first FCU instance to collect and monitor faults from the first and second SoC subsystems. In selected embodiments, the disclosed method generates, by the first FCU instance, an overflow signal at an overflow output of the first FCU instance in response to receiving an overflow signal at an overflow input of the first FCU instance. In other selected embodiments, the first FCU instance is connected and configured to monitor a plurality of fault signals at a plurality of fault inputs, to generate a local control signal for one or more hardware resources controlled by the first FCU instance, and to escalate any unresolved fault on the first fault output. In such embodiments, each fault signal received at a fault input of the first FCU instance may include a fault signal and an associated fault execution environment ID (EID) value which identifies a collection of hardware resources that perform a specified software function and that generated the fault signal. In such embodiments, the first FCU instance may be further configured to generate an overflow signal at an overflow output of the first FCU instance if two or more fault signals having different fault EID values are receiving at the plurality of fault inputs of the first FCU instance. In selected embodiments, the escalation FCU instance may also monitor a plurality of fault outputs generated by the first FCU instance and one or more additional FCU instances, and may then generate a fault escalation signal to a managing execution environment (MEENV) which manages one or more execution environments (EENVs) when the first EENV has not handled the first fault.
In yet other embodiments, there is provided a fault collection and handling system, method, architecture, and circuit for a system-on-chip (SoC) which includes a plurality of SoC subsystems. The disclosed fault collection and handling system includes a first plurality of fault collection unit (FCU) instances deployed with at least one FCU instance at each of the plurality of SoC subsystems. In addition, the disclosed fault collection and handling system includes an escalation FCU instance deployed at a first management SoC subsystem. As disclosed, each of the first plurality of FCU instances includes one or more fault inputs and a fault output, and is configured to monitor one or more fault input signals at the one or more fault inputs, to generate a local control signal for one or more hardware resources controlled by the FCU instance, and to generate a fault output signal to escalate any unresolved fault on the fault output. In addition, the first plurality of FCU instances is connected in a fault escalation tree with the escalation FCU instance by connecting the fault outputs from the first plurality of FCU instances to the one or more escalation fault inputs of the escalation FCU instance. In selected embodiments, the disclosed fault collection and handling system may also include a second plurality of FCU instances deployed at the plurality of SoC subsystems, where the second plurality of FCU instances are connected in a fault collection daisy-chain by connecting the fault output from each FCU instance in the second plurality of FCU instances to a fault input of a succeeding FCU instance in the fault collection daisy-chain except for a terminating FCU instance in the fault collection daisy-chain which has a fault output connected to a fault input of the escalation FCU instance deployed at the first management SoC subsystem. In selected embodiments, each of the plurality of FCU instances is further configured to generate an overflow signal at an overflow output of said FCU instance in response to receiving an overflow signal at an overflow input of said FCU instance. In selected embodiments of the disclosed fault collection and handling system, each of the one or more fault input signals may include a fault signal and an associated fault execution environment ID (EID) value which identifies a collection of hardware resources that perform a specified software function and that generated the fault signal. In addition, the fault output signal may include a fault signal and an associated fault execution environment ID (EID) value which identifies a collection of hardware resources that perform a specified software function and that generated the fault signal.
The block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams, and combinations of blocks in the block diagrams, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
While particular embodiments of the present invention have been shown and described, it will be obvious to those skilled in the art that, based upon the teachings herein, that changes and modifications may be made without departing from this invention and its broader aspects. Therefore, the appended claims are to encompass within their scope all such changes and modifications as are within the true spirit and scope of this invention. Furthermore, it is to be understood that the invention is solely defined by the appended claims. It will be understood by those with skill in the art that if a specific number of an introduced claim clement is intended, such intent will be explicitly recited in the claim, and in the absence of such recitation no such limitation is present. For non-limiting example, as an aid to understanding, the following appended claims contain usage of the introductory phrases “at least one” and “one or more” to introduce claim elements. However, the use of such phrases should not be construed to imply that the introduction of a claim element by the indefinite articles “a” or “an” limits any particular claim containing such introduced claim element to inventions containing only one such clement, even when the same claim includes the introductory phrases “one or more” or “at least one” and indefinite articles such as “a” or “an”; the same holds true for the use in the claims of definite articles. As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object-oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages.
Aspects of the present invention are described hereinabove with reference to block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. In certain implementations, a system on a chip or SoC may be implemented.
These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
Although the described exemplary embodiments disclosed herein focus on example fault collection and handling system, the present invention is not necessarily limited to the example embodiments illustrate herein. For example, various embodiments of using sound sensors may be applied in any suitable fault handling systems, and not just automotive vehicle systems, and may use additional or fewer circuit components than those specifically set forth. Thus, the particular embodiments disclosed above are illustrative only and should not be taken as limitations upon the present invention, as the invention may be modified and practiced in different but equivalent manners apparent to those skilled in the art having the benefit of the teachings herein. Accordingly, the foregoing description is not intended to limit the invention to the particular form set forth, but on the contrary, is intended to cover such alternatives, modifications and equivalents as may be included within the spirit and scope of the invention as defined by the appended claims so that those skilled in the art should understand that they can make various changes, substitutions and alterations without departing from the spirit and scope of the invention in its broadest form.
Benefits, other advantages, and solutions to problems have been described above with regard to specific embodiments. However, the benefits, advantages, solutions to problems, and any element(s) that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as a critical, required, or essential feature or element of any or all the claims. As used herein, the terms “comprises,” “comprising,” or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Number | Date | Country | Kind |
---|---|---|---|
202341089675 | Dec 2023 | IN | national |