Computer servers of modern safety- and security-critical applications are challenged by arbitrary faults. Such faults can include malicious cyber threats (e.g., spoofing, unauthorized data access, state modification, deadlock, or instruction stream alteration), exploitation of design flaws, and vulnerabilities in a global supply chain. In addition to design flaws, under-constrained design methodology can create opportunities to unanticipated system stimulus that can cause unspecified consequences. Further, supply chain assurance is a growing concern, as fewer trusted foundries may exist, and counterfeit, cloned, over-produced, and recycled components have entered the supply chain of programs with a thorough chain-of-custody from trusted suppliers. Computer servers are a common target for malicious attack as they are critical shared resources. Thus, they are at risk with broad consequences in disruption of service or data compromise.
The systems and methods disclosed herein provide reliable fault tolerance solutions. One example embodiment is a system for trusted integration of untrusted components. The example system includes at least three electrical components and voting (consensus) circuitry. The components have varied hierarchical implementations for providing common output given common input. The voting circuitry is configured to receive, as input, outputs from the components and provide a consensus output that is a majority of the outputs received from the components. The electrical components of the system can be digital components and the voting circuitry can be analog circuitry. The varied hierarchical implementations of the components can include, for example, any of differing processor instruction sets, differing register sets, and differing address schemes.
In some embodiments, each component can include a processor having an input queue, state memory, state machine, and output queue. The output queues can provide input to the voting circuitry. The state machines can be configured to interpret headers in data of the input queue, where the headers indicating the source and nature of the data. Each state memory can be configured to permit direct memory access for fault recovery. In an event the voting circuitry detects a fault by a given component, the state memory of the given component can be overwritten with data from a state memory of a component that satisfied the consensus. In some embodiments, the components that satisfied the consensus can be enabled to proceed to a next state while the given component is recovered if there are enough components that satisfied the consensus to protect against additional faults.
The system can include a timer to ensure completion of output queue data of each output queue before the processors can proceed to a next state. Such a timer can be associated with the voting circuitry, and the voting circuitry can be configured to provide an interrupt to cause the processors to proceed to the next state. The voting circuitry can be configured to reboot a component that fails repeatedly, and to reboot the system in an event a consensus output cannot be obtained.
The voting circuitry can include, for each bit of output across the components, a voting input stage, a transfer stage, and an accumulating stage. The voting input stage can include at least three input switched capacitors corresponding to the components. The input switched capacitors can be configured to receive, as input, a bit of output across the components. The transfer stage can include transfer switched capacitors corresponding to the input switched capacitors. The transfer switched capacitors can be configured to charge a voting capacitor corresponding to each input switched capacitor during a state of a clock signal. The accumulating stage can include accumulating switched capacitors connecting the voting capacitors in series. The accumulating switched capacitors can cause the charges of the voting capacitors to be accumulated during an alternate state of the clock signal. The accumulated charge of the voting capacitors can represent the consensus output of the bit of output across the components.
Another example embodiment is a method of providing trusted integration of untrusted components. The method includes integrating at least three electronic components into a system. The components have varied hierarchical implementations for providing common output given common input. The method further includes providing outputs from the components as input to voting circuitry to provide a consensus output that is a majority of the outputs of the components. Providing outputs from the components as input to voting circuitry can include, for each bit of the outputs across the components, (i) providing one bit of output from the at least three components as inputs to at least three voting inputs, each in the form of high or low logical bits, (ii) converting the voting inputs to analog voltages, resulting in analog voting voltages, and (iii) accumulating the analog voting voltages, resulting in an accumulated analog voting voltage. The accumulated analog voting voltage represents the consensus output of the bit of output across the components.
Another example embodiment is a system for trusted integration of untrusted components. The example system includes at least three processors, each processor including an input queue, state memory, state machine, and output queue having varied hierarchical implementations for providing common output given common input. The system also includes voting circuitry that includes, for each bit of output across the components, a voting input stage, a transfer stage, and an accumulating stage. The voting input stage includes at least three input switched capacitors corresponding to the components. The input switched capacitors are configured to receive, as input, a bit of output across the output queues. The transfer stage includes transfer switched capacitors corresponding to the input switched capacitors. The transfer switched capacitors are configured to charge a voting capacitor corresponding to each input switched capacitor during a state of a clock signal. The accumulating stage includes accumulating switched capacitors connecting the voting capacitors in series. The accumulating switched capacitors are configured to cause the charges of the voting capacitors to be accumulated during an alternate state of the clock signal. The accumulated charge of the voting capacitors represents the consensus output of the bit of output across the output queues.
The foregoing will be apparent from the following more particular description of example embodiments, as illustrated in the accompanying drawings in which like reference characters refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating embodiments.
A description of example embodiments follows.
Commoditized commercial-off-the-shelf (COTS) processors are well supported by modern operating systems and offer long product lifecycles for implementation in servers. Client-server applications typically employ state-machine-based implementation of a software server process. In the Internet of Things (IoT), for example, high availability and reliability of a server is paramount for critical applications across distributed computing. Data, materials, and services are interconnected throughout the world, adding many new dimensions to well-established concerns of service disruption by equipment failure, environmental catastrophe, or malicious intrusion.
Trust Vulnerabilities
Computer servers of modern safety- and security-critical applications are challenged by arbitrary faults that can occur. Such faults can include malicious cyber threats, exploitation of design flaws, and vulnerabilities in a global supply chain. Cyber-attacks can include spoofing, unauthorized data access, state modification, deadlock, or instruction stream alteration. Malware has been met by a subscription business model of detection and patch for an accumulated catalog of threats, but it is a solution that will always lag malware development and impact computational performance. In addition to design flaws, under-constrained design methodology can create opportunities to unanticipated system stimulus that can cause unspecified consequences. Extended iterations of custom design and trusted fabrication at the high complexity of modern processors inevitably suffer from new exploitable flaws. Supply chain assurance is a growing concern, as fewer trusted foundries may exist, and counterfeit, cloned, over-produced, and recycled components have entered the supply chain of programs with a thorough chain-of-custody from trusted suppliers. Further, malicious Trojan logic or selectively adulterated fabrication can escape manufacturing testing and be deployed for ultimate activation/failure. Further, insider threat in the development process is significantly difficult to eliminate, even with trusted foundries.
Verification Methodology for Trusted Logic
ASIC design methodology of functional verification by comparison to an independently developed model is commonly used to flag bugs. That is, equivalent but diverse models developed from a single specification must agree in function. This concept is as useful for software as it is with hardware. Complex control path architectures with many corner cases, as is the case for a processor, are much harder to fully verify than pipelined, regular data path architectures. With time-to-market being a pressing need, modern complex commercial ASICs are released after constrained-random verification coverage that samples distinct test cases most likely and most critical to be covered by customers—but not exhaustive verification coverage, which would require an unacceptable number of years of verification. There is wide-spread acceptance in industry today that every complex ASIC tape-out has remaining unfound bugs, however minor. Formal verification methods can be used to ascertain that specific vulnerabilities do not exist, but this continues to be limited by computation complexity and characterization of both the model and a known vulnerability. Synthesizable assertions can also be extended from ASIC/FPGA verification to validation and deployed operation to assure that unspecified behavior does not occur. This has been employed in custom solutions for trusted microelectronics.
Fault-Tolerance Approaches to Trusted Server Operation
Computer servers are a common target for malicious attack because as they are critical shared resources. Thus, they are at risk with broad consequences in disruption of service or data compromise. Fault-tolerant approaches for highly-available services are means of exploiting distributed computing for replication and consensus of server state machines. Recovery can occur by acquiring a consensus state from a non-faulty processor replica. Faults can be arbitrary; that is, the precise cause does not require determination for a solution to be rendered. Fault-tolerant computing has matured in space applications, where a single event upset of digital computation is not uncommon. It is also useful for critical data applications for which distributed computing is not co-located, providing protection from earthquake, tsunami, power grid outage, or other natural disasters. Fault-tolerant computing concepts can be extended to modern multicore processor architectures, which can be adequate for faults due to single event upset. However, this does not consider other formidable vulnerabilities. Equivalent, but diverse, model comparison used in verification methodology can be extended to fault tolerant computing. Binary diversity on multicore processors can be used for detection of software intrusion. The notion of binary diversity is that any fault due to a cyber-attack or malware would not occur in the same way or at the same time across different cores. This is of conceptual interest, but inadequate for the many other possible vulnerabilities on identically replicated silicon design. That is, it is not sufficient to ensure Byzantine resilience from any arbitrary fault(s).
Diverse System Integration for Trusted Fault-Tolerance
Fault-tolerant principles posit that 2F+1 replicated state machines in consensus can permit F faults at every comparison with stable operation. For trusted operation, a distinct set of faults that can be detected by comparison of state machine replica output must be a superset of possible vulnerabilities. However, vulnerabilities can exist at various levels of an architecture's implementation. Therefore, implementation diversity of replicated state machines at appropriate layers of vulnerability can provide trusted operation for a fault tolerant architecture. A sufficiently diverse fault-tolerant solution can address all levels of vulnerability, e.g., compiler, operating system, processor architecture, digital logic design, fabrication technology, and foundry. Rather than presuming that trusted operation is designed into trusted components, one can consider the trusted integration of untrusted COTS components. This can apply to hardware and software. COTS voting replicas that have varied hierarchical implementation can be integrated into a single, trusted fault-tolerant server if all replicated state machines see the same input at the same time and have consensus on state machine output. This greatly simplifies the distributed computing paradigm of fault tolerance, where a state machine would otherwise never be certain if all others have seen the same input and in the same order.
A diversity of multiple untrusted COTS system components (hardware and/or software) engaged in redundant operation can be integrated to as a single consensus-based trusted system with a high degree of fault tolerance to, for example, unforeseen environmental interference, cyber-attack, supply chain counterfeit, inserted Trojan logic, or component design flaws. The degree of fault tolerance can be increased by increasing the degree of diversity of redundant operational nodes or by increasing the number of diversely implemented operational nodes.
Input is captured on Input FIFOs (queues) 105 of sufficient size for identically-ordered sequential processing at the server application bandwidth. Data units on the FIFOs 105 can have headers indicating the source and nature of payload data. These data units can be constructed for input to an amalgamated server to facilitate generalization from any incorporating system input transceiver or bus. Each processor 115 has dedicated state memory 110 for reference and update when evaluating input. This memory 110 can also provide a simplified recovery mechanism when there is a fault by permitting Direct Memory Access (DMA) from the state memory 110 of a consensus processor 115. A timer in a voting (consensus) circuit 125 can ensure completion of all candidate state machine output 120. Upon providing candidate state machine output to FIFOs 120 and notifying the voting circuit 125, processors 115 can await an interrupt from the voting circuit 125 to proceed to the next state. The voting circuit 125 can concurrently step through each data word on all candidate output FIFOs 120, performing exclusive-OR to check for a violation of consensus. Checksum comparison is not advised, since it is a mere indication of data uniqueness and can be spoofed.
In the case that the voting circuit 125 has detected a fault, it can enable DMA of state memory 110 from a replica that satisfied consensus. After DMA completion, the voting circuit 125 can trigger a next state to the processors 115 by interrupt. DMA latency to correct the state variables of the faulty processor can be masked by allowing non-faulty processors to concurrently proceed to next state if sufficient 2F+1 processors remain available.
In the case that a processor 115 is not able to deliver state output or a processor 115 repeatedly fails, the voting circuit 125 can include a hardwired-configuration to reboot the processor 115. When processors 115 fail to reach majority consensus or a majority fail to deliver state output, the voting circuit 125 can include a hard-wired configuration to reboot the system.
Because an aspect of this solution's strength is in its diversity, it follows that differing processor instruction sets, register sets, and addressing schemes can contribute to the many ways that the same state machine output can be accomplished. This can be ideal for trusted fault-tolerant server operation of a state machine replica. For the fault tolerant server, it does not matter how it arrives but that it does indeed arrive at output consensus. However, it should not be implied that processor diversity would also apply to the granularity of atomic operations evaluated at processor I/O in general purpose computing. This technique assures the defined application-specific objective of the hardware/software amalgamation, rather than cycle-accurate operation of untrusted components at an arbitrary level of implementation.
Example Hierarchical Diversity for Trusted Fault Tolerance
An example configuration for PCB integration can implement a SQL database server handling requests from clients for access to an SQL database. This is a simplified example to demonstrate the merit of the conceptual architecture. A diversity of processors may be run on different real-time operating systems:
Three processors are selected for this example to handle at most one fault at any state machine consensus, but the example can be scaled to any 2F+1 arrangement.
Voting Circuit
Diversely implemented nodes of a redundant state-based functional system can submit votes by charging switched capacitors of a voting circuit. Integration of nodes can place these charges in tandem, for which voltage potential between the ground and the last node would be the consensus to be routed when a threshold majority is met, e.g., a voltage above or below the logic threshold for a Complementary Metal-Oxide-Semiconductor (CMOS)<<PLEASE PROVIDE EXAMPLE OF ALTERNATIVE CIRCUITRY>>. All nodes can sample the consensus output, and if the consensus output differs from a node's state, the node can revise its state based on the consensus output.
The illustrated circuit can be a bitwise analog voting circuit with a totem of switched capacitors connected in series by CMOS switches at evaluation of the aggregate (accumulated) voltage of stacked consensus, VTRUST, but isolated from each other by these CMOS switches when the voting charge of each replica's bit is being transferred to each individual switched capacitor in the stack by parallel CMOS switches on the alternate phase of a driving clock, C. Note that the number of voting inputs to the analog circuit could support a quantity of three or greater voting replicas. An odd number can be used to reduce the chance of a split vote having ambiguous logic output. 2F+1 voting replicas would provide fault tolerant consensus for F faults. Thus, five replicas would be needed for Byzantine resilience in the case of two possible faults. Each voting input stage can be implemented with a CMOS switch connecting a voltage divider. While the number of voting replicas, N, can vary for the number of coincident faults that the system is to tolerate, the resistive voltage divider at each voting input can be scaled (N−1):1. This ensures that a unanimous vote of logic high at circuit inputs accumulates to no more than the supply voltage, logic high, at output. Thus, resistor proportions on each voltage divider is directly related to how many voting replicas are to be integrated for consensus voting to tolerate a particular number of faults at once.
The CMOS switch can be considered to be “off” at the voting input stage when a logic low is input. In such a case, no current is drawn from the supply across the voltage divider and there is no voltage drop on the lower resister—yielding ground voltage at the voting terminal (top of the lower resistor in the voltage divider). This voltage contribution to the consensus stack for VTRUST will be nil on the next phase of the driving clock. The CMOS switch can be considered to be “on” at the voting input stage when a logic high is input; that is, the CMOS switch shorts from transistor source to drain. When that happens, current flows from the power supply through the voltage divider to ground. The contribution VTRUST on the consensus stack will be 1/N*VCC, or 1/Nth of logic high. If VTRUST is over a CMOS threshold voltage for logic “1”, then the bitwise consensus can be logic “1”. Else, the consensus can be logic “0” at the digital output of the analog circuit. Thus, the circuit can employ an implicit comparison of the aggregate voltage of consensus to logic “0” or “1” when the output drives CMOS digital logic, and no analog comparator is needed.
Three redundant processors 405a-c are illustrated in
The timing diagram illustrates that the three input values 205a-c are changed to high at time T2. At time T3, when the driving clock C is high, the three voting capacitors 220a-c are shown as being high. This is because the transfer stage of circuit 200 charges the voting capacitors 220a-c corresponding to each input switched capacitor 210a-c during a high state of the clock signal. At time T4, when the driving clock C is low (and “not C” is high), the accumulated charge (VTRUST) 230 is shown as being high. This is because the accumulating stage of circuit 200 causes the charges of the voting capacitors 220a-c to be accumulated during a low state of the clock signal.
While example embodiments have been particularly shown and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the embodiments encompassed by the appended claims.
This application claims the benefit of U.S. Provisional Application Nos. 62/385,440 and 62/385,435, both filed on Sep. 9, 2016. The entire teachings of the above applications are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
62385440 | Sep 2016 | US | |
62385435 | Sep 2016 | US |