Statement Regarding Federally Sponsored Research or Development
Not applicable.
Reference to a “Sequence Listing”, a Table, or Computer Program
Not applicable.
The invention relates generally to the field of reducing faults in circuitry, specifically in relation to using bio-inspired hardware to predict and respond to anticipated faults in circuitry.
Biomedical circuits are growing in complexity as they are being used in powerful devices used in critical situations and applications. Some examples of such cases are surgeries, prosthetics, monitoring of vital signs, artificial organs, imaging, therapeutic equipment such as kidney dialysis, and diagnostics such as lab on a chip. It is expected that such biomedical circuits work efficiently and reliably, preferably without failure or downtime. Embryonic Hardware (EmHW) is a promising methodology for designing components and subsystems used in biomedical systems such as adders, multipliers, adder accumulators, Fourier-transform spectrometer, and other subsystems for Digital Signal Processing (DSP). Components and subsystems designed using EmHW are ideally expected to be highly efficient and capable of self-healing. In EmHW, a cell is configured with certain properties and functionalities, and the same cellular configuration is extended to implement a set of functions in a process called differentiation. Each cell can perform the same set of functionalities. This allows the system to replace faulty cells and interchange them with healthy cells whenever needed.
An EmHW structure implements a function through an array of active cells and spare cells. In case some active cells fail, spare cells are used to replace the faulty cells in a similar way as in the biological recovery of stem cells to guarantee the desired performance. The general EmHW cellular structure has six components/modules: control module, input/output module, address module, a configuration module, detection module, and function module. As shown in
Although very promising, EmHW hardware systems may face failure in any hardware component, which may reduce their performance. Hardware failure may occur during the time a system is running critical, real-time tasks. Such failures may occur due to the aging of the hardware. It may also occur due to the impact of the surrounding conditions such as temperature, humidity, and radiation, etc. The sources causing failure can be internal to the system or external. A fault occurs when an error affects one or more hardware components of the system. An error may also propagate to the other components and produce compounded errors. A system failure occurs when an error propagates to the service interfaces and deviates the system function from an intended one. The time delay between fault activation and failure is defined as failure latency. Faults are divided into labels based on persistence, effect, boundary, and source. In the case of persistence, a fault can be a permanent situation, intermittent status, or transient. The intermittent fault situation is caused frequently but not continuously, and it is a repetitive crash of the system or device. Errors may be produced by devices or wires. A permanent fault occurs once, and then it continues. Thus it can be hypothesized as a repetitive error. The transient fault occurs only once for a short time duration, and it does not continue as a permanent fault, and the transient fault is random.
A self-healing mechanism is used for recovering faults without any human intervention, especially in places that require high-cost maintenance such as biomedical emergency and aerospace. Self-healing is defined as the ability of a system to recover faults or failures without external intervention. Self-healing and self-repairing techniques can be used interchangeably. Repairing and healing present the reintegration of recovered/fixed cells/blocks inside the system, or they can be the replacement process of faulty cells by active cells. In other words, the system is able to check to maintain and repair its operation. The self-healing of EmHW increases system reliability for working in the desired performance for a long time in a biomedical system.
Current self-healing methods depend on fault detection to do the healing. While useful, its major drawback is that by the time self-healing begins, a system already has experienced a fault, and the fault may cause a missed operation or loss of data. Therefore, the system needs to predict and recover fault early to avoid affecting performance.
The fundamental concepts in self-healing are fault, error, and failure. A fault is an abnormal physical condition in a hardware system that provides an error. An error is a manifestation of a fault in a hardware system. Failure is the inability of the system to perform its functions due to inherent errors or disorder in its environment. A failure might happen due to error propagation to the system level. Failure can also manifest as a type of communication failure because of broken wire, loosening connectors, circuit board faults, failing communication transceivers, communication timing issues, and electromagnetic interference. Hardware faults may affect system performance. EmHW can experience faults such as open-circuit, short-circuit, noise, delay faults. A self-healing of the embryonic system aims to recover EmHW, which may have any kinds of faults that are permanent faults and transient faults. The permanent fault occurs once, and it continues for a long time. It can result from stuck at one, stuck at zero, open-circuit, or short-circuit. The transient faults can be frequent, but they occur for a short time. They happen due to some reasons such as pulse skew, delay, and bit flip.
Faults can affect an EmHW's performance due to their occurrence in, and effect on, an internal module. In the presence of faults, a module may not work as intended. For example, consider the Address Coordinate module in
The existing techniques for self-healing in EmHW are based on cell elimination regardless of the type of fault. The main challenges that these techniques face are area overhead, flexibility, scalability, and mapping the spare cells. Self-healing methods are based on using spare components to repair faulty components. A typical mechanism of existing methods is shown in
There are self-healing methods known in the art. Zhai Zhang et al. previously presented a Fault-Cell Reutilization Self-Healing Strategy (FCRSS) technique which focuses on transient faults through reusing a faulty cell. (Z. Zhang, Q. Yao, Y. Xiaoliang, Y. Rui, C. Yan, and W. Youren, “A self-healing strategy with fault-cell reutilization of bio-inspired hardware,” Chin. J. Aeronautics, vol. 32, no. 7, pp. 1673-1683, 2019). Their method has two stages of self-healing: elimination and reconfiguration. During the elimination stage, the cell, which has a transient fault, is used as a transparent cell to replace the functions of the cells on the right or left side, depending on the design. In the transparent state, the cell is reconfigured to realize re-utilization of the faulty cell. This method is simulated using a 4-bit adder in a cell array of 3×4. The main challenges of the Zhang method are that the time complexity is high, it is not robust, and the area overhead is high.
Boesen et al also has suggested a self-healing approach for EmHW. (See M. R. Boesen, J. Madsen, and P. Pop, “Application-aware optimization of redundant resources for the reconfigurable self-healing eDNA hardware architecture,” in Proc. IEEE NASA/ESA Conf. Adaptive Hardware Syst., 2011, pp. 66-73). Their method is based on using spare cells for recovering faulty cells. Three techniques for distributing spare cells are used which are: 0-Faults-Anticipated (OFA), Uniform Distribution (UD), and Minimum spare-cell Distance (MD). In the OFA method, spare cells are added at the edge columns or rows. In the UD method, spare cells are distributed uniformly in the architecture. In the MD method, the distribution of spare cells is based on allowing each active cell has a neighbor spare one by distance d. If d=1, it means each cell has one spare cell by distance one cell. The main challenge of this method is area overhead and its lack of flexibility in a complex system.
Wang Youren et al. present a self-healing method of an embryonic cellular structure array (See W. Youren and Y. Shanshan, “New self-repairing digital circuit based on embryonic cellular array,” in Proc. IEEE 8th Int. Conf. Solid-State Integr. Circuit Technol., 2006, pp. 1997-1999). Their disclosed method consists of a two-dimensional cellular array and the cellular circuit is based on a Look-Up Table (LUT). Spare cells are used for recovery, and these cells are added as one column. In the case of a faulty cell, the spare column is used for recovery. The technique is tested on a multiplier case study. The drawbacks of this method are that it does not work for multiple faults and it has a high area overhead.
The accompanying drawings are not intended to be drawn to scale. In the drawings, each identical or nearly identical component that is illustrated in various figures is represented by a like numeral. For purposes of clarity, not every component may be labeled in every drawing.
Disclosed herein is a mechanism for self-healing, fault-prediction, and fault-prediction assisted self-healing of bio-inspired Embryonic Hardware (EmHW). The EmHW system is disclosed and validated for an arithmetic-logic unit. EmHW bio-inspired is modeled as a cellular structure for a hardware system, and it mimics the learning mechanisms from nature on providing self-repairing and self-organizing in the same manner as the cells. Designing biomedical circuits using EmHW is beneficial for supporting fault recovery and reorganizing the system to be in an optimum structure as needed.
The fault prediction mechanism is part of a complete technique staring from fault prediction to self-healing without external intervention. A flow-chart showing the disclosed method is provided in
The Applicant believes this disclosure to be the first disclosed method for predicting faults in EmHW. Machine learning is utilized in fault predictions. Machine learning has different structures of a neural network such as Recurrent Neural Networks (RNNs) and Convolutional Neural Network (CNN). The machine learning techniques for fault prediction of EmHW consists of four components: Fast Fourier Transform (FFT) to get the fault frequency signature, Principal Component Analysis (PCA) or Relative Principal Component Analysis (RPCA) to get the most important data with less dimension, and Economic Long Short-Term Memory (ELSTM) to learn and classify faults.
The second stage of the complete system is the self-healing method, which heals the predicted fault. The data from the fault prediction technique is utilized by the self-healing technique to recover from a fault. The self-healing technique gets the fault time and location information from the fault prediction unit and it can use this information to recover it. After repairing faults, the process repeats. In the case of no faults, the system applies the fault prediction mechanism after a certain delay At, and this time delay is tunable. The self-healing mechanism for EmHW is based on time multiplexing and two-level spare cells.
This method utilizes PCA, RPCA, and ELSTM to provide a fault prediction accuracy of more than 99 percent with lower execution time. Further, implementing the fault prediction mechanism on FPGA ensures that the method is practical, scalable, and performance is stable and robust.
The following description sets forth exemplary methods, parameters, and the like. It should be recognized, however, that such description is not intended as a limitation on the scope of the present disclosure but is instead provided as a description of exemplary embodiments.
In the following description of the disclosure and embodiments, reference is made to the accompanying drawings in which are shown, by way of illustration, specific embodiments that can be practiced. It is to be understood that other embodiments and examples can be practiced, and changes can be made, without departing from the scope of the disclosure.
In addition, it is also to be understood that the singular forms “a,” “an,” and “the” used in the following description are intended to include the plural forms as well unless the context clearly indicates otherwise. It is also to be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It is further to be understood that the terms “includes,” “including,” “comprises,” and/or “comprising,” when used herein, specify the presence of stated features, integers, steps, operations, elements, components, and/or units but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, units, and/or groups thereof.
Some portions of the detailed description that follow are presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps (instructions) leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical, magnetic, or optical signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It is convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like. Furthermore, it is also convenient at times to refer to certain arrangements of steps requiring physical manipulations of physical quantities as modules or code devices without loss of generality.
However, all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the following discussion, it is appreciated that, throughout the description, discussions utilizing terms such as “processing,” “computing,” “calculating,” “determining,” “displaying,” or the like refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system memories or registers or other such information storage, transmission, or display devices.
Certain aspects of the present invention include process steps and instructions described herein in the form of an algorithm. It should be noted that the process steps and instructions of the present invention could be embodied in software, firmware, or hardware, and, when embodied in software, they could be downloaded to reside on, and be operated from, different platforms used by a variety of operating systems.
The present invention also relates to a device for performing the operations herein. This device may be specially constructed for the required purposes or it may comprise a general-purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a non-transitory, computer-readable storage medium such as, but not limited to, any type of disk, including floppy disks, optical disks, CD-ROMs, magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, application-specific integrated circuits (ASICs), or any type of media suitable for storing electronic instructions and each coupled to a computer system bus. Furthermore, the computers referred to in the specification may include a single processor or may be architectures employing multiple processor designs for increased computing capability.
Self-Healing Mechanism. The disclosed self-healing mechanism is designed on a 2-D EmHW structure using two levels of cells. The bottom level contains the normal EmHW structure, while the upper level consists of spare cells, as shown in
In one embodiment, this approach is extended to allow each active cell as a spare cell for its neighbor. Each one has the ability to perform two tasks: its task and the task of the neighbor cell. Time-Division multiplexing is used where each cell has the capacity to perform two tasks within the same clock when a fault happens, as shown in
Fault Prediction Mechanism. Fault prediction is a significant process for fault recovery purposes. The mechanism may take the form of multiple embodiments, starting from a simple method to more advanced to find the most efficient method. The first embodiment comprises FFT and Multilayer Perceptron (MLP) as shown in
Dataset. The various embodiments of the fault prediction mechanism have been tested using the extracted data from EmHW system. The dataset includes the signal variation of the I/O module, address module, a configuration module, control module, and function module. The parameters which are used for fault prediction are voltage, current, noise, delay, and temperature. These parameters are studied on EmHW to know the system behavior with these parameters versus aging, open-circuit, and short-circuit faults. Electromigration and Stress migration are some sources for open and short circuits faults. Electromigration is caused due to the intense stress of current density. Electromigration leads to a sudden delay increase, open, or short faults. The electromigration issue happens in the interconnection, and it can be described as the physical displacement of the ions of metal in the wires' interconnection. This displacement results due to the effect of a large flow of electronics (this is called a large current density mechanism) that interacts with the metal ions. Voids and hillocks happen due to this movement, and this phenomenon produces short or open circuits connections. As the electromigration is accelerated close to the metal grain boundaries, contact holes and vias become susceptible to this impact. Stress migration occurs because of excessive structural stress. This phenomenon is similar to electromigration, wherein it leads to a sudden delay increase, short, or open faults. In this behavior, the metal atoms migrate in the interconnects due to mechanical stress, which is similar to electromigration. The stress migration is resulted by thermo-mechanical stresses that are originated by different rates of thermal expansion of different materials. The final data has 15230 samples, and it includes 550 samples for the non-faulty state. The dataset is generated in-house and used for training and testing the disclosed method. For testing, the time series data is divided into segments to apply the operation. The sampling rate is done at 1 kHz, and each recording is divided into 15s segments. Thus, each segment consists of 15000 samples.
Fast Fourier Transformation Stage. FFT transfers a signal from the original domain (such as time or space) to a representation in the frequency domain, which can help diagnose or pinpoint hardware faults. Data that represents hardware faults are not sufficient to get accurate data to machine learning. Machine Learning methods need more data and accuracy to represent faults, which allows learning to be efficient. Here, FFT is used for representing fault signals in the frequency domain. The advantages of this are getting more representative data and signature of fault in the frequency domain. Each hardware fault represents itself by a unique frequency signature. The FFT is considered as one version of the Discrete Fourier Transform (DFT), but the FFT is faster.
The FFT is performed using advanced algorithms to perform the same operation as the DFT but in much less time. For instance, a DFT computation of N points in a fundamental way, using the definition, takes O(N2) arithmetic operations while the FFT computation of the same result is only O(NlogN) operations. In the disclosed fault prediction methods, the FFT output signals and the first b frequencies have been used for the feature data for the next step to PCA or RPCA where b <<number of samples. The PCA and RPCA are used to improve the diagnostic accuracy and the computational efficiency of hardware faults. Therefore, in this stage, the role of FFT is to obtain the frequency domain of the signal, which feeds the component analysis stage. For a discrete signal xi,n which can be voltage, current, temperature, humidity, etc. where i=1, 2, 3, . . . , m and n=0, 1, 2, 3, . . . N−1. The FFT of this signal will be called Xi,k with i=1, 2, 3, . . . , m and k=0, 1, 2, 3, . . . b−1 where b is the retained harmonics size and m is the training samples size. The mathematical equations of FFT are:
and the transformation equation can be divided into even and odd sections.
Using the substitution of WN2=WN/2, and name the first terms and the second term as H1(k) and H2(k), respectively.
X(k)=H1(k)+WNkH2(k). l=0,1, . . . , N−1 (4)
Where, H1(k) and H2(k) are the N/2 point DFTs of the sequences h1(m) and h2(m), respectively. H1(k) and H2(k) are periodic, with period N/2, therefore H1(k+N/2)=H1(k) and H2(k+N/2)=H2(k) and H2(k+N/2)=H2(k). In addition, the factor WNk+N/2=−WNk.
Where N is the number of sampling points in an output discrete signal. By these equations, the FFT transform of the input signal will be calculated to represent the signature of the fault in the frequency domain.
Component Analysis Stage. The principal component is used to reduce the data dimension with the most important data. The benefit of this stage is to reduce the training complexity and time of classification of the next stage. There are two techniques for this purpose: PCA and RPCA, which each can be used to reduce the data size of the FFT result. The result from this stage is applied to the fault classification stage, and the classification process is performed with minimum complexity.
Principal Component Analysis. We expand the data using FFT to get more fault information and the sign of each fault. Therefore, the role of PCA is to only retain the most important data. The results include important components with a lower dimension. The idea of PCA depends on converting the correlated set of sample variables to uncorrelated variables.
PCA uses orthogonal transformation to achieve this reduction. Assume a set of sample vectors x={x1, x2, x3, . . . , xn} and orthogonal normalized basis Ai where i=1, 2, . . . , +∞. The orthogonal basis can be written as
Each sample vector can be given as an infinite superposition of basis vectors where a basis has the same dimension. The sample vector is expressed as:
Representing the original sample approximately by finite basis vector is used in PCA to reduce the error to a minimum. Thus, the estimated sample vector of the first d basis vector will consider the first d points, and this basis can be calculated via the orthogonal basis by:
The error depends on the difference between the original value and the estimated value. Therefore, the subtraction between Equation 8 and 9 is given by:
From Equation 10, the error can be calculated using expectation (E) of the difference between the original and resulted value. The error can in two ways, the first being:
The second method of calculating error can be obtained by using Equation 9, where AiTx=Σm=1∞AiTαmAm=αi and xTAi=Σm=1∞AmTαmAi=αi. Using these equations to substitute in Equation 9 and the result will be:
Using the error value, the basis coefficients will be adjusted by the error value to become as small as possible. The error can be calculated using Equation 11 or Equation 12, where X=E[xxT]. The minimum error value is obtained under constrained condition which is AiTAi=1. The eigenvalue is calculated after applying the partial derivative, and the derivative result equals zero. Therefore, the eigenvalue can be calculated by:
XAi=λiAi (13)
Where λ is the eigenvalue which is used to represent the importance of each component. The minimum error value can be achieved when the basis vector is the eigenvectors of E(xxT). These eigenvectors can be calculated using a scatter matrix S,
The eigenvectors' values are used for representing the components. The first mode or component of the sample vectors is referred by the eigenvector which corresponds to the largest eigenvalue. The second component refers to the eigenvector which corresponds to the second largest eigenvalue, and the sequence of the other components is define in the same definition Consequently, the sample vectors go towards a lower dimension which presents the benefit of using the PCA technique to the next stage of learning.
Relative Principal Component Analysis. RPCA is another method for data size reduction. The RPCA method is used to extract more effective principal components than PCA due to uniform distribution. This technique is based on relative weight to avoid getting false information. For the purposes of explanation of RPCA, assume M is a set generated by a measurable set S with a standard deviation of a and a mean of μ. M can be presented in such a form of the compatible sets with A=Ai where μ(M)=1. The entropy can be obtained by:
For corresponding feature A and training set A, the uncertainty level to classify set of D, is given by empirical entropy H(D). The uncertainty level to classify feature A using the condition set of D is H(D|A). The difference between H(D) and H(D|A) presents the information gain of the uncertainly of classification is given by:
g(D, A)=H(D)−H(D|A) (16)
For training dataset D, |D| is denoted to the number of the samples. The set D has L classes, and each class is given by Cl where l=1, 2, . . . L, and |Cl| is the number of samples in Cl
Assume feature A has n values; A={α1, α2, α3, . . . αn}, and D has n values where D={D1, D2, D3, . . . , Dn}
Where Djl is the intersection of Class Cl and subset Dj, the empirical entropy of dataset can be expressed by:
And the conditional entropy is given by:
The information gain of dataset D by feature of A, is called the corresponding relative transformation MA
MA=g(D.A)=J(D)−H(D|A) (23)
The process to get MA is repeated for each feature to get the corresponding relative transformation Mi=g(D|i). For getting the relative principal component, assume XεRs×f where s is the samples number and f is the features number. The normalized value is needed and is given by:
When M=I, RPCA will be equivalent to PCA. Therefore, M is beneficial to consider the relative importance of variables into account. In order to get the values of the Principal Components (PCs) of XR, the correlation matrix is used which can be expressed by
ΣxR=E{[XR]T[XR]} (28)
Assume all eigenvalues have λ1R≥λ2R≥λ3R≥. . . , λbR, we useλj for the eigenvalue, and Pj is the corresponding eigenvector of λj
|λRI−Σx
|λxjRI−Σx
A new lower dimensional matrix Ta×m can be obtained by:
Ta×b×Pa×n (31)
Where Pa×m={(P1R, P2R,P3R, . . . ,PNR}, m is the PCs size, and PiR={P1R(1), P2R(2), P3R(3), . . . , PnR(b)}. The selecting number of the relative principal element can be calculated using the Cumulative Percentage of Variance (CPV) which measures the variation amount selected by the first n latent variables, and where P can be choses by a user as a threshold.
Multilayer Perceptron (MLP). MLP is commonly used in artificial neural networks. MLP network includes multiple layers that are divided into three labeled layers. The first layer is called the input layer and the last layer is called the output layer. The layers between the input and output layers are called hidden layers. Each layer consists of multiple nodes, and each node connects to all next layer nodes. This connection line between nodes can transmit a signal from one node to another as shown in
Where Xj is the jth node output in the prior layer, and n is the number of nodes, Wji is the node weight from jth node to the ith node in the proceeding layer, f is the activation function symbol, and b is the bias.
Economic LSTM Recurrent Neural Network. The ELSTM method hardware structure is shown in
f(t)=σ(Wf.If+bf) (34)
f(t)=σ([Wcf,Wxf,Ubf].[x(t), c(t−1), h(t−1)]+bf) (35)
u(t)=tanh (Wu, Iu+bu) (36)
u(t)=tanh ([Wcu,Wxu,Uuu].[x(t), c(t−1), h(t−1)]+bu) (37)
C(t)=f(t) ⊙C(t−1)+(1−f(t)) ⊙U(t) (38)
h(t)=f(t) ⊙tanh (C(t)) (39)
Where If is an input in the first phase, f(t)ϵRd×h×r while the width is d, the height is h, and n is the number of channels of ft. x(t) is the input where x(t)ϵRd×h×r, and it may be for a certain issue such as fault, speech, image, and r is the number of input channels. The output of the block is h(t−1) at the time of (t−1), and the stack c(t−1) memory state represents the internal statement at the time of (t−1). In the same manner of f(t), h(t−1) and c(t−1)ϵRd×h×n. The weights, Wxf, Wcf, and Uhf are the convolutional weights, and they have dimension size of (m×m) for all kernels. bf is the bias which is a vector of a dimension n×1. Furthermore, Iu is also an input for the second ELSTM stage, u(t) is the output of the update gate where u(t)ϵRd×h×n and is the same dimension as ft. buϵRn×1, and has the same dimension of bfϵRn×1. The weights of Wcu, Wxu, and Uuu are used for update output computation. The final memory state is C(t), the final output is h(t), and the ⊙ symbol represents elementwise multiplication. ELSTM performs learning for long term history which is beneficial for fault prediction. ELSTM also has economic hardware components which reduce the computation time and power consumption.
Of the discussed embodiments, the latter two embodiments are the most efficient. The trade-offs between the two are training time and classification accuracy. These embodiments are now discussed in greater detail.
Implementation. The self-healing mechanism is implemented on FPGA. Arithmetic Logic Unit (ALU) has been implemented on EmHW to study the behavior of the disclosed method. ALU operations are used in many applications such as biomedical systems, aircraft systems, and signal processing. The EmHW is implemented using 64 cells for performing ALU operations, and the disclosed method applied. The disclosed method is implemented on Altera Arria 10GX FPGA. The disclosed method has the ability to recover 125% faulty cells, including spare cells. The area overhead is 34%, while the fault recovery is high. Thus, the disclosed method provides more age extension of EmHW. The resource consumption of the disclosed method on FPGA is shown in
Reliability is one of the significant evaluation parameters for a self-healing technique, and it is the ability of the system to execute a function correctly within a certain time duration. The probability of success for the system can be given by p(t)=exp(−λt). Where all units are identical (all cells) in structure, and p(t) is hypothesized to be an exponential distribution failure. λ is the failure rate. Spare cells are used, and each cell also can perform two functions in the same clock period for recovering neighboring faulty cells. The system reliability is evaluated by the following equation:
Where n is the number of active units for m number of function. The traditional method is based on isolating the faulty component and keeping the circuit working, typically with a lower performance. For example, in a system with 16 cells, if the system has two faulty cells, the two faulty cells are isolated from the rest of the cells. Thus, the system works with only 14 healthy cells, and the performance of the system is degraded. The reliability performance for the traditional and disclosed methods using different failure rate are studied, and a comparison is presented as shown in
MTTF=∫oxR(t)dt (41)
The analysis of the self-healing mechanism in terms of MTTF for the traditional and disclosed method is shown in
A comparison between the disclosed self-healing mechanism and the prior methods is shown in
The disclosed fault prediction methods have been implemented for EmHW. The training and testing are carried on 80% of the training data which is used for the training set and 20% which is used for the validation set. The FFT process is used to extract the data and represent it in the frequency domain. The signal is converted by FFT into 0-49 harmonics using a sampling rate of 1000, and each recording is divided into 15s segments. The harmonic 0 presents the DC component of the signal. The disclosed method is tested 100 times, and the first 42 harmonics are found to be sufficient for fault diagnosis. For example, the FFT of the voltage signal in normal mode without any fault is shown in
Sensitivity refers to the ratio between correct number of identified classes and the total sum of TP and FN. Sensitivity can be expressed as:
Specificity measures the proportion of actual negatives that are correctly identified.
Precision is the ratio between the corrected number of identified classes and the sum of the correct and uncorrected classes.
Tension is the relation between sensitivity and precision, which should be balanced. Increasing precision results in decreasing sensitivity. Sensitivity improves with low FN, which results in increasing FP, and it reduces the precision.
Accuracy of the test provides the ability to differentiate classes correctly.
The result shows that the MLP based method has the worst performance in terms of sensitivity, specificity, precision, accuracy, training time and a number of parameters. The low performance of this method is due to updating the network parameters without features extraction for the input. In this method, the accuracy is 83.12% and training time is 6.8 minutes with a huge number of parameters of 8,246,130 as shown in
The disclosed methods have been implemented in VHDL, and Altera Arria 10 GX FPGA 10AX115N2F45E1SG using the operating frequency of 120 MHZ. The hardware resources consumption of Lookup Tables (LUTs), DSPs, Buffers, block RAM, Flip Flop (FF), etc., for each block is studied. The hardware resource consumption for implementing the FFT stage is presented in
The methods, devices, and systems described herein are not inherently related to any particular computer or other apparatus. Various general-purpose systems may also be used with programs in accordance with the teachings herein, or it may prove convenient to construct a more specialized apparatus to perform the required method steps. The required structure for a variety of these systems will appear from the description below. In addition, the present invention is not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the present invention, as described herein.
Although the description herein uses terms first, second, etc., to describe various elements, these elements should not be limited by the terms. These terms are only used to distinguish one element from another.
The terminology used in the description of the various described embodiments herein is for the purpose of describing particular embodiments only and is not intended to be limiting. As used in the description of the various described embodiments and the appended claims, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms “includes,” “including,” “comprises,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof
The term “if” may be construed to mean “when” or “upon” or “in response to determining” or “in response to detecting,” depending on the context. Similarly, the phrase “if it is determined” or “if [a stated condition or event] is detected” may be construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event],” depending on the context.
The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the disclosure to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain the principles of the techniques and their practical applications. Others skilled in the art are thereby enabled to best utilize the techniques and various embodiments with various modifications as are suited to the particular use contemplated.
Although the disclosure and examples have been fully described with reference to the accompanying figures, it is to be noted that various changes and modifications will become apparent to those skilled in the art. Such changes and modifications are to be understood as being included within the scope of the disclosure and examples as defined by the claims.
This application discloses several numerical ranges in the text and figures. The numerical ranges disclosed inherently support any range or value within the disclosed numerical ranges, including the endpoints, even though a precise range limitation is not stated verbatim in the specification, because this disclosure can be practiced throughout the disclosed numerical ranges.
The above description is presented to enable a person skilled in the art to make and use the disclosure, and it is provided in the context of a particular application and its requirements. Various modifications to the preferred embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the disclosure. Thus, this disclosure is not intended to be limited to the embodiments shown but is to be accorded the widest scope consistent with the principles and features disclosed herein. Finally, the entire disclosure of the patents and publications referred in this application are hereby incorporated herein by reference.
This application claims priority to U.S. Provisional Application No. 63/137,222 titled “Hardware Fault Prediction and Self-Healing in Embryonic Hardware System” filed on Jan. 14, 2021.
Number | Date | Country | |
---|---|---|---|
63137222 | Jan 2021 | US |