OPTIMAL METASTABILITY-CONTAINING SORTING VIA PARALLEL PREFIX COMPUTATION

Information

  • Patent Application
  • 20210349687
  • Publication Number
    20210349687
  • Date Filed
    October 31, 2019
    4 years ago
  • Date Published
    November 11, 2021
    2 years ago
Abstract
In order to provide smaller, faster and less error-prone circuits for sorting possibly metastable inputs, a novel sorting circuit is provided. According to the invention, the circuit is metastability-containing.
Description

The present invention relates to metastability-containing circuits for sorting an arbitrary number of inputs.


INTRODUCTION

Metastability is a fundamental obstacle when crossing clock domains, potentially resulting in soft errors with critical consequence. As it has been shown that metastability cannot be avoided deterministically, synchronizers are employed to reduce the error probability to tolerable levels. This approach trades precious time for reliability: the more time is allocated for metastability resolution, the smaller the probability of metastability-induced faults.


Recently, a different approach has been proposed, coined metastability-containing (MC) circuits (S. Friedrichs, M. Függer and C. Lenzen, “Metastability-Containing Circuits,” in IEEE Transactions on Computers, vol. 67, no. 8, pp. 1167-1183, 1 Aug. 2018). It accepts a limited amount of metastability in the input to a digital circuit and ensures limited metastability of its output, so that the result is still useful. In particular when sorting inputs, metastability can be contained when sorting inputs arising from time-to-digital converters, i.e., measurement values can be correctly sorted without resolving metastability using synchronizers first.


RELATED WORK

Sorting Networks: Sorting networks sort n inputs from a totally ordered universe by feeding them into n parallel wires that are connected by 2-sort elements, i.e., subcircuits sorting two inputs; these can act in parallel whenever they do not depend on each other's output. A correct sorting network sorts all possible inputs, i.e., the wires are labeled 1 to n such that the ith wire outputs the ith element of the sorted list of inputs. The size of a sorting network is its number of 2-sort elements and its depth is the maximum number of 2-sort elements an input may pass through until reaching the output.


Parallel Prefix Computation: Ladner and Fischer (R. E. Ladner, M. J. Fischer, “Parallel prefix computation”, JACM, vol. 27, no. 4, pp. 831-838, 1980) studied the parallel application of an associative operator to all prefixes of an input string of length l (over an arbitrary alphabet). They give parallel prefix computation (PPC) circuits of depth O(log l) and size O(l) (given a constant-size circuit implementing the operator). A number of additional constructions have been developed for adders, and special cases of the construction by Ladner and Fischer were discovered (in all likelihood) independently, cf. [24]. However, no other construction simultaneously achieves asymptotically optimal depth and size.


It is an object of the invention to provide smaller, faster and less error-prone circuits for sorting possibly metastable inputs.


This object is achieved by a circuit according to independent claim 1. Advantageous embodiments are defined in the dependent claims.


According to an aspect of the invention, CMOS implementations of basic gates realize Kleene logic. The task of comparing inputs can be decomposed into performing a four-valued comparison on each prefix pair of two input strings, followed by inferring the corresponding output bits. Plugging the resulting 2-sort(B) circuits for B-bit inputs into a sorting network for n values readily yields an MC sorting circuit for n valid strings.


The above reduces the task of MC sorting to a parallel prefix computation (PPC) problem, for which circuits that are simultaneously (asymptotically) optimal in depth and size are known due to a celebrated result by Ladner and Fischer (Richard E Ladner and Michael J Fischer. Parallel Prefix Computation. JACM, 27(4):831-838, 1980). According to an aspect of the invention, the inventive circuits can be derived using their framework, which allows for a trade-off between depth and size of the 2-sort circuit. Most prominently, optimizing for depth reduces the depth of the circuit to optimal ┌log B┐, at the expense of increasing the size by a factor of up to 2. However, relying on the construction from Ladner at al. as-is results in a very large fan-out. In a further aspect, the invention proposes reducing fan-out to any number f≥3 without affecting depth, increasing the size by a factor of only 1+O(1/f) (plus at most 3B/2 buffers). In particular, our results imply that the depth of an MC sorting circuit can match the delay of a non-containing circuit, while maintaining constant fan-out and a constant-factor size overhead.


Post-layout area and delay of the designed circuits compare favorably with a baseline provided by a straightforward non-containing implementation.






FIG. 1 shows standard transistor-level implementations of inverter (left), NAND (center), and NOR (right) gates in CMOS technology. The latter can be turned into AND and OR, respectively, by appending an inverter.



FIG. 2 shows finite state machine determining which of two Gray code inputs g, h∈BB is larger. In each step, the machine receives gihi as input. State encoding is given in square brackets.



FIG. 3 shows An example for a computation of the 2-sort(9) circuit arising from the inventive construction for fan-out f=3. The inputs are g=101010110 and h=101M100000; see Table 10 for sMi(g, h) and the output. We labeled each ⋄M by its output. Buffers and duplicated gates (here the one computing 0M) reduce fan-out, but do not affect the computation. Grey boxes indicate recursive steps of the PPC construction; see also FIG. 7 for a larger PPC circuit using the one here in its “right” top-level recursion. For better readability, wires not taking part in a recursive step are dashed or dotted.



FIG. 4 shows the recursion tree T4 (center). Right nodes are depicted black, left nodes gray and leafs are depicted white. The recursive patterns applied at left and right nodes are shown on the left and right, respectively. At the root and its left child, we have that B=B/2; for other nodes, B gets halved for each step further down the tree (where the leaves simply wire their single input to their single output). The left pattern comes in different variants. The basic construction does not incorporate the gray buffers; these will be needed in Section 5.2 to reduce fan-out. The gray wire with index B+1 is present only if B is odd; this never occurs in PPC(C,Tb), but becomes relevant when initially applying the left pattern exclusively for k∈N steps (see Theorem 5.8), reducing the size of the resulting circuit at the expense of increasing its depth by k.



FIG. 5 shows comparison of the balanced recursion from [19] and ours. The curves for unbounded fan-out are the exact sizes obtained, whereas “upper bound” refers to the bound from Corollary 5.7; the fan-out 3 curves show that the unbalanced strategy performs better also for the construction from Theorem 5.18 (for f=3 and k=0) we derive next.



FIG. 6 shows construction of PPC(C,T4)′. On the left, we see the recursion tree, with the aggregation trees separated and shown at the bottom. Inputs are depicted as black triangles. On the right, the application of the recursive patterns at the children of the root is shown. Parts marked blue will be duplicated in the second step of the construction that achieves constant fan-out; this will also necessitate duplicating some gates in the aggregation trees.



FIG. 7 shows PPC(3)(C,T4). Right recursion steps Rr are marked with dark gray, left recursion steps with light gray. The steps at the root (above) and aggregation trees (below) are not marked explicitly. Duplicated gates are depicted in a layered fashion. Dashed lines indicate that a wire is not participating in a recursive step.



FIG. 8 shows a dependence of the size of the modified construction on f. For comparison, the upper bound from Corollary 5.7 on the circuit with unbounded fan-out is shown as well.



FIG. 9 shows an XMUX circuit according to an embodiment of the invention, used to implement ⋄M and outM.



FIG. 10 shows constructing 2-sort(B) from outM and PPCM(B−1).



FIG. 11 shows an excerpt from a simulation for 4-bit inputs, where X=M. The rows show (from top to bottom) the inputs g and h, both outputs of the simple non-containing circuit, and both outputs of our design. Inputs g and h are randomly generated valid strings. Columns 1 and 3 show that the simpler design fails to implement a 2-sort(4) circuit.



FIG. 12 shows a comparison of the inventive solution PPC Sort to a standard non-containing one. For the latter, the unexpected delay reduction at B=16 is the result of automatic optimization with more powerful gates, which the inventive solution does not use.





DETAILED DESCRIPTION

We set [N]:={0, . . . , N−1} for N∈custom-character and [i, j]={i, i+1, . . . , j} for i, j∈custom-character, i≤j. We denote custom-character:={0,1} and custom-characterM:={0,1, M}. For a B-bit string g∈custom-characterM and i∈[1,B], denote by gi its i-th bit, i.e., g=g1g2 . . . gB. We use the shorthand gi,j:=gi . . . gj, where i, j∈[1,B] and i≤j. Let par(g) denote the parity of g∈custom-characterB, i.e, par(g)=Σi=1Bgi mod 2. For a function f and a set A, we abbreviate f(A):={f(y)|y∈A}.


A standard binary representation of inputs is unsuitable: uncertainty of the input values may be arbitrarily amplified by the encoding. E.g. representing a value unknown to be 11 or 12, which are encoded as 1011 resp. 1100, would result in the bit string 1MMM, i.e., a string that is metastable in every position that differs for both strings. However, 1MMM may represent any number in the interval from 8 to 15, amplifying the initial uncertainty of being in the interval from 11 to 12. An encoding that does not lose precision for consecutive values is Gray code.


A B-bit binary reflected Gray code, rgB:[N]→custom-character, is defined recursively. For simplicity (and without loss of generality) we set N:=2B. A 1-bit code is given by rg1(0)=0 and rg1(1)=1. For B>1, we start with the first bit fixed to 0 and counting with rgB−1(⋅) (for the first 2B−1 codewords), then toggle the first bit to 1, and finally “count down” rgB−1(⋅) while fixing the first bit again, cf. Table 1. Formally, this yields for x∈[N]












r



g
B



(
x
)



:

=

{




0

r



g

B
-
1




(
x
)







if





x



[

2

B
-
1


]







1





r



g

B
-
1




(


2
B

-
1
-
x

)







if





x




[

2
B

]



[

2

B
-
1


]








.




(
x
)







As each B-bit string is a codeword, the code is a bijection and the encoding function also defines the decoding function. Denote by custom-charactercustom-character:custom-characterB→[N] the decoding function of a Gray code string, i.e., for x∈[N], custom-characterrgB(x)custom-character=x.


For two binary reflected Gray code strings g, h∈custom-characterB, we define their maximum and minimum as







(



max
rg



{

g
,
h

}


,


min
rg



{

g
,
h

}



)

:=

{




(

g
,
h

)





if







g






h








(

h
,
g

)





if







g



<


h











For example:





maxrg{0011,0100}=maxrg{rgB(2),rgB(7)}=0100,





minrg{0111,0101}=minrg{rgB(9),rgB(10)}=0111.


Inputs to the sorting circuit may have some metastable bits, which means that the respective signals behave out-of-spec from the perspective of Boolean logic. However, they are valid strings in the sense of the invention. Valid strings have at most one metastable bit. If this bit resolves to either 0 or 1, the resulting string encodes either x or x+1 for some x, cf. Table 2.


More formally, if B∈custom-character and N=2B, the set of valid strings of length B is defined as








S
rg
B

:

=


r



g
B



(

[
N
]

)








x


[

N
-
1

]





{

r



g
B



(
x
)


*
r



g
B



(

x
+
1

)



}







The operator * is called the superposition and is defined as









i



{

1
,





,
B

}




(

x
*
y

)

i




:=

{




x
i





if






x
i


=

y
i






M



else
.









The specification of maxrg and minrg may be extended to valid strings in the above sense by taking all possible resolutions of metastable bits into account. More particularly, in order to extend the specification of maxrg and minrg to valid strings, the metastable closure (Stephan Friedrichs, Matthias Függer, and Christoph Lenzen. Metastability-Containing Circuits. Transactions on Computers, 67, 2018) is used. The metastable closure of an operator on binary inputs extends it to inputs that may contain metastable bits, by considering all possible stable resolutions of the inputs, applying the operator and taking the superposition of the results.


The closure is the best one can achieve w.r.t. containing metastability with clocked logic using standard registers, i.e., when fM(x)i=M, no such implementation can guarantee that the ith output stabilizes in a timely fashion.


If one wants to construct a circuit computing the maximum and minimum of two valid strings, allowing to build sorting networks for valid strings, one also needs to answer the question what it means to ask for the maximum or minimum of valid strings. To this end, suppose a valid string is rgB(x)*rgB(x+1) for some x∈[N−1], i.e., the string contains a metastable bit that makes it uncertain whether the represented value is x or x+1. If one waits for metastability to resolve, the string will stabilize to either rgB(x) or rgB(x+1). Accordingly, it makes sense to consider rgB(x)*rgB(x+1) “in between” rgB(x) and rgB(x+1), resulting in the following total order on valid strings (cf. Table 2).


Definition (<). A total order < is defined on valid strings as follows. For g, h∈custom-characterB, g<h⇔custom-charactergcustom-character<custom-characterhcustom-character. For each x∈[N−1], we define rgB(x)<rgB(x)*rgB(x+1)<rgB(x+1). We extend the resulting relation on SrgB×SrgB to a total order by taking the transitive closure. Note that this also defines ≤, via g≤h⇔(g=h∨g<h).


We intend to sort with respect to this order. It turns out that implementing a 2-sort circuit w.r.t. this order amounts to implementing the metastable closure of maxrg and minrg. In other words, maxMrg and minMrg are the max and min operators w.r.t. the total order on valid strings shown in Table 2, e.g.,





maxMrg{1001,1000}=rg4(15)=1000,





maxMrg{0M10,0010}=rg4(3)*rg4(4)=0M10, and





maxMrg{0M10,0110}=rg4(4)=0110.


Hence, our task is to implement maxMrg and minMrg.


Definition (2-sort(B)). For B∈custom-character, a 2-sort(B) circuit is specified as follows.


Input: g, h∈SrgB


Output: g′, h′∈SrgB


Functionality: g′=maxMrg{g, h}, h′=minMrg{g, h}.



FIG. 1 shows standard transistor-level implementations of inverter (left), NAND (center), and NOR (right) gates in CMOS technology. The latter can be turned into AND and OR, respectively, by appending an inverter.


The invention seeks to use standard components and combinational logic only. In particular, the behavior of basic gates on metastable inputs may be specified via the metastable closure of their behavior on binary inputs, cf. Table 3, using the standard notational convention that a+b=ORM (a, b) and ab=ANDM(a, b). In this logic, most familiar identities hold: AND and OR are associative, commutative, and distributive, and DeMorgan's laws hold. However, naturally the law of the excluded middle becomes void. For instance, in general, OR(x, x)≠1, as OR(M, M)=M.


It can be shown that the basic CMOS gates shown in FIG. 1 behave according to this logic, i.e. that they implement the truth tables given in Table 3, thereby justifying the model.



FIG. 2 shows finite state machine determining which of two Gray code inputs g, h∈BB is larger. In each step, the machine receives gihi as input. State encoding is given in square brackets.


More particularly, FIG. 2 depicts a finite state machine performing a four-valued comparison of two Gray code strings. In each step of processing inputs g, h∈BB, it is fed the pair of ith input bits gihi. In the following, we denote by s(i)(g, h) the state of the machine after i steps, where s(0)(g,h):=00 is the starting state. For ease of notation, we will omit the arguments g and h of s(i) whenever they are clear from context.


Because the parity keeps track of whether the remaining bits are compared w.r.t. the standard or “reflected” order, the state machine performs the comparison correctly w.r.t. the meaning of the states indicted in FIG. 2.


For all i∈[1,B], we have that max









{

g
,
h

}

i



rg




min
rg




{

g
,
h

}

i



=

out






(






j
=
1



l
-
1




g
j



h
j


,


g
i



h
i



)






In order to extend this approach to potentially metastable inputs, all involved operators are replaced by their metastable closure: for i∈[1, B] (i) compute s(i), (ii) determine maxrg{g, h}i and minrg{g, h}i according to Table 4, and finally (iii) exploit associativity of the operator computing the state s(i) in the PPC framework. Thus, we only need to implement ⋄M and the outM (both of constant size), plug them into the framework, and immediately obtain an efficient circuit.


The reader may raise the question why we compute sM(1) for all i∈[09,B−1] instead of computing only sM(B)) with a simple tree of ⋄M elements, which would yield a smaller circuit. Since sM(B) is the result of the comparison of the entire strings, it could be used to compute all outputs, i.e., we could compute the output by out(sM(B), gihi) instead of out(s(i−1)M, gihi). However, in case of metastability, this may lead to incorrect results: e.g., for g=0M1 and h=001, we have that sM(3)=00*01=0M and outM(0M; g2h2)=MM, yet minMrg{g, h}2=0 (see Tables 6 and 7).


While it is not obvious that this approach yields correct outputs, it may be formally proven that: (P1) ⋄M is associative. (P2) repeated application of ⋄M computes sM(i). (P3) applying outM to sM(i−1) and gihi results for all valid strings in maxMrg{g, h}i minMrg{g, h}i. This yields the desired correctness. Regarding the first point, we note the statement that ⋄M is associative does not depend on B. In other words, it can be verified by checking for all possible x, y, z∈B;M2 whether (x⋄My)⋄Mz=x⋄M(y⋄Mz).


While it is tractable to manually verify all 36=729 cases (exploiting various symmetries and other properties of the operator), it is tedious and prone to errors. Instead, it was verified that both evaluation orders result in the same outcome by a short computer program, proving the desired associativity of the operator.


For the convenience of the reader, Table 6 gives the truth table of ⋄M. It can be shown that repeated application of this operator to the input pairs gjhjj∈[1, i], actually results in sM(i). This is closely related to the elegant recursive structure of Binary Reflected Gray Code, leading to the important observation that if in a valid string there is a metastable bit at position m, then the remaining B−m following bits are the maximum code word of a (B−m)-bit code.


It may be observed that for g∈SrgB, if there is an index 1≤m<B such that gm=M then gm+1,B=10B−m−1.


The reasoning is based on distinguishing two main cases: one is that sM(i) contains at most one metastable bit, the other that sM(i′)MM. Each of these cases can be proven by technical statements.


It may further be observed that if |res(sM(i))|≤2 for any i∈[B+1], then res(sm(i))=⋄j=1ires(gjhj).


The operator out: B2×B2→B2 is the operator given in Table 4 computing maxrg{g, h}i minrg{g, h}i out of s(i−1) and gihi. For convenience of the reader, we provide the truth table of outM in Table 7.



FIG. 3 shows an example for a computation of the 2-sort(9) circuit arising from the inventive construction for fan-out f=3. The inputs are g=101010110 and h=101M100000; see Table 10 for sMi(g, h) and the output. More particularly, table 10 shows an example run of the FSM in FIG. 2 on inputs g=101010110 and h=101M100000. We drop sM(9)s(9)M, as it is not needed to compute g′9h′9. We labeled each ⋄M by its output. Buffers and duplicated gates (here the one computing 0M) reduce fan-out, but do not affect the computation. Grey boxes indicate recursive steps of the PPC construction; see also FIG. 7 for a larger PPC circuit using the one here in its “right” top-level recursion. For better readability, wires not taking part in a recursive step are dashed or dotted.


In order to derive a small circuit from the above, we make use of the PPC framework by Ladner and Fischer. They described a generic method that is applicable to any finite state machine translating a sequence of B input symbols to B output symbols, to obtain circuits of size O(B) and depth O(log B). They reduce the problem to a parallel prefix computation (PPC) task by observing that each input symbol defines a restricted transition function, whose compositions evaluated on the starting state yield the state of the machine after the corresponding number of steps. This matches our needs, as we need to determine sM(i) for each i∈[B]. However, their generic construction involves large constants. Fortunately, we have established that ⋄M:BM2×BM2 is an associative operator, permitting us to directly apply the circuit templates for associative operators they provide for computing sM(i)=(⋄M)j=1igjhj for all i∈[B]. Accordingly, only these templates are discussed.


We revisit the part of the framework relevant to our construction, also providing a minor improvement on their results in the process. To this end, we first formally specify the PPC task for the special case of associative operators.


Definition 5.1 (PPC(B)). For associative ⊕: D×D→D and B∈N, a PPC(B) circuit is specified as follows.


Input: d∈DB,


Output: π∈DB,


Functionality: πi=⊕j=1idj for all i∈[1, B].


In our case, ⊕=⋄M and D=BM2 the method by Ladner et al. provides a family of recursive constructions of PPC circuits. They are obtained by combining two different recursive patterns.



FIG. 4 shows the recursion tree T4 (center). Right nodes are depicted black, left nodes gray and leaves are depicted white. The recursive patterns applied at left and right nodes are shown on the left and right, respectively. At the root and its left child, we have that B=B/2; for other nodes, B gets halved for each step further down the tree (where the leaves simply wire their single input to their single output). The left pattern comes in different variants. The basic construction does not incorporate the gray buffers; these will be needed to reduce fan-out. The gray wire with index B+1 is present only if B is odd; this never occurs in PPC(C,Tb), but becomes relevant when initially applying the left pattern exclusively for k∈N steps, reducing the size of the resulting circuit at the expense of increasing its depth by k.


More particularly, suppose that C and P are circuits implementing ⊕ and PPC(┌B/2┐) for some B∈N, respectively. Then applying the recursive pattern given at the left of FIG. 4 (i) with B:=B and without the rightmost gray line if B is even and (ii) with B:=B−1 if B is odd yields a PPC(B) circuit. It has depth 2d(C)+d(P) and size at most (B−1)|C|+|P|. Moreover, the last output is at depth at most d(C)+d(P) of the circuit.


The second recursive pattern, shown in FIG. 4c, avoids to increase the depth of the circuit beyond the necessary d(C) for each level of recursion. Assume for now that B is a power of 2. We represent the recursion as a tree Tb, where b:=log B, given in the center of FIG. 6. It has depth b with all leafs in this depth, and there are two types of nonleaf nodes: right nodes (filled in black) have two children, a left and a right node, whereas left nodes (filled in gray) have a single child, which is a right node. Tb is essentially a Fibonacci tree in disguise.


Definition. T0 is a single leaf. T1 consists of the (right) root and two attached leaves. For b≥2, Tb can be constructed from Tb−1 and Tb−2 by taking a (right) root r, attaching the root of Tb−1 as its right child, a new left node l as the left child of r, and then attaching the root of Tb−2 as (only) child of l.


The recursive construction is now defined as follows. A right node applies the pattern given in FIG. 4 to the right, where Rl is the circuit (recursively) defined by the subtree rooted at the left child, Rr is the circuit (recursively) defined by the subtree rooted at the right child, and B=2b−d−1, where d∈[b] is the depth of the node. A left child applies the pattern on the left, where the recursively used circuit Rc is defined by the subtree rooted at its child and B=2b−d, where d∈[b] is the depth of the node. The base case for a single input and output is simply a wire connecting the input to the output, for both patterns. As b=log B and each recursive step cuts the number of inputs and outputs in half, the base case applies if and only if the node is a leaf. Note that the figure shows the recursive patterns at the root and its left child, where B=2b−1 is always even (i.e., in this recursive pattern, the gray wire with index B+1 is never present); when applying the patterns to nodes further down the tree, B and B are scaled down by a factor of 2 for every step towards the leaves.


In the following, denote by PPC(C,Tb) the circuit that results from applying the recursive construction described above to the base circuit C implementing ⊕. Moreover, we refer to the ith input and output of the subcircuit corresponding to node ν∈Tb as diν and πiν, respectively.


It may be shown that If C implements ⊕, PPC(C,Tb) is a PPC(2b)circuit and PPC(C,Tb) has depth b·d(C).


It remains to bound the size of the circuit. Denote by Fi, i∈N, the ith Fibonacci number, i.e., F1=F2=1 and Fi+1=Fi+Fi−1 for all 2≤i∈N. Then it may be shown that PPC(C,Tb) has size (2b+2−Fb+5+1)|C|.


Asymptotically, the subtractive term of Fb+5 is negligible, as Fb+5∈(1/√{square root over (5)}+0(1))((1+√{square root over (5)})/2)b+5⊆O(1.62b); however, unless B is large, the difference is substantial. We also get a simple upper bound for arbitrary values of B. To this end, we “split” in the recursion such that the left branch is “complete,” while applying the same splitting strategy on the right. This is where our construction differs from and improves on the method of Ladner et al. They perform a balanced split and obtain an upper bound of 4B on the circuit size.


It follows that for B∈N and circuit C implementing ⊕, set b:=┌log B┐. Then a PPC(B) of depth ┌log B┐d(C) and size smaller than (5B−2b−Fb+3)|C|≤(4B−Fb+3) exists.


We remark that one can give more precise bounds by making case distinctions regarding the right recursion, which for the sake of brevity we omit here. Instead, we computed the exact numbers for B≤70.



FIG. 5 shows comparison of the balanced recursion from Ladner et al and ours. The curves for unbounded fan-out are the exact sizes obtained, whereas “upper bound” refers to the above-given bound; the fan-out 3 curves show that the unbalanced strategy performs better also for the construction (for f=3 and k=0) we derive next.


The construction derived from iterative application of the above results can be combined with PPC(C,Tb), achieving the following trade-off; note that if B=2b for b∈N, then F┌log B┐−k+3 can be replaced by Fb−k+5.


Suppose C implements ⊕. For all k∈[┌log B┐+1] and b∈N, there is a PPC(B) circuit of depth (┌log B┐+k)d(C) and size at most







(



(

2
+

1

2

k
-
1




)


B

-

F




log





B



-
k
+
3



)




C







FIG. 6 shows the construction of PPC(C,T4)′. On the left, we see the recursion tree, with the aggregation trees separated and shown at the bottom. Inputs are depicted as black triangles. On the right, the application of the recursive patterns at the children of the root is shown. Parts marked blue will be duplicated in the second step of the construction that achieves constant fan-out; this will also necessitate duplicating some gates in the aggregation trees.


The optimal depth construction incurs an excessively large fan-out of Θ(B), as the last output of left recursive calls needs to drive all the copies of C that combine it with each of the corresponding right call's outputs. This entails that, despite its lower depth, it will not result in circuits of smaller physical delay than simply recursively applying the construction from FIG. 4a. Naturally, one can insert buffer trees to ensure a constant fan-out (and thus constantly bounded ratio between delay and depth), but this increases the depth to Θ(log2 B+d(C)log B).


We now modify the recursive construction to ensure a constant fan-out, at the expense of a limited increase in size of the circuit. The result is the first construction that has size O(B), optimal depth, and constant fan-out.


In the following, we denote by f≥3 the maximum fanout we are trying to achieve, where we assume that gates or memory cells providing the input to the circuit do not need to drive any other components. For simplicity, we consider C to be a single gate.


We proceed in two steps. First, we insert 2B buffers into the circuit, ensuring that the fan-out is bounded by 2 everywhere except at the gate providing the last output of each subcircuit corresponding to a right node. In the second step, we will resolve this by duplicating such gates sufficiently often, recursively propagating the changes down the tree. Neither of these changes will affect the output of the circuit or its depth, so the main challenges are to show our claim on the fan-out and bounding the size of the final circuit.


Step 1: Almost Bounding Fan-Out by 2


Before proceeding to the construction in detail, we need some structural insight on the circuit.


For node ν∈Tb, define its range Rυ and left-count aυ recursively as follows.

    • If υ is the root, then Rυ=[1,2b] and aυ=0.
    • If υ is the left child of p with Rp=[i, i+j], then Rυ=[i, i+(j+1)/2] and aυ=ap.
    • If υ is the right child of right node p with Rp=[i, i+j], then Rυ=[i+(j+1)/2+1, i+j] and aυ=ap.
    • If υ is the right child of left node p, then Rυ=Rp and aυ=ap+1.


Suppose the subcircuit of PPC(C,Tb) represented by node ν∈Tb in depth d∈[b+1] has range Rυ=[i, i+j].

    • Then
    • (i) it has 2b−d inputs,
    • (ii) j=2b−d+αν−1,
    • (iii) if υ is a right node, all its inputs are outputs of its childrens' subcircuits, and
    • (iv) if υ is a left node or leaf, only its even inputs are provided by its child (if it has one) and for odd k∈[1,2b−d], we have that







d
k
v

=





k


=

i
+

(

k
-
1

)




i
+

k


2


α
v

-
1








2

α
v





d

k



.







This leads to an alternative representation of the circuit PPC(C,Tb), see FIG. 6, in which we separate gates in the recursive pattern from FIG. 4a that occur before the subcircuit Rc. Adding the buffers we need in our construction, this results in the modified patterns given in FIG. 6b. The separated gates appear at the bottom of FIG. 6a: for each leaf υ of Tb, there is a tree of depth αυ aggregating all of the circuit's inputs from its range. Each non-root node in an aggregation tree provides its output to its parent. In addition, one of the two children of an inner node in the tree must provide its output as an input to one of the subcircuits corresponding to a node of Tb′, cf. Property (iv) above.


From this representation, we will derive that the following modifications of PPC(C,Tb) result in a PPC(2b) circuit PPC(C,Tb)′, for which a fan-out larger than 2 exclusively occurs on the last outputs of subcircuits corresponding to nodes of Tb.

    • 1) Add a buffer on each wire connecting a non-root node of any of the aggregation trees to its corresponding subcircuit (see FIG. 6a).
    • 2) For the subcircuit corresponding to left node l with range Rl=[i, i+j], add for each even k≤j (i.e., each even k but the maximum of j+1) a buffer before output πkl (see bottom of FIG. 6b).
    • 3) For each right node r with range [i, i+j], add a buffer before output π(j+1)/2r (see top of FIG. 6b).


With the exception of gates providing the last output of subcircuits corresponding to nodes of Tb (blue in FIG. 6b), fan-out of PPC(C,Tb)′ is 2. Buffers or gates driving an output of the circuit drive nothing else.


It remains to count the inserted buffers. The following helper statement will be useful for this, but also later on.


Denote by Lb⊆Tb the set of leaves of Tb. Then |Lb|=Fb+2 and Σν∈Lb2αν=2b.


Next, consider the recurrence given by L′0=1, L′1=2, and L′b=L′b−1+2L′b−2 for b≥2; the factor of 2 assigns twice the weight to the subtree rooted at the child of the root's left child, thereby ensuring that each leaf is accounted for with weight 2αν. This recurrence has solution 2b.


Denote by s the size of a buffer. Then





|PPC(C,Tb)′|=|PPC(C,Tb)|+(2b+2b−1−Fb+3)s.


Step 2: Bounding Fan-Out by f


In the second step, we need to resolve the issue of high fan-out of the last output of each recursively used sub circuit in PPC(C,Tb)′. Our approach is straightforward. Starting at the root of Tb and progressing downwards, we label each node υ with a value a(υ) that specifies a sufficient number of additional copies of the last output of the sub circuit represented by υ to avoid fan-out larger than f. At right nodes, this is achieved by duplicating the gate computing this output sufficiently often, marked blue in FIG. 6b (top). For left nodes, we simply require the same number of duplicates to be provided by the sub circuit represented by their child (i.e., we duplicate the blue wire in the bottom recursive pattern shown in FIG. 6b). Finally, for leaves, we will require a sufficient number of duplicates of the root of their aggregation tree; this, in turn, may require to make duplicates of their descendants in the aggregation tree.


We start by defining a(υ) and then argue how to use these values for modifying the circuit to obtain our fan-out f circuit. Afterwards, we will analyze the increase in size of the circuit compared to PPC(C,Tb)′.


Definition (a(υ)). Fix b∈N0. For ν∈Tb in depth d∈[b+1], define








a


(
v
)


:

=

{



0



if





v





is





the





root








a


(
p
)


+

2

b
-
d



f




if





v





is





the





left





child





of





p







a


(
p
)


f




if





v





is





the





right





child





of





right





node





p





p



if





v





is





the






(
only
)






child





of





left





node






p
.










Suppose that for each leaf ν∈Tb, there are └a(ν)┘ additional copies of the root of the aggregation tree, and for each right node ν∈Tb, we add └a(ν)┘ gates that compute (copies of) the last output of their corresponding sub circuit of PPC(C,Tb)′. Then we can wire the circuit such that all gates that are not in aggregation trees have fan-out at most f, and each output of the circuit is driven by a gate or buffer driving only this output.


It remains to modify the aggregation trees so that sufficiently many copies of the roots' output values are available.


Consider an aggregation tree corresponding to leaf ν∈Tb and fix f≥3. We can modify it such that the fan-out of all its non-root nodes becomes at most f, there are └a(ν)┘ additional gates computing the same output as the root, and at most (fa(ν))/(f−2)+(2aν−1)/(f−1) gates are added.


Finally, we need to count the total number of gates we add when implementing these modifications to the circuit.


For f≥3, define PPC(f)(C,Tb) by modifying PPC(C,Tb)′ according to Lemmas 5.15 and 5.16. Then, with λ1:=(1+√{square root over (5)})/4, |PPC(f)(C,Tb)| is bounded by










PPC


(

C
,

T
b


)






+


2
b



(


1


2

f

-
2


+

2

f
-
2


+

O


(


λ
1
b


f
2


)



)





C


.






We summarize our findings in the following:


Suppose that C implements ⊕, buffers have size s and depth at most d(C), and set λ1:=(1+√{square root over (5)})/4. Then for all k∈[b+1], b∈N0 and f≥3, there is a PPC(2b) circuit of fan-out f, depth (b+k)d(C), and size at most








(


2

b
+
1


+


2

b
-
k




(

2
+



5

f

-
6



2


f
2


-

6

f

+
4


+

O


(


λ
1
b


f
2


)



)



)




C



+


(


2
b

+

2

b
-
k
-
1



)



s
.






Due to space constraints, we refrain from analyzing the size of the construction for values of B that are not powers of 2. However, in FIG. 8 we plot the exact bounds (without buffers) for k=0 and selected values of f against B.



FIG. 7 shows, as an example for the overall resulting construction, PPC(3)(C,T4). Right recursion steps Rr are marked with dark gray, left recursion steps with light gray. The steps at the root (above) and aggregation trees (below) are not marked explicitly. Duplicated gates are depicted in a layered fashion. Dashed lines indicate that a wire is not participating in a recursive step.



FIG. 8 shows a dependence of the size of the modified construction on f. For comparison, the upper bound on the circuit with unbounded fan-out is shown as well.



FIG. 9 shows an XMUX circuit according to an embodiment of the invention, used to implement ⋄M and outM.


First, we need to specify implementations of the sub circuits computing ⋄M and outM.


From Tables 5a and 5b, for s, b∈B2 we can extract the Boolean formulas





(s⋄b)1=s1s2+s1b1+s2b1





(s⋄b)2=s1s2+s1b2+s2b2





out(s,b)1=s1b2+s2b1+b1b2





out(s,b)2=s1b2+s2b1+b1b2.


In general, realizing a Boolean formula f by replacing negation, multiplication, and addition by inverters, AND, and OR gates, respectively, does not result in a circuit implementing fM1 However, we can easily verify that the above formulas are disjunctions of all prime implicants of their respective functions. As one can manually verify, these formulas evaluate to the truth tables given in Tables 6 and 7 and in this special case the resulting circuits do implement the closure—provided the gates behave as in Table 3, which the implementations given in FIG. 2 do. Using distributive laws (recall that these also hold in Kleene logic), the above formulas can be rewritten as





(s⋄b)1=s1(s2+b1)+s2b1





(s⋄b)2=s2(s1+b2)+s1b2





out(s,b)1=b1(b2+s2)+b2s1





out(s,b)2=b2(b1+s1)+b1s2.



1 For instance, (s⋄b)1=s1b1+s2b1 as Boolean formula, but the two expressions differ when evaluated on s1=s2=1 and b1=M. The circuits resulting from the different formulas are implementations of a multiplexer (with select bit b1) and its closure, respectively.


We see that, in fact, a single circuit with suitably wired (and possibly negated) inputs can implement all four operations. As for sel1=sel2 the circuit implements a multiplexer with select bit sel1, we refer to it as extended multiplexer, or XMUX for short. Its functionality is specified by





XMUX(sel1,sel2,x,y):=y(x+sel2)+x sel1.


Table 8 lists how to map inputs to compute ⋄M and outM.


We note that this circuit is not a particularly efficient XMUX implementation; a transistor-level implementation would be much smaller. However, our goal here is to verify correctness and give some initial indication of the size of the resulting circuits—a fully optimized ASIC circuit is beyond the scope of this article. The size of the implementation may be slightly reduced by moving negations. Due to space limitations, we refrain from detailing this modification here, but note that FIG. 12 and table 9 consider it.


We now have all the pieces in place to assemble a containing 2-sort(B) circuit. As stated above, ⋄M is associative. Thus, from a given implementation of ⋄M (e.g., two copies of the circuit from FIG. 9 with appropriate wiring and negation, cf. Table 8) we can construct PPCM (B−1) circuits of small depth and size, as shown above. We can combine such a circuit with an outM implementation (again, two XMUXes with appropriate wiring and negation will do) to obtain our 2-sort(B) circuit.



FIG. 10 shows constructing 2-sort(B) from outM and PPCM(B−1).


The correctness of this construction follows from the above explanations, where we can plug in any PPCM(B−1) circuit. For the circuits derived by relying on the XMUX circuit from FIG. 9, we independently confirmed this via simulation.


More particularly, we implemented the design given in FIG. 10 on register transfer-level using the PPCM(B−1) circuit described above for k=0.3 Quartus by Altera was used for design entry, which in our case mainly consists of checking correct implementation. After design entry, we used ModelSim by Altera for behavioral simulation. Note that we must not simulate the preprocessed Quartus output, because processing may compromise metastability-containing behavior. Instead, we simulate pure VHDL. Metastable signals are simulated using VHDL signal X, because its behavior matches the worst-case behavior assumed for M.



FIG. 11 shows an excerpt from a simulation for 4-bit inputs, where X=M. The rows show (from top to bottom) the inputs g and h, both outputs of the simple non-containing circuit, and both outputs of our design. Inputs g and h are randomly generated valid strings. Columns 1 and 3 show that the simpler design fails to implement a 2-sort(4) circuit.


For the implementation of PPCM(B−1) we used the basic circuits, i.e., we did not make use of the extension to constant fan-out. We exhaustively checked the design from FIG. 10 for B up to 12 (and all k accordingly). Simulation shows that the design works correct for several levels of recursion, e.g., when regarding B=1 and B=2 as simple base cases, B=12 implies 3 levels of recursion for both patterns. We refrained from simulating the constant fan-out construction, because it simply repeats replicates intermediate results without adding functionality.


After behavioral simulation we continue with a comparison of our design and a standard sorting approach Bin-comp(B). As mentioned earlier, the 2-sort(B) implementation given in FIG. 9 is slightly optimized by pulling out a negation from the operators in every recursive step [3]. After design entry as described above we use Encounter RTL Compiler for synthesis and Encounter for place and route. Both tools are part of the Cadence tool set and in both steps we use NanGate 45 nm Open Cell Library as a standard cell library.


Since metastability-containing circuits may include additional gates that are not required in traditional Boolean logic, Boolean optimization may compromise metastability containing properties. Accordingly, we were forced to disable optimization during synthesis of the circuits.



FIG. 12 shows a comparison of the inventive solution PPC Sort to a standard non-containing one. For the latter, the unexpected delay reduction at B=16 is the result of automatic optimization with more powerful gates, which the inventive solution does not use.


As a binary benchmark Bin-comp was used: In short, Bin-comp consists of a simple VHDL statement comparing two binary encoded inputs and outputting the maximum and the minimum, accordingly. It follows the same design process as 2-sort, but then undergoes optimization using a more powerful set of basic gates. For example, the standard cell library provides prebuilt multiplexers. These multiplexers are used by Bin-comp, but not by 2-sort. We stress that these more powerful gates provide optimized implementations of multiple Boolean functions, yet each of them is still counted as a single gate. Thus, comparing our design to the binary design in terms of gate count, area, and delay disfavors our solution. Moreover, we noticed that the optimization routine switches to employing more powerful gates when going from B=8 to B=16 (cf. FIG. 12), resulting in a decrease of the delay of the Bin-comp implementation.


Nonetheless, our design performs comparably to the non-containing binary design in terms of delay, cf. FIG. 12 and Table 9. This is quite notable, as further optimization of our design is possible by optimizing it on the transistor level, with significant expected gains. The same applies to gate count and area, where a notable gap remains. Recall, however, that the Bin-comp design hides complexity by using more advanced gates and does not contain metastability.


We emphasize that we refrained from optimizing the design by making use of all available gates or devising transistor-level implementations, since such an approach is tied to the utilized library or requires design of standard cells.


In conclusion, we demonstrated that efficient metastability containing sorting circuits are possible. Our results indicate that optimized implementations can achieve the same delay as non-containing solutions, without a dramatic increase in circuit size. This is of high interest to an intended application motivating us to design MC sorting circuits: fault tolerant high-frequency clock synchronization. Sorting is a key step in envisioned implementations of the Lynch-Welch algorithm with improved precision of synchronization. The complete elimination of synchronizer delay is possible due to the efficient MC sorting networks presented in this article; enabling an increment of the rate at which clock corrections are applied, significantly reducing the negative impact of phase drift of local clock sources on the precision of the algorithm.


More generally speaking, MC circuits like those presented here are of interest in mixed signal control loops whose performance depends on very short response times. When analog control is not desirable, traditional solutions incur synchronizer delay before being able to react to any input change. Using MC logic saves the time for synchronization, while metastability of the output corresponds to the initial uncertainty of the measurement; thus, the same quality of the computational result can be achieved in shorter time. Note that our circuits are purely combinational, so they can be used in both clocked and asynchronous control logic.


Examples of such control loops are clock synchronization circuits, but MC has been shown to be useful for adaptive voltage control and fast routing with an acceptable low probability of data corruption as well.

Claims
  • 1. A sorting circuit, characterized in that the circuit is metastability-containing.
  • 2. The circuit of claim 1, comprising one or more sub circuits for comparing each prefix pair of at least two input strings.
  • 3. The circuit of claim 2, further comprising a sub circuit for inferring the output bits, based on the result of the comparison.
  • 4. The circuit of claim 2, wherein the input strings are Gray coded.
  • 5. The circuit of claim 4, wherein the Gray code is a binary reflected Gray code.
  • 6. The circuit of claim 1, wherein the sorting circuit is a 2-sort circuit for sorting two input strings.
  • 7. The circuit of claim 1, wherein the sorting circuit is a sorting network for sorting n strings.
  • 8. The circuit of claim 1, wherein a size of the sorting circuit is within the order of the size of the input strings O(B).
  • 9. The circuit of claim 1, wherein buffers are used to bound the fan-out.
  • 10. The circuit of claim 9, wherein the number of buffers used is twice the length (B) of an input string.
  • 11. The circuit of claim 9, wherein a fan-out of the sorting circuit is constant.
  • 12. The circuit of claim 1, wherein a depth of the sorting circuit is within the order of ┌log B┐, wherein B is the number of bits in an input string.
  • 13. A transistor-level implementation of the circuit of claim 1.
Priority Claims (1)
Number Date Country Kind
18000850.0 Oct 2018 EP regional
PCT Information
Filing Document Filing Date Country Kind
PCT/EP2019/079861 10/31/2019 WO 00