The present invention relates to error correction circuits for detecting and correcting errors in transmitted or stored digital data and in particular to an error correction circuit providing high-speed linear programming detection and error correction.
The reliable transmission and storage of digital data, for example binary symbols transmitted as electrical impulses over data communication networks or stored on digital storage media, may employ error detection and/or correction circuitry and protocols to guard the transmitted or stored data against corruption.
Generally, error detection and correction is obtained by providing redundant bits in the transmitted or stored data. A naïve redundancy scheme may simply duplicate the transmission or storage of the data; however, sophisticated redundancy systems provide a limited number of detection and correction bits (henceforth “check bits”) each of which serve to detect and/or correct errors in multiple other data bits. For example, an 8-bit message might have a single ninth check bit termed a parity bit. This parity bit is set or reset so as to make the total number of set bits in the message and parity bit an even number. It will be understood that corruption of any one of the message or parity bits caused by a crossover (i.e. changing of the bits during transmission or during storage) will be readily detected by checking whether the total number of bits set is even. If not, an error in transmission or storage can be assumed.
More generally, multiple check bits can be added to any stored or transmitted data word allowing both detection and correction of errors in the message bits. These extra bits increase the numeric spacing or Hamming distance between legitimate message symbols represented by the data and check bits taken together. Errors are detected if the received symbol is positioned between legitimate symbols and error correction is obtained by choosing the closest legitimate symbol. In this respect it will be understood that error correction simply chooses the most likely correct symbol.
Sophisticated error detection and correction systems may employ “low density parity check codes” in which overlapping subsets of data bits and check bits are subject to independent constraints (e.g. even parity), the constraints thus each applying to only a small subset of the bits. Low-density parity check codes allow transmission of data very close to the Shannon limit that relates the amount of information that can be transmitted over a channel of a given bandwidth in the presence of given noise interference.
Decoding information that has been encoded using low density parity check codes is computationally demanding, involving the determination of a most likely transmitted string in the face of errors and subject to the overlapping constraints of the originally transmitted message. One method of performing this decoding, termed “belief propagation”, iteratively communicates values of the received bits in each subset (as maintained in a buffer) to a circuit that applies attempts to reconcile the received values according to their constraint parity and retransmit updated reconciled values to the buffer. Each bit of the buffer receives updates from multiple reconciling circuits and a mechanism is provided to integrate the different and often conflicting updates. This cycle is performed iteratively.
Belief propagation is parallelizable to the extent that the calculations associated with each reconciliation step and each integration step may be implemented simultaneously and independently by separate computing elements. This is important for high-speed message processing.
Unfortunately, although belief propagation is often empirically successful, there is no guarantee that it will converge to a set of bits that meets the constraint of the parity rules. Further, belief propagation is subject to an “error floor” representing a limit to its ability to detect and correct errors.
The present invention provides a decoder for low density parity check codes and other similar coding systems that both demonstrably converges and may be parallelized for execution of different portions of the decoding simultaneously for high-speed processing.
Specifically, the present invention provides an error correction circuit including a buffer memory for holding a received string of bits derived from a transmitted string of bits, the latter subject to a probability of transmission error. “Transmitted” shall be understood in this context to include the process of data storage as well as the process of data transmission. A parity rule memory holds a set of parity rules for the transmitted string of bits, each parity rule describing a predefined intended relationship between a subset of the bits as originally transmitted. The buffer memory and parity rule memory communicate with a linear programming optimizer, the latter which generates a corrected string of bits from the received string of bits using a linear programming process configured to maximize a probability that the corrected string of bits represents the transmitted string of bits, subject to the parity rules for the bits.
It is thus a feature of at least one embodiment of the invention to provide a robust error correction and detection decoding that can be proven to converge. Current belief propagation is not subject to this rigorous understanding. It is further a feature of at least one embodiment of the invention to provide an error correction and detection system that does not appear to be subject to an error floor, making it suitable for extremely high reliability error detection and correction.
The linear programming optimizer may iteratively repeat two steps, a first step adjusting values of the corrected string of bits based on iteratively changing replica parity subvectors (henceforth “replicas”) and a second step updating the iteratively changing replicas based upon their deviation from the actual parity rules.
It is thus a feature of at least one embodiment of the invention to provide a decoding system that reduces data flow paths by modification of replicas, thus offering potential for simpler hardware implementation.
The first step of adjusting values of the corrected string of bits may adjust each bit of the corrected string of bits as a function of the iteratively changing replicas independent of the value of the other bits of the corrected string of bits.
It is thus a feature of at least one embodiment of the invention to permit parallel processing of the modifications of the replicas for improved scalability with long data words and improved execution speed.
The electronic circuit may provide independently executing multiple computational elements associated with different replicas to substantially simultaneously adjust the different replicas
It is thus a feature of at least one embodiment of the invention to provide a circuit exploiting the parallelism of the replica iterations.
The electronic circuit may provide multiple independently executing computational elements associated with different values of the corrected string of bits to substantially simultaneously adjust the different values of the corrected string of bits.
It is thus a feature of at least one embodiment of the invention to provide a circuit exploiting the parallelism of the correction of the received string of bits.
The second step of updating the iteratively changing replicas may define a projection of the iteratively changing replicas to a parity polytope being a convex hull whose vertices are defined by the parity rules.
It is thus a feature of at least one embodiment of the invention to provide a mechanism for efficiently projecting solutions to the parity polytope, the latter representing a relaxation to the decoding process permitting linear programming to be applied.
The first and second steps may implement an alternating direction method of multipliers.
It is thus a feature of at least one embodiment of the invention to exploit a robust and well-understood mathematical technique in a novel way to provide parallelism in the linear programming solution to error detection and correction.
The maximized probability may model a binary symmetric channel.
It is thus a feature of at least one embodiment of the invention to provide a simple and widely applicable model for the transmission link.
The parity rules may provide a low-density parity check taking as arguments less than 20 percent of the bits of the transmitted string of bits. Alternatively or in addition, the parity rules may provide for even parity for a subset of bits of the transmitted string of bits.
It is thus a feature of at least one embodiment of the invention to provide a system suitable for a common class of error correction protocols.
These particular features and advantages may apply to only some embodiments falling within the claims and thus do not define the scope of the invention.
Referring now to
Data communicated on the data communication system 10 may comprise data composed of string of binary bits, termed a “packet” 20, including a content portion 22 and error correcting portion 24 which may be segregated as shown or interleaved arbitrarily. The error correcting portion 24 provides message redundancy to improve one or both of error detection and error correction in the event that bits of the packet 20 are corrupted or flipped, being generally changing a bit value from 1 to 0 or 0 to 1.
Referring now also to
The present invention provides an error decoder circuit 38 that may be located to receive the packet 20 after transmission at any of the devices 12, 16, 34 and 36, and as so located operate to detect any errors caused by transmission or storage of the packet 20 and, to the extent possible, correct those errors based on an analysis of the transmitted bits of the content portion 22 and the error correcting portion 24.
Referring now to
Generally, each bit 19 will have one of two states generally referred to as zero or one, each parity rule 43 will be represented by a matrix, each replica 45 as a vector and each Lagrangian multiplier value 47 and each likelihood value 49 will be represented as a binary word of predetermined length. The error decoder circuit 38 will generally receive, through an input channel 39, packets 20 transmitted from another device and, using a process that will be described below and with access to other data of the memory structures 42, 44, 46, and 48, will detect errors in the received packet 41 for later use by downstream device.
The error decoder circuit 38 may also provide multiple computational units 50 each being, for example, a limited instruction set processor or discrete processing circuitry including but not limited to a gate array or the like. These computational units 50 may execute substantially independently and simultaneously according to a stored program, implemented in software, firmware or as is implicit in the circuit organization and connection, to read and write to each of the memory structures 40, 42, 44, 46, and 48 to permit parallel execution of the error correcting process as will be described. Generally an arbitrary number of computational units 50 may be used and allocated appropriately.
In the embodiment discussed herein, the computational units 50 may include message bit computational units 52 and replica computational units 54. The message bit computational units 52 may be provided in number equal to the number of bits in the binary packet 20 (and hence the number of bits i in the buffer 40) and replica computational units 54 may be provided equal in number to the number j of the replicas 45 of the replica store 44. The invention is not limited to this particular embodiment and it will be understood from the discussion below that the staging of the processing of data by the error decoder circuit 38 may permit a sharing of functions between the message bit computational units 52 and the replica computational units 54 when those computational units are general-purpose processors and that some sequential processing may be permitted with a concomitant loss in speed.
Referring now to
In the example of
Linear programming will be understood to be a mathematical technique for problems that can be expressed as maximization (or minimization) of a vector of variables subject to a constraint which defines a convex polytope over which the function must be maximized (or minimized). In this case the linear programming problem may be expressed as:
subject to each xi meeting the parity rules
where γi is a negative log likelihood ratio defined as:
and where the numerator and denominator express the probability that {tilde over ()} received given that the value to the right of the vertical line was transmitted. This definition assumes a binary symmetric transmission channel where the bits are equally likely to flip from 0 to 1 as from 1 to 0. A more complete explanation of this formulation is provided in the attached Appendix.
Referring now to
As indicated by process block 72, the parity rule store 42 next may be loaded with the known parity rules for the communication system and, at process block 74, the replica store 44 may be initialized to match the parity rules 43 of the parity rule store 42. Finally the Lagrangian multiplier store 46 may be initialized, for example, to all zero values.
Referring now to
The terms of this equation are more fully defined and developed in the attached Appendix.
The above first step which updates the values of the bits 19 is limited to changing the bits 19 to values between zero and one. This process may be performed in parallel by multiple message bit computational units 52 each operating on a different bit 19 simultaneously for multiple or all bits 19.
Referring now to
In a second substep, the vector vj is then projected onto a closest surface of the convex hull (parity polytope) whose vertices are defined by the parity rules 43.
Referring momentarily also to
The coordinates of the point 82 of the projection define the new replica vector zj. This updating of the replica vector zj for each of the replica vectors j may also be accomplished in parallel by multiple replica computational units 54 which do not require the solutions to the other updated replica vectors zj to complete their updating.
Referring now to
λj=λj+μ(Pjx−zj). (5)
Again, this process may be executed in parallel by multiple replica computational units 54.
At decision block 88, convergences are checked by determining whether the greatest difference between a parity rule 43 and its replica 45 (among all the parity rules) is less than the predetermined value according to the inequality:
If this condition is satisfied, the current values of xi are returned as a solution to the correction problem as indicated by process block 90. Otherwise an additional iterated loop is performed beginning again at process block 76.
The applicants believe that the solution provided by this linear programming technique avoids the error floor inherent in belief propagation techniques. Accordingly the present invention both provides certainty of convergence and may present a preferred decoding technique for high reliability applications where frame error rates (also known as word error rates) of 10−10 or lower are required. Linear programming decoders do not display error floors.
When introducing elements or features of the present disclosure and the exemplary embodiments, the articles “a”, “an”, “the” and “said” are intended to mean that there are one or more of such elements or features. The terms “comprising”, “including” and “having” are intended to be inclusive and mean that there may be additional elements or features other than those specifically noted. It is further to be understood that the method steps, processes, and operations described herein are not to be construed as necessarily requiring their performance in the particular order discussed or illustrated, unless specifically identified as an order of performance. It is also to be understood that additional or alternative steps may be employed.
References to “an electronic circuit”, “a controller”, a “computer” and “a processor” and the like can be understood to include one or more circuits, controllers, computers or processors that can communicate in a stand-alone and/or a distributed environment(s), and can thus be configured to communicate via wired or wireless communications with other processors, where such one or more processor can be configured to operate on one or more processor-controlled devices that can be similar or different devices. Furthermore, references to memory, unless otherwise specified, can include one or more processor-readable and accessible memory elements and/or components that can be internal to the processor-controlled device, external to the processor-controlled device, and can be accessed via a wired or wireless network. Similarly the term program or the like may refer to conventional computer programming in programming languages stored in memory or implicit programming contained in the interconnection of electronic devices, for example, with programmable field gate arrays. The term electronic circuit should be broadly construed to cover programmable and other circuits including computers, field programmable gate arrays, as well as application-specific integrated circuits and should generally include circuits including both optical and electronic features used for implementing circuit type functions.
It is specifically intended that the present invention not be limited to the embodiments and illustrations contained herein and the claims should be understood to include modified forms of those embodiments including portions of the embodiments and combinations of elements of different embodiments as come within the scope of the following claims. All of the publications described herein, including patents and non-patent publications, are hereby incorporated herein by reference in their entireties.
In this paper we consider a binary linear LDPC code of length N defined by a M×N parity-check matrix H. Each of the M parity checks, indexed by ={1, 2, . . . , M}, corresponds to a row in the parity check matrix H. Codeword symbols are indexed by the set ={1, 2, . . . , N}. The neighborhood of a check j, denoted by (j), is the set of indices i∈ that participate in the jth parity check, i.e., c(j)={i|Hj,i=1}. Similarly for a component i∈, v(i)={j|Hj,i=1}. Given a vector χ∈{0,1}N, the jth parity-check is said to be satisfied if (j)χi is even. In other words, the set of values assigned to the χi for i∈c(j) have even parity. We say that a length-n binary vector χ is a codeword, χ∈, if and only if (iff) all parity checks are satisfied. In a regular LDPC code there is a fixed constant d, such that for all checks j∈, |(j)|=d. Also for all components i∈, v(i)| is a fixed constant. For simplicity of exposition we focus our discussion on regular LDPC codes. Our techniques and results extend immediately to general LDPC codes and to high density parity check codes as well.
To denote compactly the subset of coordinates of χ that participate in the jth check we introduce the matrix Pj. The matrix Pj is the binary d×N matrix that selects out the d components of χ that participate in the th check. For example, say the neighborhood of the jth check, c(j)={i1, i2, . . . id}, where i1<i2< . . . <id. Then, for all k∈[d] the (k, ik)th entry of Pj is one and the remaining entries are zero. For any codeword χ∈ and for any j, Pjχ is an even parity vector of dimension d. In other words we say that Pjχ∈d for all j∈ (a “local codeword” constraint) where d is defined as
d
={e∈{0,1}d|∥e∥1 is even}. (1)
Thus, d is the set of codewords (the codebook) of the length-d single parity-check code.
We begin by describing maximum likelihood (ML) decoding and the LP relaxation proposed by Feldman et al. Say vector {tilde over (χ)} is received over a discrete memoryless channel described by channel law (conditional probability) W:×{tilde over ()}→≧0, =1 for all χ∈. Since the development is for binary codes ||=2. There is no restriction on . Maximum likelihood decoding selects a codeword χ∈ that maximizes , the probability that was received given that χ was sent. For discrete memoryless channel W, =Πi∈W({tilde over (χ)}i|χi). Equivalently, we select a codeword that maximizes Σi∈ log . Let γi be the negative log-likelihood ratio, γi:=log [W({tilde over (χ)}i|0)/W({tilde over (χ)}i|1)]. Since log W({tilde over (χ)}i|χi)=−γiχi+log , ML decoding reduces to determining an χ∈ that minimizes γTχ=Σi∈γiχi. Thus, ML decoding requires minimizing a linear function over the set of codewords.
Feldman et al. [3] show that ML decoding is equivalent to minimizing a linear cost over the convex hull of all codewords. In other words, minimize γTχ subject to χ∈conv(). The feasible region of this program is termed the “codeword” polytope. However, this polytope cannot be described tractably. Feldman's approach is first to relax each local codeword constraint Pjχ∈d to Pjχ∈d where
d=conv(d)=conv({e∈{0,1}d|∥e∥1 is even}). (2)
The object d is called the “parity polytope”. It is the codeword polytope of the single parity-check code (of dimension d). Thus, for any codeword χ∈, Pjχ is a vertex of d for all j. When the constraints Pjχ∈d are intersected for all j∈ the resulting feasible space is termed the “fundamental” polytope. Putting these ingredients together yields the LP relaxation that we study:
minimize γTχs.t. Pjχ∈d∀j∈. (3)
The statement of the optimization problem in (3) makes it apparent that compact representation of the parity polytope d is crucial for efficient solution of the LP. Study of this polytope dates back some decades. In [36] Jeroslow gives an explicit representation of the parity polytope and shows that it has an exponential number of vertices and facets in d. Later, in [37], Yannakakis shows that the parity polytope has small lift, meaning that it is the projection of a polynomially faceted polytope in a dimension polynomial in d. Indeed, Yannakakis' representation requires a quadratic number of variables and inequalities. This is one of the descriptions discussed in [3] to state the LP decoding problem.
Yannakakis' representation of a vector u∈d consists of variables μs∈[0, 1] for all even s≦d. Variable μs indicates the contribution of binary (zero/one) vectors of Hamming weight s to u. Since u is a convex combination of even-weight binary vectors, Σeven sdμs=1. In addition, variables zi,s are used to indicate the contribution to ui, the ith coordinate of u made by binary vectors of Hamming weight s. Overall, the following set of inequalities over O(d2) variables characterize the parity polytope (see [37] and [3] for a proof).
This LP can be solved with standard solvers in polynomial time. However, the quadratic size of the LP prohibits its solution with standard solvers in real-time or embedded decoding applications. In Section IV-B we show that any vector u∈d can always be expressed as a convex combination of binary vectors of Hamming weight r and r+2 for some even integer r. Based on this observation we develop a new formulation for the parity polytope that consists of O(d) variables and constraints. This is a key step towards the development of an efficient decoding algorithm. Its smaller description complexity also makes our formulation particularly well suited for high-density codes whose study we leave for future work.
In this section we present the ADMM formulation of the LP decoding problem and summarize our contributions. In Section III-A we introduce the general ADMM template. We specialize the template to our problem in Section We state the algorithm in Section III-C and frame it in the language of message-passing in Section III-D.
To make the LP (3) fit into the ADMM template we relax χ to lie in the hypercube, χ∈[0, 1]N, and add the auxiliary “replica” variables zj∈d for all j∈. We work with a decoupled parameterization of the decoding LP.
minimize γTχ
subject to Pjχ=zj∀j∈
z
j∈d∀j∈
χ∈[0,1]N. (4)
The alternating direction method of multiplies works with an augmented Lagrangian which, for this problem, is
Here λj∈d for j∈ are the Lagrange multipliers and μ>0 is a fixed penalty parameter. We use λ and z to succinctly represent the collection of λjs and zjs respectively. Note that the augmented Lagrangian is obtained by adding the two-norm term of the residual to the ordinary Lagrangian. The Lagrangian without the augmentation can be optimized via a dual subgradient ascent method [38], but our experiments with this approach required far too many message passing iterations for practical implementation. The augmented Lagrangian smoothes the dual problem leading to much faster convergence rates in practice [39]. For the interested reader, we provide a discussion of the standard dual ascent method in the appendix.
Let χ and denote the feasible regions for variables χ and z respectively: χ=[0, 1]N and we use z∈ to mean that z1×z2× . . . ×∈d×d× . . . ×d, the ||-fold product of d. Then we can succinctly write the iterations of ADMM as
χk+1:=Lμ(χ,zk,λk)
z
k+1:=Lμ(χk+1,z,λk)
λjk+1:=λjk+μ(Pjχk+1−zjk+1).
The ADMM update steps involve fixing one variable and minimizing the other. In particular, χk and zk are the kth iterate and the updates to the χ and z variable are performed in an alternating fashion. We use this framework to solve the LP relaxation proposed by Feldman et al. and hence develop a distributed decoding algorithm.
The χ-update corresponds to fixing z and λ (obtained from the previous iteration or initialization) and minimizing Lμ(χ, z, λ) subject to χ∈[0, 1]N. Taking the gradient of (5), setting the result to zero, and limiting the result to the hypercube χ=[0, 1]N, the χ-update simplifies to
where P=ΣjPjTPj and Π[0,1]
Component-wise, the update rule corresponds to taking the average of the corresponding replica values, zj, adjusted by the scaled dual variable, λj/μ, and taking a step in the negative log-likelihood direction. For any j∈v(i) let zj(i) denote the component of zj that corresponds to the ith component of χ, in other words the ith component of PjTzj. Similarly let λj(i) be the ith component of PjTλj. With this notation the update rule for the ith component of χ is
Each variable update can be done in parallel.
The z-update corresponds to fixing χ and λ and minimizing Lμ(χ, λ, z) subject to zj∈d for all j∈. The relevant observation here is that the augmented Lagrangian is separable with respect to the zjs and hence the minimization step can be decomposed (or “factored”) into separate problems, each of which be solved independently. This decouples the overall problem, making the approach scalable.
We start from (5) and concentrate on the terms that involve zj. For each j∈ the update is to find the zj that minimizes
Since the values of χ and λ are fixed, so are Pjχ and λj/μ. Setting v=Pjχ+λj/μ and completing the square we get that the desired update z*j is
z*
j=d∥v−{tilde over (z)}∥22.
The z-update thus corresponds to || projections onto the parity polytope.
The complete ADMM-based algorithm is specified in the Algorithm 1 box. We declare convergence when the replicas differ from the χ variables by less than some tolerance ∈>0.
We now present a message-passing interpretation of the ADMM decoding algorithm, Algorithm 1. We establish this interpretation using the “normal” factor graph representation [41]
(sometimes also called “Formey-style” factor graphs). One key difference between normal factor graphs and ordinary factor graphs is that the variables in normal factor graph representation are associated with the edges of a regular factor graphs [42], and the constraints of the normal graph representation are associated with both factor and variable nodes of the regular representation. See [41], [43] for details. In representing the ADMM algorithm as a message-passing algorithm the χ and the replicas z are the variables in the normal graph.
We denote by χij(k) the replica associated with the edge joining node i∈ and node j∈, where k indicates the kth iteration. Note that χij
The z-update can be rewritten as
m
j→(k+1)=d(m→j(k)+λ′j(k)).
The λ′j updated is
λ′j(k+1)=λ′j(k)+(m→j(k)−mj→(k)).
With this interpretation, it is clear that the ADMM algorithm decouples the decoding problem and can be performed in a distributed manner.
In this section we develop our efficient projection algorithm. Recall that d={e∈{0, 1}d|∥e∥1 is even} and that d=conv(d). Generically we say that a point v∈d if and only if there exist a set of ei∈ci such that v=Σiαiei where Σiαi=1 and αi≧0. In contrast to this generic representation, the initial objective of this section is to develop a novel “two-slice” representation of any point v∈d: namely that any such vector can be written as a convex combination of vectors with Hamming weight r and r+2 for some even integer r. We will then use this representation to construct an efficient projection.
We open the section in Section IV-A by describing the structured geometry of d that we leverage, and laying out the results that will follow in ensuing sections. In Section IV-B, we prove a few necessary lemmas illustrating some of the symmetry structure of the parity polytope. In Section IV-C we develop the two-slice representation and connect the l1-norm of the projection of any v∈d onto d to the (easily computed) “constituent parity” of the projection of v onto the unit hypercube. In Section IV-D we present the projection algorithm.
In this section we discuss the geometry of d. We develop intuition and foreshadow the results to come. We start by making a few observations about d.
First, we can classify the vertices of d by their weight. We do this by defining dr, the constant-weight analog of d, to be the set of weight-r vertices of d:
d
r
={e∈{0,1}d|∥e∥1=r}, (6)
i.e., the constant-weight-r subcode of γd. Since all elements of d are in some dr for some even r, d=∪0≦r≦d:r evendr. This gives us a new way to think about characterizing the parity polytope,
d=conv(∪0≦r≦d:r evendr).
Second, we define dr to be the convex hull of dr,
d
r=conv(dr)=conv({e∈{0,1}d|∥e∥1=r}). (7)
This object is a “permutahedron”, so termed because it is the convex hull of all permutations of a single vector; in this case a length-d binary vector with r ones. Of course,
d=conv(∪0≦r≦d:r evendr).
Third, define the affine hyper-plane consisting of all vectors whose components sum to r as
d
r
={x∈
d|1Tx=r}
where 1 is the length-d all-ones vector. We can visualize dr as a “slice” through the parity polytope defined as the intersection of dr with d. In other words, a definition of dr equivalent to (7) is
d
r=dΩdr,
for r an even integer.
Finally, we note that the dr are all parallel. This follows since all vectors lying in any of these permutahedra are orthogonal to 1. We can think of the line segment that connects the origin to 1 as the major axis of the parity polytope with each “slice” orthogonal to the axis.
The above observations regarding the geometry of d are illustrated in
Third, in Sec.-C we show that given a point v∈d that we wish to project onto d, it is easy to identify the constituent parity of the projection. To express this formally, let d(v) be the projection of v onto d. Then, our statement is that we can easily find the even integer r such that Πd(v) can be expressed as a convex combination of vectors in dr and dr+2.
Finally, in Sec.-D we develop our projection algorithm. Roughly, our approach is as follows. Given a vector v∈d we first compute r, the constituent parity of its projection. Given the two-slice representation, projecting onto d is equivalent to determining an α∈[0, 1], a vector a∈dr, and a vector b∈dr+2 such that the l2 norm of v−αa−(1−α)b is minimized.
In [45] we showed that, given α, this projection can be accomplished in two steps. We first project v onto αdr={x∈d|0≦χi≦α, Σi=1dχi=αr} a scaled version of dr, scaled by the convex weighting parameter. Then we project the residual onto (1−α)dr. The object αdr is an l1 ball with box constraints. Projection onto αdr can be done efficiently using a type of waterfilling. Since the function dr+2∥v−αa−(1−α)b∥22 is convex in a we can perform a one-dimensional line search (using, for example, the secant method) to determine the optimal value for α and thence the desired projection.
In contrast to the original approach, in Section.D we develop a far more efficient algorithm that avoids the pair of projections and the search for α. In particular, taking advantage of the convexity in α we use majorization to characterize the convex hull of dr and dr+2 in terms of a few linear constraints (inequalities). As projecting onto the parity polytope is equivalent to projecting onto the convex hull of the two slices, we use the characterization to express the projection problem as a quadratic program, and develop an efficient method that directly solves the quadratic program. Avoiding the search over a yields a considerable speed-up over the original approach taken in [45].
Let us first describe some of the essential features of the parity polytope that are critical to the development of our efficient projection algorithm. First, note the following
Proposition 1: u∈d if and only if Σu is in the parity polytope for every permutation matrix Σ.
This proposition follows immediately because the vertex set d is invariant under permutations of the coordinate axes.
Since we will be primarily concerned with projections onto the parity polytope, let us consider the optimization problem
minimizez∥v−z∥2 subject to z∈d. (8)
The optimal z* of this problem is the Euclidean projection of v onto d, which we denote by z*=d(v). Again using the symmetric nature of d, we can show the useful fact that if v is sorted in descending order, then so is d(v).
Proposition 2: Given a vector v∈d, the component-wise ordering of d(v) is same as that of v.
Proof: We prove the claim by contradiction. Write z*=d(v) and suppose that for indices i and j we have vi>vj but z*i<z*j. Since all permutations of z* are in the parity polytope, we can swap components i and j of z* to obtain another vector in d. Under the assumption z*j>z*i and vi−vj>0 we have z*j(vi−vj)>z*i(vi−vj). This inequality implies that (vi−z*i)2+(vj−z*j)2>(vi−z*j)2+(vi−z*i)2, and hence we get that the Euclidean distance between v and z* is greater than the Euclidean distance between v and the vector obtained by swapping the components.
These two propositions allow us assume through the remainder of this section that our vectors are presented sorted in descending order unless explicitly stated otherwise.
The permutation invariance of the parity polytope also lets us also employ powerful tools from the theory of majorization to simplify membership testing and projection. The fundamental theorem we exploit is based on the following definition.
Definition 1: Let u and w be d-vectors sorted in decreasing order. The vector w majorizes u if
Our results rely on the following Theorem, which states that a vector lies in the convex hull of all permutations of another vector if and only if the former is majorized by the latter (see [44] and references therein).
Theorem 1: Suppose u and w are d-vectors sorted in decreasing order. Then u is in the convex hull of all permutations of w if and only if w majorizes u.
To gain intuition for why this theorem might hold, suppose that u is in the convex hull of all of the permutations of w. Then u=Σi=1npiΣiw with Σi being permutation matrices, pi≧0, and 1Tp=1. The matrix =Σi=1npiΣi is doubly stochastic, and one can immediately check that if u=w and is doubly stochastic, then w majorizes u.
To apply majorization theory to the parity polytope, begin with one of the permutahedra ds. We recall that ds is equal to the convex hull of all binary vectors with weight s, equivalently the convex hull of all permutations of the vector consisting of s ones followed by d−s zeros. Thus, by Theorem 1, u∈[0, 1]d is in ds if and only if
The parity polytope d is simply the convex hull of all of the ds with s even. Thus, we can use majorization to provide an alternative characterization of the parity polytope to that of Yannakakis or Jeroslow.
Lemma 1: A sorted vector u∈d if and only if there exist non-negative coefficients {μs}even s≦d such that
Proof: First, note that every vertex of d of weight s satisfies these inequalities with μs=1 and μs′=0 for s′≠s. Thus u∈d must satisfy (11)-(13). Conversely, if u satisfies (11)-(13), then u is majorized by the vector
where bs is a vector consisting of s ones followed by d−s zeros. w is contained in d as are all of its permutations. Thus, we conclude that u is also contained in d.
While Lemma 1 characterizes the containment of a vector in d, the relationship is not one-to-one; for a particular u∈B there can be many sets {μs} that satisfy the lemma. We will next show that there is always one assignment of μs with only two non-zero μs.
For α∈, let └α┘even denote the “even-floor” of a, i.e., the largest even integer r such that r≦a. Define the “even-ceiling,” ┌a┐even similarly. For a vector u we term └∥u∥1┘even the constituent parity of vector u. In this section we will show that if u∈d has constituent parity r, then it can be written as a convex combination of binary vectors with weight equal to r and r+2. This result is summarized by the following
Lemma 2: (“Two-slice” lemma) A vector u∈d iff u can be expressed as a convex combination of vectors in dr and dr+2 where r=└∥u∥1┘even.
Proof: Consider any (sorted) u∈d. Lemma 1 tells us that there is always (at least one) set {μs,} that satisfy (11)-(13). Letting r be defined as in the lemma statement, we define α to be the unique scalar between zero and one that satisfies the relation ∥u∥1=αr+(1−α)(r+2):
Then, we choose the following candidate assignment: μr=α, μr+2=1−α, and all other μs=0. We show that this choice satisfies (11)-(13) which will in turn imply that there is a γr∈dr and a ur+2∈dr+2 such that u=αur+(1−α)ur+2.
First, by the definition of α, (11) and (13) are both satisfied. Further, for the candidate set the relations (12) and (13) simplify to
To show that (15) is satisfied is straightforward for the cases q≦r and q≧r+2. First consider any q≦r. Since min(q, r)=min(q, r+2)=q, uk≦1 for all k, and there are only q terms, (15) must hold. Second, consider any q≧r+2. We use (16) to write Σk=1quk=αr+(1−α)(r+2)−Σq+1duk. Since uk≧0 this is upper bounded by αr+(1−α)(r+2) which we recognize as the right-hand side of (15) since r=min(q, r) and r+2=min(q, r+2).
It remains to verify only one more inequality in (15) namely the case when q=r+1, which is
To show that the above inequality holds, we maximize the right-hand-side of (12) across all valid choices of {μs} and show that the resulting maximum is exactly r+1−α. Since this maximum is attainable by some choice of {μs} and our choice meets that bound, our choice is a valid choice.
The logic is as follows. Since u∈d any valid choice for {μs} must satisfy (11) which, for g=r+1, is
To see that across all valid choice of {μs} the largest value attainable for the right hand side is precisely r+1−α consider the linear program
maximize Σs evenμsmin(s,r+1)
subject to Σs evenμs=1
Σs evenμss=(1−α)(r+2)
μs≧0.
The first two constraints are simply (11) and (13). Recognizing αr+(1−a)(r+2)=r+2−2α, the dual program is
minimize(r+2−2α)λ1+λ2
subject to λ1s+λ2≧min(s,r+1)∀s even.
Setting μr=α, μr+2=(1−α), the other primal variable to zero, λ1=½, and λ2=r/2, satisfies the Karush-Kuhn-Tucker (KKT) conditions for this primal/dual pair of LPs. The associated optimal cost is r+1−α. Thus, the right hand side of (17) is at most r+1−α.
We have proved that if u∈d then the choice of r=└∥u∥1┘even and α as in (14) satisfies the requirements of Lemma 1 and so we can express u as u=αur+(1−α)ur+2. The converse—given a vector u that is a convex combination of vectors in dr and dr+2 it is in d—holds becauseconv(dr∪dr+2)⊂c d.
A useful consequence of Theorem 1 is the following corollary.
Corollary 1: Let u be a vector in [0, 1]d. If Σk=1duk is an even integer then u∈d.
Proof: Let Σk=1duk=s. Since u is majorized by a sorted binary vector of weight s then, by Theorem 1, u∈ds which, in turn, implies u∈d.
We conclude this section by showing that we can easily compute the constituent parity of d(v) without explicitly computing the projection of v.
Lemma 3: For any vector v∈d, let z=Π[0,1]
└∥z∥1┘even≦∥d(v)∥1≦┌∥z∥1┐even.
That is, we can compute the constituent parity of the projection of v by projecting v onto [0, 1]d and computing the even floor.
Proof: Let ρU=┌∥z∥1┌even and pρL=└∥z∥1┘even. We prove the following fact: given any y′∈d with ∥y′∥1>ρU there exits a vector y∈[0, 1]d such that ∥y∥1=ρU, y∈d, and ∥v−y|22<∥v−y′∥22. The implication of this fact will be that any vector in the parity polytope with l1 norm strictly greater that ρU cannot be the projection of v. Similarly we can also show that any vector with l1 norm strictly less than ρL cannot be the projection on the parity polytope.
First we construct the vector y based on y′ and z. Define the set of “high” values to be the coordinates on which y′i is greater than zi, i.e., :={i∈[d]|y′i>zi}. Since by assumption ∥y′∥1>ρU≧∥z∥1 we know that ||≧1. Consider the test vector t defined component-wise as
Note that ∥t∥1≦∥z∥1≦ρU<∥y′∥1. The vector t differs from y′ only in . Thus, by changing (reducing) components of y′ in the set we can obtain a vector y such that λy∥1=ρU. In particular there exists a vector y with ∥y∥1=ρU such that y′i≧yi≧zi for i∈ and yi=y′i for i∉. Since the l1 norm of y is even and it is in [0, 1]d we have by Corollary 1 that y∈d.
We next show that for all i∈, ∥vi−yi|≦|vi−y′i|. The inequality will be strict for at least one i yielding ∥v−y∥22<∥v−y′∥22 and thereby proving the claim.
We start by noting that y′∈d so y′i∈[0, 1] for all i. Hence, if zi<y′i for some i we must also have zi<1, in which case vi≦zi since zi is the projection of vi onto [0, 1]. In summary, zi<1 iff vi<1 and when zi<1 then vi≦zi. Therefore, if y′i>zi then zi≧vi. Thus for all i∈ we get y′i≧yi≧zi≧vi where the first inequality is strict for at least one i. Since yi=y′i for i∉ this means that |vi−yi|≦|vi−y′i| for all i where the inequality is strict for at least one value of i. Overall, ∥v−y∥22<∥v−y′∥22 and both y∈d (by construction) and y′∈d (by assumption). Thus, y′ cannot be the projection of v onto d. Thus the l1 norm of the projection of v, ∥d(v)∥1≦ρU. A similar argument shows that ∥d(v)∥1≧ρL, and so ∥d(v)∥1 must lie in [ρL,ρU]
In this section we formulate a quadratic program (Problem PQP) for the projection problem and then develop an algorithm (Algorithm 2) that efficiently solves the quadratic program.
Given a vector v∈d, set r=└∥Π[0,1]
The constraint that z* has to be sorted in decreasing order can be stated as Sz*≧0, where 0 is the all-zeros vector.
In addition, Lemma 2 implies that z* is a convex combination of vectors of Hamming weight r and r+2. Using inequality (15) we get that a d-vector z∈[0, 1]d, with
is a convex combination of vectors of weight r and r+2 if it satisfies the following bounds:
where z(k) denotes the kth largest component of z. As we saw in the proof of Lemma 1, the fact that the components of z are no more than one implies that inequalities (19) are satisfied for all q≦r. Also, (18) enforces the inequalities for q≧r+2. Therefore, inequalities in (19) for q≦r and q≧r+2 are redundant. Note that in addition we can eliminate the variable α by solving (18) giving
(see also (14)). Therefore, for a sorted vector v, we can write the projection onto d as the optimization problem
The last two constraints can be simplified as follows. First, constraint (20) simplifies to r≦Σk=1dzk≦r+2. Next, defining the vector
we can rewrite inequality (21) as ƒrTz≦r. Using these simplifications yields the final form of our quadratic program:
Problem PQP:
The projection algorithm we develop efficiently solves the KKT conditions of PQP. The objective function is strongly convex and the constraints are linear. Hence, the KKT conditions are not only necessary but also sufficient for optimality. To formulate the KKT conditions, we first construct the Lagrangian with dual variables β, μ, γ, ξ, θ, and ζ:
The KKT conditions are then given by stationarity of the Lagrangian, complementary slack-ness, and feasibility.
z=v−βƒ
r−μ+γ−(ξ−ζ)1+STθ.
0≦β⊥ƒrTz−r≦0
0≦μ⊥z≦1
0≦γ⊥z≧0
0≦θ⊥Sz≧0
0≦ξ⊥1Tz−r−2≦0
0≦ζ⊥1Tz−r≧0. (27)
A vector z that satisfies (27) and the following orthogonality conditions is equal to the projection of v onto d.
To proceed, set
and define the parameterized vector
z(β):=Π[0,1]
The following lemma implies that the optimizer of PQP, i.e., z*=d(v), is z(βopt) for some βopt∈[0, βmax].
Lemma 4: There exists a αoptβ[0, βmax] such that z(αopt) satisfies the KKT conditions of the quadratic program PQP.
Proof: Note that when β>βmax we have that zr+1(β)<zr+2(β) and z(β) is ordered differently from v and ƒrTz(β)<r. Consequently z(β) cannot be the projection onto d for β>βmax. At the other boundary of the interval, when β=0 we have z(0)=Π[0,1]
Assume now that ƒrTz(0)>r. Taking the directional derivative with respect to β increasing, we obtain the following:
proving that ƒrTz(β) is a decreasing function of β. Therefore, by the mean value theorem, there exists a βopt∈[0, βmax),] such that ƒrTz(βopt)=r.
First note that z(βopt) is feasible for Problem PQP. We need only verify (25). Recalling that r is defined as r=└|Π[0,1]
1Tz(βopt)≧ƒrTz(βopt)=r.
The components of z(βopt) are all less than one, so Σk=1r+1zk(βopt)≦r+1. Combining this with the equality ƒrTz(βopt)=r tells us that Σk=r+2dzk(βopt)≦1. We therefore find that 1Tz(βopt) is no more than r+2.
To complete the proof, we need only find dual variables to certify the optimality. Setting ξ, ζ, and θ to zero, and μ and γ to the values required to satisfy (27) provides the necessary assignments to satisfy the KKT conditions.
Lemma 4 thus certifies that all we need to do to compute the projection is to compute the optimal β. To do so, we use the fact that the function ƒrTz(β) is a piecewise linear function of β. For a fixed β, define the active set to be the indices where z(β) is strictly between 0 and 1
(β):={k|1≦k≦d,0<zk(β)<1}. (30)
Let the clipped set be the indices where z(β) is equal to 1.
(β):={k|1≦k≦d,zk(β)=1}. (31)
Let the zero set be the indices where z(β) is equal to zero
(β):={k|1≦k≦d,zk(β)=0}. (32)
Note that with these definitions, we have
Our algorithm simply increases beta until the active set changes, keeping track of the sets (β), (β), and (β). We break the interval [0, βmax] into the locations where the active set changes, and compute the value of ƒrTz(β) at each of these breakpoints until ƒrTz(β)<r. At this point, we have located the appropriate active set for optimality and can find βopt by solving the linear equation (33).
The breakpoints themselves are easy to find: they are the values of β where an index is set equal to one or equal to zero. First, define the following sets
1
:={v
i−1|1≦i≦r+1},
2
:={v
i|1≦i≦r+1},
3
:={−v
i
|r+2≦i≦d},
4:={−vi+1|r+2≦i≦d}.
The sets 1 and 2 concern the r+1 largest components of v; 3 and 4 the smallest components. The set of breakpoints is
There are thus at most 2d+2 breakpoints.
To summarize, our Algorithm 2 sorts the input vector, computes the set of breakpoints, and then marches through the breakpoints until it finds a value of βi∈ with ƒrTz(βi)≦r. Since we will also have ƒrTz(βi−)>r, the optimal β will lie in [βi−1:βi] and can be found by solving (33). In the algorithm box for Algorithm 2, b is the largest and a is the smallest index in the active set. We use V to denote the sum of the elements in the active set and Λ the total sum of the vector at the current break point. Some of the awkward if statements in the main for loop take care of the cases when the input vector has many repeated entries.
Algorithm 2 requires two sorts (sorting the input vector and sorting the breakpoints), and then an inspection of at most 2d breakpoints. Thus, the total complexity of the algorithm is linear plus the time for the two sorts.
If we work with an (un-augmented) Lagrangian
the dual subgradient ascent method consists of the iterations:
χk+1:=argmin∞∈L0(χ,zk,λk)
z
k+1:=argminz∈ZL0(χk,z,λk)
λjk+1:=λjk+μ(Pjχk+1−zjk+1).
Note here that the χ and z updates are run with respect to the k iterates of the other variables, and can be run completely in parallel.
The χ-update corresponds to solving the very simple LP:
This results in the assignment:
is the Heaviside function.
For the z-update, we have to solve the following LP for each j∈
maximize λjk
subject to zj∈d. (34)
Maximizing a linear function over the parity polytope can be performed in linear time. First, note that the optimal solution necessarily occurs at a vertex, which is a binary vector with an even hamming weight. Let r be the number of positive components in the cost vector λjk. If r is even, the vector v∈d which is equal to 1 where λjk is positive and zero elsewhere is a solution of (34), as making any additional components nonzero decreases the cost as does making any of the components equal to 1 smaller. If λjk is odd, we only need to compare the cost of the vector equal to 1 in the r−1 largest components and zero elsewhere to the cost of the vector equal to 1 in the r+1 largest components and equal to zero elsewhere.
The procedure to solve (34) is summarized in Algorithm 3. Note that finding the smallest positive element and largest nonnegative element can be done in linear time. Hence, the complexity of Algorithm 3 is O(d).
While this subgradient ascent method is quite simple, it is requires vastly more iterations than the ADMM method, and thus we did not pursue this any further.
These references are incorporated in their entireties by reference.