Robust Regression Analysis Techniques Using Exponential Random Variables

Information

  • Patent Application
  • 20150286612
  • Publication Number
    20150286612
  • Date Filed
    April 07, 2014
    10 years ago
  • Date Published
    October 08, 2015
    9 years ago
Abstract
Embodiments relate to methodologies and program product is provided for conducting regression analysis. In one embodiment the method includes obtaining data related to a statistical process including a plurality of points in a plurality of dimensions and organizing the plurality of points and the plurality of dimensions in a matrix. The method also includes calculating a vector of a particular measurement such that the measurement equal the number of the plurality of points and calculating a least absolute deviation by determining the number of non-zero entries provided in the matrix.
Description
STATEMENT REGARDING PRIOR DISCLOSURES BY THE INVENTOR OR A JOINT INVENTOR

The following disclosure(s) are submitted under 35 U.S.C. 102(b)(1)(A): David P. Woodruff, Qin Zhang, “Subspace Embeddings and lp-Regression Using Exponential Random Variables”, submitted May 23, 2013 (v1), last revised Mar. 17, 2014 (v2); 27 pages.


BACKGROUND

The present disclosure relates generally to regression analysis techniques and more particularly to regression analysis techniques using exponential random variables.


In statistics, regression analysis provides a statistical process for estimating the relationships among variables. These relationships can be used to establish techniques for modeling and analyzing several variables at once. The


focus of such analysis is based on the relationships existing between a dependent variable and one or a plurality of independent variables. Therefore, regression analysis is useful in understanding how values of the dependent variables change when any one of the independent variables is changed especially when other independent variables are held at a constant.


Embedding can be also used to aid regression analysis. In mathematics, an embedding is where a mathematical structure is contained within another group that is in turn a subgroup. An embedding can provide a one-to-one function that is a homeomorphism onto its image. An “oblivious subspace embedding” (OSE) is a type of embedding that provides distribution over matrices S such that for any low-dimensional subspace V, with high probability over the choice of S, ∥Sx∥2 approximately equals ∥x∥2 (up to 1+eps multiplicative error) for all x in V simultaneously.


Oblivious subspace embeddings have proven to be an essential ingredient for quickly and approximately solving numerical linear algebra problems such as used in regression analysis. Prior art provides that such embeddings could be used to approximately solve least squares regression and low rank approximation time. OSE can also be used for speeding up algorithms for several numerical linear algebra problems. Problems that benefit from OSE's may include approximate least squares regression, low-rank approximation, approximating leverage scores, and constructing good preconditioners. A precondition can be defined as a condition or predicate that must always be true. In computing environments, a precondition must be true prior to the execution of some section or all areas of the code or before an operation in a formal specification. Traditionally, if a precondition is violated, the effect of the calculation in statistical data or execution of the code in computing environments becomes undefined and thus may or may not carry out its intended work.


BRIEF SUMMARY

Methodologies and program product is provided for conducting regression analysis. In one embodiment the method includes obtaining data related to a statistical process including a plurality of points in a plurality of dimensions and organizing the plurality of points and the plurality of dimensions in a matrix. The method also includes calculating a vector of a particular measurement such that the particular measurement equals the number of the plurality of points and calculating a least absolute deviation by determining the number of non-zero entries provided in the matrix. The regression analysis is then performed and analysis estimates provided based on the calculated absolute value deviation


Additional features and advantages are realized through the techniques of the present disclosure. Other embodiments and aspects of the disclosure are described in detail herein. For a better understanding of the disclosure with the advantages and the features, refer to the description and to the drawings.





BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

The subject matter which is regarded as the invention is particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other features, and advantages of the disclosure are apparent from the following detailed description taken in conjunction with the accompanying drawings in which:



FIG. 1 depicts a flow analysis in accordance with an embodiment; and



FIG. 2 depicts a block diagram of a computer system for practicing the teachings herein according to an embodiment.





DETAILED DESCRIPTION


FIG. 1 provides a flow diagram associated with one embodiment of the present invention. FIG. 1 along with discussions provided presently provides one or more techniques for presenting a quick and robust solution to regression analysis problems. Such regression problems (also known as absolute deviation regression) may involve oblivious subspace embedding (OSE). In one embodiment, the solution is provided by exponential random variables and is designed to be less sensitive to outliers than square regression solutions.


In one embodiment of the present invention, an optimized technique for providing a low-distortion oblivious subspace embedding is provided. An example of this is provided and will be discussed in conjunction with results for pε[1,2). In this embodiment, the results will be referenced as ΠM and can be computed in O(nnz(M)) time, assuming that nnz(M)≧dω+γ where ω<3 is the exponent of matrix multiplication and γ is an arbitrarily small constant.


In one embodiment, a matrix ΠεRO(d1+γ)×n (γ is an arbitrary small constant) for 1≦p<2 such that given MεRn×d, with constant probability is calculated such that:





Ω(1/(d log d)1/p)·∥Mx∥p≦∥ΠMx∥2≦O((d log d)1/p)·∥Mx∥p, ∀xεRd.


Similarly, a matrix ΠεRO(d log d)×n can be calculated such that given MεRn=d, with constant probability:





Ω(1/(d log d))·∥Mx∥1≦∥ΠMx∥1≦O(d log d)·∥Mx∥1, ∀xεRd.


In order to provide the solutions as discussed above, certain concepts relating to oblivious subspace embedding (OSE) should be discussed. An oblivious subspace embedding with distortion is a distribution over linear maps S:custom-characterncustom-charactert for which for any fixed d-dimensional subspace of Rn, represented as the column space of an n×d matrix M and with constant probability ∥Mx∥p≦∥SMx∥p≦κ∥Mx∥p that is provided simultaneously for all vectors xεcustom-characterd. The goal is to minimize t, κ, and the time to compute S·M. Here for an n-dimensional vector v, ∥υ∥p=(Σt=1n1|p)1/p is the p norm.


Oblivious subspace embedding (OSE) provide a quickly way to approximately solving numerical linear algebra problems especially when solving least squares regression and low rank approximation. Optimizations can be made to streaming models that make this possible. For example, in a least squares regression problem, a matrix M with n×d contents is usually over-constrained (i.e., n>>d, as well as a vector bεcustom-charactern. The goal is to output x*=argminx∥Mx−b∥l2 that is, to find the vector x* so that Mx* is the (Euclidean) projection of b onto the column space of M. This example can be solved in O(nd2) time by computing the normal equations. Prior art provides solutions so that a vector x* can be found with ∥Mx′−b∥2≦(1+ε)∥Mx′−b∥2 in O(nd log d)+poly(d/ε) time, providing substantial improvement. The application of oblivious subspace embeddings is immediate: given M and b, compute SM and Sb, and solve the problem minx∥SMx−Sb∥2. If κ=(1+ε) and t<<n, then a relative error approximation value can be obtained by solving a much smaller instance of regression.


The bottleneck of these algorithms for lp-regression was a preprocessing step, in which one first well conditions the matrix M by choosing a different basis for its column space. Some prior art got around this for the important case of p=1 by designing an oblivious subspace embedding S for which ∥Mx∥1≦∥SMx∥1≦d log d|M|x|1 in which S has O(d log d) rows. In such a case, S was chosen to be a matrix of Cauchy random variables. (A Cauchy distribution is defined as one that has the probability density function of f(x; 0,1)=1/(π(1+x2}.) The key point of the embedding is that one can instead run the expensive conditioning step on the matrix SM, which is much smaller than M, obtaining a d×d change of basis matrix R−1. Then one can show the matrix SMR−1 is well-conditioned. This reduced the running time for l1-regression to ndω−1+poly(d/ε), where ω<3 is the exponent of matrix multiplication. The dominant term is the ndω−1, which is just the cost of computing SM when S is a matrix of Cauchy random variables. However, the prior art solution is inefficient and is also limited in terms of values it can accommodate.


A more structured family of subspace embeddings can be provided by improving the running time for L1-regression to O(nd log n)+poly(d/ε). An alternate construction will provide a family of subspace embeddings that was obtained by partitioning the matrix M into n/poly(d) blocks. Using this approach, it is also possible to obtain an O(nd log n)+poly(d/ε) time algorithm for Lp-regression for every 1≦p≦∞.


The results discussed are suitable for dense matrices, but they are not optimal if the number of non-zero entries of M, denoted nnz(M), is much smaller than nd. Indeed, in practical applications M is often a sparse matrix, and one could hope to achieve a running time of O(nnz(M))+poly(d/ε). A family of sparse oblivious subspace embeddings S with poly(d/ε) rows, for which ∥Mx∥2≦∥SMx∥2≦(1+ε)∥Mx∥2 for all x. Importantly, the time to compute SM is only nnz(M), that is, proportional to the sparsity of the input matrix. The poly(d/ε) factors can be optimized in this way. Combining this idea with the partitioning of M into blocks in the FCT2, they were able to achieve a running time of O(nnz(M)log n)+poly(d/ε) for lp-regression for any constant p, 1≦p<∞. Table 1 can be used to demonstrate the results achieved using this approach.


In Table 1, the L1 oblivious subspace embeddings is provided for a transform with properties of ω<3 as the exponent of matrix multiplication, where γ is an arbitrarily small constant. In this embodiment, the distortion O˜(d3)1 is optimal for p=1, provided that ΠM can be computed in O(nnz(M)) time. The result is provided as a function of the distortion









TABLE 1







Results for L1 oblivious subspace embeddings











Time
Distortion
Dimemsion





[17]
ndω−1
O~(d)
O~(d)


[5]
ndlogd
O~(d2+γ)
O~(d5)


[7] + [14]
nnz(M)logn
O~ d(x+1)/2) (x ≧ 1)
O~ (n/dx)


[7] + [5] + [14]
nnz(M)logn
O~(d3)
O~(d)


[7] + [17] + [14]
nnz(M)logn
O~(d1+ω/2)
O~(d)


[12]
nnz(M)
O~(d3)
O~(d5)


[12] + FJLT
nnz(M) + O~(d6)
O~(d3)
O~(d)


This paper
nnz(M) + O~(dω+γ)
O~(d2)
O~(d)









The results as provided further shows that numerical values are improved in respect to O˜(d2). Previous results for L1 oblivious subspace embeddings can be compared to what is provided in Table 1. This shows that the OSE results under this one embodiment lead to an improvement of (1+ε)-approximation (results for Lp regression for every pε[1,2]). One numerical example can now be provided: for an Lp regression problem specified by MεRn×(d−1), bεRn and p, let M˜=[M,−b)εRn×d. In addition, let φ(t,d) be the time of solving Lp regression problem on t vectors in d dimensions. Using the embodiment discussed, an algorithm for Lp (1≦p<2) regression can be obtained with running time






O(nnz( M)log n+d7−p/2 log3−p/2d+φ(O(d2+p log(1/ε)ε2),d)).


One technique used in the prior art for achieving O(nnz(M)log n)+poly(d) time for Lp-regression with t=poly(d) sketching dimension has the form S·D·M, where S is a t×n hashing matrix, that is, a matrix for which in each column there is a single randomly positioned entry which is randomly either 1 or −1, and D is a diagonal matrix of p-stable random variables. p-stable random variables have the property that if X and Y are independent p-stable random variables, then for scalars a and b, aX+aY is distributed as (|a|p+|b|p)1/pZ, where Z is itself a p-stable random variable. For p=2, they are Gaussian random variables, while for p=1 they are Cauchy random variables.


One inherent limitation with using p-stable random variables is that in analyses of them for regression [5, 12, 17], one needs to bound the concentration of |X|p, the p-th power of the absolute value of a p stable random variable. To upper bound the distortion of the embedding, one needs to bound Pr[|X|p>t median(|X|p)], while to lower bound the distortion one needs to bound Pr[|X|p<1/t median(|X|p)], for t≧1. The smaller these bounds, the smaller the distortion is of the embedding. For p-stable random variables for 1≦p<2, both of these bounds are Ω(1/t), which means |X|p is heavy-tailed in both directions.


In the present embodiment, a starting point for improving these sketches is to use exponential random variables instead of p-stable random variables. Exponential random variables have stability properties with respect to the minimum operation, that is, if are exponentially distributed and λi>0 are scalars, than min{u11, . . . , unλn} is distributed as u/λ, where λ=Piλi. This property can also be used to estimate the harmonic mean of n numbers and for estimating the p-norm for p>2. By replacing the diagonal matrix D in the prior art, for Lp-regression with a diagonal matrix with 1/p diagonal entries 1/ui for independent exponential random variables u1, . . . , un, the sketch coincides with that of sketching dimension t.


By using random variables 1/ui1/p on the diagonal, the restriction of p-stable random variables in the context of regression can be bypassed. In the analysis for regression, the concentration of Xp=1/u can be provided, where u is an exponential random variable. While Pr[Xp>t·median(Xp)] is still Ω(1/t), now Pr[|X|p<1/t·median(Xp)] is only exp(−t), that is, the lower tail is now exponentially small. This property ultimately leads to an improved distortion bound for Lp subspace embeddings into Lp for pε[1,2), and in turn the improved running time for Lp-regression for pε[1,2). The use of exponential random variables requires p≧2 since the variance of 1/u1/p does not exist unless p>2 and he embeds vectors into L with constant distortion, while the analysis for pε[1,2) instead embeds vectors into L2 with a distortion that depends on the dimension of the underlying subspace. To prove this works, the fact that S provides a subspace embedding for L′2 can be shown as provided in FIG. 1 and further optimized in [12,14]. The low dimensional L′2 space is then embedded into L1 in a variety of ways as known by those skilled in the art such as through the use of a Fast Johnson Lindenstrauss Transform. (In mathematics, the Johnson-Lindenstrauss lemma concerning low-distortion embeddings of points from high-dimensional into low-dimensional Euclidean space.) This illustrates the versatility of exponential random variables.


In another embodiment, given a matrix MεRn×d, if M1, . . . , Md are the columns of M, and M1, . . . , Mn are the rows of M. then li=∥Mip (i=1, . . . , n) can be defined as the leverage scores of M. Consider, range(M)={y|y=Mx, xεRd}. In addition, w.l.o.g., as will be discussed below, ∥x∥1=1, xεRd; by scaling the results such that it will hold for all xεRd. Define ∥M∥p to be the element-wise Lp norm of M. That is, ∥Mx∥p=(Σiε|d|∥Mipp)1/p=(Σjε|n|∥Mjpp)1/p.


Well-Conditioning of A Matrix—The following definitions can be used in the well conditioning of matrices.


Definition 1—((α, β, p)-well-conditioning—Given a matrix MεRn×d and pε[1,∞), let q be the dual of p, that is, 1/p+1/q=1. In this case, the first formula is derived in the following manner if M is (α, β, p)-well-conditioned:





x∥q≦∥Mx∥p for any xεRd, and  (1)






kMk
p≦α. Define Δp′(M)=custom-character.  (2)


It is well known that for a d-dimensional subspace (Rn, ∥·∥p) is (d1/p, 1, p)-well-conditioned. Thus by definition, ∥x∥q≦∥Ax∥p for any xεRd, and kAkp≦d1/p. In addition, the definition provides for the property that ∥A∥kp=1 for all iε[d].


Definition 2 (lp-conditioning)—Given a matrix MεRn×d and pε[1,∞), let ζpmax(M)=max∥x∥2≦1∥Mxkp and ζpmin (M)=min∥x∥2≧1∥Mx∥p. Define Δp(M)=ζpmax(M)=/ζpmin(M) to be the Lp-norm condition number of M. The following lemma states the relationship between the two definitions.


Lemma 1—Given a matrix MεRn×d and pε[1,∞),






d
−|1/2−1/p|Δp(M)≦Δp′(M)≦dmax{1/2,1/p}Δp(M).


Oblivious Subspace Embeddings—To design an oblivious subspace embedding (OSE), given a parameter d for a a distribution D over m×n matrices (such that for any d-dimensional subspace S⊂Rn, with probability 0.99 over the choices of Π˜D), the following formula can be used for all xεS,





½·∥x∥2≦∥Πx∥2≦3/2·∥x∥2.


In this example, the OSE's only works for the 2-norm, but as can be appreciated similar results can be achieved for Lp-norms for all pε[1,∞)/{2}. Two important parameters needs to be minimized in this case in the construction of OSE's. These are:

    • 1. The number of rows of Π, that is, m. This is the dimension of the embedding.
    • 2. The number of non-zero entries in the columns of Π, denoted by s. This affects the running time of the embedding.


      In several OSE constructions, in particular, where it is shown that there exist OSE's with (m,s)=O(d2), 1) and (m,s)=O(d1+γ), O(1)) this will be true for any constant γ>0.


Distributions—In order to address the problems and subsequently solutions related to distributions of functions, the following discussion will be helpful. Given two variables X, Y, then it is true that X≃Y if X and Y have the same distribution. In addition, for p-stable Distribution, then a distribution Dp is p-stable, if for any n dimensional vector α=(α1, . . . , αn)εRn and Xni.i.d˜Dp,





Σiε|n|αiXi≃∥α∥p)X,


where X˜Dp. It is well-know that p-stable distribution exists if and only if pε[1,2]. For p=2 it is the Gaussian distribution and for p=1 it is the Cauchy distribution. Aa random variable X is p-stable if X is chosen from a p-stable distribution.


Exponential Distributions merit their own discussion. An exponential distribution has support xε[0,∞), probability density function (PDF) f(x)=e−x and cumulative distribution function F(x)=1−e−x. A random variable X is exponential if X is chosen from the exponential distribution. There are several properties relating to exponential distributions that need to be addressed:


Property 1—The exponential distribution has the following properties.

    • 1. (max stability) If u1, . . . , un are exponentially distributed, and αi>0 (i=1, . . . , n) are real numbers, then





max{α1/u1, . . . ,αn/un}≃(Σiε[n]αi)/u,

      • where u is exponential.
    • 2. Lower tail bound—for any X that is exponential, for a large enough constant c0,






Pr[X≦t]≦c
0
t, for ∀t≧0.


Property 2—The second property holds since the median of the exponential distribution is a constant ln 2 (that is, Pr[x≦ln 2]=50%), and the PDFs on x=0, x=ln 2 are f(0)=1, f(ln 2)=½, differing by a factor of 2. Given two random variables X,Y chosen from two probability distributions Xcustom-characterY, if for ∀tεR then Pr[X≧t]≧Pr[Y≧t]. The following lemma shows a relationship between the p-stable distribution and the exponential distribution.


Lemma 2—For any pε[1,2), there exists a constant κp such that





|Xpcustom-characterκb·1/U1/p,


where Xp is p-stable and U is an exponential. In other words, if Xp is p-stable with pε[1,2), then






Pr[X>x]˜c
p
x
−p,


for some constant cp when x→∞. By Property 1 if U is exponential, then






Pr[1/U1/p>x]=Pr[U<1/xp]≦c0x−p,


for some constant c0. Therefore there exists a constant κp such that |Xp|custom-characterκp·1/U1/p.


It should be noted that in the discussion presented here, to aid understanding, several events E0, E1, . . . (along their analysis) are defined such that each of those events holds with probability 0.99, and there are no more than 10 of them. Thus by a union bound all of them hold simultaneously with probability 0.9. This is so that those conditions will not affect the analysis. However, it is appreciated by those skilled in the art that other arrangements can be provided in alternative embodiments and this is only used to achieve understanding and clarity.


Subspace Embedding—It is important that the topic of subspace embedding will now be discussed in relation to its different components. First, the algorithm itself should be provided. The subspace embedding matrix Π=SD is provided such that D is an n×n diagonal matrix with 1/u11/p, . . . , 1/un1/p on the diagonal, where ui (i=1, . . . , n) are i.i.d. exponentials. In this example, S is chosen to be an (m,s)−OSE with (m,s)=O(d1+γ). O(1)) (γ is an arbitrary small constant). More preciously, the first pick random hash functions is h:[n]×[s]→[m/s], σ:[n]×[s]→{−1,1}. For each (i,j)ε[n]×[s] S(j−1)s+h(i,j),i=σ(i,j)/√{square root over (s)}, where √{square root over (s)} is just a normalization factor.


Consequently, the following theorem can be proven:


Theorem 1—Let A be the basis of a d-dimensional subspace of (custom-charactern,∥·∥p). Given the above choices of S and D, let Π=SDεRO(d1+γ)×n). For any 1≦p<2:





Ω(1/(d log d)1/p)·∥Ax∥p≦∥ΠAx∥2≦O((d log d)1/p)·∥Ax∥p, ∀xεcustom-characterd.


Again, since the inequality holds for all xεRd, the theorem holds if the basis is replaced by any full-rank matrix M in d-dimensional subspace of (Rn∥·∥p).


By observing that ∥ΠAx∥2≦∥ΠAx∥p≦d(1+γ)(1/p−1/2)∥ΠA∥k2 for pε[1,2), Theorem 1 directly implies the following.


Corollary 1—Under the same assumptions as Theorem 1, for any 1≦p<2





Ω(1/(d log d)1/p)·∥Ax∥p≦∥ΠAx∥p≦O(d(1+γ)(1/p−1/2)(d log d)1/p)·∥Ax∥p, ∀xεcustom-characterd


where γ is an arbitrary small constant. This can be further improved for the distortion for p=1 by pre-multiplying a Fast Johnson-Lindenstrauss Transform (FJLT) matrix proposed.


Lemma 3—Given a fixed set X of 2ck vectors in Rt (c is a universal constant), there exists a FJLT matrix Φ:Rt→Rk such that for all xεX, for both pε{1,2}, αp∥x∥2/2≦∥Φx∥p≦2αp∥x∥2, where:


α1=k√{square root over (2π−1)} and α2=k. The sketch can be computed in time O(tkω−1).


By a net-argument like before, and setting k=θ(d log d), one can show that α1∥x∥2/2≦kΦxk1≦2α1∥x∥2 holds for all xεRd. Now combining this with Theorem 1, theorem 2 can be obtained.


Theorem 2—Let M be a full-rank matrix in a d-dimensional subspace of (custom-charactern, ∥·∥1). Given the above choices of S and D, let Π=SD.





Ω(1/(d log d)·∥Mx∥1≦∥ΦΠMx∥1≦O(d log d)·∥Mx∥1, ∀xεcustom-characterd.


The embedding matrix ΦΠ can be computed in time O(nnz(M)+dω+γ log d), where γ is an arbitrary small constant. In one example, given the embedded matrix ΦΠ, the distortion for p=1 is tight up to a factor of θ(√{square root over ( d). The worst case M is the same as the “bad” example before where the M=(Id,0)T and Id is the d×d identity matrix. Lemma 3 provides










ΦΠ





Mx



1

=

θ


(
k
)






Suppose that the top d rows of M get perfectly hashed, then





∥ΠMx∥2=(Σiε[d](xi/ui)2)12/


where ui are i.i.d. exponentials. Let i*=argmaxiε[d]1/ui. It is known from Property 1 that with a constant probability, 1/ui*=Ω(d). Now if x is chosen such that xi*=1 and xi=0 for all i6=i*, then kΠMxk2=d. On the other hand, with a constant probability, for Ω(d) of iε[d], 1/ui=Θ(1). Let K(|K|=Ω(d)) denote this set of indices. If x is chosen such that








x
1

=




1
/


K




for





all





i



K





and






x
1



=


0





for





all





i




[
d
]


\



K






,






then









Π





Mx



2


=


1
/



K




=


O


(

1
/

d


)


.







Therefore the distortion is at least Ω(d3/2). Theorem 1 can be proven as well. If E0 is defined to be the event that ∥SDAx∥2=(1±12)∥DAx∥2 which is conditioned on. By this choice of S, E0 holds with probability 0.99. The global parameter is set as ρ=c1d log d.


The formulas and calculations discussed so far do not provide an overestimation. This is can be easily ascertained by looking at







S
=


1

s





(


S
1

,





,

S
s


)

T



,




where each SiεR(m/s)×n with one±1 on each column. For any xεRd, let γ=AxεRn. Let D0εRn×n be a diagonal matrix with i.i.d. p-stable random variables on the diagonal. Let E1 be the event that for all iε[s], for iε[s], ∥SiD0y∥p≦c4(d log d)1/p·∥y∥p for all yεrange(A), where c4 is some constant. Since s=O(1) and S1, . . . , Ss are independent. By the previous discussion, it is known that E1 holds with probability 0.99.














SDy


2








3
/
2

·



Dy


2








(

conditioned





on






ɛ
0


)

















3
/
2

·

κ
p








D



y



2







(

Lemma





2

)
















3
·

κ
p








SD



y



2







(

conditioned





on






ɛ
0


)
















3
·

κ
p








SD



y



p
















3
·

κ
p

·

1

5








2
=
1

5












S
t



D



y



p







(

triangle





inequality

)



















3
·

κ
p

·

1

5


·
5
·



c
d



(

d





log





d

)



1
/
p


·



y


p








(

conditioned





on






ɛ
1


)























cs


(

d





log





d

)



1
/
p


·



y


p



,





(


s
=

O


(
1
)



,


κ
p

=

O


(
1
)




)








(
1
)







where cg is a sufficiently large constant.


Similarly there is also no underestimation because

    • For any xεcustom-characterd, let y=Axεcustom-charactern.





SDy∥2≧½·∥Dy∥2 (conditioned on ε0)≧½·∥Dy∥≧½·1/ρ1/p·∥y∥p.  (2)

    • The last inequality holds with probability (1−e−p), since ∥Dy∥˜∥y∥p/u1/p where u is an exponential, and Pr[u≧ρ]≦e−p for an exponential.


Similar as before, a net-argument can be used. Let ball B={yεRn|y=Ax, ∥y∥p≦1}. Let BεB be an d an-net of B with size at most (3/ε)d. If the choice is made that ε=1/(4c5(ρd2 log d)1/p), then probability 1−e−ρ·(3/ε)d≧0.99, ∥SDy′∥2≧1/(2ρ1/p)·∥y′∥p holds for all y′εBε. Let ε2 denote this event which is conditioned for yεB\Bε, let y′εBε such that ∥y−y′∥p≦ε. By the triangle inequality












SDy


2







SDy




2

-





SD


(

y
-

y



)




2

.






(
3
)













SD


(

y
-

y



)




2








c
5



(

d





log





d

)



1
/
p


·




y
-

y





p















c
5



(

d





log





d

)



1
/
p


·
ε















c
5



(

d





log





d

)



1
/
p


·
ε
·

d

1
/
p







y


p








=




1
/

(

4






ρ

1
/
p



)


·




y


p

.









(
4
)









    • By (2) (3) ad (4), conditioned on ε2, we have for all γεrange(A), it holds that








SDy∥2≧1/(2ρ1/p)·∥y∥p−1/(4ρ1/p)∥y∥p≧1/(4ρ1/p)∥y∥p.  (5)


Regression—In one embodiment, the prior discussion can be applied t regression.


Lemma 4—Given a matrix MεRn×d with full column rank and pε[1,∞), it takes at most O(nd3 log n) time to find a matrix RεRd×d such that MR−1 is (α, β, p)-well-conditioned with αβ≦2d1+max{½,1/p}.


Lemma 5—Given a matrix Mεcustom-charactern×d, pε[1,∞), ε>0, and a matrix RεRd×d such that MR−1 is (α, β, p)-well-conditioned, it takes O(nnz(M)·log n) time to compute a sampling matrix ΠεRt×n such that with probability 0.99:





(1−ε)∥Mx∥p≦∥ΠMx∥p≦(1+ε)∥Mx∥p, ∀xεcustom-characterd.


t is O(custom-character)pd log(1/ε)/ε2 for 1≦p<2 and O(custom-character)pdp/2 log(1/ε)/ε2) for p>2.


Lemma 6—Given an ′p regression problem specified by MεRn×(d−1), b εRn, and pε[1,∞, let Π be a (1±ε)-distortion embedding matrix of the subspace spanned by M's columns and b from Lemma 5, and let x̂ be an optimal solution to the sub-sampled problem minxεRd kΠMx−Πbkp. Then x̂ is a







1
+
ε


1
-
ε





approximation solution to the original problem.


Lemma 7—Given MεRn×d with full column rank, pε[1,2), and ΠεRm×n whose entries are i.i.d. p-stables, if m=cd log d for a sufficiently large constant c, then with probability 0.99:





Ω(1)·∥Mx∥p≦∥ΠMx∥p≦O((d log d)1/p·∥Mx∥p, ∀xεcustom-characterd.


In addition, ΠM can be computed in time O(ndω−1) where ω is the exponent of matrix multiplication.


Lemma 8—Let ΠΣRm×n be a subspace embedding matrix of the d-dimensional normed space spanned by the columns of matrix MεRn×d such that





μ1·∥Mx∥p≦∥ΠMx∥2≦μ2·∥Mx∥p, ∀xεcustom-characterd.  (6)


If R be the “R” matrix in the QR-decomposition of ΠM, then MR−1 is (α, β, p)-well-conditioned with αβ≦d1/pμ2i for any pε[1,2).













MR

-
1



x



p








1
/

μ
1


·




Π






MR

-
1



x



2








(

by






(
6
)


)













=



1
/

μ
1


·



Qx


2








(


Π






MR

-
1



=


QRR

-
1


=
Q


)













=



1
/

μ
1


·



x


2








(

Q





has





orthonormal





columns

)










And












MR

-
1



x



p








1
/

μ
2


·




Π






MR

-
1



x



2








(

by






(
6
)


)













=


1
/

μ
2


·



Qx


2













=


1
/

μ
2


·



x


2









Then by Lemma 1 it holds that






custom-characterp′(MR−1)≦dmax{1/2,1/p}Δn(MR−1)=d1/pμ21.


Theorem 3—There exists an algorithm that given an Lp regression problem specified by MεRn×(d−1), bεRn and pε[1,2), with a constant probability computes a (1+ε)-approximation to an Lp regression problem in time O(nnz(( M)log n+d7−p/2 log3−p/2 d+φ(d2+p log(1/ε)/ε2),d)) where M=[M,−b] and φ(t,d) is the time to solve ′p-regression problem on t vectors in d dimensions. In order to provide a proof of this, let Π be the subspace embedding matrix. By theorem:





12)=(Ω(1/(d log d)1/p),O((d log d)1/p)).


In one embodiment, as provided in FIG. 1, for 1≦p<2) as shown in block 110 first ΠM is computed. In block 120, the QR decomposition is computed for ΠM such that RεRd×d is the “R” in the QR-decomposition. In block 130, Given R, Lemma 5 is used to find a sampling matrix Π1εRt1×n such that





(1−½)·∥Mx∥p≦∥Π1Mx∥p≦(1+½)·∥Mx∥p, ∀xεcustom-characterd.  (7)


In block 140, the Lemma 7 is used to compute a matrix Π2εRt2×t1 for Π1M such that





Ω(1)·∥Π1Mx∥p≦∥Π2Π1Mx∥p≦O((d log d)1/p)·∥Π1Mx∥p, ∀xεcustom-characterd.


Let Π32Π1εRt2=n. By (7) and kzk2≦kzkp≦m1/p−1/2 kzk2 for any zεRm, then





Ω(1/t21/p−1/2)·∥Mx∥p≦∥Π3Mx∥2≦O(d log d)1/p·∥ Mx∥p, ∀xεcustom-characterd.


In block 150, the QR-decomposition of Π3M is computed so that R1εRd×d be the “R” in the QR-decomposition.


In block 160, given R1, Lemma 5 is used to find a sampling matrix Π4εRt3×n such that Π4 is a (1±½) distortion embedding matrix of the subspace spanned by M. In block 170, Lemma 4 is used to compute a matrix R2εRd×d such that Π4MR2−1 is (α, β, p)-well-conditioned with αβ≦2d1+1/p. Finally, in block 180, Given R2, Lemma 5 is used to find a sampling matrix Π5εRt4×n such that Π5 is a (1±ε) distortion embedding matrix of the subspace spanned by M.


In block 190, {circumflex over (x)} is computed which is the optimal solution to the sub-sampled solution mincustom-characterdΠ5Mx−Π5p. In one embodiment, the correctness of the algorithm is guaranteed by Lemma 6. Now for analyzing the running time, block 110 calculated traditionally can costs time O(nnz(M)), by choice of Π Block 120 costs time O(md2)=O(d3+γ) using standard QR-decomposition, where γ is an arbitrarily small constant. Block 130 costs time O(nnz(M−)log n) by Lemma 5, giving a sampling matrix Π1εRt1×n with t1=O(d4 log2 d). Block 140 costs time O(t1dω−1)=O(d3+ω log2 d) where ω is the exponent of matrix multiplication, giving a matrix Π3εRt2×n with t2=O(d log d). Block 150 costs time O(t2d2)=O(d3 log d). Step 6 costs time O(nnz(M)log n) by Lemma 5, giving a sampling matrix Π4εRt3×n with t3=O(d4−p/2 log2−p/2d). Block 170 costs time O(t3d3 log t3)=O(d7−p/2 log3−p/2 d). Block 180 costs time O(nnz(M)log n) by Lemma 5, giving a sampling matrix Π5εRt4×n with t4=O(d2+p(1/ε)/ε2). Block 190 costs time φ(t4,d), which is the time to solve Lp-regression problem on t4 vectors in d dimensions. To sum up, the total running time is






O(nnz( M)log n+d7−p/2 log3−p/2 d+φ(O(d2+p log(1/ε)/ε2),d)).


However, in an algorithm together as suggested in the embodiments above and along with several variants for L1 regression as proposed, all running times are improved especially when in the form of O nnz(M−)log n+poly(d)+φ(O(poly(c) log(1/ε)/ε2), d)). Among all those variants, the power of d in poly(d) (ignoring log factors) in the second term is at least 7, and the power of d in poly(d) in the third term is at least 3.5. In addition, in the proposed techniques and using the algorithms as suggested both terms get improved.


Referring now to FIG. 2, a block diagram of an exemplary computer system 300 for use with the teachings herein is shown. The methods described herein can be implemented in hardware software (e.g., firmware), or a combination thereof. In an exemplary embodiment, the methods described herein are implemented in hardware, and is part of the microprocessor of a special or general-purpose digital computer, such as a personal computer, workstation, minicomputer, or mainframe computer. The system 300 therefore includes general-purpose computer 301.


In an exemplary embodiment, in terms of hardware architecture, as shown in FIG. 2, the computer 301 includes a processor 305, memory 310 coupled via a memory controller 335, a storage device 320, and one or more input and/or output (I/O) devices 340, 345 (or peripherals) that are communicatively coupled via a local input/output controller 335. The input/output controller 335 can be, for example, but not limited to, one or more buses or other wired or wireless connections, as is known in the art. The input/output controller 335 may have additional elements, which are omitted for simplicity, such as controllers, buffers (caches), drivers, repeaters, and receivers, to enable communications. Further, the local interface may include address, control, and/or data connections to enable appropriate communications among the aforementioned components. The storage device 320 may include one or more hard disk drives (HDDs), solid state drives (SSDs), or any other suitable form of storage.


The processor 305 is a computing device for executing hardware instructions or software, particularly that stored in memory 310. The processor 305 can be any custom made or commercially available processor, a central processing unit (CPU), an auxiliary processor among several processors associated with the computer 301, a semiconductor based microprocessor (in the form of a microchip or chip set), a macroprocessor, or generally any device for executing instructions. The processor 305 may include a cache 370, which may be organized as a hierarchy of more cache levels (L1, L2, etc.).


The memory 310 can include any one or combination of volatile memory elements (e.g., random access memory (RAM, such as DRAM, SRAM, SDRAM, etc.)) and nonvolatile memory elements (e.g., ROM, erasable programmable read only memory (EPROM), electronically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), tape, compact disc read only memory (CD-ROM), disk, diskette, cartridge, cassette or the like, etc.). Moreover, the memory 310 may incorporate electronic, magnetic, optical, and/or other types of storage media. Note that the memory 310 can have a distributed architecture, where various components are situated remote from one another, but can be accessed by the processor 305.


The instructions in memory 310 may include one or more separate programs, each of which comprises an ordered listing of executable instructions for implementing logical functions. In the example of FIG. 2, the instructions in the memory 310 include a suitable operating system (OS) 311. The operating system 311 essentially controls the execution of other computer programs and provides scheduling, input-output control, file and data management, memory management, and communication control and related services.


In an exemplary embodiment, a conventional keyboard 350 and mouse 355 can be coupled to the input/output controller 335. Other output devices such as the I/O devices 340, 345 may include input devices, for example but not limited to a printer, a scanner, microphone, and the like. Finally, the I/O devices 340, 345 may further include devices that communicate both inputs and outputs, for instance but not limited to, a network interface card (NIC) or modulator/demodulator (for accessing other files, devices, systems, or a network), a radio frequency (RF) or other transceiver, a telephonic interface, a bridge, a router, and the like. The system 300 can further include a display controller 325 coupled to a display 330. In an exemplary embodiment, the system 300 can further include a network interface 360 for coupling to a network 365. The network 365 can be an IP-based network for communication between the computer 301 and any external server, client and the like via a broadband connection. The network 365 transmits and receives data between the computer 301 and external systems. In an exemplary embodiment, network 365 can be a managed IP network administered by a service provider. The network 365 may be implemented in a wireless fashion, e.g., using wireless protocols and technologies, such as Wi-Fi, WiMax, etc. The network 365 can also be a packet-switched network such as a local area network, wide area network, metropolitan area network, Internet network, or other similar type of network environment. The network 365 may be a fixed wireless network, a wireless local area network (LAN), a wireless wide area network (WAN) a personal area network (PAN), a virtual private network (VPN), intranet or other suitable network system and includes equipment for receiving and transmitting signals.


If the computer 301 is a PC, workstation, intelligent device or the like, the instructions in the memory 310 may further include a basic input output system (BIOS) (omitted for simplicity). The BIOS is a set of essential routines that initialize and test hardware at startup, start the OS 311, and support the transfer of data among the storage devices. The BIOS is stored in ROM so that the BIOS can be executed when the computer 301 is activated.


When the computer 301 is in operation, the processor 305 is configured to execute instructions stored within the memory 310, to communicate data to and from the memory 310, and to generally control operations of the computer 301 pursuant to the instructions. In exemplary embodiments, the computer system 300 includes one or more accelerators 380 that are configured to communicate with the processor 305. The accelerator 380 may be a field programmable gate array (FPGA) or other suitable device that is configured to perform specific processing tasks. In exemplary embodiments, the computer system 300 may be configured to offload certain processing tasks to an accelerator 380 because the accelerator 380 can perform the processing tasks more efficiently than the processor 305.


The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.


The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present disclosure has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the disclosure in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the disclosure. The embodiments were chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure for various embodiments with various modifications as are suited to the particular use contemplated.


Further, as will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method, or computer program product. Accordingly, aspects of the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present disclosure may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.


Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.


A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.


Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.


Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).


Aspects of the present disclosure are described above with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.


These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.


The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.


The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Claims
  • 1. A method of conducting regression analysis, comprising: obtaining data related to a statistical process including a plurality of points in a plurality of dimensions;organizing the plurality of points and the plurality of dimensions in a matrix;calculating a vector of a particular measurement such that the particular measurement equals the number of the plurality of points;calculating a least absolute deviation by determining the number of non-zero entries provided in the matrix; andperforming regression analysis and providing analysis estimates based on said calculated absolute value deviation.
  • 2. The method of claim 1, wherein the least absolute deviation is calculated by sketching using an exponential random variable.
  • 3. The method of claim 2, wherein the matrix is defined by a first variable A and the vector is defined by second variable b wherein A is smaller than b.
  • 4. The method of claim 3, wherein the least absolute deviation is calculated based on a regression formula minxΠAx−bΠ1 regression for an L1 regression function.
  • 5. The method of claim 4, wherein the non-zero entries are defined as nnz(M) where the number of plurality of points are defined as n and their dimensions are defined as d.
  • 6. The method of claim 5, wherein the least absolute deviation is calculated by taking an arbitrary constant (p) having a value of greater than or equal to 1 but less than 2 such that a relationship can be established between the statistical data points and the dimensions.
  • 7. The method of claim 6, wherein minxΠAx−bΠ1 is calculated by obtaining a (1+ε)-approximation to lp-regression in O(nnz(M)log n)+poly(d/ε) time.
  • 8. The method of claim 7, wherein nnz(M) is greater than or equal to less d*ω+γ where ω<3 is the exponent of matrix multiplication and γ is an arbitrarily small constant.
  • 9. The method of claim 8, further comprising obtaining a (1+ε)-approximation to an Lp-regression used in calculation of the 6 regression.
  • 10. The method of claim 9, wherein the lp-regression is calculated based on the function O(nnz(M)log n)+poly(d/ε) wherein the best known poly(d/ε) factors for every pε[1,2)
  • 11. A computer program product for calculating a regression of a function, the computer program product comprising: a computer readable storage medium having program code embodied therewith, the program code executable by a processor to: obtain data related to a statistical process including a plurality of points in a plurality of dimensions;organize the plurality of points and the plurality of dimensions in a matrix;calculate a vector of a particular measurement such that the particular measurement equals the number of the plurality of points; andcalculate a least absolute deviation by determining the number of non-zero entries provided in the matrix; andperforming regression analysis and providing analysis estimates based on said calculated absolute value deviation.
  • 12. The computer program product of claim 11, wherein the least absolute deviation is calculated by sketching using an exponential random variable.
  • 13. The computer program product of claim 12, wherein the non-zero entries are defined as nnz(M) where the number of plurality of points are defined as n and their dimensions are defined as d.
  • 14. The computer program product of claim 13, wherein an arbitrary constant (p) is provided in the calculation of the least absolute deviation such that the constant (p) has a value greater than 1 and less than 2 (pε[1,2).
  • 15. The computer program product of claim 14, wherein nnz(M) is greater than or equal to less d*ω+γ where ω<3 is the exponent of matrix multiplication and γ is an arbitrarily small constant.
  • 16. The computer program product of claim 15, wherein the Lp-regression is calculated based on the function O(nnz(M)log n)+poly(d/ε) wherein the best known poly(d/ε) factors for every pε[1,2).
  • 17. The computer program product of claim 16, wherein the distortion of linear maps is defined by S:n→t for which for any fixed d-dimensional subspace of Rn, represented as the column space of an n×d matrix M, with constant probability ∥Mx∥p≦∥SMx∥p≦κ∥Mx∥p simultaneously for all vectors xεd.
  • 18. A method for solving a regression problem comprising: creating a first matrix by sampling data gathered and to be used in the regression solution;organizing the sampled data in the first matrix such that the first matrix has a plurality of rows representing a plurality of points (n) and a plurality of columns associated with dimensions of each plurality of points;calculating a first vector associated with the dimensions in response related to the number of points (n);calculating a least absolute deviation by computing an exponential random variable and a decomposition associated with the exponential random variable;calculating a distortion embedding matrix based on sampled data for the subspace spanned by the matrix;calculating a well conditioned matrix based on the distortion embedding of the subspace spanned by the first matrix;computing an optimal solution based on a sub-sampling of the distortion embedding subspace; andcalculating a least absolute deviation as a solution for the regression by sketching the exponential random variable and applying the optimal based sub-sampling of the distorted embedding to it.