1. Technical Field
This disclosure relates to iterative estimates of an unknown parameter of a model or state of a system.
2. Description of Related Art
The expectation-maximization (EM) algorithm is an iterative statistical algorithm that estimates maximum-likelihood parameters from incomplete or corrupted data. See A. P. Dempster, N. M. Laird and D. B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm (with discussion),”Journal of the Royal Statistical Society, Series B 39 (1977) 1-38; G. J. McLachlan and T. Krishnan, The EM Algorithm and Extensions (John Wiley and Sons, 2007); M. R. Gupta and Y. Chen, “Theory and Use of the EM Algorithm,” Foundations and Trends in Signal Processing 4 (2010) 223-296. This algorithm has a wide array of applications that include data clustering, see G. Celeux and G. Govaert, “A Classification EM Algorithm for Clustering and Two Stochastic Versions,” Computational Statistics & Data Analysis 14 (1992) 315-332; C. Ambroise, M. Dang and G. Govaert, “Clustering of spatial data by the em algorithm,” Quantitative Geology and Geostatistics 9 (1997) 493-504, automated speech recognition, see L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proceedings of the IEEE 77 (1989) 257-286; B. H. Juang and L. R. Rabiner, “Hidden Markov models for speech recognition,”Technometrics 33 (1991) 251-272, medical imaging, see L. A. Shepp and Y. Vardi, “Maximum likelihood reconstruction for emission tomography,” IEEE Transactions on Medical Imaging 1 (1982) 113-122; Y. Zhang, M. Brady and S. Smith, “Segmentation of Brain MR Images through a Hidden Markov Random Field Model and the Expectation-Maximization Algorithm,” IEEE Transactions on Medical Imaging 20 (2001) 45-57, genome-sequencing, see C. E. Lawrence and A. A. Reilly, “An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences,” Proteins: Structure, Function, and Bioinformatics 7 (1990) 41-51; T. L. Bailey and C. Elkan, “Unsupervised learning of multiple motifs in biopolymers using expectation maximization,” Machine learning 21 (1995) 51-80 , radar denoising, see J. Wang, A. Dogandzic and A. Nehorai, “Maximum likelihood estimation of compound-gaussian clutter and target parameters,” IEEE Transactions on Signal Processing 54 (2006) 3884-3898, and infectious-disease tracking, see M. Reilly and E. Lawlor, “A likelihood-based method of identifying contaminated lots of blood product,” International Journal of Epidemiology 28 (1999) 787-792; P. Bacchetti, “Estimating the incubation period of AIDS by comparing population infection and diagnosis patterns,” Journal of the American Statistical Association 85 (1990) 1002-1008. A prominent mathematical modeler has even said that the EM algorithm is “as close as data analysis algorithms come to a free lunch” , see N. A. Gershenfeld, The Nature of Mathematical Modeling (Cambridge University Press, 1999). But the EM algorithm can converge slowly for high-dimensional parameter spaces or when the algorithm needs to estimate large amounts of missing information, see G. J. McLachlan and T. Krishnan, The EM Algorithm and Extensions (John Wiley and Sons, 2007); M. A. Tanner, Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions and Likelihood Functions, Springer Series in Statistics (Springer, 1996).
An estimating system may iteratively estimate an unknown parameter of a model or state of a system. An input module may receive numerical data about the system. A noise module may generate random, chaotic, or other type of numerical perturbations of the received numerical data and/or may generate pseudo-random noise. An estimation module may iterativel estimate the unknown parameter of the model or state of the system based on the received numerical data. The estimation module may use the numerical perturbations and/or the pseudo-random noise and the input numerical data during at least one of the iterative estimates of the unknown parameter. A signaling module may determine whether successive parameter estimates or information derived from successive parameter estimates differ by less than a predetermined signaling threshold and, if so, signal when this occurs.
The estimation module may estimate the unknown parameter of the model or state of the system using maximum likelihood, expectation-maximization, minorization-maximization, or another statistical optimization or sub-optimization method.
The noise module may generate random, chaotic, or other type of numerical perturbations of the input numerical data that fully or partially satisfy a noisy expectation maximization (NEM) condition. The estimation module may estimate the unknown parameter of the model or state of the system by adding, multiplying, or otherwise combining the received numerical data with these numerical perturbations.
The estimation module may cause the magnitude of the generated numerical perturbations to eventually decay during successive parameter estimates.
The noise module may generate numerical perturbations that do not depend on the received numerical data. The estimation module may estimate the unknown parameter of the model or state of the system using the numerical perturbations that do not depend on the received numerical data.
The system may be a model that is a probabilistically weighted mixture of probability curves, including a scalar or vector Gaussian and Cauchy curves. The noise module may cause the generated numerical perturbations and/or pseudo-random noise to fully or partially satisfy a mixture-based NEM condition, including a component-wise quadratic NEM condition.
Non-transitory, tangible, computer-readable storage media may contain a program of instructions that may cause a computer system running the program of instruction to function as any of the estimating computer systems that are described herein or any of their components.
These, as well as other components, steps, features, objects, benefits, and advantages, will now become clear from a review of the following detailed description of illustrative embodiments, the accompanying drawings, and the claims.
The drawings are of illustrative embodiments. They do not illustrate all embodiments. Other embodiments may be used in addition or instead. Details that may be apparent or unnecessary may be omitted to save space or for more effective illustration. Some embodiments may be practiced with additional components or steps and/or without all of the components or steps that are illustrated. When the same numeral appears in different drawings, it refers to the same or like components or steps.
Illustrative embodiments are now described. Other embodiments may be used in addition or instead. Details that may be apparent or unnecessary may be omitted to save space or for a more effective presentation. Some embodiments may be practiced with additional components or steps and/or without all of the components or steps that are described.
A noise-injected version of the Expectation-Maximization (EM) algorithm is presented: the Noisy Expectation Maximization (NEM) algorithm. The NEM algorithm may use noise to speed up the convergence of the EM algorithm. The NEM theorem shows that additive noise can speed up the average convergence of the EM algorithm to a local maximum of the likelihood surface if a positivity condition holds. Corollary results give special cases when noise improves the EM algorithm such as in the case of the Gaussian mixture model (GMM) and the Cauchy mixture model (CMM). The NEM positivity condition may simplify to a quadratic inequality in the GMM and CMM cases. A final theorem shows that the noise benefit for independent identically distributed additive noise may decrease with sample size in mixture models. This theorem implies that the noise benefit may be most pronounced if the data is sparse.
Careful noise injection can increase the average convergence speed of the EM algorithm. It may also derive a general sufficient condition for this EM noise benefit. Simulations show this EM noise benefit include the ubiquitous Gaussian mixture model (
The EM noise benefit may be an example of stochastic resonance in statistical signal processing. Stochastic resonance may occur when noise improves a signal system's performance, see A. R. Bulsara and L. Gammaitoni, “Tuning in to Noise,” Physics Today (1996) 39-45; L. Gammaitoni, P. Hänggi, P. Jung and F. Marchesoni, “Stochastic Resonance,” Reviews of Modern Physics 70 (1998) 223-287; B. Kosko, Noise (Viking, 2006): small amounts of noise may improve the performance while too much noise may degrade it. Much early work on noise benefits involved natural systems in physics, see J. J. Brey and A. Prados, “Stochastic Resonance in a One-Dimension Ising Model,” Physics Letters A 216 (1996) 240-246, chemistry, see H. A. Kramers, “Brownian Motion in a Field of Force and the Diffusion Model of Chemical Reactions,” Physica VII (1940) 284-304; A. Förster, M. Merget and F. W. Schneider, “Stochastic Resonance in Chemistry. 2. The Peroxidase-Oxidase Reaction,” Journal of Physical Chemistry 100 (1996) 4442-4447, and biology, see F. Moss, A. Bulsara and M. Shlesinger, eds., Journal of Statistical Physics, Special Issue on Stochastic Resonance in Physics and Biology (Proceedings of the NATO Advanced Research Workshop), volume 70, no. 1/2 (Plenum Press, 1993); P. Cordo, J. T. Inglis, S. Vershueren, J. J. Collins, D. M. Merfeld, S. Rosenblum, S. Buckley and F. Moss, “Noise in Human Muscle Spindles,”, Nature 383 (1996) 769-770; R. K. Adair, R. D. Astumian and J. C. Weaver, “Detection of Weak Electric Fields by Sharks, Rays and Skates,” Chaos: Focus Issue on the Constructive Role of Noise in Fluctuation Driven Transport and Stochastic Resonance 8 (1998) 576-587; P. Hänggi, “Stochastic resonance in biology,” ChemPhysChem 3 (2002) 285-290. This work inspired the search for noise benefits in nonlinear signal processing and statistical estimation. See A. R. Bulsara and A. Zador, “Threshold Detection of Wideband Signals: A Noise-Induced Maximum in the Mutual Information,” Physical Review E 54 (1996) R2185R2188; F. Chapeau-Blondeau and D. Rousseau, “Noise-Enhanced Performance for an Optimal Bayesian Estimator,” IEEE Transactions on Signal Processing 52 (2004) 1327-1334; M. McDonnell, N. Stocks, C. Pearce and D. Abbott, Stochastic resonance: from suprathreshold stochastic resonance to stochastic signal quantization (Cambridge University Press, 2008); H. Chen, P. Varshney, S. Kay and J. Michels, “Noise Enhanced Nonparametric Detection,” IEEE Transactions on Information Theory 55 (2009) 499-506; A. Patel and B. Kosko, “Noise Benefits in Quantizer-Array Correlation Detection and Watermark Decoding,” IEEE Transactions on Signal Processing 59 (2011) 488-505; B. Franzke and B. Kosko, “Noise can speed convergence in Markov chains,” Physical Review E 84 (2011) 041112. The EM noise benefit may not involve a signal threshold unlike almost all SR noise benefits, see L. Gammaitoni, P. Hänggi, P. Jung and F. Marchesoni, “Stochastic Resonance,” Reviews of Modern Physics 70 (1998) 223-287.
The next sections develop theorems and algorithms for Noisy Expectation-Maximization (NEM). Section 2 summarizes the key facts of the Expectation-Maximization algorithm. Section 3 introduces the theorem and corollaries that underpin the NEM algorithm. Section 4 presents the NEM algorithm and some of its variants. Section 5 presents a theorem that describes how sample size may affect the NEM algorithm for mixture models when the noise is independent and identically distributed (i.i.d.).
The EM algorithm is an iterative maximum-likelihood estimation (MLE) method for estimating pdf parameters from incomplete observed data. See A. P. Dempster, N. M. Laird and D. B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm (with discussion),”Journal of the Royal Statistical Society, Series B 39 (1977) 1-38; G. J. McLachlan and T. Krishnan, The EM Algorithm and Extensions (John Wiley and Sons, 2007); M. R. Gupta and Y. Chen, “Theory and Use of the EM Algorithm,” Foundations and Trends in Signal Processing 4 (2010) 223-296, EM may compensate for missing information by taking expectations over all missing information conditioned on the observed incomplete information and on current parameter estimates. A goal of the EM algorithm is to find the maximum-likelihood estimate {circumflex over (θ)} for the pdf parameter θ when the data Y has a known parametric pdf f(y|θ). The maximumlikelihood estimate {umlaut over (θ)} is
where l(θ|y)=ln f(y|θ) is the log-likelihood (the log of the likelihood function).
The EM scheme may apply when an incomplete data random variable Y=r(X) is observered instead of the complete data random variable X. The function r: X→Y may model data corruption or information loss. X=(Y,Z) can denote the complete data X, where Z is a latent or missing random variable. Z may represent any statistical information lost during the observation mapping r(X). This corruption may make the observed data log-likelihood l(θ|y) complicated and difficult to optimize directly in (1).
The EM algorithm may address this difficulty by using the simpler complete log-likelihood l(θ|y,z) to derive a surrogate function Q(θ|θk) for l(θ|y). Q(θ|θk) is the average of l(θ|y, z) over all possible values of the latent variable Z, given the observation Y=y and the current parameter estimate θk:
A. P. Dempster, N. M. Laird and D. B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm (with discussion),”Journal of the Royal Statistical Society, Series B 39 (1977) 1-38, first showed that any θ that increases Q(θ|θk) cannot reduce the log-likelihood difference l(θ|y)−l(θk|y). This “ascent property” led to an iterative method that performs gradient ascent on the log-likelihood l(θ|y). This result underpin the EM algorithm and its many variants, see G. Celeux and J. Diebolt, “The SEM algorithm: A Probabilistic Teacher Algorithm Derived from the EM Algorithm for the Mixture Problem,” Computational Statistics Quarterly 2 (1985) 73-82; G. Celeux and G. Govaert, “A Classification EM Algorithm for Clustering and Two Stochastic Versions,” Computational Statistics & Data Analysis 14 (1992) 315-332; X. L. Meng and D. B. Rubin, “Maximum Likelihood Estimation via the ECM algorithm: A general framework,” Biometrika 80 (1993) 267; C. Liu and D. B. Rubin, “The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence” Biometrika 81 (1994) 633; J. A. Fessler and A. O. Hero, “Space-Alternating Generalized Expectation-Maximization Algorithm,” IEEE Transactions on Signal Processing 42 (1994) 2664-2677; H. M. Hudson and R. S. Larkin, “Accelerated Image Reconstruction using Ordered Subsets of Projection Data,” IEEE Transactions on Medical Imaging 13 (1994) 601-609.
The following notation for expectations to avoid cumbersome equations are used:
where S and T are random variables, φ and θ are deterministic parameters, and g is integrable with respect to the conditional pdf fS|T.
A standard EM algorithm may perform the following two steps iteratively on a vector y=(y1, . . . , yM) of observed random samples of Y:
The algorithm may stop when successive estimates differ by less than a given tolerance ∥θk−θk-1∥<10−tol or when ∥l(θk|y)−l(θk-1|y)∥<ε. The EM algorithm may converge (θk→θ*) to a local maximum θ*, see C. F. J. Wu, “On the Convergence Properties of the EM Algorithm,” The Annals of Statistics 11 (1983) 95-103; R. A. Boyles, “On the convergence of the EM algorithm,” Journal of the Royal Statistical Society. Series B (Methodological) 45 (1983) 47-50: θk→θ*.
The EM algorithm may be a family of MLE methods for working with incomplete data models. Such incomplete data models may include mixture models, see R. A. Redner and H. F. Walker, “Mixture Densities, Maximum Likelihood and the EM algorithm,” SIAM Review 26 (1984) 195-239; L. Xu and M. I. Jordan, “On convergence properties of the EM algorithm for gaussian mixtures,” Neural computation 8 (1996) 129-151, censored exponential family models, see R. Sundberg, “Maximum likelihood theory for incomplete data from an exponential family,” Scandinavian Journal of Statistics (1974) 49-58, and mixtures of censored models, see D. Chauveau, “A stochastic EM algorithm for mixtures with censored data,” Journal of Statistical Planning and Inference 46 (1995) 1-25. The next subsection describes examples of such incomplete data models. Users may have a good deal of freedom when they specify the EM complete random variables X and latent random variables Z for probabilistic models on the observed data Y. This freedom in model selection may allow users to recast many disparate algorithms as EM algorithms, see R. J. Hathaway, “Another interpretation of the EM algorithm for mixture distributions,” Statistics & Probability Letters 4 (1986) 53-56; J. P. Delmas, “An equivalence of the EM and ICE algorithm for exponential family,” IEEE Transactions on Signal Processing 45 (1997) 2613-2615; M. Á. Carreira-Perpiñán, “Gaussian mean shift is an EM algorithm,” IEEE Trans. on Pattern Analysis and Machine Intelligence 29 (2005) 2007; G. Celeux and G. Govaert, “A Classification EM Algorithm for Clustering and Two Stochastic Versions,” Computational Statistics & Data Analysis 14 (1992) 315-332. Changes to the E and M steps give another degree of freedom for the EM scheme, see A. P. Dempster, N. M. Laird and D. B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm (with discussion,” Journal of the Royal Statistical Society, Series B 39 (1977) 1-38; J. A. Fessler and A. O. Hero, “Space-Alternating Generalized Expectation-Maximization Algorithm,” IEEE Transactions on Signal Processing 42 (1994) 2664-2677; H. M. Hudson and R. S. Larkin, “Accelerated Image Reconstruction using Ordered Subsets of Projection Data,” IEEE Transactions on Medical Imaging 13 (1994) 601-609; X. L. Meng and D. B. Rubin, “Using EM to obtain asymptotic variance-covariance matrices: the SEM algorithm,”, Journal of the American Statistical Association 86 (1991) 899-909; G. Celeux, S. Chrétien, F. Forbes and A. Mkhadri, “A component-wise EM algorithm for mixtures,” Journal of Computational and Graphical Statistics 10 (2001) 697-712. A. Incomplete Data Models for EM: Mixture and Censored Gamma Models
Now described is the finite mixture model, an example of an incomplete data model which may be used to compare the EM and the NEM algorithms:.
A finite mixture model, see R. A. Redner and H. F. Walker, “Mixture Densities, Maximum Likelihood and the EM algorithm,” SIAM Review 26 (1984) 195-239; G. J. McLachlan and D. Peel, Finite Mixture Models, Wiley series in probability and statistics: Applied probability and statistics (Wiley, 2004), may be a convex combination of a finite set of sub-populations. The sub-population pdfs may come from the same parametric family. Mixture models may be useful for modeling mixed populations for statistical applications such as clustering, pattern recognition, and acceptance testing. The following notation for mixture models are used. Y is the observed mixed random variable. K is the number of sub-populations. Zε{1, . . . , K} is the hidden sub-population index random variable. The convex population mixing proportions α1, . . . , αK form a discrete pdf for Z: P(Z=j)=αj. The pdf f(y|Z=j,θj) is the pdf of the jth sub-population where θ1, . . . , θK are the pdf parameters for each sub-population. Θ is the vector of all model parameters Θ={α1, . . . , αK, θ1, . . . , θK}. The joint pdf f(y,z|Θ) is
The marginal pdf for Y and the conditional pdf for Z given y are
by Bayes theorem. The joint pdf in exponential form for ease of analysis are rewritten
EM algorithms for finite mixture models may estimate Θ using the sub-population index Z as the latent variable. An EM algorithm on a finite mixture model may use (5) to derive the Q -function
Theorem 1 below states a general sufficient condition for a noise benefit in the average convergence time of the EM algorithm.
The possible EM noise benefit may differ from almost all SR noise benefits because it may not involve the use of a signal threshold, see L. Gammaitoni, P. Hänggi, P. Jung and F. Marchesoni, “Stochastic Resonance,” Reviews of Modern Physics 70 (1998) 223-287. The possible EM noise benefit may also differ from most SR noise benefits because the additive noise can depend on the signal. Independent noise can lead to weaker noise benefits than dependent noise in EM algorithms. This may also happen with enhanced convergence in noise-injected Markov chains, see B. Franzke and B. Kosko, “Noise can speed convergence in Markov chains,”, Physical Review E 84 (2011) 041112.
The idea behind the EM noise benefit is that sometimes noise can make the signal data more probable. This occurs at the local level when
f(y+n|θ)>f(y|θ) (14)
for probability density function (pdf) f, realization y of random variable Y, realization n of random noise N, and parameter θ. This condition holds if and only if the logarithm of the pdf ratio is positive:
The logarithmic condition (15) in turn occurs much more generally if it holds only on average with respect to all the pdfs involved in the EM algorithm:
where random variable Z represents missing data in the EM algorithm and where θ. is the limit of the EM estimates θk: θk→θ*. The positivity condition (16) may be precisely the sufficient condition for a noise benefit in Theorem 1 below, called the NEM or Noisy EM Theorem.
The EM noise benefit may be defined by first defining a modified surrogate log-likelihood function
Q
N(θ|θk)=Z|y,θ
and its maximizer
The modified surrogate log-likelihood QN(θ|θk) equals the regular surrogate log-likelihood Q(θ|θk) when N=0. Q(θ|θ*) is the final surrogate log-likelihood given the optimal EM estimate θ*. So θ* may maximize Q(θ|θ*). Thus
Q(θ*|θ*)≧Q(θ|θ*) for all θ. (18)
An EM noise benefit occurs when the noisy surrogate log-likelihood QN(θk|θ*) is closer to the optimal value Q(θ*|θ*) than the regular surrogate log-likelihood Q(θk|θ*) is. This holds when
Q
N(θk|θ*)≧Q(θk|θ*) (19)
or
(Q(θ*|θ*)−Q(θk|θ*))≧(Q(θ*|θ*)−QN(θk|θ*)). (20)
So the noisy perturbation QN(θ|θk) of the current surrogate log-likelihood Q(θ|θk) may be a better log-likelihood function for the data than Q is itself. An average noise benefit results when the expectations of both sides of inequality (20):
N
[Q(θ*|θ*)−Q(θk|θ*)]≧N[Q(θ*|θ*)−QN(θk|θ*)]. (21)
are taken.
The average noise benefit (21) occurs when the final EM pdf f(y,z|θ*) is closer in relative-entropy to the noisy pdf f(y+N,z|θk) than it is to the noiseless pdf f(y,z|θk) . Define the relative-entropy pseudo-distances
c
k(N)=D(f(y,z|θ*)∥f(y+N,z|θk)) (22)
c
k
=c
k(0)=D(f(y,z|θ*)∥f(y,z|θk)). (23)
Then noise benefits the EM algorithm when
c
k
≧c
k(N) (24)
holds for the relative-entropy pseudo-distances. The relative entropy itself has the form, see T. M. Cover and J. A. Thomas, Elements of Information Theory (Wiley & Sons, New York, 1991), 1 edition,
for positive pdfs h and g over the same support. Convergent sums can replace the integrals as needed.
The Noisy Expectation Maximization (NEM) Theorem below uses the following notation. The noise random variable N has pdf f(n|y). So the noise N can depend on the data Y. Independence implies that the noise pdf becomes f(n|y)=fN(n). {θk} is a sequence of EM estimates for θ. θ*=limk→∞θk is the converged EM estimate for θ. Assume that the differential entropy of all random variables is finite. Assume also that the additive noise keeps the data in the likelihood function's support. The Appendix below gives proof of the NEM Theorem and its three corollaries.
Theorem 1: Noisy Expectation Maximization (NEM). The EM estimation iteration noise benefit
(Q(θ*|θ*)−Q(θk|θ*))≧(Q(θ*|θ*)−QN(θk|θ*)) (26)
may occur on average if
The NEM theorem also applies to EM algorithms that use the complete data as their latent random variable. The proof for these cases follows from the proof in the appendix. The NEM positivity condition in these models may changes to
The theorem also holds for more general methods of noise injection like using noise multiplication y.N instead of noise addition y+N. The NEM condition for generalized noise injection is
where φ (Y,N) is some generalized function for combining data with noise.
The NEM Theorem may imply that each iteration of a suitably noisy EM algorithm moves closer on average towards the EM estimate θ* than does the corresponding noiseless EM algorithm, see O. Osoba, S. Mitaim and B. Kosko, “Noise Benefits in the Expectation-Maximization Algorithm: NEM Theorems and Models,” in The International Joint Conference on Neural Networks (IJCNN) (IEEE, 2011), pp. 3178-3183. This may hold because the positivity condition (27) implies that EN[ck(N)]≦ck at each step k since ck does not depend on N from (23). The NEM algorithm may use larger overall steps on average than does the noiseless EM algorithm for any number k of steps
The NEM theorem's stepwise possible noise benefit may lead to a noise benefit at any point in the sequence of NEM estimates. This is because the following inequalities may be had when the expected value of inequality (19) takes the form
Q(θk|θ*)≦N[QN(θk|θ*)] for any k. (29)
Thus
Q(θ*|θ*)−Q(θk|θ*)≧Q(θ*|θ*)−N[QN(θk|θ*)] for any k. (30)
The EM (NEM) sequence may converge when the left (right) side of inequality (30) equals zero. Inequality (30) implies that the difference on the right side is closer to zero at any step k.
NEM sequence convergence may be even stronger if the noise Nk decays to zero as the iteration count k grows to infinity. This noise annealing implies Nk→0 with probability one. Continuity of Q as a function of Y implies that QN
The evolution of EM algorithms may guarantee that limkQ(θk|θ*)=Q(θ*|θ*). This may give the probability-one limit
So for any ε>0 there may exist a k0 such that for all k>k0:
|Q(θk|θ*)−Q(θ*|θ*)|<ε and |QN
Inequalities (29) and (33) may imply that Q(θk|θ*) is ε-close to its upper limit Q(θ*|θ*) and
[QN
So the NEM and EM algorithms may converge to the same fixed-point by (32). And the inequalities (34) may imply that NEM estimates are closer on average to optimal than EM estimates are at any step k.
The first corollary of Theorem 1 gives a dominated-density condition that satisfies the positivity condition (27) in the NEM Theorem. This strong pointwise condition is a direct extension of the pdf inequality in (14) to the case of an included latent random variable Z .
for almost all y, z, and n.
The Corollary 1 may be used to derive conditions on the noise N that produce NEM noise benefits for mixture models. NEM mixture models may use two special cases of Corollary 1. These special cases as Corollaries 2 and 3 are stated below. The corollaries use the finite mixture model notation in Section 2.1. Recall that the joint pdf of Y and Z is
f(y,z|θ)=Σjαjf(y|j,θ)δ[z−j]. (36)
Define the population-wise noise likelihood difference as
Δfj(y,n)=f(y+n|j,θ)−f(y|j,θ). (37)
Corollary 1 implies that noise benefits the mixture model estimation if the dominated-density condition holds:
f(y+n,z|θ)≧f(y,z|θ). (38)
This may occur if
Δfj(y,n)≧0 for all j. (39)
The Gaussian mixture model (GMM) may use normal pdfs for the sub-population pdfs, see V. Hasselblad, “Estimation of Parameters for a Mixture of Normal Distributions,” Technometrics 8 (1966) 431-444; R. A. Redner and H. F. Walker, “Mixture Densities, Maximum Likelihood and the EM algorithm,” SIAM Review 26 (1984) 195-239. Corollary 2 states a simple quadratic condition that may ensure that the noisy sub-population pdf f(y+n|Z=j,θ) dominates the noiseless sub-population pdf f(y|Z=j,θ) for GMMs. The additive noise samples n may depend on the data samples y.
Corollary 2: Suppose Y|Z=j˜N(μj,σj2) and thus f(y|j,θ) is a normal pdf. Then
Δfj(y,n)≧0 (40)
holds if
n
2≦2n(μj−y). (41)
Now apply the quadratic condition (41) to (39). Then f(y+n,z|θ)≧f(y,z|θ) may hold when
n
2≦2n(μj−y) for all j. (42)
The inequality (42) gives the GMM-NEM noise benefit condition (misstated in O. Osoba and B. Kosko, “Noise-Enhanced Clustering and Competitive Learning Algorithms,” Neural Networks 37 (2013) 132-140, but corrected in O. Osoba and B. Kosko, “Corrigendum to ‘Noise enhanced clustering and competitive learning algorithms’ [Neural Netw. 37 (2013) 132-140],” Neural Networks (2013)) when the NEM system more quickly estimates the standard deviations σj than does noiseless EM. This can also benefit expectation-conditional-maximization (ECM), see X. L. Meng and D. B. Rubin, “Maximum Likelihood Estimation via the ECM algorithm: A general framework,” Biometrika 80 (1993) 267, methods.
Corollary 3 gives a similar quadratic condition for the Cauchy mixture model.
Corollary 3: Suppose Y|Z=j˜C(mj,dj) and thus f(y|j,θ) is a Cauchy pdf. Then
Δfj(y,n)≧0 (43)
holds if
n
2≦2n(mj−y). (44)
Again apply the quadratic condition (44) to (39). Then f(y+n,z|θ)≧f(y,z|θ) may hold when
n
2≦2n(mj−y) for all j. (45)
Both quadratic NEM inequality conditions in (42) and (45) may reduce to the following inequality (replace μ with m for the CMM case):
n[n−2(μj−y)]≦0 for all j. (46)
So the noise n may fall in the set where the parabola n2−2n(μj−y) is negative for all j. There are two possible solution sets for (46) depending on the values of μj and y. These solution sets are
N
−
j(y)=[2(μj−y),0] (47)
N
+
j(y)=[0,2(μj−y)]. (48)
A goal may be to find the set N(y) of n values that satisfy the inequality in (42) for all j:
N(y)=∩jNj(y) (49)
where Nj(y)=N+j(y) or Nj(y)=N−j(y). N(y)≠{0} may hold only when the sample y lies on one side of all subpopulation means (or location parameters) μj. This may hold for
y<μj for all j or y<μj for all j . (50)
The NEM noise N may take values in ∩jN−j if the data sample y falls to the right of all sub-population means (y>μj for all j). The NEM noise N may take values in ∩jN+j if the data sample y falls to the left of all subpopulation means (y<μj for all j). And N=0 may only be valid value for N when y falls between sub-populations means. Thus, the noise N may tend to pull the data sample y away from the tails and towards the cluster of sub-population means (or locations).
The NEM Theorem and its corollaries give a general method for modifying the noiseless EM algorithm. The NEM Theorem also may imply that, on average, these NEM variants outperform the noiseless EM algorithm.
Algorithm 2 gives the Noisy Expectation-Maximization algorithm schema. The operation NEMNoiseSample(y) generates noise samples that satisfy the NEM condition for the current data model. The noise sampling distribution may depend on the vector of random samples y in the Gaussian and Cauchy mixture models. The noise can have any value in the NEM algorithm for censored gamma models. The E-Step may take a conditional expectation of a function of the noisy data samples y† given the noiseless data samples y.
A deterministic decay factor k−τ scales the noise on the kth iteration. τ is the noise decay rate. The decay factor k−τ reduces the noise at each new iteration. This factor drives the noise Nk to zero as the iteration step k increases. The simulations in this presentation use τ=2 for demonstration. Values between τ=1 and τ=3 also work. Nk still needs to satisfy the NEM condition for the data model. The cooling factor k−τ must not cause the noise samples to violate the NEM condition. This may means that 0<k−τ≦1 and that the NEM condition solution set is closed with respect to contractions.
The decay factor may reduce the NEM estimator's jitter around its final value. This may be important because the EM algorithm converges to fixed-points. So excessive estimator jitter may prolong convergence time even when the jitter occurs near the final solution. The simulations in this presentation use polynomial decay factors instead of logarithmic cooling schedules found in annealing applications, see S. Kirkpatrick, C. Gelatt Jr and M. Vecchi, “Optimization by simulated annealing,” Science 220 (1983) 671-680; V. Cerny, “Thermodynamical approach to the Traveling Salesman Problem: An efficient simulation algorithm,” Journal of Optimization Theory and Applications 45 (1985) 41-51; S. Geman and C. Hwang, “Diffusions for global optimization,” SIAM Journal on Control and Optimization 24 (1986) 1031-1043; B. Hajek, “Cooling schedules for optimal annealing,” Mathematics of operations research (1988) 311-329; B. Kosko, Neural Networks and Fuzzy Systems: A Dynamical Systems Approach to Machine Intelligence (Prentice Hall, 1991).
Deterministic and/or chaotic samples can achieve effects similar to random noise in the NEM algorithm.NEM variants that use deterministic or chaotic perturbations instead of random noise may be called Deterministic Interference EM or Chaotic EM respectively.
The next algorithm is an example of the full NEM algorithm for 1-D GMMs using an inverse square cooling rate on the additive noise. The N-step combines both NS and NA steps in the NEM algorithm.
such that ni[ni − 2(μj − yi)] ≦ 0 for all i, j
The NEM algorithm may inherit variants from the classical EM algorithm schema. A NEM adaptation to the Generalized Expectation Maximization (GEM) algorithm may be one of the simpler variations. The GEM algorithm replaces the EM maximization step with a gradient ascent step. The Noisy Generalized Expectation Maximization (NGEM) algorithm (Algorithm 3) may use the same M-step:
The NEM algorithm schema may also allow for some variations outside the scope of the EM algorithm. These involve modifications to the noise sampling step NS-Step or to the noise addition step NA-Step. One such modification may not require an additive noise term ni for each yi. This may be useful when the NEM condition is stringent because then noise sampling can be time intensive. This variant changes the NS-Step by picking a random or deterministic sub-selection of y to modify. Then, it samples the noise subject to the NEM condition for those sub-selected samples. This is the Partial Noise Addition NEM (PNA-NEM).
← {1 . . . M}
← SubSelection( )
The NEM noise generating procedure NEMNoiseSample(y) may return a NEMcompliant noise sample n at a given noise level σN for each data sample y. This procedure may change with the EM data model. The noise generating procedure for the GMMs and CMMs comes from Corollaries 2 and 3. The following 1-D noise generating procedure may be used for the GMM simulations:
where TN(0,σN|N(y)) is the normal distribution N(0,σN) truncated to the support set N(y). The set N(y) is the interval intersection from (49). Multi-dimensional versions of the generator can apply the procedure component-wise.
The noise-benefit effect may depend on the size of the GMM data set. Analysis of this effect may depend on the probabilistic event that the noise satisfies the GMM-NEM condition for the entire sample set. This analysis also applies to the Cauchy mixture model because its NEM condition is the same as the GMM's. Define Ak as the event that the noise N satisfies the GMM-NEM condition for the kth data sample:
A
k
={N
2≦2N(μj−yk)|∀j}. (52)
Then define the event AM that noise random variable N satisfies the GMM-NEM condition for each data sample as
This construction may be useful for analyzing NEM when the independent and identically distributed (i.i.d.) noise
for all yk is used while still enforcing the NEM condition.
The next theorem shows that the set AM shrinks to the singleton set {0} as the number M of samples in the data set grows. So the probability of satisfying the NEM condition with i.i.d. noise samples goes to zero as M→∞ with probability one.
Assume that the noise random variables are i.i.d. Then the set of noise values
A
M
={N
2≦2N(μj−yk)|∀j and ∀k} (55)
that satisfy the Gaussian NEM condition for all data samples yk decreases with probability one to the set {0} as M→∞:
The proof shows that larger sample sizes M may place tighter bounds on the size of AM with probability one. The bounds shrink AM all the way down to the singleton set {0} as M→∞. AM is the set of values that identically distributed noise N can take to satisfy the NEM condition for all yk. AM={0} means that Nk must be zero for all k because the Nk are identically distributed. This corresponds to cases where the NEM Theorem cannot guarantee improvement over the regular EM using just i.i.d. noise. So identically distributed noise may have limited use in the GMM- and CMM-NEM framework.
Theorem 2 is a “probability-one” result. But it also implies the following convergence-in-probability result. Suppose Ñ is an arbitrary continuous random variable. Then the probability P(ÑεAM) that Ñ satisfies the NEM condition for all samples may fall to P(Ñε{0})=0 as M→∞.
Using non-identically distributed noise Nk may avoid the reduction in the probability of satisfying the NEM-condition for large M. The NEM condition may still hold when NkεAk for each k even if Nk∉AM=∩kAk. This noise sampling model may adapt the kth noise random variable Nk to the kth data sample yk . This is the general NEM noise model.
The i.i.d noise model in Theorem 2 has an important corollary effect for sparse data sets. The size of AM decreases monotonically with M because AM=∩kMAk. Then for M0<M1:
P(NεAM
since M0<M1 implies that AM
The input module 403 may have a configuration that receives numerical data about a model or state of the system. The input module 403 may consist of or include a network interface card, a data storage system interface, any other type of device that receives data, and/or any combination of these.
The noise module 405 may have a configuration that generates random, chaotic, or other type of numerical perturbations of the received numerical data and/or that generates pseudo-random noise.
The noise module 405 may have a configuration that generates random, chaotic, or other type of numerical perturbations of the input numerical data that fully or partially satisfy a noisy expectation maximization (NEM) condition.
The noise module 405 may have a configuration that generates numerical perturbations that do not depend on the received numerical data.
The estimation module 407 may have a configuration that iteratively estimates the unknown parameter of the model or state of the system based on the received numerical data and then uses the numerical perturbations in the input numerical data and/or the pseudo-random noise and the input numerical data during at least one of the iterative estimates of the unknown parameter.
The estimation module 407 may have a configuration that estimates the unknown parameter of the model or state of the system using maximum likelihood, expectation-maximization, minorization-maximization, or another statistical optimization or sub-optimization method.
The estimation module 407 may have a configuration that estimates the unknown parameter of the model or state of the system by adding, multiplying, or otherwise combining the input data with the numerical perturbations.
The estimation module 407 may have a configuration that estimates the unknown parameter of the model or state of the system using the numerical perturbations that do not depend on the received numerical data.
The estimation module 407 may have a configuration that causes the magnitude of the generated numerical perturbations to eventually decay during successive parameter estimates.
Other documents that disclose details about the technology that has been described herein include:
Careful noise injection can speed up the average convergence time of the EM algorithm. The various sufficient conditions for such a noise benefit may involve a direct or average effect where the noise makes the signal data more probable. Special cases may include mixture density models and log-convex probability density models. Noise injection for the Gaussian and Cauchy mixture models may improve the average EM convergence speed when the noise satisfies a simple quadratic condition. Even blind noise injection can sometimes benefit these systems when the data set is sparse. But NEM noise injection still outperforms blind noise injection in all data models tested.
An EM estimation iteration noise benefit
(Q(θ*|θ*)−Q(θk|θ*))≧(Q(θ*|θ*)−QN(θk|θ*) (67)
occurs on average if
Proof: Each expectation of Q-function differences in (21) is a distance pseudo-metric. Rewrite Q as an integral:
∫Zln[f(y,z|θ)]f(z|y,θk)dz. (69)
ck=D(f(y,z|θ*)∥f(y,z|θk)) is the expectation over Y because
ck(N) is likewise the expectation over Y:
Take the noise expectation of ck and ck(N):
N[ck]=ck (76)
N
[c
k(N)]=N[ck(N)]. (77)
So the distance inequality
c
k≧N[ck(N)] (78)
guarantees that noise benefits occur on average:
N,Y|θ
[Q(θ*|θ*)−Q(θk|θ*)]≧N,Y|θ
The inequality condition (78) may be used to derive a more useful sufficient condition for a noise benefit. Expand the difference of relative entropy terms ck−ck(N):
Take the expectation with respect to the noise term N:
The assumption of finite differential entropy for Y and Z may ensure that In f(y,z|θ)f(y,z|θ*) is integrable. Thus the integrand may be integrable. So Fubini's theorem, see G. B. Folland, Real Analysis: Modern Techniques and Their Applications (Wiley-Interscience, 1999), 2nd edition, permits the change in the order of integration in (87):
Then an EM noise benefit may occur on average if
for almost all y, z, and n.
Proof: The following inequalities need hold only for almost all y, z, and n:
Thus
Corollary 2: Suppose Y|Z=j˜N(μj,σj2) and thus f(y|j,θ) is a normal pdf. Then
Δfj(y,n)≧0 (96)
holds if
n
2≦2n(μj−y) (97)
Proof: The proof compares the noisy and noiseless normal pdfs. The normal pdf is
So f(y+n|θ)≧f(y|θ)
Inequality (101) may hold because σj is strictly positive. Expand the left-hand side to get (97):
(y−μj)2+n2+2n(y−μj)≦(y−μj)2 (102)
iff n
2+2n(y−μj)≦0 (103)
iff n
2≦−2n(y−μj) (104)
iff n
2≦2n(μj−y) (105)
Corollary 3: Suppose Y|Z=j˜C(mj,dj) and thus f(y|j,θ) is a Cauchy pdf. Then
Δfj(y,n)≧0 (106)
holds if
n
2≦2n(mj−y). (107)
Proof:The proof compares the noisy and noiseless Cauchy pdfs. The Cauchy pdf is
Then f(y+n|θ)≧f(y|θ)
Proceed as in the last part of the Gaussian case:
The estimating computer system 401 that has been described herein, including each of its modules (except for the input module 403), is implemented with a computer system configured to perform the functions that have been described herein for the component. The computer system includes one or more processors, tangible memories (e.g., random access memories (RAMs), read-only memories (ROMs), and/or programmable read only memories (PROMS)), tangible storage devices (e.g., hard disk drives, CD/DVD drives, and/or flash memories), system buses, video processing components, network communication components, input/output ports, and/or user interface devices (e.g., keyboards, pointing devices, displays, microphones, sound reproduction systems, and/or touch screens). Each module may have its own computer system or some or all of the modules may share a single computer system.
Each computer system may be a desktop computer or a portable computer, or part of a larger system, such a system that clusters algorithms for Big Data; Trains hidden Markov models for speech, natural language, and other kinds of sequential data (including DNA); that trains neural networks for speech and computer vision; identifies sequences for genomics and proteomics; reconstructs medical image in positron emission tomography; segments images for medical imaging and robotics; or estimates risks for portfolio management.
Each computer system may include one or more computers at the same or different locations. When at different locations, the computers may be configured to communicate with one another through a wired and/or wireless network communication system.
Each computer system may include software (e.g., one or more operating systems, device drivers, application programs, and/or communication programs). When software is included, the software includes programming instructions and may include associated data and libraries. When included, the programming instructions are configured to implement one or more algorithms that implement one or more of the functions of the computer system, as recited herein. The description of each function that is performed by each computer system also constitutes a description of the algorithm(s) that performs that function.
The software may be stored on or in one or more non-transitory, tangible storage devices, such as one or more hard disk drives, CDs, DVDs, and/or flash memories. The software may be in source code and/or object code format. Associated data may be stored in any type of volatile and/or non-volatile memory. The software may be loaded into a non-transitory memory and executed by one or more processors.
The components, steps, features, objects, benefits, and advantages that have been discussed are merely illustrative. None of them, nor the discussions relating to them, are intended to limit the scope of protection in any way. Numerous other embodiments are also contemplated. These include embodiments that have fewer, additional, and/or different components, steps, features, objects, benefits, and advantages. These also include embodiments in which the components and/or steps are arranged and/or ordered differently.
For example, the use of Bayesian priors and penalized likelhood functions in Maximum A Posteriori and Penalized EM algorithms, other variants of the EM algorithm, and the more general class of minorization-maximization algorithms.
Unless otherwise stated, all measurements, values, ratings, positions, magnitudes, sizes, and other specifications that are set forth in this specification, including in the claims that follow, are approximate, not exact. They are intended to have a reasonable range that is consistent with the functions to which they relate and with what is customary in the art to which they pertain.
All articles, patents, patent applications, and other publications that have been cited in this disclosure are incorporated herein by reference.
The phrase “means for” when used in a claim is intended to and should be interpreted to embrace the corresponding structures and materials that have been described and their equivalents. Similarly, the phrase “step for” when used in a claim is intended to and should be interpreted to embrace the corresponding acts that have been described and their equivalents. The absence of these phrases from a claim means that the claim is not intended to and should not be interpreted to be limited to these corresponding structures, materials, or acts, or to their equivalents.
The scope of protection is limited solely by the claims that now follow. That scope is intended and should be interpreted to be as broad as is consistent with the ordinary meaning of the language that is used in the claims when interpreted in light of this specification and the prosecution history that follows, except where specific meanings have been set forth, and to encompass all structural and functional equivalents.
Relational terms such as “first” and “second” and the like may be used solely to distinguish one entity or action from another, without necessarily requiring or implying any actual relationship or order between them. The terms “comprises,” “comprising,” and any other variation thereof when used in connection with a list of elements in the specification or claims are intended to indicate that the list is not exclusive and that other elements may be included. Similarly, an element preceded by an “a” or an “an” does not, without further constraints, preclude the existence of additional elements of the identical type.
None of the claims are intended to embrace subject matter that fails to satisfy the requirement of Sections 101, 102, or 103 of the Patent Act, nor should they be interpreted in such a way. Any unintended coverage of such subject matter is hereby disclaimed. Except as just stated in this paragraph, nothing that has been stated or illustrated is intended or should be interpreted to cause a dedication of any component, step, feature, object, benefit, advantage, or equivalent to the public, regardless of whether it is or is not recited in the claims.
The abstract is provided to help the reader quickly ascertain the nature of the technical disclosure. It is submitted with the understanding that it will not be used to interpret or limit the scope or meaning of the claims. In addition, various features in the foregoing detailed description are grouped together in various embodiments to streamline the disclosure. This method of disclosure should not be interpreted as requiring claimed embodiments to require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive subject matter lies in less than all features of a single disclosed embodiment. Thus, the following claims are hereby incorporated into the detailed description, with each claim standing on its own as separately claimed subject matter.
This application is based upon and claims priority to U.S. provisional patent application 61/674,615, entitled “NOISE-ENHANCED EXPECTATION-MAXIMIZATION ALGORITHM,” filed Jul. 23, 2012, attorney docket number 028080-0769. The entire content of this application is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
61674615 | Jul 2012 | US |