ITERATIVE ESTIMATION OF SYSTEM PARAMETERS USING NOISE-LIKE PERTURBATIONS

BACKGROUND

1. Technical Field

This disclosure relates to iterative estimates of an unknown parameter of a model or state of a system.

2. Description of Related Art

The expectation-maximization (EM) algorithm is an iterative statistical algorithm that estimates maximum-likelihood parameters from incomplete or corrupted data. See A. P. Dempster, N. M. Laird and D. B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm (with discussion),”Journal of the Royal Statistical Society, Series B 39 (1977) 1-38; G. J. McLachlan and T. Krishnan, The EM Algorithm and Extensions (John Wiley and Sons, 2007); M. R. Gupta and Y. Chen, “Theory and Use of the EM Algorithm,” Foundations and Trends in Signal Processing 4 (2010) 223-296. This algorithm has a wide array of applications that include data clustering, see G. Celeux and G. Govaert, “A Classification EM Algorithm for Clustering and Two Stochastic Versions,” Computational Statistics & Data Analysis 14 (1992) 315-332; C. Ambroise, M. Dang and G. Govaert, “Clustering of spatial data by the em algorithm,” Quantitative Geology and Geostatistics 9 (1997) 493-504, automated speech recognition, see L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proceedings of the IEEE 77 (1989) 257-286; B. H. Juang and L. R. Rabiner, “Hidden Markov models for speech recognition,”Technometrics 33 (1991) 251-272, medical imaging, see L. A. Shepp and Y. Vardi, “Maximum likelihood reconstruction for emission tomography,” IEEE Transactions on Medical Imaging 1 (1982) 113-122; Y. Zhang, M. Brady and S. Smith, “Segmentation of Brain MR Images through a Hidden Markov Random Field Model and the Expectation-Maximization Algorithm,” IEEE Transactions on Medical Imaging 20 (2001) 45-57, genome-sequencing, see C. E. Lawrence and A. A. Reilly, “An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences,” Proteins: Structure, Function, and Bioinformatics 7 (1990) 41-51; T. L. Bailey and C. Elkan, “Unsupervised learning of multiple motifs in biopolymers using expectation maximization,” Machine learning 21 (1995) 51-80 , radar denoising, see J. Wang, A. Dogandzic and A. Nehorai, “Maximum likelihood estimation of compound-gaussian clutter and target parameters,” IEEE Transactions on Signal Processing 54 (2006) 3884-3898, and infectious-disease tracking, see M. Reilly and E. Lawlor, “A likelihood-based method of identifying contaminated lots of blood product,” International Journal of Epidemiology 28 (1999) 787-792; P. Bacchetti, “Estimating the incubation period of AIDS by comparing population infection and diagnosis patterns,” Journal of the American Statistical Association 85 (1990) 1002-1008. A prominent mathematical modeler has even said that the EM algorithm is “as close as data analysis algorithms come to a free lunch” , see N. A. Gershenfeld, The Nature of Mathematical Modeling (Cambridge University Press, 1999). But the EM algorithm can converge slowly for high-dimensional parameter spaces or when the algorithm needs to estimate large amounts of missing information, see G. J. McLachlan and T. Krishnan, The EM Algorithm and Extensions (John Wiley and Sons, 2007); M. A. Tanner, Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions and Likelihood Functions, Springer Series in Statistics (Springer, 1996).

SUMMARY

An estimating system may iteratively estimate an unknown parameter of a model or state of a system. An input module may receive numerical data about the system. A noise module may generate random, chaotic, or other type of numerical perturbations of the received numerical data and/or may generate pseudo-random noise. An estimation module may iterativel estimate the unknown parameter of the model or state of the system based on the received numerical data. The estimation module may use the numerical perturbations and/or the pseudo-random noise and the input numerical data during at least one of the iterative estimates of the unknown parameter. A signaling module may determine whether successive parameter estimates or information derived from successive parameter estimates differ by less than a predetermined signaling threshold and, if so, signal when this occurs.

The estimation module may estimate the unknown parameter of the model or state of the system using maximum likelihood, expectation-maximization, minorization-maximization, or another statistical optimization or sub-optimization method.

The noise module may generate random, chaotic, or other type of numerical perturbations of the input numerical data that fully or partially satisfy a noisy expectation maximization (NEM) condition. The estimation module may estimate the unknown parameter of the model or state of the system by adding, multiplying, or otherwise combining the received numerical data with these numerical perturbations.

The estimation module may cause the magnitude of the generated numerical perturbations to eventually decay during successive parameter estimates.

The noise module may generate numerical perturbations that do not depend on the received numerical data. The estimation module may estimate the unknown parameter of the model or state of the system using the numerical perturbations that do not depend on the received numerical data.

The system may be a model that is a probabilistically weighted mixture of probability curves, including a scalar or vector Gaussian and Cauchy curves. The noise module may cause the generated numerical perturbations and/or pseudo-random noise to fully or partially satisfy a mixture-based NEM condition, including a component-wise quadratic NEM condition.

Non-transitory, tangible, computer-readable storage media may contain a program of instructions that may cause a computer system running the program of instruction to function as any of the estimating computer systems that are described herein or any of their components.

These, as well as other components, steps, features, objects, benefits, and advantages, will now become clear from a review of the following detailed description of illustrative embodiments, the accompanying drawings, and the claims.

BRIEF DESCRIPTION OF DRAWINGS

The drawings are of illustrative embodiments. They do not illustrate all embodiments. Other embodiments may be used in addition or instead. Details that may be apparent or unnecessary may be omitted to save space or for more effective illustration. Some embodiments may be practiced with additional components or steps and/or without all of the components or steps that are illustrated. When the same numeral appears in different drawings, it refers to the same or like components or steps.

FIG. 1 illustrates an example of a possible EM noise benefit for a Gaussian mixture model.

FIG. 2 illustrates an example of a possible EM noise benefit for a log-convex censored gamma model.

FIG. 3 illustrates an example of comparing the possible effects of noise injection with and without the NEM sufficient condition.

FIG. 4 illustrates an example of a computer estimation system for iteratively estimating an unknown parameter of a model or state of a system.

FIG. 5 illustrates an example of computer-readable storage media that may contain a program of instructions that causes a computer system running the program of instructions to function as any of the types of estimating computer system described herein.

DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS

Illustrative embodiments are now described. Other embodiments may be used in addition or instead. Details that may be apparent or unnecessary may be omitted to save space or for a more effective presentation. Some embodiments may be practiced with additional components or steps and/or without all of the components or steps that are described.

Overview

A noise-injected version of the Expectation-Maximization (EM) algorithm is presented: the Noisy Expectation Maximization (NEM) algorithm. The NEM algorithm may use noise to speed up the convergence of the EM algorithm. The NEM theorem shows that additive noise can speed up the average convergence of the EM algorithm to a local maximum of the likelihood surface if a positivity condition holds. Corollary results give special cases when noise improves the EM algorithm such as in the case of the Gaussian mixture model (GMM) and the Cauchy mixture model (CMM). The NEM positivity condition may simplify to a quadratic inequality in the GMM and CMM cases. A final theorem shows that the noise benefit for independent identically distributed additive noise may decrease with sample size in mixture models. This theorem implies that the noise benefit may be most pronounced if the data is sparse.

I. Introduction

Careful noise injection can increase the average convergence speed of the EM algorithm. It may also derive a general sufficient condition for this EM noise benefit. Simulations show this EM noise benefit include the ubiquitous Gaussian mixture model (FIG. 1), the Cauchy mixture model, and the censored gamma model (FIG. 2). The simulations in FIG. 3 also show that the noise benefit may be faint or absent if the system simply injects blind noise that ignores the sufficient condition. This suggests that the noise benefit sufficient condition may also be necessary for some data models. The discussion concludes with results that show that the noise benefit tends to occur most sharply in sparse data sets.

The EM noise benefit may be an example of stochastic resonance in statistical signal processing. Stochastic resonance may occur when noise improves a signal system's performance, see A. R. Bulsara and L. Gammaitoni, “Tuning in to Noise,” Physics Today (1996) 39-45; L. Gammaitoni, P. Hänggi, P. Jung and F. Marchesoni, “Stochastic Resonance,” Reviews of Modern Physics 70 (1998) 223-287; B. Kosko, Noise (Viking, 2006): small amounts of noise may improve the performance while too much noise may degrade it. Much early work on noise benefits involved natural systems in physics, see J. J. Brey and A. Prados, “Stochastic Resonance in a One-Dimension Ising Model,” Physics Letters A 216 (1996) 240-246, chemistry, see H. A. Kramers, “Brownian Motion in a Field of Force and the Diffusion Model of Chemical Reactions,” Physica VII (1940) 284-304; A. Förster, M. Merget and F. W. Schneider, “Stochastic Resonance in Chemistry. 2. The Peroxidase-Oxidase Reaction,” Journal of Physical Chemistry 100 (1996) 4442-4447, and biology, see F. Moss, A. Bulsara and M. Shlesinger, eds., Journal of Statistical Physics, Special Issue on Stochastic Resonance in Physics and Biology (Proceedings of the NATO Advanced Research Workshop), volume 70, no. 1/2 (Plenum Press, 1993); P. Cordo, J. T. Inglis, S. Vershueren, J. J. Collins, D. M. Merfeld, S. Rosenblum, S. Buckley and F. Moss, “Noise in Human Muscle Spindles,”, Nature 383 (1996) 769-770; R. K. Adair, R. D. Astumian and J. C. Weaver, “Detection of Weak Electric Fields by Sharks, Rays and Skates,” Chaos: Focus Issue on the Constructive Role of Noise in Fluctuation Driven Transport and Stochastic Resonance 8 (1998) 576-587; P. Hänggi, “Stochastic resonance in biology,” ChemPhysChem 3 (2002) 285-290. This work inspired the search for noise benefits in nonlinear signal processing and statistical estimation. See A. R. Bulsara and A. Zador, “Threshold Detection of Wideband Signals: A Noise-Induced Maximum in the Mutual Information,” Physical Review E 54 (1996) R2185R2188; F. Chapeau-Blondeau and D. Rousseau, “Noise-Enhanced Performance for an Optimal Bayesian Estimator,” IEEE Transactions on Signal Processing 52 (2004) 1327-1334; M. McDonnell, N. Stocks, C. Pearce and D. Abbott, Stochastic resonance: from suprathreshold stochastic resonance to stochastic signal quantization (Cambridge University Press, 2008); H. Chen, P. Varshney, S. Kay and J. Michels, “Noise Enhanced Nonparametric Detection,” IEEE Transactions on Information Theory 55 (2009) 499-506; A. Patel and B. Kosko, “Noise Benefits in Quantizer-Array Correlation Detection and Watermark Decoding,” IEEE Transactions on Signal Processing 59 (2011) 488-505; B. Franzke and B. Kosko, “Noise can speed convergence in Markov chains,” Physical Review E 84 (2011) 041112. The EM noise benefit may not involve a signal threshold unlike almost all SR noise benefits, see L. Gammaitoni, P. Hänggi, P. Jung and F. Marchesoni, “Stochastic Resonance,” Reviews of Modern Physics 70 (1998) 223-287.

The next sections develop theorems and algorithms for Noisy Expectation-Maximization (NEM). Section 2 summarizes the key facts of the Expectation-Maximization algorithm. Section 3 introduces the theorem and corollaries that underpin the NEM algorithm. Section 4 presents the NEM algorithm and some of its variants. Section 5 presents a theorem that describes how sample size may affect the NEM algorithm for mixture models when the noise is independent and identically distributed (i.i.d.).

II. The EM Algorithm

The EM algorithm is an iterative maximum-likelihood estimation (MLE) method for estimating pdf parameters from incomplete observed data. See A. P. Dempster, N. M. Laird and D. B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm (with discussion),”Journal of the Royal Statistical Society, Series B 39 (1977) 1-38; G. J. McLachlan and T. Krishnan, The EM Algorithm and Extensions (John Wiley and Sons, 2007); M. R. Gupta and Y. Chen, “Theory and Use of the EM Algorithm,” Foundations and Trends in Signal Processing 4 (2010) 223-296, EM may compensate for missing information by taking expectations over all missing information conditioned on the observed incomplete information and on current parameter estimates. A goal of the EM algorithm is to find the maximum-likelihood estimate {circumflex over (θ)} for the pdf parameter θ when the data Y has a known parametric pdf f(y|θ). The maximumlikelihood estimate {umlaut over (θ)} is

$\begin{matrix} \hat{θ} = \begin{matrix} \arg \max \\ θ \end{matrix}  (θ | y) & (1) \end{matrix}$

where l(θ|y)=ln f(y|θ) is the log-likelihood (the log of the likelihood function).

The EM scheme may apply when an incomplete data random variable Y=r(X) is observered instead of the complete data random variable X. The function r: X→Y may model data corruption or information loss. X=(Y,Z) can denote the complete data X, where Z is a latent or missing random variable. Z may represent any statistical information lost during the observation mapping r(X). This corruption may make the observed data log-likelihood l(θ|y) complicated and difficult to optimize directly in (1).

The EM algorithm may address this difficulty by using the simpler complete log-likelihood l(θ|y,z) to derive a surrogate function Q(θ|θ_k) for l(θ|y). Q(θ|θ_k) is the average of l(θ|y, z) over all possible values of the latent variable Z, given the observation Y=y and the current parameter estimate θ_k:

$\begin{matrix} \begin{matrix} Q (θ | θ_{k}) = _{Z} [ (θ | y, Z) | Y = y, θ_{k}] \\ = \int_{}^{}  (θ | y, z) f (z | y, θ_{k}) \partial z . \end{matrix} & (2) \end{matrix}$

A. P. Dempster, N. M. Laird and D. B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm (with discussion),”Journal of the Royal Statistical Society, Series B 39 (1977) 1-38, first showed that any θ that increases Q(θ|θ_k) cannot reduce the log-likelihood difference l(θ|y)−l(θ_k|y). This “ascent property” led to an iterative method that performs gradient ascent on the log-likelihood l(θ|y). This result underpin the EM algorithm and its many variants, see G. Celeux and J. Diebolt, “The SEM algorithm: A Probabilistic Teacher Algorithm Derived from the EM Algorithm for the Mixture Problem,” Computational Statistics Quarterly 2 (1985) 73-82; G. Celeux and G. Govaert, “A Classification EM Algorithm for Clustering and Two Stochastic Versions,” Computational Statistics & Data Analysis 14 (1992) 315-332; X. L. Meng and D. B. Rubin, “Maximum Likelihood Estimation via the ECM algorithm: A general framework,” Biometrika 80 (1993) 267; C. Liu and D. B. Rubin, “The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence” Biometrika 81 (1994) 633; J. A. Fessler and A. O. Hero, “Space-Alternating Generalized Expectation-Maximization Algorithm,” IEEE Transactions on Signal Processing 42 (1994) 2664-2677; H. M. Hudson and R. S. Larkin, “Accelerated Image Reconstruction using Ordered Subsets of Projection Data,” IEEE Transactions on Medical Imaging 13 (1994) 601-609.

The following notation for expectations to avoid cumbersome equations are used:

$\begin{matrix} _{S | t, φ} [g (S, t, θ)] \equiv _{S} [g (S, T, θ) | T = t, φ] \\ = \int g (s, t, θ) f_{S | T} (s | t, φ) \partial s \end{matrix}$

where S and T are random variables, φ and θ are deterministic parameters, and g is integrable with respect to the conditional pdf f_S|T.

A standard EM algorithm may perform the following two steps iteratively on a vector y=(y₁, . . . , y_M) of observed random samples of Y:

Algorithm 1 {acute over (θ)}_EM= EM-Estimate(y)

1: E-Step: Q (θ|θ_k) ← custom-character

_Z|y,θk[ln f(y,Z|θ)]

2: M-Step: θ_k+1 ← argmax_θ{Q (θ|θ_k)}

The algorithm may stop when successive estimates differ by less than a given tolerance ∥θ_k−θ_k-1∥<10^−tolor when ∥l(θ_k|y)−l(θ_k-1|y)∥<ε. The EM algorithm may converge (θ_k→θ_*) to a local maximum θ_*, see C. F. J. Wu, “On the Convergence Properties of the EM Algorithm,” The Annals of Statistics 11 (1983) 95-103; R. A. Boyles, “On the convergence of the EM algorithm,” Journal of the Royal Statistical Society. Series B (Methodological) 45 (1983) 47-50: θ_k→θ_*.

The EM algorithm may be a family of MLE methods for working with incomplete data models. Such incomplete data models may include mixture models, see R. A. Redner and H. F. Walker, “Mixture Densities, Maximum Likelihood and the EM algorithm,” SIAM Review 26 (1984) 195-239; L. Xu and M. I. Jordan, “On convergence properties of the EM algorithm for gaussian mixtures,” Neural computation 8 (1996) 129-151, censored exponential family models, see R. Sundberg, “Maximum likelihood theory for incomplete data from an exponential family,” Scandinavian Journal of Statistics (1974) 49-58, and mixtures of censored models, see D. Chauveau, “A stochastic EM algorithm for mixtures with censored data,” Journal of Statistical Planning and Inference 46 (1995) 1-25. The next subsection describes examples of such incomplete data models. Users may have a good deal of freedom when they specify the EM complete random variables X and latent random variables Z for probabilistic models on the observed data Y. This freedom in model selection may allow users to recast many disparate algorithms as EM algorithms, see R. J. Hathaway, “Another interpretation of the EM algorithm for mixture distributions,” Statistics & Probability Letters 4 (1986) 53-56; J. P. Delmas, “An equivalence of the EM and ICE algorithm for exponential family,” IEEE Transactions on Signal Processing 45 (1997) 2613-2615; M. Á. Carreira-Perpiñán, “Gaussian mean shift is an EM algorithm,” IEEE Trans. on Pattern Analysis and Machine Intelligence 29 (2005) 2007; G. Celeux and G. Govaert, “A Classification EM Algorithm for Clustering and Two Stochastic Versions,” Computational Statistics & Data Analysis 14 (1992) 315-332. Changes to the E and M steps give another degree of freedom for the EM scheme, see A. P. Dempster, N. M. Laird and D. B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm (with discussion,” Journal of the Royal Statistical Society, Series B 39 (1977) 1-38; J. A. Fessler and A. O. Hero, “Space-Alternating Generalized Expectation-Maximization Algorithm,” IEEE Transactions on Signal Processing 42 (1994) 2664-2677; H. M. Hudson and R. S. Larkin, “Accelerated Image Reconstruction using Ordered Subsets of Projection Data,” IEEE Transactions on Medical Imaging 13 (1994) 601-609; X. L. Meng and D. B. Rubin, “Using EM to obtain asymptotic variance-covariance matrices: the SEM algorithm,”, Journal of the American Statistical Association 86 (1991) 899-909; G. Celeux, S. Chrétien, F. Forbes and A. Mkhadri, “A component-wise EM algorithm for mixtures,” Journal of Computational and Graphical Statistics 10 (2001) 697-712. A. Incomplete Data Models for EM: Mixture and Censored Gamma Models

Now described is the finite mixture model, an example of an incomplete data model which may be used to compare the EM and the NEM algorithms:.

A finite mixture model, see R. A. Redner and H. F. Walker, “Mixture Densities, Maximum Likelihood and the EM algorithm,” SIAM Review 26 (1984) 195-239; G. J. McLachlan and D. Peel, Finite Mixture Models, Wiley series in probability and statistics: Applied probability and statistics (Wiley, 2004), may be a convex combination of a finite set of sub-populations. The sub-population pdfs may come from the same parametric family. Mixture models may be useful for modeling mixed populations for statistical applications such as clustering, pattern recognition, and acceptance testing. The following notation for mixture models are used. Y is the observed mixed random variable. K is the number of sub-populations. Zε{1, . . . , K} is the hidden sub-population index random variable. The convex population mixing proportions α₁, . . . , α_Kform a discrete pdf for Z: P(Z=j)=α_j. The pdf f(y|Z=j,θ_j) is the pdf of the j^thsub-population where θ₁, . . . , θ_Kare the pdf parameters for each sub-population. Θ is the vector of all model parameters Θ={α₁, . . . , α_K, θ₁, . . . , θ_K}. The joint pdf f(y,z|Θ) is

$\begin{matrix} f (y, z | Θ) = \sum_{j = 1}^{K} α_{j} f (y | j, θ_{j}) δ [z - j] . & (3) \end{matrix}$

The marginal pdf for Y and the conditional pdf for Z given y are

$\begin{matrix} f (y  Θ) = \sum_{j} α_{j} f (y  j, θ_{j}) and & (4) \\ p_{Z} (j  y, Θ) = \frac{α_{j} f (y  Z = j, θ_{j})}{f (y  Θ)} & (5) \end{matrix}$

by Bayes theorem. The joint pdf in exponential form for ease of analysis are rewritten

$\begin{matrix} f (y, z  Θ) = \exp [\sum_{j} [\ln (α_{j}) + \ln f (y  j, θ_{j})] δ [z - j]] . Thus & (6) \\ \ln f (y, z  Θ) = \sum_{j} δ [z - j] \ln [α_{j} f (y  j, θ_{j})] . & (7) \end{matrix}$

EM algorithms for finite mixture models may estimate Θ using the sub-population index Z as the latent variable. An EM algorithm on a finite mixture model may use (5) to derive the Q -function

$\begin{matrix} Q (Θ  Θ_{k}) {(= )}_{Z  y, Θ_{k}} [\ln f (y, Z  Θ)] (8) \\ = \sum_{z} (\sum_{j} δ [z - j] \ln [α_{j} f (y  j, θ_{j})]) \times p_{z} (z  y, Θ_{k}) (9) \\ = \sum_{j} \ln [α_{j} f (y  j, θ_{j})] p_{Z} (j  y, Θ_{k}) . (10) \end{matrix}$

B Noise Benefits in the EM Algorithm

FIG. 1 illustrates an example of a possible EM noise benefit for a Gaussian mixture model. The plot used the noise-annealed NEM algorithm. Low intensity initial noise decreased convergence time while higher intensity starting noise increased it. The optimal initial noise level had standard deviation σ_N* =2.5. The average optimal NEM speed-up over the noiseless EM algorithm is 27.2%. This NEM procedure added noise with a cooling schedule. The noise cools at an inverse-square rate. The Gaussian mixture density was a convex combination of two normal sub-populations N₁and N₂. The simulation used 200 samples of the mixture normal distribution to estimate the standard deviations of the two sub-populations. The additive noise used samples of zero-mean normal noise with standard deviation σ_Nscreened through the GMMNEM condition in (42). Each sampled point on the curve is the average of 100 trials. The vertical bars are 95% bootstrap confidence intervals for the mean convergence time at each noise level.

Theorem 1 below states a general sufficient condition for a noise benefit in the average convergence time of the EM algorithm. FIG. 1 shows a simulation instance of this theorem for the important EM case of Gaussian mixture densities. Small values of the noise variance may reduce convergence time while larger values may increase it. This possible U-shaped noise benefit may be the non-monotonic signature of stochastic resonance. The optimal noise speeds may converge by 27.2%. Other simulations on multidimensional GMMs have shown speed increases of up to 40%.

The possible EM noise benefit may differ from almost all SR noise benefits because it may not involve the use of a signal threshold, see L. Gammaitoni, P. Hänggi, P. Jung and F. Marchesoni, “Stochastic Resonance,” Reviews of Modern Physics 70 (1998) 223-287. The possible EM noise benefit may also differ from most SR noise benefits because the additive noise can depend on the signal. Independent noise can lead to weaker noise benefits than dependent noise in EM algorithms. This may also happen with enhanced convergence in noise-injected Markov chains, see B. Franzke and B. Kosko, “Noise can speed convergence in Markov chains,”, Physical Review E 84 (2011) 041112.

The idea behind the EM noise benefit is that sometimes noise can make the signal data more probable. This occurs at the local level when

f(y+n|θ)>f(y|θ) (14)

for probability density function (pdf) f, realization y of random variable Y, realization n of random noise N, and parameter θ. This condition holds if and only if the logarithm of the pdf ratio is positive:

$\begin{matrix} \ln (\frac{f (y + n  θ)}{f (y  θ)}) > 0. & (15) \end{matrix}$

The logarithmic condition (15) in turn occurs much more generally if it holds only on average with respect to all the pdfs involved in the EM algorithm:

$\begin{matrix} _{Y, Z, N  θ_{*}} [\ln \frac{f (Y + N, Z  θ_{k})}{f (Y, Z  θ_{k})}] \geq 0 & (16) \end{matrix}$

where random variable Z represents missing data in the EM algorithm and where θ. is the limit of the EM estimates θ_k: θ_k→θ_*. The positivity condition (16) may be precisely the sufficient condition for a noise benefit in Theorem 1 below, called the NEM or Noisy EM Theorem.

III. Noisy Expectation Maximization Theorems

The EM noise benefit may be defined by first defining a modified surrogate log-likelihood function

Q
_N(θ|θ_k)= custom-character _Z|y,θ_k[lnf(y+N,Z|θ)] (17)

and its maximizer

$θ_{k + 1, N} = \begin{matrix} \arg \max \\ θ \end{matrix} {Q_{N} (θ | θ_{k})} .$

The modified surrogate log-likelihood Q_N(θ|θ_k) equals the regular surrogate log-likelihood Q(θ|θ_k) when N=0. Q(θ|θ_*) is the final surrogate log-likelihood given the optimal EM estimate θ_*. So θ_*may maximize Q(θ|θ_*). Thus

Q(θ_*|θ_*)≧Q(θ|θ_*) for all θ. (18)

An EM noise benefit occurs when the noisy surrogate log-likelihood Q_N(θ_k|θ_*) is closer to the optimal value Q(θ_*|θ_*) than the regular surrogate log-likelihood Q(θ_k|θ_*) is. This holds when

Q
_N(θ_k|θ_*)≧Q(θ_k|θ_*) (19)

(Q(θ_*|θ_*)−Q(θ_k|θ_*))≧(Q(θ_*|θ_*)−Q_N(θ_k|θ_*)). (20)

So the noisy perturbation Q_N(θ|θ_k) of the current surrogate log-likelihood Q(θ|θ_k) may be a better log-likelihood function for the data than Q is itself. An average noise benefit results when the expectations of both sides of inequality (20):

custom-character
_N
[Q(θ_*|θ_*)−Q(θ_k|θ_*)]≧_N[Q(θ_*|θ_*)−Q_N(θ_k|θ_*)]. (21)

are taken.

The average noise benefit (21) occurs when the final EM pdf f(y,z|θ_*) is closer in relative-entropy to the noisy pdf f(y+N,z|θ_k) than it is to the noiseless pdf f(y,z|θ_k) . Define the relative-entropy pseudo-distances

c
_k(N)=D(f(y,z|θ_*)∥f(y+N,z|θ_k)) (22)

c
_k
=c
_k(0)=D(f(y,z|θ_*)∥f(y,z|θ_k)). (23)

Then noise benefits the EM algorithm when

c
_k
≧c
_k(N) (24)

holds for the relative-entropy pseudo-distances. The relative entropy itself has the form, see T. M. Cover and J. A. Thomas, Elements of Information Theory (Wiley & Sons, New York, 1991), 1 edition,

$\begin{matrix} D (h (u, v) \langle \rangle g (u, v)) = \int_{U, V} \ln [\frac{h (u, v)}{g (u, v)}] h (u, v) \partial u \partial v & (25) \end{matrix}$

for positive pdfs h and g over the same support. Convergent sums can replace the integrals as needed.

A. NEM Theorem

The Noisy Expectation Maximization (NEM) Theorem below uses the following notation. The noise random variable N has pdf f(n|y). So the noise N can depend on the data Y. Independence implies that the noise pdf becomes f(n|y)=f_N(n). {θ_k} is a sequence of EM estimates for θ. θ_*=lim_k→∞θ_kis the converged EM estimate for θ. Assume that the differential entropy of all random variables is finite. Assume also that the additive noise keeps the data in the likelihood function's support. The Appendix below gives proof of the NEM Theorem and its three corollaries.

Theorem 1: Noisy Expectation Maximization (NEM). The EM estimation iteration noise benefit

(Q(θ_*|θ_*)−Q(θ_k|θ_*))≧(Q(θ_*|θ_*)−Q_N(θ_k|θ_*)) (26)

may occur on average if

$\begin{matrix} _{Y, Z, N } θ_{*} [\ln (\frac{f (Y + N, Z  θ_{k})}{f (Y, Z  θ_{k})})] \geq 0. & (27) \end{matrix}$

The NEM theorem also applies to EM algorithms that use the complete data as their latent random variable. The proof for these cases follows from the proof in the appendix. The NEM positivity condition in these models may changes to

$\begin{matrix} _{X, Y, N  θ_{*}} [\ln (\frac{f (X + N  θ_{k})}{f (X  θ_{k})})] \geq 0. & (28) \end{matrix}$

The theorem also holds for more general methods of noise injection like using noise multiplication y.N instead of noise addition y+N. The NEM condition for generalized noise injection is

$_{Y, Z, N  θ_{*}} [\ln (\frac{f (φ (Y, N), Z  θ_{k})}{f (Y, Z  θ_{k})})] \geq 0$

where φ (Y,N) is some generalized function for combining data with noise.

The NEM Theorem may imply that each iteration of a suitably noisy EM algorithm moves closer on average towards the EM estimate θ_*than does the corresponding noiseless EM algorithm, see O. Osoba, S. Mitaim and B. Kosko, “Noise Benefits in the Expectation-Maximization Algorithm: NEM Theorems and Models,” in The International Joint Conference on Neural Networks (IJCNN) (IEEE, 2011), pp. 3178-3183. This may hold because the positivity condition (27) implies that E_N[c_k(N)]≦c_kat each step k since c_kdoes not depend on N from (23). The NEM algorithm may use larger overall steps on average than does the noiseless EM algorithm for any number k of steps

The NEM theorem's stepwise possible noise benefit may lead to a noise benefit at any point in the sequence of NEM estimates. This is because the following inequalities may be had when the expected value of inequality (19) takes the form

Q(θ_k|θ_*)≦ custom-character _N[Q_N(θ_k|θ_*)] for any k. (29)

Thus

Q(θ_*|θ_*)−Q(θ_k|θ_*)≧Q(θ_*|θ_*)− custom-character _N[Q_N(θ_k|θ_*)] for any k. (30)

The EM (NEM) sequence may converge when the left (right) side of inequality (30) equals zero. Inequality (30) implies that the difference on the right side is closer to zero at any step k.

NEM sequence convergence may be even stronger if the noise N_kdecays to zero as the iteration count k grows to infinity. This noise annealing implies N_k→0 with probability one. Continuity of Q as a function of Y implies that Q_N_k(θ|θ_k)→Q(θ|θ_k) as N_k43 0. This may hold because Q(θ|θ_k)= custom-character _Z|y,θ_k[lnf(y,Z|θ)] and because the continuity of Q implies that

$\begin{matrix} \begin{matrix} \lim_{N \to 0} Q_{N} (θ  θ_{k}) = _{Z  y, θ_{k}} [\ln f (\lim_{N \to 0} (y + N), Z  θ)] \\ = _{Z  y, θ_{k}} [\ln f (y, Z  θ)] \\ = Q (θ  θ_{k}) . \end{matrix} & (31) \end{matrix}$

The evolution of EM algorithms may guarantee that lim_kQ(θ_k|θ_*)=Q(θ_*|θ_*). This may give the probability-one limit

$\begin{matrix} \lim_{k \to \infty} Q_{N_{k}} (θ_{k}  θ_{*}) = Q (θ_{*}  θ_{*}) . & (32) \end{matrix}$

So for any ε>0 there may exist a k₀such that for all k>k₀:

|Q(θ_k|θ_*)−Q(θ_*|θ_*)|<ε and |Q_N_k(θ_k|θ_*)−Q(θ_*|θ_*)|<ε with probability one. (33)

Inequalities (29) and (33) may imply that Q(θ_k|θ_*) is ε-close to its upper limit Q(θ_*|θ_*) and

custom-character [Q_N_k(θ_k|θ_*)]≧Q(θ_k|θ_*) and Q(θ_*|θ_*)≧Q(θ_k|θ_*) (34)

So the NEM and EM algorithms may converge to the same fixed-point by (32). And the inequalities (34) may imply that NEM estimates are closer on average to optimal than EM estimates are at any step k.

B. NEM: Dominated Densities and Mixture Densities

The first corollary of Theorem 1 gives a dominated-density condition that satisfies the positivity condition (27) in the NEM Theorem. This strong pointwise condition is a direct extension of the pdf inequality in (14) to the case of an included latent random variable Z .

Corollary 1:

$\begin{matrix} _{Y, Z, N  θ_{*}} [\ln \frac{f (Y + N, Z  θ)}{f (Y, Z  θ)}] \geq 0 if f (y + n, z  θ) \geq f (y, z  θ) & (35) \end{matrix}$

for almost all y, z, and n.

The Corollary 1 may be used to derive conditions on the noise N that produce NEM noise benefits for mixture models. NEM mixture models may use two special cases of Corollary 1. These special cases as Corollaries 2 and 3 are stated below. The corollaries use the finite mixture model notation in Section 2.1. Recall that the joint pdf of Y and Z is

f(y,z|θ)=Σ_jα_jf(y|j,θ)δ[z−j]. (36)

Define the population-wise noise likelihood difference as

Δf_j(y,n)=f(y+n|j,θ)−f(y|j,θ). (37)

Corollary 1 implies that noise benefits the mixture model estimation if the dominated-density condition holds:

f(y+n,z|θ)≧f(y,z|θ). (38)

This may occur if

Δf_j(y,n)≧0 for all j. (39)

The Gaussian mixture model (GMM) may use normal pdfs for the sub-population pdfs, see V. Hasselblad, “Estimation of Parameters for a Mixture of Normal Distributions,” Technometrics 8 (1966) 431-444; R. A. Redner and H. F. Walker, “Mixture Densities, Maximum Likelihood and the EM algorithm,” SIAM Review 26 (1984) 195-239. Corollary 2 states a simple quadratic condition that may ensure that the noisy sub-population pdf f(y+n|Z=j,θ) dominates the noiseless sub-population pdf f(y|Z=j,θ) for GMMs. The additive noise samples n may depend on the data samples y.

Corollary 2: Suppose Y|_Z=j˜N(μ_j,σ_j²) and thus f(y|j,θ) is a normal pdf. Then

Δf_j(y,n)≧0 (40)

holds if

n
²≦2n(μ_j−y). (41)

Now apply the quadratic condition (41) to (39). Then f(y+n,z|θ)≧f(y,z|θ) may hold when

n
²≦2n(μ_j−y) for all j. (42)

The inequality (42) gives the GMM-NEM noise benefit condition (misstated in O. Osoba and B. Kosko, “Noise-Enhanced Clustering and Competitive Learning Algorithms,” Neural Networks 37 (2013) 132-140, but corrected in O. Osoba and B. Kosko, “Corrigendum to ‘Noise enhanced clustering and competitive learning algorithms’ [Neural Netw. 37 (2013) 132-140],” Neural Networks (2013)) when the NEM system more quickly estimates the standard deviations σ_jthan does noiseless EM. This can also benefit expectation-conditional-maximization (ECM), see X. L. Meng and D. B. Rubin, “Maximum Likelihood Estimation via the ECM algorithm: A general framework,” Biometrika 80 (1993) 267, methods.

FIG. 1 shows an example of a simulation instance of possible noise benefits for GMM parameter estimation based on the GMM-NEM condition (42). The simulation estimates the sub-population standard deviations σ₁and σ₂from 200 samples of a Gaussian mixture of two 1-D sub-populations with known means μ₁=−2 and μ₂=2 and mixing proportions α₁=0.5 and α₂=0.5. The true standard deviations may be σ₁*=2 and σ₂*=2. Each EM and NEM procedure may start at the same initial point with σ₁(0)=4.5 and σ₂(0)=5. The simulation runs NEM on 100 GMM data sets for each noise level σ_Nand counts the number of iterations before convergence for each instance. The average of these iteration counts is the average convergence time at that noise level σ_N. The EM and NEM simulations use the NArgMax numerical maximization routine in Mathematica for the M-step. Simulations (not shown) also confirm that both the Cauchy mixture model (CMM) and non-Gaussian noise may show a similar pronounced noise benefit.

Corollary 3 gives a similar quadratic condition for the Cauchy mixture model.

Corollary 3: Suppose Y|_Z=j˜C(m_j,d_j) and thus f(y|j,θ) is a Cauchy pdf. Then

Δf_j(y,n)≧0 (43)

holds if

n
²≦2n(m_j−y). (44)

Again apply the quadratic condition (44) to (39). Then f(y+n,z|θ)≧f(y,z|θ) may hold when

n
²≦2n(m_j−y) for all j. (45)

Both quadratic NEM inequality conditions in (42) and (45) may reduce to the following inequality (replace μ with m for the CMM case):

n[n−2(μ_j−y)]≦0 for all j. (46)

So the noise n may fall in the set where the parabola n²−2n(μ_j−y) is negative for all j. There are two possible solution sets for (46) depending on the values of μ_jand y. These solution sets are

N
⁻
_j(y)=[2(μj−y),0] (47)

N
⁺
_j(y)=[0,2(μ_j−y)]. (48)

A goal may be to find the set N(y) of n values that satisfy the inequality in (42) for all j:

N(y)=∩_jN_j(y) (49)

where N_j(y)=N⁺_j(y) or N_j(y)=N⁻_j(y). N(y)≠{0} may hold only when the sample y lies on one side of all subpopulation means (or location parameters) μ_j. This may hold for

y<μ_jfor all j or y<μ_jfor all j . (50)

The NEM noise N may take values in ∩_jN⁻_jif the data sample y falls to the right of all sub-population means (y>μ_jfor all j). The NEM noise N may take values in ∩_jN⁺_jif the data sample y falls to the left of all subpopulation means (y<μ_jfor all j). And N=0 may only be valid value for N when y falls between sub-populations means. Thus, the noise N may tend to pull the data sample y away from the tails and towards the cluster of sub-population means (or locations).

IV. The Noisy Expectation-Maximization Algorithm

The NEM Theorem and its corollaries give a general method for modifying the noiseless EM algorithm. The NEM Theorem also may imply that, on average, these NEM variants outperform the noiseless EM algorithm.

Algorithm 2 gives the Noisy Expectation-Maximization algorithm schema. The operation NEMNoiseSample(y) generates noise samples that satisfy the NEM condition for the current data model. The noise sampling distribution may depend on the vector of random samples y in the Gaussian and Cauchy mixture models. The noise can have any value in the NEM algorithm for censored gamma models. The E-Step may take a conditional expectation of a function of the noisy data samples y_† given the noiseless data samples y.

Algorithm 2 {circumflex over (θ)}_NEM= NEM-Estimate(y)

Require: y = (y₁, . . ., y_M): vector of observed incomplete data

Ensure: {circumflex over (θ)}_NEM: NEM estimate of parameter θ

1:
while (||θ_k− θ_k−1|| ≧ 10^−tol) do

2:

N_S-Step: n ← k^−τ × NEMNoiseSample(y)

3:

N_A-Step: y_† ← y + n

4:

E-Step: Q(θ|θ_k) ← E_Z|y,θ_k [ln ∫(y_†, Z|θ)]

5:

M - Step : θ_{k + 1} \leftarrow \underset{θ}{argmax} {Q (θ | θ_{k})}

6:

k ← k + 1

7:
end while

8:
{circumflex over (θ)}_NEM← θ_k

A deterministic decay factor k^−τ scales the noise on the k^thiteration. τ is the noise decay rate. The decay factor k^−τ reduces the noise at each new iteration. This factor drives the noise N_kto zero as the iteration step k increases. The simulations in this presentation use τ=2 for demonstration. Values between τ=1 and τ=3 also work. N_kstill needs to satisfy the NEM condition for the data model. The cooling factor k^−τ must not cause the noise samples to violate the NEM condition. This may means that 0<k^−τ≦1 and that the NEM condition solution set is closed with respect to contractions.

The decay factor may reduce the NEM estimator's jitter around its final value. This may be important because the EM algorithm converges to fixed-points. So excessive estimator jitter may prolong convergence time even when the jitter occurs near the final solution. The simulations in this presentation use polynomial decay factors instead of logarithmic cooling schedules found in annealing applications, see S. Kirkpatrick, C. Gelatt Jr and M. Vecchi, “Optimization by simulated annealing,” Science 220 (1983) 671-680; V. Cerny, “Thermodynamical approach to the Traveling Salesman Problem: An efficient simulation algorithm,” Journal of Optimization Theory and Applications 45 (1985) 41-51; S. Geman and C. Hwang, “Diffusions for global optimization,” SIAM Journal on Control and Optimization 24 (1986) 1031-1043; B. Hajek, “Cooling schedules for optimal annealing,” Mathematics of operations research (1988) 311-329; B. Kosko, Neural Networks and Fuzzy Systems: A Dynamical Systems Approach to Machine Intelligence (Prentice Hall, 1991).

Deterministic and/or chaotic samples can achieve effects similar to random noise in the NEM algorithm.NEM variants that use deterministic or chaotic perturbations instead of random noise may be called Deterministic Interference EM or Chaotic EM respectively.

The next algorithm is an example of the full NEM algorithm for 1-D GMMs using an inverse square cooling rate on the additive noise. The N-step combines both N_Sand N_Asteps in the NEM algorithm.

Algorithm 3 GMM-NEM Algorithm (1-D)

Require: y = (y₁, . . ., y_M): vector of observed incomplete data

Ensure: {circumflex over (θ)}_NEM: NEM estimate of parameter θ

1:
while (||θ_k− θ_k−1|| ≧ 10^−tol) do

2:

N - Step : y_{†, i} = y_{i} + n_{i} where n_{i} is a sample of the truncated Gaussian \sim N (0, \frac{σ N}{k^{2}})

such that n_i[n_i− 2(μ_j− y_i)] ≦ 0 for all i, j

3:

E-Step: Q(Θ|Θ(t)) = Σ_i=1^MΣ_j=1^Kln[α_j∫(z_i|j, θ_j)]p_Z(j|y, Θ(t))

4:

M - Step : θ_{k + 1} = \underset{θ}{argmax} {Q (θ | θ_{k})}

5:

k = k + 1

6:
end while

7:
{circumflex over (θ)}_NEM= θ_k.

The NEM algorithm may inherit variants from the classical EM algorithm schema. A NEM adaptation to the Generalized Expectation Maximization (GEM) algorithm may be one of the simpler variations. The GEM algorithm replaces the EM maximization step with a gradient ascent step. The Noisy Generalized Expectation Maximization (NGEM) algorithm (Algorithm 3) may use the same M-step:

Algorithm 3 Modified M-Step for NGEM:

1: M-Step: θ_k+1 ← {tilde over (θ)} such that Q ({tilde over (θ)}|θ_k) ≧ Q (θ_k|θ_k)

The NEM algorithm schema may also allow for some variations outside the scope of the EM algorithm. These involve modifications to the noise sampling step N_S-Step or to the noise addition step N_A-Step. One such modification may not require an additive noise term n_ifor each y_i. This may be useful when the NEM condition is stringent because then noise sampling can be time intensive. This variant changes the N_S-Step by picking a random or deterministic sub-selection of y to modify. Then, it samples the noise subject to the NEM condition for those sub-selected samples. This is the Partial Noise Addition NEM (PNA-NEM).

Algorithm 4 Modified N_s-Step or PNA-NEM

custom-character

← {1 . . . M}

custom-character

← SubSelection( custom-character

)

for all i ε custom-character

do

n_i← k^−T × NEMNoiseSample(y_i)

end for

The NEM noise generating procedure NEMNoiseSample(y) may return a NEMcompliant noise sample n at a given noise level σ_Nfor each data sample y. This procedure may change with the EM data model. The noise generating procedure for the GMMs and CMMs comes from Corollaries 2 and 3. The following 1-D noise generating procedure may be used for the GMM simulations:

NEMNoiseSample for GMM- and CMM-NEM

Require: y and σ_N: current data sample and noise level

Ensure: n: noise sample satisfying NEM condition

N (y) ← ∩_jN_j(y)

n is a sample from the distribution T N (0, σ_N|N(y))

where TN(0,σ_N|N(y)) is the normal distribution N(0,σ_N) truncated to the support set N(y). The set N(y) is the interval intersection from (49). Multi-dimensional versions of the generator can apply the procedure component-wise.

V. NEM Sample Size Effects: Gaussian and Cauchy Mixture Models

The noise-benefit effect may depend on the size of the GMM data set. Analysis of this effect may depend on the probabilistic event that the noise satisfies the GMM-NEM condition for the entire sample set. This analysis also applies to the Cauchy mixture model because its NEM condition is the same as the GMM's. Define A_kas the event that the noise N satisfies the GMM-NEM condition for the k^thdata sample:

A
_k
={N
²≦2N(μ_j−y_k)|∀_j}. (52)

Then define the event A_Mthat noise random variable N satisfies the GMM-NEM condition for each data sample as

$\begin{matrix} \begin{matrix} A_{M} = \underset{k}{⋂^{M}} A_{k} \\ = {N^{2} \leq 2 N (μ_{j} - y_{k})  \forall j and \forall k} . (54) \end{matrix} & (53) \end{matrix}$

This construction may be useful for analyzing NEM when the independent and identically distributed (i.i.d.) noise

$N_{k} \overset{d}{=} N$

for all y_kis used while still enforcing the NEM condition.

A Large Sample Size Effects

The next theorem shows that the set A_Mshrinks to the singleton set {0} as the number M of samples in the data set grows. So the probability of satisfying the NEM condition with i.i.d. noise samples goes to zero as M→∞ with probability one.

Theorem 2: Large Sample GMM and CMM-NEM

Assume that the noise random variables are i.i.d. Then the set of noise values

A
_M
={N
²≦2N(μ_j−y_k)|∀j and ∀k} (55)

that satisfy the Gaussian NEM condition for all data samples y_kdecreases with probability one to the set {0} as M→∞:

$\begin{matrix} P (\lim_{M \to \infty} A_{M} = {0}) = 1. & (56) \end{matrix}$

The proof shows that larger sample sizes M may place tighter bounds on the size of A_Mwith probability one. The bounds shrink A_Mall the way down to the singleton set {0} as M→∞. A_Mis the set of values that identically distributed noise N can take to satisfy the NEM condition for all y_k. A_M={0} means that N_kmust be zero for all k because the N_kare identically distributed. This corresponds to cases where the NEM Theorem cannot guarantee improvement over the regular EM using just i.i.d. noise. So identically distributed noise may have limited use in the GMM- and CMM-NEM framework.

Theorem 2 is a “probability-one” result. But it also implies the following convergence-in-probability result. Suppose Ñ is an arbitrary continuous random variable. Then the probability P(ÑεA_M) that Ñ satisfies the NEM condition for all samples may fall to P(Ñε{0})=0 as M→∞.

Using non-identically distributed noise N_kmay avoid the reduction in the probability of satisfying the NEM-condition for large M. The NEM condition may still hold when N_kεA_kfor each k even if N_k∉A_M=∩_kA_k. This noise sampling model may adapt the k^thnoise random variable N_kto the k^thdata sample y_k. This is the general NEM noise model. FIG. 1 and FIG. 2 use the NEM noise model. This model may be equivalent to defining the global NEM event Ã_Mas a Cartesian product of sub-events Ã_M=Π_k^MA_kinstead of the intersection of sub-events A_M=∩_kA_k. Thus, the bounds of Ã_Mand its coordinate projections may no longer depend on sample size M.

FIG. 3 illustrates an example of comparing of the possible effects of noise injection with and without the NEM sufficient condition. The data model is a GMM with sample size M=225. The blind noise model added annealed noise without checking the NEM condition. The plot shows that NEM noise injection outperformed the blind noise injection. NEM converged up to about 20% faster than the blind noise injection for this model. And blind noise injection produced no reduction in average convergence time. The Gaussian mixture density had mean μ=[0,1], standard deviations σ=[1,1], and weights α=[0.5,0.5] with M=225 samples.

FIG. 3 compares the performance of the NEM algorithm with a simulated annealing version of the EM algorithm. This version of EM adds annealed i.i.d. noise to data samples y without screening the noise through the NEM condition, called blind noise injection. FIG. 3 shows that NEM may outperform blind noise injection at a single sample size M=225. But it also shows that blind noise injection may fail to give any benefit even when NEM achieves faster average EM convergence for the same set of samples. Thus blind noise injection (or simple simulated annealing) may perform worse than NEM and may sometimes performs worse than EM itself.

B. Small Sample Size: Sparsity Effect

The i.i.d noise model in Theorem 2 has an important corollary effect for sparse data sets. The size of A_Mdecreases monotonically with M because A_M=∩_k^MA_k. Then for M₀<M₁:

P(NεA_M₀)≧P(NεA_M₁) 57)

since M₀<M₁implies that A_M₁⊂A_M₀. Thus arbitrary noise N (i.i.d and independent of Y_k) is more likely to satisfy the NEM condition and produce a noise benefit for smaller samples sizes M₀than for larger samples sizes M₁. The probability that NεA_Mfalls to zero as M→∞. So the strength of the i.i.d. noise benefit falls as M→∞.

Possible Hardware

FIG. 4 illustrates an example of a computer estimation system 401 for iteratively estimating an unknown parameter of a model or state of a system. The estimating computer system 401 may include an input module 403, a noise module 405, and estimation module 407, and a signaling module 409. The computer estimated system 401 may include additional modules and/or not all of these modules. Collectively, the various modules may be configured to implement any or all of the algorithms that have been discussed herein. Now set forth are examples of these implementations.

The input module 403 may have a configuration that receives numerical data about a model or state of the system. The input module 403 may consist of or include a network interface card, a data storage system interface, any other type of device that receives data, and/or any combination of these.

The noise module 405 may have a configuration that generates random, chaotic, or other type of numerical perturbations of the received numerical data and/or that generates pseudo-random noise.

The noise module 405 may have a configuration that generates random, chaotic, or other type of numerical perturbations of the input numerical data that fully or partially satisfy a noisy expectation maximization (NEM) condition.

The noise module 405 may have a configuration that generates numerical perturbations that do not depend on the received numerical data.

The estimation module 407 may have a configuration that iteratively estimates the unknown parameter of the model or state of the system based on the received numerical data and then uses the numerical perturbations in the input numerical data and/or the pseudo-random noise and the input numerical data during at least one of the iterative estimates of the unknown parameter.

The estimation module 407 may have a configuration that estimates the unknown parameter of the model or state of the system using maximum likelihood, expectation-maximization, minorization-maximization, or another statistical optimization or sub-optimization method.

The estimation module 407 may have a configuration that estimates the unknown parameter of the model or state of the system by adding, multiplying, or otherwise combining the input data with the numerical perturbations.

The estimation module 407 may have a configuration that estimates the unknown parameter of the model or state of the system using the numerical perturbations that do not depend on the received numerical data.

The estimation module 407 may have a configuration that causes the magnitude of the generated numerical perturbations to eventually decay during successive parameter estimates.

FIG. 5 illustrates an example of computer-readable storage media that may contain a program of instructions that cause a computer system running the program of instructions to function as any of the types of estimating computer system described herein.

Other documents that disclose details about the technology that has been described herein include:

- O. Osoba, S. Mitaim, B. Kosko, “The Noisy Expectation Maximization Algorithm,” Fluctuation and Noise Letters, June 2013
- K. Audhkhasi, O. Osoba, and B. Kosko, “Noise Benefits in Back-Propagation and Deep Bidirectional Pre-Training,” International Joint Conference on Neural Networks (IJCNN), 2013
- K. Audhkhasi, O. Osoba, and B. Kosko, “Noisy Hidden Markov Models for Speech Recognition,” International Joint Conference on Neural Networks (IJCNN), 2013
- O. Osoba, B. Kosko, “Noise-enhanced Clustering and Competitive Learning Algorithms,” Neural Networks, vol. 37, pp.132-140, January 2013
- O. Osoba, S. Mitaim, B. Kosko, “Noise Benefits in the Expectation-Maximization Algorithm: NEM Theorems and Models,” International Joint Conference on Neural Networks (IJCNN), pp. 3178-3183, August 2011
- Osoba, Osonde Adekorede. Noise Benefits in Expectation-Maximization Algorithms. Dissertation, University of Southern California, August 2013

VI. Conclusion

Careful noise injection can speed up the average convergence time of the EM algorithm. The various sufficient conditions for such a noise benefit may involve a direct or average effect where the noise makes the signal data more probable. Special cases may include mixture density models and log-convex probability density models. Noise injection for the Gaussian and Cauchy mixture models may improve the average EM convergence speed when the noise satisfies a simple quadratic condition. Even blind noise injection can sometimes benefit these systems when the data set is sparse. But NEM noise injection still outperforms blind noise injection in all data models tested.

APPENDIX
Proof of Theorems
Theorem 1: Noisy Expectation Maximization (NEM)

An EM estimation iteration noise benefit

(Q(θ_*|θ_*)−Q(θ_k|θ_*))≧(Q(θ_*|θ_*)−Q_N(θ_k|θ_*) (67)

occurs on average if

$\begin{matrix} _{Y, Z, N  θ_{*}} [\ln (\frac{f (Y + N, Z  θ_{k})}{f (Y, Z  θ_{k})})] \geq 0. & (68) \end{matrix}$

Proof: Each expectation of Q-function differences in (21) is a distance pseudo-metric. Rewrite Q as an integral:

∫_Zln[f(y,z|θ)]f(z|y,θ_k)dz. (69)

c_k=D(f(y,z|θ_*)∥f(y,z|θ_k)) is the expectation over Y because

$\begin{matrix} \begin{matrix} c_{k} = \int \int [\ln (f (y, z  θ_{*})) - \ln f (y, z  θ_{k})] \\ f (y, z  θ_{*}) \partial z \partial y \\ = \int \int [\ln (f (y, z  θ_{*})) - \ln f (y, z  θ_{k})] (71) \\ f (z  y, θ_{*}) f (y  θ_{*}) \partial z \partial y \\ = _{Y  θ_{k}} [Q (θ_{*}  θ_{*}) - Q (θ_{k}  θ_{*})] . (72) \end{matrix} & (70) \end{matrix}$

c_k(N) is likewise the expectation over Y:

$\begin{matrix} \begin{matrix} c_{k} (N) = \int \int [\ln (f (y, z  θ_{*})) - \ln f (y + N, z  θ_{k})] \\ f (y, z  θ_{*}) \partial z \partial y \\ = \int \int [\ln (f (y, z  θ_{*})) - \ln f (y + N, z  θ_{k})] (74) \\ f (z  y, θ_{*}) f (y  θ_{*}) \partial z \partial y \\ = _{Y  θ_{k}} [Q (θ_{*}  θ_{*}) - Q_{N} (θ_{k}  θ_{*})] . (75) \end{matrix} & (73) \end{matrix}$

Take the noise expectation of c_kand c_k(N):

custom-character
_N[c_k]=c_k (76)

custom-character
_N
[c
_k(N)]=_N[c_k(N)]. (77)

So the distance inequality

c
_k≧ custom-character _N[c_k(N)] (78)

guarantees that noise benefits occur on average:

The inequality condition (78) may be used to derive a more useful sufficient condition for a noise benefit. Expand the difference of relative entropy terms c_k−c_k(N):

$\begin{matrix} \begin{matrix} c_{k} - c_{k} (N) = \int \int_{Y, Z}^{} (\ln [\frac{f (y, z  θ_{*})}{f (y, z  θ_{k})}] - \\ \ln [\frac{f (y, z  θ_{*})}{f (y + N, z  θ_{k})}]) f (y, z  θ_{*}) \partial y \partial z \\ = \int \int_{Y, Z}^{} (\ln [\frac{f (y, z  θ_{*})}{f (y, z  θ_{k})}] + (81) \\ \ln [\frac{f (y + N, z  θ_{k})}{f (y, z  θ_{*})}]) f (y, z  θ_{*}) \partial y \partial z \\ = \int \int_{Y, Z}^{} \ln [\frac{f (y, z  θ_{*}) f (y + N, z  θ_{k})}{f (y, z  θ_{k}) f (y, z  θ_{*})}] (82) \\ f (y, z  θ_{*}) \partial y \partial z \\ = \int \int_{Y, Z}^{} \ln [\frac{f (y + N, z  θ_{k})}{f (y, z  θ_{k})}] f (y, z  θ_{*}) \partial y \partial z . (83) \end{matrix} & (80) \end{matrix}$

Take the expectation with respect to the noise term N:

$\begin{matrix} \begin{matrix} _{N} [c_{k} - c_{k} (N)] = c_{k} - _{N} [c_{k} (N)] \\ = \int_{N}^{} \int \int_{Y, Z}^{} \ln [\frac{f (y + n, z  θ_{k})}{f (y, z  θ_{k})}] (85) \\ f (y, z  θ_{*}) f (n  y) \partial y \partial z \partial n \\ = \int \int_{Y, Z}^{} \int_{N}^{} \ln [\frac{f (y + n, z  θ_{k})}{f (y, z  θ_{k})}] (86) \\ f (n  y) f (y, z  θ_{*}) \partial n \partial y \partial z \\ = _{Y, Z, N  θ_{*}} [\ln \frac{f (Y + N, Z  θ_{k})}{f (Y, Z  θ_{k})}] . (87) \end{matrix} & (84) \end{matrix}$

The assumption of finite differential entropy for Y and Z may ensure that In f(y,z|θ)f(y,z|θ_*) is integrable. Thus the integrand may be integrable. So Fubini's theorem, see G. B. Folland, Real Analysis: Modern Techniques and Their Applications (Wiley-Interscience, 1999), 2nd edition, permits the change in the order of integration in (87):

$\begin{matrix} c_{k} \geq _{N} [c_{k} (N)] iff _{Y, Z, N  θ_{*}} [\ln (\frac{f (Y + N, Z  θ_{k})}{f (Y, Z  θ_{k})})] \geq 0. & (88) \end{matrix}$

Then an EM noise benefit may occur on average if

$\begin{matrix} _{Y, Z, N  θ_{*}} [\ln (\frac{f (Y + N, Z  θ_{k})}{f (Y, Z  θ_{k})})] \geq 0. & (89) \end{matrix}$

Corollary 1:

$\begin{matrix} _{Y, Z, N  θ_{*}} [\ln \frac{f (y + N, z  θ_{k})}{f (y, z  θ_{k})}] \geq 0 if f (y + n, z  θ) \geq f (y, z  θ) & (90) \end{matrix}$

for almost all y, z, and n.

Proof: The following inequalities need hold only for almost all y, z, and n:

$\begin{matrix} f (y + n, z  θ) \geq f (y, z  θ) & (91) \\ iff \ln [f (y + n, z  θ)] \geq \ln [f (y, z  θ)] & (92) \\ iff \ln [f (y + n, z  θ)] - \ln [f (y, z  θ)] \geq 0 & (93) \\ iff \ln [\frac{f (y + n, z  θ)}{f (y, z  θ)}] \geq 0. & (94) \end{matrix}$

Thus

$\begin{matrix} _{Y, Z, N | θ_{*}} [\ln \frac{f (Y + N, Z | θ)}{f (Y, Z | θ)}] \geq 0. & (95) \end{matrix}$

Corollary 2: Suppose Y|_Z=j˜N(μ_j,σ_j²) and thus f(y|j,θ) is a normal pdf. Then

Δf_j(y,n)≧0 (96)

holds if

n
²≦2n(μ_j−y) (97)

Proof: The proof compares the noisy and noiseless normal pdfs. The normal pdf is

$\begin{matrix} f (y  θ) = \frac{1}{σ_{j} \sqrt{2 π}} \exp [- \frac{{(y - μ_{j})}^{2}}{2 σ_{j}^{2}}] . & (98) \end{matrix}$

So f(y+n|θ)≧f(y|θ)

$\begin{matrix} iff \exp [- \frac{{(y + n - μ_{j})}^{2}}{2 σ_{j}^{2}}] \geq \exp [- \frac{{(y - μ_{j})}^{2}}{2 σ_{j}^{2}}] & (99) \\ iff - {(\frac{y + n - μ_{j}}{σ_{j}})}^{2} \geq - {(\frac{y - μ_{j}}{σ_{j}})}^{2} & (100) \\ iff - {(y - μ_{j} + n)}^{2} \geq - {(y - μ_{j})}^{2} . & (101) \end{matrix}$

Inequality (101) may hold because σ_jis strictly positive. Expand the left-hand side to get (97):

(y−μ_j)²+n²+2n(y−μ_j)≦(y−μ_j)² (102)

iff n
²+2n(y−μ_j)≦0 (103)

iff n
²≦−2n(y−μ_j) (104)

iff n
²≦2n(μ_j−y) (105)

Corollary 3: Suppose Y|_Z=j˜C(m_j,d_j) and thus f(y|j,θ) is a Cauchy pdf. Then

Δf_j(y,n)≧0 (106)

holds if

n
²≦2n(m_j−y). (107)

Proof:The proof compares the noisy and noiseless Cauchy pdfs. The Cauchy pdf is

$\begin{matrix} f (y  θ) = \frac{1}{π d_{j} [1 + {(\frac{y - m_{j}}{d_{j}})}^{2}]} . & (108) \end{matrix}$

Then f(y+n|θ)≧f(y|θ)

$\begin{matrix} iff \frac{\frac{1}{π d_{j}}}{[1 + {(\frac{y + n - m_{j}}{d_{j}})}^{2}]} \geq \frac{\frac{1}{π d_{j}}}{[1 + {(\frac{y - m_{j}}{d_{j}})}^{2}]} & (109) \\ iff [1 + {(\frac{y - m_{j}}{d_{j}})}^{2}] \geq [1 + {(\frac{y + n - m_{j}}{d_{j}})}^{2}] & (110) \\ iff {(\frac{y - m_{j}}{d_{j}})}^{2} \geq {(\frac{y + n - m_{j}}{d_{j}})}^{2} . & (111) \end{matrix}$

Proceed as in the last part of the Gaussian case:

$\begin{matrix} {(\frac{y - m_{j}}{d_{j}})}^{2} \geq {(\frac{y - m_{j} + n}{d_{j}})}^{2} & (112) \\ iff {(y - m_{j})}^{2} \geq {(y - m_{j} + n)}^{2} & (113) \\ iff {(y - m_{j})}^{2} \geq {(y - m_{j})}^{2} + n^{2} + 2 n (y - m_{j}) & (114) \\ iff 0 \geq n^{2} + 2 n (y - m_{j}) & (115) \\ iff n^{2} \leq 2 n (m_{j} - y) . & (116) \end{matrix}$

The estimating computer system 401 that has been described herein, including each of its modules (except for the input module 403), is implemented with a computer system configured to perform the functions that have been described herein for the component. The computer system includes one or more processors, tangible memories (e.g., random access memories (RAMs), read-only memories (ROMs), and/or programmable read only memories (PROMS)), tangible storage devices (e.g., hard disk drives, CD/DVD drives, and/or flash memories), system buses, video processing components, network communication components, input/output ports, and/or user interface devices (e.g., keyboards, pointing devices, displays, microphones, sound reproduction systems, and/or touch screens). Each module may have its own computer system or some or all of the modules may share a single computer system.

Each computer system may be a desktop computer or a portable computer, or part of a larger system, such a system that clusters algorithms for Big Data; Trains hidden Markov models for speech, natural language, and other kinds of sequential data (including DNA); that trains neural networks for speech and computer vision; identifies sequences for genomics and proteomics; reconstructs medical image in positron emission tomography; segments images for medical imaging and robotics; or estimates risks for portfolio management.

Each computer system may include one or more computers at the same or different locations. When at different locations, the computers may be configured to communicate with one another through a wired and/or wireless network communication system.

Each computer system may include software (e.g., one or more operating systems, device drivers, application programs, and/or communication programs). When software is included, the software includes programming instructions and may include associated data and libraries. When included, the programming instructions are configured to implement one or more algorithms that implement one or more of the functions of the computer system, as recited herein. The description of each function that is performed by each computer system also constitutes a description of the algorithm(s) that performs that function.

The software may be stored on or in one or more non-transitory, tangible storage devices, such as one or more hard disk drives, CDs, DVDs, and/or flash memories. The software may be in source code and/or object code format. Associated data may be stored in any type of volatile and/or non-volatile memory. The software may be loaded into a non-transitory memory and executed by one or more processors.

The components, steps, features, objects, benefits, and advantages that have been discussed are merely illustrative. None of them, nor the discussions relating to them, are intended to limit the scope of protection in any way. Numerous other embodiments are also contemplated. These include embodiments that have fewer, additional, and/or different components, steps, features, objects, benefits, and advantages. These also include embodiments in which the components and/or steps are arranged and/or ordered differently.

For example, the use of Bayesian priors and penalized likelhood functions in Maximum A Posteriori and Penalized EM algorithms, other variants of the EM algorithm, and the more general class of minorization-maximization algorithms.

Unless otherwise stated, all measurements, values, ratings, positions, magnitudes, sizes, and other specifications that are set forth in this specification, including in the claims that follow, are approximate, not exact. They are intended to have a reasonable range that is consistent with the functions to which they relate and with what is customary in the art to which they pertain.

All articles, patents, patent applications, and other publications that have been cited in this disclosure are incorporated herein by reference.

The phrase “means for” when used in a claim is intended to and should be interpreted to embrace the corresponding structures and materials that have been described and their equivalents. Similarly, the phrase “step for” when used in a claim is intended to and should be interpreted to embrace the corresponding acts that have been described and their equivalents. The absence of these phrases from a claim means that the claim is not intended to and should not be interpreted to be limited to these corresponding structures, materials, or acts, or to their equivalents.

The scope of protection is limited solely by the claims that now follow. That scope is intended and should be interpreted to be as broad as is consistent with the ordinary meaning of the language that is used in the claims when interpreted in light of this specification and the prosecution history that follows, except where specific meanings have been set forth, and to encompass all structural and functional equivalents.

Relational terms such as “first” and “second” and the like may be used solely to distinguish one entity or action from another, without necessarily requiring or implying any actual relationship or order between them. The terms “comprises,” “comprising,” and any other variation thereof when used in connection with a list of elements in the specification or claims are intended to indicate that the list is not exclusive and that other elements may be included. Similarly, an element preceded by an “a” or an “an” does not, without further constraints, preclude the existence of additional elements of the identical type.

None of the claims are intended to embrace subject matter that fails to satisfy the requirement of Sections 101, 102, or 103 of the Patent Act, nor should they be interpreted in such a way. Any unintended coverage of such subject matter is hereby disclaimed. Except as just stated in this paragraph, nothing that has been stated or illustrated is intended or should be interpreted to cause a dedication of any component, step, feature, object, benefit, advantage, or equivalent to the public, regardless of whether it is or is not recited in the claims.

The abstract is provided to help the reader quickly ascertain the nature of the technical disclosure. It is submitted with the understanding that it will not be used to interpret or limit the scope or meaning of the claims. In addition, various features in the foregoing detailed description are grouped together in various embodiments to streamline the disclosure. This method of disclosure should not be interpreted as requiring claimed embodiments to require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive subject matter lies in less than all features of a single disclosed embodiment. Thus, the following claims are hereby incorporated into the detailed description, with each claim standing on its own as separately claimed subject matter.

ITERATIVE ESTIMATION OF SYSTEM PARAMETERS USING NOISE-LIKE PERTURBATIONS

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

US Classifications

International Classifications

Abstract

Description

Claims

CROSS-REFERENCE TO RELATED APPLICATION

Provisional Applications (1)