The present disclosure relates to systems and methods for change point detection, and more particularly, for Bayesian online change point detection.
Change point analysis (CPA), also known as change point detection (CPD), is the characterization of changes in the parameters of a stochastic process, through the form of sequential data. Often CPA is employed in the segmentation of signals, i.e., to facilitate the process of tracking, identification or recognition
Online versions of a Bayesian version of change point analysis were recently formulated. BOCPD is a Bayesian Online Change Point Detection algorithm that allows for online inference with causal predictive filtering processing necessary in real-time systems that interact with dynamic physical environments. One of the key challenges and critiques of Bayesian approaches is the high computational requirement that often necessitates high-precision floating point computations.
A reconfigurable computing architecture for Bayesian Online ChangePoint Detection (BOCPD) is provided. In an exemplary embodiment, the architecture may be employed for use in video processing, and more specifically to computing whether a pixel in a video sequence belongs to the background or to an object (foreground). Each pixel may be processed with only information from its intensity and its time history. The computing architecture employs unary fixed point representation for numbers in time, using pulse density or random pulse density modulation—i.e., a stream of zeros and ones, where the mean of that stream represents the encoded value. Each processor incorporates on-line machine learning that enable a continuous update of processing parameters in the statistical model.
For a fuller understanding of the nature and objects of the disclosure, reference should be made to the following detailed description taken in conjunction with the accompanying drawings, in which:
In embodiments, the present disclosure provides a reconfigurable computing architecture for a change-point detector. The architecture may be configured to compute a change point according to the Adams/McKay Bayesian Online ChangePoint Detection technique. In a non-limiting embodiment, the present architecture is suitable for video processing, and more specifically, for computing whether a pixel in a video sequence belongs to the background or to an object (foreground). Each pixel may be processed with only information from its intensity and its time history. The computing architecture employs unary fixed-point representation for numbers in time, using pulse density or random pulse density modulation—i.e., a stream of zeros and ones, where the mean of that stream represents the encoded value. Each processor may incorporate on-line machine learning to enable a continuous update of processing parameters in the statistical model.
Data representation, hardware architecture and processing units that employ probabilistic event representation and computational structures that natively operate on probabilities. A fully programmable multicore processor for exact Bayesian inference is synthesized in hardware description language and as a custom application specific integrated circuit (ASIC). The architecture is programmable and allows computation with precision on demand, allowing to tradeoff power dissipation with performance. Computational elements in the architecture enable on-line machine learning of the hyper-parameters of the predictive probability distribution. As an example of application we use the McKay/Adams Change Point Detection algorithm to perform image segmentation.
The architecture can seamlessly, natively, and optimally perform computationally disparate tasks: optimally perform traditional computing tasks like Image Processing and Radio Frequency (RF) Signal Processing on the same that will also optimally perform cognitive computing tasks like scene understanding and inferential prediction and the native execution of cognitive and machine learning tasks in addition to the traditional image and signal processing tasks.
The extremely low power dissipation of the custom ASIC architecture (10 nJ per exact Bayesian inference) makes the architecture useful in Internet of Things (IoT) and Cyber-Physical (CPS) systems with cognition are systems with cyber technologies, both hardware and software, that interact with physical components and in environments populated by humans. As these technologies have the potential to create competitive advantage for the U.S. economy in the 21st century, research on the topic is thriving. Systems are currently investigated in a variety of sectors, including smart manufacturing (robots that work safely with people in shared spaces), smart grid and utilities (systems for efficient and effective transmission and distribution of electric power, etc.), smart buildings and infrastructure (active monitoring and control of buildings), smart transportation and mobility (vehicle to vehicle communications, autonomous vehicles) and smart healthcare (cyberphysical medical devices).
Bayesian On-line Change Point Detection (BOCPD)
Consider a sequence of independent random variables (RVs) X1, X2, . . . , Xt. The parameters of the distribution of these RVs can change over time. If that change of parameters could be characterized, then a point of change could be identified. To give an example of such signal, we can consider a simple stream of zeros and ones from a serial line, where the received voltage values at the endpoint can fluctuate due to channel noise. Considering the noise Gaussian with mean equal to zero, the sampled signal will be distributed ˜N(μ, σ2), where σ2 is the fixed variance of the channel noise, and μ is the actual value to transmit, for instance 0 V or 5 V. In this case μ is the only parameter that changes over time, but a model in which σ2 changes can be considered as well. Samples are distributed normally with parameters μ and σ2, and those parameters can also be considered drawn from another distribution P(μ, σ2). For the example mentioned before, we can assume that σ2 is fixed, but μ is drawn from a Bernoulli distribution, with a p probability of sending 5 V, and a (1−p) probability of sending 0 V.
We now introduce the run-length concept, which is the number of consecutive samples that have been drawn from the same distribution (the parameters' values didn't change for all of the samples in that run). At time t, rt is the number of past samples corresponding to that run. If rt=k, then the samples that are considered to be part of that run are xt-k, xt-(k+1), . . . , xt-1. A distinction has to be made between the RV X and a sample from that random variable x. We will notate the xs from a run rt=k to be xt(r=k). In
The objective of this algorithm is to predict, based on history, what is the probability density function P(rt|X1,t=xi,t). Note that for this algorithm, at every time step, there is a distribution P (rt|X1:t=x1:t), where rt can take values from 0 to t. At the bottom of
For finding changes in the underlying distribution from which samples are obtained over time, the change in the run-length distribution is analyzed with every new sample received. The expression for the run-length distribution is:
We see in Equation (1), at time t, P(rt|X1:t) depends on the same distribution for the previous sample at time t−1. This dependency allows this algorithm to be run online. In our implementation H(rt-1+1)=1/λ. For the second term in 1, for every rt-1=k, we can demonstrate:
We assume the dependence on the past samples xt-1(r
P({right arrow over (θ)}|Xt,{right arrow over (ϕ)}k)∝P(Xt|{right arrow over (θ)})P({right arrow over (θ)}|{right arrow over (ϕ)}k) (4)
If the right term in Equation (4) can be formulated, then an expression for (3) can be found, and then the expression for the distribution in (1) is finally unveiled. An assumption is made here, that P(Xt|{right arrow over (θ)}) in (4) is normal ˜N(σ−2,μ), with {right arrow over (θ)}=μ(σ2 fixed), and P({right arrow over (θ)}|{right arrow over (ϕ)}k) is a Gaussian prior ˜N(μok,σ2), where {right arrow over (ϕ)}k(μok(xt-k:t-1), σ2(xt-k:t-1)) The Gaussian distribution chosen for P(Xt|{right arrow over (θ)}), is the conjugate distribution of the prior and posterior in Equation (4), and hence the posterior P({right arrow over (θ)}Xt,{right arrow over (ϕ)}k) is Gaussian. These are the expressions of the updated parameters {right arrow over (ϕ)}k=(σok2,μok) due to a new received sample:
When obtaining P (Xt|{right arrow over (θ)}) (the last piece missing in Equation (1)), through integration of the {right arrow over (θ)} variables in (3), this distribution turns out to be a Gaussian with parameters μk=ok and σk2=σ2+σok2. We finally have all the required pieces for turning the CPD into its online version BOCPD, performing updates with each new received sample.
BOCPD Algorithm Description
Having developed the mathematical framework, we now outline the implemented process step by step.
1. Initialize. r(t=0) can only take one value, 0, and then we have P(rt=0=0)=1. If for some reason we know something about the process before time t=0, we can start with another distribution P(rt=0)=S(r). We will use here the exemplary case in which we know nothing about the time before t=0.
At each time step, r will have the possibility of being different values, and since the whole last subsection was developed for a particular value of r=k, then for every value of r we will have a different set of values for μ0, v, a, and b for the case of the gamma inverse prior, and μ0 and σ0 for the case of the normal prior. We will consider {right arrow over (θ)} to be μ0, v, a, and b for the case of the gamma inverse prior and μ0 and σ0 for the normal prior. We set {right arrow over (θ)} to the initial values θ0.
2. Observe New Datum Xt.
3. Evaluate the predictive probability.
P(Xt|rt-1,Xt-1(r))=πt(r) (6)
4. Calculate Growth Probabilities.
P(rt=rt-1+1,X1:t)=P(rt-1,X1:t-1)πt(r)(1−H(rt-1)) (7)
5. Calculate Changepoint Probabilities. In this step, if we find the probability to be high enough, we can say that we are in presence of a changepoint.
6. Calculate evidence
7. Determine Run Length Distribution
P(rt|X1:t)=P(rt,X1:t)/P(X1:t) (10)
8. Update the parameters. For a concise explanation, we provide an example (intended to be non-limiting). At time t=1, r can be either 0 or 1, so we have now two sets of parameters, and the second set of parameters, for which r=1, will be updated using the parameters corresponding to the previous time step for r=0. This happens for all the values of r except r=0 for which we start with the parameters from step 1.
({right arrow over (θ)})tr+1=f(Xt,({right arrow over (θ)})(t-1)r) (11))
9. Return to Step 2
Probabilistic Computation
An architecture that operates natively with probability data structures is advantageous for effective and efficient processing of data using probabilistic techniques. Stochastic computation employs unary representation in contrast to the binary representation employed in modern computer systems. Primarily famed for mapping complex computations into simple bit-wise logic operations, and an error-tolerant encoding, stochastic computation failed to gain much popularity during its earlier years because of relatively lower accuracy and longer computation time compared to conventional methodologies. Capitalizing on its strengths, stochastic computation has been successfully implemented for neural computation, image processing, parity check (LPDC) decoding. More recently, probabilistic computation and address event-based representation (
Computation in unary stochastic representation of data maps to simpler logic functions. For example, with uncorrelated probabilistic unary streams, multiplication can be implemented with an AND gate in a unipolar coding format, where 0s represents 0, and 1s represent 1 in the stream. An example of this computation is shown in
These streams can be decoded back to their initial encoding by integrating the streams, and normalizing to its range. For a digital design with binary inputs, this can be done with a counter. The sum of these Bernoulli trials in the stream, normalized to the probability space of [0,1], give rise to a Binomial-distributed random variable,
where I(*) is the indicator function, N is the sample size, and xi is a Bernoulli-distributed sample in the stream.
Analyzing this distribution, reveals an inverse-proportional relationship between the variance σ2, and the sample size N. Mathematically, this is seen as,
where p is the probability of getting a 1 in each sample, that is the mean of the Bernoulli trial.
With variance σ2 directly related to the precision of the computation, there exists a trade-off between computation precision and latency. Computing over longer stochastic streams nets lower variance in the outputs, and thus more precise results. In the computing architecture presented in embodiments of this disclosure, state values are stored in registers in binary base due to the compactness of representation. The distinction of this architecture with respect to others is that when performing any kind of computation, these state values are translated into a probabilistic representation through the usage of uniform random number generators (URNGs). Computations in this new domain allow robustness against EM noise in bit flipping, reduction in silicon area in comparison to conventional computational units such as adders, multipliers and dividers, and to control power consumption through management of accuracy. The domain translation in state variables is done utilizing a digital comparator, the state variable's value to encode in binary base and a URNG (
The second constraint that probabilistic representation exhibits is that all of the encoded values need to represent a probability which is bounded to [0,1]. This requires some ingenious thinking to shape formulations of problems so that they can be solved stochastically. The same numeric value can be represented using different probabilities, depending on the maximum number of bits used to represent binary numbers. For instance, 64 can be Bernoulli probability p= 64/256 for n=8, or p= 64/512 for n=9 and so on.
Stochastic Architecture for BOCPD Image Segmentation
The change point detection algorithm is employed in a brain-inspired system architecture for real-time high velocity BIGDATA processing, that originates in large format tiled imaging arrays used in wide area motion imagery ubiquitous surveillance. High performance and high throughput is achieved through approximate computing and fixed-point arithmetic in a variable precision (6 bits to 18 bits) architecture. The architecture implements a variety of processing algorithms classes ranging from convolutional networks (ConvNets), to linear and non-linear morphological processing, probabilistic inference using exact and approximate Bayesian methods and ConvNet based classification. The processing pipeline is implemented entirely using event-based neuromorphic and stochastic computational primitives.
In a particular non-limiting embodiment (used to illustrate the present disclosure), the BOCPD architecture may be useful for video processing, and more particularly to computing whether a pixel in a video sequence belongs to the background or to a moving object (image segmentation). Each pixel may be processed with only information from its intensity and its time history (
Core Architecture
In real online implementations of the algorithm, runlength rt cannot be allowed to diverge to infinity, and a bound may be set. In the present CPD implementation, rt can be assigned a maximum value. We will call Nwin the maximum time window in the past we can observe. At time t=(Nwin−1) a set of (Nwin−1) parameters {right arrow over (ϕ)}k has already been generated and is already available, with the addition of the default parameters {right arrow over (ϕ)}0 (bias parameters). Since up to Nwin values are stored for the runlength distribution P(rt|X1:t=x1:t), with the next xt=Nwin sample, parameters {right arrow over (ϕ)}k=(σok2,μok) for k=1, . . . , Nwin−1 are updated, and the ones for k=Nwin are not generated. The probability value that would have been assigned to rNwin=Nwin, will be added to the probability calculated for rNwin=Nwin−1. As such, rt is always limited to a constant number of Nwin values over time. Furthermore, we will have to assign P(rt=Nwin−1|{right arrow over (ϕ)}(Nwin-1))+P(rt=Nwin|{right arrow over (ϕ)}(Nwin)) to P(rt=Nwin−1|{right arrow over (ϕ)}(Nwin-1)). By doing this, P(rt=Nwin−1|{right arrow over (ϕ)}(Nwin-1)) can be interpreted as the probability of changepoint not only for rt=Nwin−1, but for all rt≥Nwin−1.
We now present the architecture 10 designed for this algorithm in
With the arrival of a new sample at a sample input 50, all of the values in the Ro registers 20 need to be probabilistically encoded. As such, the architecture includes an encoder 30. As described above, a set of comparators 32 and URNGs 34 are used to generate the stochastic streams. Even if there are Nwin+1 register values for Ro and Wo, only two URNGs 34 may be needed by the depicted encoder 30 because computations do not intersect between any Ro(i) and Ro(j) with i≠j streams. The total number of URNGs (2n+1) used in the depicted architecture 10 is shown at the bottom right corner of
On the right side of Ro registers 20 we find Ri registers 80 Ri(0), . . . , Ri(Nwin−1) and a Wi register 82 that are configured to store updated probability distribution P(rt|X1:t=x1:t) by the end of the chosen computational time. Every time a new sample xt arrives, the registers Ro registers 20 and Wo register 22 are loaded with the distribution P(rt-1|X1:t-1=x1:t-1) stored by the Ri registers 80 and the Wi register 82, and the Ri registers 80 and Wi register 82 are set to zero. The Ri registers 80 and Wi register 82 are configured to count the ones coming from their respective input stochastic streams. It can be said that these incoming ones and zeros for these counting registers 80 are Bernoulli samples with a certain probability p(i), i∈[0, . . . , Nwin−1] equal to the distribution values P(rt=i|X1:t=x1:t), i∈[0, . . . , Nwin−1]. After the chosen computation time, the resulting count in registers {right arrow over (R)}i 80 will represent a scaled estimated version of the P(rt|X1:t=x1:t) distribution. This resulting distribution will not be necessarily normalized because of the nature of this type of stochastic computation. Each value allocated in registers {right arrow over (R)}i 80 is the result of an estimator that possess a mean and a variance, and it is because of that variance, that the distribution might not add to the number of clock cycles chosen to compute. Regarding the mean, it can be shown that the true mean of these incoming streams to registers Ri 80 are indeed the distribution P(rt|X1:t=x1:t).
With a similar structure, the parameters from Equation (5) are updated. In fact, notice that the expression for σok2′ does not depend on the value of the current xt sample for all k, and as a result, they are constants. That allows not having to allocate memory for these parameters. In
Going back to
By computing a weighted sum of these probabilities, we can approximate the desired function:
In order to make this approximation stochastically feasible, weights wi are values ∈[0; 1]. An exemplary architecture for this Bernstein polynomial function approximation can be found in
Preprocessing Unit
As mentioned above, an objective of the ChangePoint Detection algorithm is to find changes in the parameters from the underlying distribution from which samples xt come from by interpreting the run-length probability distribution P(rt|X1:t=x1:t). Due to the choice of Gaussian distribution for the right expression in Equation (4), we concluded that Equation (1) is Gaussian. Now, if a coming sample xt is very different from the range of values received in the past, this could mean that there is an outlier or that actually a change in the parameters has taken place. In either case, when sample xt is very far from the mean values μok and falls where the Gaussian distribution has a very low probability value, the precision for this stochastic implementation gets impaired. This problem can be seen in
A possible way of addressing this problem is by normalizing the samples received over time, by tracking a slow varying mean for these samples, and amplifying the difference between this mean and the current sample value xt. This can be thought as a high pass filter with a cut-off frequency very close to 0 Hz and a gain greater than one. The stochastic stream sent to the Bernstein block approximator will never encode Bernoulli probabilities greater than 1 because of the nature of stochastic processing, and then when approximating half of the Gaussian bell with the Bernstein polynomials, and performing the filtering and amplifying the values for xt, if the approximation of the Gaussian bell has a minimum value in the [0:1] domain that is greater than 1/Ncomp, where Ncomp is the number of clock cycles chosen for processing each sample, then the values in the registers Ri at the end of the computational process time will not contain all zeros. By not containing all zeros, the architecture can recover easily from this situation. An exemplary embodiment of a stochastic pre-processing unit according to the present disclosure is shown in
In
An experimental setup using an embodiment of the presently-disclosed architecture was implemented in a Xilinx Kintex-7 FPGA communicating with a host system using USB3 interface and Matlab, featuring 14 CPD cores with their own respective pre-signal adjustment allowing the processing of 120×160 pixel images @ 10 fps. All of the parameters were fixed, including, without limitation, the number of Bernstein coefficients n=5, the processing time 8192 clock cycles, and the time window Nwin=8. The design was written completely parametrically in VHDL, allowing the parameters to be changed and the architecture resynthesized. The disclosed stochastic BOCPD was implemented as part of an image processing pipeline for implementing background-foreground segmentation. Points of change were identified by analyzing the probability distribution obtained for each new sample xt in 1. This system architecture proved capable of processing in real-time 160×120 raw pixel data running on a reconfigurable computing platform (5 Xilinx Kintex-7 FPGAs 11). The image data was generated by standard cameras and hence the value of the pixels was encoded using pulse density modulation (PDM) so that it was compatible with the computational infrastructure. A screen shot of results from the image processing pipeline up to the segmentation and feature extraction stage is shown in
Although the present disclosure has been described with respect to one or more particular embodiments, it will be understood that other embodiments of the present disclosure may be made without departing from the spirit and scope of the present disclosure. Hence, the present disclosure is deemed limited only by the appended claims and the reasonable interpretation thereof.
This application is a 35 U.S.C. § 371 U.S. national entry of International Application PCT/US2018/040658, having an international filing date of Jul. 2, 2018, which claims the benefit of U.S. Provisional Application No. 62/528,023, filed Jun. 30, 2017, the content of each of the aforementioned applications is herein incorporated by reference in their entirety.
This invention was made with government support under contract no. N000141010278 awarded by the Office of Naval Research, contract no. HR0011-13-C-0051 awarded by DARPA, and contract no. SCH-INT 1344772 awarded by the National Science Foundation. The government has certain rights in the invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2018/040658 | 7/2/2018 | WO |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2019/006474 | 1/3/2019 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
20040015458 | Takeuchi et al. | Jan 2004 | A1 |
20080294970 | Gross et al. | Nov 2008 | A1 |
20090228238 | Mansinghka et al. | Sep 2009 | A1 |
20110229032 | Ranganathan | Sep 2011 | A1 |
20140081899 | Jonas et al. | Mar 2014 | A1 |
20140222653 | Takayasu et al. | Aug 2014 | A1 |
Number | Date | Country |
---|---|---|
2016102999 | Jun 2016 | WO |
Entry |
---|
A. Andreou et al., Bio-inspired System Architecture for Energy Efficient, BIGDATA Computing with Application to Wide Area Motion Imagery, VII Latin American Symposium on Circuits and Systems (LASCAS) Apr. 14, 2016 (Year: 2016). |
K. Sanni et al., FPGQ Implementation of a Deep Belief Network Architecture for Character Recognition Using Stochastic Computation, 2015 49th Annual Conference on Information Sciences and Systems (CISS), 2015 (Year: 2015). |
R.P. Adams et al., Bayesian Online Changepoint Detection, arXiv:0710.3742v1 [stat.ML], 2007 (Year: 2007). |
J.L. Molin et al., FPGA Emulation of a Spike-Based Stochastic System for Real-time Image Dewarping, 2015 IEEE 58th International Midwest Symposium on Circuits and Systems (MWSCAS), 2015 (Year: 2015). |
T. Figlioia, A. A.G. Andreou, An FPGA multiprocessor architecture for Bayesian online change point detection using stochastic computation, Microprocessors and Microsystems 74, 2020 (Year: 2020). |
T. Figliolia, 2.5D Chiplet Architecture for Embedded Processing of High Velocity Streaming Data, Phd dissertation, Jan. 16, 2018 at https://jscholarship.library.jhu.edu/handle/1774.2/60968 (Year: 2018). |
Badarna, M., et al., “Change point detection in streams” University of Haifa 2009. |
Adams, R., et al., “Bayesian online changepoint detection” Cavendish Laboratory, Cambridge 2007. |
Number | Date | Country | |
---|---|---|---|
20200125979 A1 | Apr 2020 | US |
Number | Date | Country | |
---|---|---|---|
62528023 | Jun 2017 | US |