The present invention generally relates to data coding and more particularly relates to systems, devices and methods for sparse coding of data using a neural network processor.
Many types of signals can be well-approximated by a small subset of elements from an over complete dictionary. The process of choosing a good subset of dictionary elements from an overcomplete dictionary set, along with the corresponding coefficients, to represent a signal is known as sparse approximation, sparse representation, or sparse coding. Sparse coding is a difficult non-convex optimization problem that is at the center of much research in mathematics and signal processing. Neurophysiological data obtained from the brain cortex has shown that the human brain in effect performs sparse coding of stimuli in a parallel manner using a large number of interconnected neurons. In this context, a sparse code refers to a representation where a relatively small number of neurons are active with the majority of neurons being inactive or showing low activity in a population.
Sparse coding has been used in recent years as a strong mathematical tool for the processing of image, video, and sound, see, e.g. [1] [2]. In fact, it allows the generation of shift-invariant representations of a given input signal with a good preservation of transients and other non-stationary elements. Most of the proposed approaches to generate sparse representations use the greedy approach, such as the so-called matching pursuit (MP), or one of its derivatives. However, greedy approaches, which are mathematical abstractions of the brain function, are very difficult to implement in parallel. More recently, sparse code generators based on neural circuitry have been disclosed, see for example, article [3] and U.S. Pat. No. 7,783,459 issued to Rozell et al., which is referred to hereinafter as the '459 patent, both of which are incorporated herein by reference, and also [4], [5], and [6]. These neural based architectures have the potential to better correspond to brain sparse coding, are much easier to implement, and less computationally expensive than the MP algorithm or other greedy methods.
More specifically, the '459 patent, which is incorporated herein by reference, teaches a neural network type system that implements a Local Competitive Algorithm (LCA) approach to image and video processing using Gabor kernels as dictionary elements. The LCA aims to encode a given signal with the least number of active neurons possible. In this approach, an input signal representing an image is decomposed into a plurality of signals, each matched to a specific Gabor kernel, and is then passed to a plurality of interconnected nodes. Each node has a thresholding element at its output and is cross-coupled to other nodes to dampen there excitation levels in proportion to its own output. After a settling time, the LCA-implementing network settles to a state where only a relatively small number of nodes are active, i.e. generate non-zero outputs that provide the desired coefficients in the sparse representation of the input data.
The inventors of the present invention have recognized that the LCA-based coder of Rozell, which is designed primarily for image and video processing, has some deficiencies related to its flexibility, and when other types of signals are to be coded. For example, in the LCA-based coder of Rozell each sparse representation corresponds to one static image or one frame of a video signal, so that the LCA in the disclosed form is not directly applicable for adaptive coding of time-dependent signals such as audio signals, wherein the signal varies with time within each frame of the coder. Another deficiency of the LCA-based coder disclosed by Rozell relates to its rather inflexible optimization criterion. The sparse representation generated by the LCA minimizes the Mean Squared Error (MSE) between the reconstructed and original signals. In some cases, however, the minimization of the MSE is not the most optimal approach. For example, audio coding often benefits from perceptual optimization, when perceptual differences between coded signals and original signals are of greater importance than the MSE. Same may be true in image processing as well.
Thus, it is an object of the present invention to address at least some of the aforementioned deficiencies of the prior art by providing an adaptive coder that utilizes parallel data processing and is applicable for sparsely coding time-dependent data with flexibly defined optimization criteria.
It is noted that in the preceding paragraphs, as well as in the remainder of this specification, the description refers to various individual publications identified by a numeric designator contained within a pair of brackets. For example, such a reference may be identified by reciting, “reference [1]” or simply “[1]”. Multiple references will be identified by a pair of brackets containing more than one designator, for example, [2, 3]. A listing of references including the publications corresponding to each designator can be found at the end of the Detailed Description section.
The present invention provides a method and apparatus for sparsely representing a signal using a network of interconnected competing nodes, wherein one or more parameters of the network are adapted based on a desired shaping of the signal or a representation error thereof.
One aspect of the present invention provides an apparatus for representing an input signal in terms of one or more dictionary elements from a plurality of dictionary elements. The apparatus comprises a plurality of interconnected nodes individually associated with the plurality of dictionary elements, wherein each node has a receptive field that is based upon one of the dictionary elements and defines node sensitivity to the input signal, and wherein each node comprises a thresholding element and an internal signal source for producing an internal node signal responsive to a node excitation signal and weighted outputs of at least some of the other nodes. The apparatus further comprises a projection unit for producing the node excitations signals representing projections of the input signal upon the receptive field of the node. The thresholding elements of the nodes are provided with node-dependent threshold values that differ from each other for at least some of the nodes in accordance with a pre-determined signal shaping characteristic.
One aspect of the present invention provides a system for representing an input signal in terms of one or more dictionary elements from a plurality of dictionary elements, comprising: a) a plurality of interconnected nodes associated with the plurality of dictionary elements, wherein each node is characterized by a receptive field that corresponds to one of the dictionary elements and comprises a thresholding element and an internal signal source for producing an internal node signal responsive to a node excitation signal and weighted outputs of at least some of the other nodes; and, b) a processor comprising a projection unit for computing the node excitation signals based on the input signal and receptive fields of the nodes, a weighting unit for applying weights to outputs of the nodes to generate the weighted outputs for providing to other nodes, and a shaping unit for applying perceptual weighting to at least one of: the receptive fields of the nodes, the weighing coefficients, and thresholds of the thresholding elements.
One aspect of the present invention provides a method for sparsely encoding a signal using an apparatus implementing a locally competitive algorithm, wherein a plurality of interconnected nodes receive projections of the input signal and wherein each of the nodes generates an output once an internal potential thereof reaches a threshold, the method comprising: a) obtaining a node-dependent threshold value for each of the nodesbased upon a pre-determined shaping characteristic, and b) setting different thresholds for different nodes for at least some of the plurality of nodes in accordance with the node-dependent threshold values obtained in step (a).
One aspect of the present invention provides a method for sparsely encoding a signal wherein a plurality of interconnected nodes receive projections of the input signal and wherein each of the nodes generates an output once an internal potential thereof reaches a threshold, the method comprising: generating the projections of the input signal using each of a plurality of dictionary elements, said plurality of dictionary elements comprising P time shifted copies of K time-dependent kernels that are spread in time over one frame of the input signal, each such kernel corresponding to a different frequency fk, wherein integers K and P are each greater than 1.
One aspect of the present invention provides a Perceptual Local Competitive Algorithm (PLCA) that takes into account perceptual differences between signals, which in application to audio signals accounts for, for example, absolute threshold of hearing and/or auditory masking. When perceptual difference measures are used, the PLCA disclosed herein is shown to have a faster convergence than the LCA for audio signals, and is robust with respect to quantization of the encoded signal. In a more general sense, the PLCA provides a generic framework whose applications is not limited to audio and include other types of signals, such as video and image, with correspondingly chosen perceptual, or more generally, signal shaping measures. The invention is not limited to any specific type of overcomplete dictionary and may be practiced using various types of kernel functions as suitable for particular applications and signal types. It enables to give selective emphasis to parts of the signal as specified in any desired domain, including but not limited to frequency domain, time domain, perceptual domain, and any combination thereof. The invention is not restricted to any specific implementation of the nodes representing neurons.
The invention will be described in greater detail with reference to the accompanying drawings which represent preferred embodiments thereof, in which like elements are indicated with like reference numerals, and wherein:
a is a block diagram of a prior art LCA system including a plurality of interconnected nodes;
b is a diagram representing schematics of one node of the prior art LCA system;
a is a schematic block diagram of a PLCA coder in accordance with an embodiment of the present invention;
b is a schematic diagram of a node of the PLCA coder of
In the following description of the exemplary embodiments of the present invention, reference is made to the accompanying drawings which form a part hereof, and which show by way of illustration specific embodiments in which the invention may be practiced. It is understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the present invention. Reference herein to any embodiment means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.
In the context of this specification, the term “computing” is used generally to mean generating an output based on one or more inputs using digital hardware, analog hardware, or a combination thereof, and is not limited to operations performed by a digital computer. Similarly, the term ‘processor’ when used with reference to hardware, may encompass digital and analog hardware or a combination thereof. The term processor may also refer to a functional unit or module implemented in software or firmware using a shared hardware processor. The terms ‘output’ and ‘input’ encompass analog and digital electromagnetic signals that may represent data sequences and single values. The terms ‘data’ and ‘signal’ are used herein interchangeably. The terms ‘coupled’ and ‘connected’ are used interchangeably; these terms and their derivatives encompass direct connections and indirect connections using intervening elements, unless clearly stated otherwise.
Before providing a description of the preferred embodiments of the present invention, the prior art LCA-based neural network coder will be first briefly described, and terms and definitions introduced that will also be used further in the description of the exemplary embodiments of the present invention.
The LCA associate each node with an element φm of an overcomplete dictionary D, which is formed by a plurality {φm} of the dictionary elements. The dictionary elements φm, which partially overlap, are also referred to herein as kernels, in the prior art LCA define receptive fields of the associated nodes, which act as input filters for the nodes, allowing only components of the input signal that matches the respective receptive field to affect the node's state. When the LCA system is presented with an input image s(t), the collection of nodes evolve according to fixed dynamics and settle on a collective output {am(t)}, corresponding to the short-term average firing rate of the nodes. The goal of the LCA is to generate a sparse code for a signal, with preferably only a few non-zero elements am, so as to minimize the MSE, as defined mathematically by the following equation:
where the LCA-generated sparse representation of the input signal s is given by equation (1a),
ŝ=Σ
m
a
m(t)φm (1a)
This sparse representation ŝ of the input signal is also referred to herein as the coded signal. Bold letters in equation (1) represent vectors. The elements am of the vector ‘a’, which contains the resulting sparse representation {am(t)}, are values read from the outputs of the nodes after the nodes in the network reach a steady state; they are also referred to as coding coefficients or simply coefficients. Furthermore, C(.) in equation (1) is the sparsity-inducing cost penalty, which is a function of the outputs ‘a’. The cost function C(.) can for example be represented by the L1-norm of neuron outputs; λ is a Lagrange multiplier.
With reference to
When the system of
Dynamics of the LCA nodes, or neurons, 100 are expressed by a linear differential equation (2):
This differential equation is of the same form as the well-known continuous Hopfield network. Here um(t) is the internal potential of the mth node, which is also referred to herein as the internal node signal, and τ is the integration time step. The node coupling coefficients The Gmn, which are also referred herein as node coupling weights, and the excitation signal for the mth node are given by equations (2a) and (2b);
G
m,n=(φm,φn), (2a)
b
m(t)=(φm,s(t)). (2b)
The excitation signal bm(t) is defined by a projection of the input signal s(t) upon the nodes receptive field φm. In matrix representation, the input signal s(t) is projected onto the kernels φm by computing ΦT s(t). Ther the matrix Φ is defined so that its rows are kernels φm. Projections of s(t) onto φm are then applies as input to nodes 100, inducing the internal node potentials um(t). Contributions from other nodes have a damping effect upon the internal node potentials
The output am(t) of each node/neuron 100 is defined by a nonlinearity am(t)=T(um(t)), where T(.) is a thresholding function. Equations (3) and (4) define relations that exist between neuron outputs, internal potentials, and sparsity factor C(a):
Here, δ is the threshold value and controls the sparsity, i.e. the number of active neurons. When the internal potential um(t) of a given neuron 100 crosses the threshold defined in Eq. 4, the neuron becomes active, i.e. it produces a non-zero output |am(t)|>0. Neurons which internal potentials are below the threshold are inactive and do not produce any output.
The thresholding function T(.) can be sigmoidal or can be a hard thresholding function, among others. Hereinafter embodiments utilizing hard thresholding function of the type defined in Eq. 4 will be described by way of example, and also because we found that the network converges better with hard thresholding when applied to audio, although other suitable thresholding functions, including those described in the '459 patent, could also be used within the scope of the present invention.
The LCA based system described in the '459 patent utilizes static Gabor kernels that do not evolve in time. One aspect of the present invention adapts the LCA to process time-dependent signals such as audio.
In one embodiment, a time-dependent input signal 11 is represented in terms of one or more dictionary elements that are selected from an over-complete dictionary DPK composed of time-dependent elementary signals, or dictionary elements, wherein each of the dictionary elements is represented as a time-dependent signal or data φm(t). In one embodiment, the plurality of dictionary elements that forms the dictionary set DPK is composed of P time-shifted copies of K base dictionary elements gk(t), each gk(t) corresponding to a different center frequency fk, k=1, . . . , K, where K denotes the number of frequency channels in the representation. In the case of audio signals, these base dictionary elements gk(t) may be, for example, gammatone filter functions or gammachirp functions. The impulse responses of the gammatone filters approach that of actual responses observed in the human hearing system, and are given, for example, in our earlier U.S. Patent Application 2008/0219466 that is assigned to the assignee of the present application, and in an article [9], both of which are incorporated herein by reference for all purposes.
The dictionary elements φm(t) can be realized both in analog and digital domain, for example as digital or analog filters or correlators, or in software. Considering digital implementations by way of example, the input signal s(t) is digitized and is in the form of a sequence of frames of length N each, with N being the number of signal samples in one frame. In one embodiment, the input signal s(t) is a sampled audio signal. Each dictionary element φm(t) may be viewed as an impulse responses of a finite impulse response (FIR) filter and mathematically represented as a vector of length N. In the dictionary DPK, each base element gk has a length Ngk<N and is present in p time-shifted copies that are spread over the frame length N, preferably uniformly. In one embodiment, each consecutive copy of a base element gk is shifted by q samples from the previous copy, thereby sampling teach frame of the input signal s(t) with a sampling period q=N/P, which is referred to herein as the hop size.
With reference to
In one embodiment of the present invention, the LCA system of a general architecture of
Each of these M projections bm(t) is passed as a node excitation signal to a respective node 100, with the total number of nodes receiving the excitation signals being M=K·P. After a network settling time, steady-state outputs am of those nodes 100 that remain active form a sparse representation of the input signal frame. Such a representation is illustrated in
The projection system 200 may be implemented in an analog domain, for example using a suitable bank of time-shifted gammatone filters as described hereinabove or other suitable time-shifted kernel functions gk(t). The projection system 200 may also be implemented digitally for example by storing elements of the projection matrix Φ in memory, and using a digital processor implementing a suitable matrix-vector multiplication algorithm. Mixed digital-analog implementations are also possible.
Computer simulation results demonstrating convergence of the afore-described LCA technique in dependence upon the hop size q, which represents temporal quantization, is described in [10], which is incorporated herein by reference. We found that the modified LCA technique is more robust than the MP to temporal quantization. The better performance of the modified LCA can be attributed to its self-organizing capacity (through lateral inhibitions) and global optimization behavior. Furthermore, the advantage of the modified LCA over MP is in its low computational complexity and its ability to be implemented in VLSI.
Another aspect of the present invention enables to flexibly shape the accuracy with which different components of the input signal s(t) are represented in the encoded signal ŝ(t). Although this shaping can take different forms within the scope of the present invention, the general approach of the present invention to such shaping will be description hereinbelow with reference to perceptual shaping of coded audio signals. However, the approach that will be now described with reference to exemplary embodiments, can also be applied to other types of shaping, such as shaping of coded images, either perceptual or otherwise, in LCA-type image and video processing, as well as error shaping in LCA coding of other types of signals.
An aspect of the present invention provides a method for sparsely encoding a signal using an apparatus implementing a locally competitive algorithm, wherein a plurality of interconnected nodes receive projections of the input signal and wherein each of the nodes generates an output once an internal potential thereof reaches a threshold. The method comprises the steps of a) obtaining a node-dependent threshold value for each of the nodes based upon a pre-determined shaping characteristic, and b) setting different thresholds for different nodes for at least some of the plurality of nodes in accordance with the node-dependent threshold values obtained in step (a).
In one embodiment of the method, the pre-determined shaping characteristic comprises perceptual sensitivity data related to perceptual significance of various components of the signal, and wherein step (a) comprises computing the node-dependent threshold values using the perceptual sensitivity data.
In one embodiment of the method, the pre-determined shaping characteristic comprises perceptual masking data, and wherein step (a) includes computing the threshold values in dependence upon the signal so as to account for perceptual masking of signal components by adjacent signal components.
In one embodiment of the method, the receptive field of each of the nodes comprises the dictionary element associated therewith that is modified based on the shaping characteristic.
In one embodiment of the method wherein the pre-determined shaping characteristic comprises perceptual masking data, the method comprises modifying each of the dictionary elements based on the pre-determined shaping characteristic to determine the receptive fields of the nodes. In one embodiment, step (c) comprises modifying each of the dictionary elements in dependence upon the signal. In one embodiment, step (c) comprises using perceptual masking data to modify each of the dictionary elements in dependence upon the signal. One embodiment of the method comprises using the receptive fields obtain in step (c) for computing the projections of the signal for receiving by the nodes, and for computing coupling coefficients characterizing competitive coupling between the nodes.
The prior art LCA, as disclosed in the '459 patent, provides a signal approximation that is optimal in a mathematical sense, i.e. it minimizes the MSE between the original and the coded signals. However, in audio coding, as well as image and video coding, a coder that minimizes a reconstruction error as perceived by a human is preferable over a coder that minimizes the mean-square error. In the case of audio signals, the human ear perceives sounds differently at different frequencies, which is reflected in a frequency dependence of the so-called absolute threshold of hearing. Furthermore, the human ear may not perceive an artifact in the audio signal when a strong sound component is present in the vicinity thereof in the time-frequency plane, the phenomenon that is known as auditory masking. Therefore, a modified LCA that uses a perceptual metrics in generating the sparse signal representation may provide a better reconstruction quality of the audio signal at a lower bitrate.
Embodiments utilizing a perceptual local competitive algorithm (PLCA) in accordance with aspects of the present invention are described hereinbelow with reference to block diagrams shown in
Furthermore, the term ‘PLCA’ is not limited to perceptual coding, but is used herein to refer to any modification of the prior art LCA that incorporates shaping of the coded signal in dependence on a pre-determined shaping characteristics or criterion.
Referring first to
First, we describe mathematical foundations of a PLCA-based coder that generates a sparse signal representation for a given time-invariant shaping filter, which shapes the signal coding error e=(s−̂s) in a desired way. Denoting the impulse response of the desired error-shaping filer w(n), one embodiment of the PLCA coder 10 is constructed in such a way that it minimizes the error function defined by equation (5):
In one embodiment, by convolving the error e between the input signal s and the reconstructed signal ŝ with the shaping filter w(n), we perceptually reshape the spectrum of the error.
Equations (6) describe the dynamics of a desired neural network minimizing the perceptually shaped error given by equation (5):
Details of the derivation of these equations can be found in [10], which is incorporated herein by reference. The new node excitation signal βm and node synaptic weights Γm,n are given by the following equations:
Γm,n=(λm,φn), (7a)
βm(t)=(λm,s(t)). (7b)
Here, λm, represents new receptive fields of the nodes 400, which are modified in accordance with the desired shaping filter w(n). The new projection matrix A, which has the new receptive fields λm as its columns, is defined by the following equation (8):
Λ=(W·WT)·Φ, (8)
where the superscript ‘T’ denotes matrix transpose, and the shaping matrix W is a Toeplitz filter matrix that is given by equation (9):
Columns of the shaping matrix W are time-stepped copies of the impulse response (IR) of the shaping filter w(n), so that Wi,j=w(i−j).
Matrix Φ is formed of the dictionary elements φn, for example as represented in
Contrary to the conventional LCA, which utilizes substantially the same output threshold values δ in the relationship (4) between the internal node signal um(t) and the node's output am, the output thresholds of the nodes 400 in the PLCA 10 are node-dependent. In one embodiment, these node-dependent threshold values vm are weighted in proportion to the frequency response W(f) of the shaping filter w(n), so that the threshold value for the mth node may be computed using the following equation (10):
v
m=δ0·W(fk) (10)
wherein fk is the channel frequency of the dictionary element φm that is associated with the mth node 100, and δ0 is a proportionality constant whose value defines the sparsity of the resulting signal representation, i.e. the number of dictionary elements used in the representation, which is given by the number of active nodes.
When it is desirable to have the same number of active neurons when using signal shaping with the PLCA as with the conventional LCA without the shaping for the same input signal s, the threshold of a given neuron m in the PLCA may be elevated or reduced based on how much the spectral characteristic of the shaping filter W(t), which is defined by the Fourier transform of the shaping filter IR w(n), amplifies the energy of the signal s at frequency fk that is associated with the mth neuron.
A time-dependent accuracy and signal shaping can be implemented within the aforedescribed framework. In one embodiment, it includes using frame-dependent shaping filters w(n) that are allowed to vary from one frame of the input signal to another. It may also be convenient to divide each coding frame of the input signal s(t) of length N into L smaller blocks of length Nl, so that N=L·Nl, and define a shaping filter wl(n) separately, but not necessarily independently, for each such block. Here, subscript l=1, . . . , Nl denotes successive blocks within a coding frame. In this case, the shaping matrix W for one length-N coding frame of the input signal s(t) may take the quasi-diagonal form,
wherein all the elements are zeros except for a diagonal band that is formed of block shaping matrices Wl of equation (12), which are of the same form as the shaping matrix of Eq. 9, but defined individually over windows of length Nl.
By way of example, L=10, Nl=2048, and N=20480.
It can be shown that a neural network defined by equations 6-8, 11, 12 minimizes a weighted error function Ep given by equation (13).
Referring again to
With reference to
Referring back to
This digital signal is first received by a projection unit 310, which function is similar to that of the projection system 200 of the LCA system of
The CP 300 further includes a weighting unit 320 for applying weights, also referred to herein as the node coupling coefficients, to outputs am(t) 111 of the nodes 400, so as to generate the weighted outputs for providing to other nodes 400, as indicated by arrows 321. A shaping unit 340 stores a pre-determined signal shaping characteristic, and provides threshold values vm, or values indicative thereof, to the thresholding elements 430 of the nodes 400 as indicated with arrows 331, and optionally provides signal shaping data based thereupon to at least one of the units 310, 320, and 330, as indicated in
In one embodiment, the signal shaping characteristic that is stored by the shaping unit 340 relates to the absolute threshold of hearing of a human ear. The absolute threshold of hearing characterizes the amount of energy needed in a pure tone such that it can be detected by a listener in a noiseless environment [7]. The absolute threshold of hearing, Θ(f) in dB, is well approximated by the following formula:
Θ(f)=3.64·(f/100)−0.8−6.5·exp[−0.6(f/1000−3.3)2]+10−3(f/1000)4. (14)
The absolute threshold of hearing could be interpreted as the maximum allowable energy level for coding distortions introduced in the frequency domain and is depicted in
In one embodiment, the spectrum Θ(f) of the absolute threshold of hearing may be used to design the signal shaping FIR filter with the impulse response w(n) yielding the filter spectral profile W(fk)=Θ(fk), for example using the frequency sampling method as known in the art. The values w(n) can then be used to compute the projection matrix Λ based on the dictionary matrix Φ in accordance with equation (8), wherein columns of Λ define the modified receptive fields λm of the nodes 400. This matrix Λ may be provided to the projection unit 310 for storing therein, and used in the generation of the node excitation signals βm in accordance with equation 7(b) as described hereinabove.
In one embodiment, the projection matrix Λ is further used to compute the weighting coefficients Γm,n, m≠n, in accordance with equation 7(a), which can be stored in the weighting unit 320 and applied to the node outputs an as they are fed back to the inputs of other nodes 400.
The aforedescribed embodiment of the PLCA coder 10 utilizes a constant signal or accuracy shaping characteristic, which could be stored in on-board memory of the coder 10, for example, in the form of the corresponding spectral characteristic W(fk) of the shaping filter, and which does not change with time and is independent on the input audio signal s(t).
In other embodiments, the coder 10 may utilize shaping characteristics that change with time and/or adapt to the input signal. One exemplary embodiment of this type relates to a PLCA implementation of auditory masking of the coded signal ŝ.
It has been shown in psychoacoustics, that strong frequency components of a sound can mask adjacent weaker frequency components by making them inaudible for the human ear. It is therefore possible in audio coding to reconstruct those masked regions coarsely without loss of perceived quality. By way of example, the embodiments of the coder 10 that we will now describe employs a variant of the MPEG Psychoacoustic model 1 [7] to determine the simultaneous masking pattern in the frequency domain.
With reference to
Where Θ[z(j),z(f)] is the masking threshold at frequency f (or equivalently, z(f) in the Bark frequency scale [7]) due to a masker component at frequency j (or equivalently, z(j) in Bark domain). The sealed-inversed masking threshold Θi(f) at frequency i is found as follows:
Θi(f)=106-Θ
The memory 345 stores shaping characteristics that define the used making model. By way of example, it may store, in digitized form, the Bark scale z(f) and the absolute threshold in quiet curve Θq.
Note that this scaled-inversed masking threshold Θi(f), which is also referred to herein as the spectral auditory mask, depends on the spectral profile and intensity of the input signal 11, also accounting for the absolute threshold of hearing. By way of example,
From this scaled-inversed masking threshold Θi(f), which is also referred to herein as the spectral auditory mask, a shaping filter generator 344 generates shaping FIR filters using for example, the frequency sampling method. More specifically, for each audio block of length Nl, the shaping filter generator 344 generates the impulse response of a block shaping filter wt(n) that has a spectrum approximating Θi(f), with l being the audio block index. These perceptual block shaping filters wl(n) adaptively define the shaping filter matrix W, see equations 1.1 and 12, and are used by 344 to generate the projection matrix Λ and the weighting coefficients Γm,n as described hereinabove. In one embodiment, 344 also generates the threshold values scale factor W(fk) for the nodes 400 using the scaled-inversed masking thresholds Θi(fk) for each block. Note that, for each frequency channel k, an l-th audio block may be sampled by a group of nodes 400 that are associated with gammatones gk(t) that fall in the respective time window of the l-th audio block. Accordingly, the processor 344 provides the scaled-inversed masking thresholds Θi(fk) for each block as the threshold scaling factors to the nodes 400 of the respective group.
Note that the splitting of the coding frames 12 of the input signal s(t) into the smaller blocks as described hereinabove is helpful in at least some embodiments of the coder 10, as it enables to have suitably long coding blocks while limiting the size of the FFT processing. This splitting is, however, optional, and the splitter 341 may be omitted in some embodiments.
In the aforedescribed embodiment, the coder 10 implements auditory masking of off-frequency channels by adaptively varying the threshold values vm of the nodes 400, the receptive fields λm of the neurons 400, and the weighting factors Γm,n for the node cross-coupling, in dependence upon the input signal 11. In other embodiments, adaptive shaping of the coded signal ŝ can be accomplished by varying one or two of these sets of parameters. Furthermore, the signal-adaptive shaping of the coded signal ŝ may be implemented based on the outputs 111 of the coder 10 instead of the input signal 11, as illustrated schematically by a dotted arrow 112 in
Referring now to
The perceptive shaping unit 340a implements a signal-adaptive threshold update process that will now be described.
The process is based on a modification of a masking model described in an article [9], which is incorporated herein by reference. In this masking model, a masker provides both temporal masking and off-channel frequency masking. In the following description, a masker is a component of an audio signal that is strong enough so its presence ‘masks’, in perception of a listener, other audio components in its vicinity in time or frequency. The nearby components, which perception by a listener are affected by the masker, are referred to as the maskee. Furthermore, the following description is provided with reference to gammatone kernels, although other suitable types of kernels, including but not limited to gammachirp kernels, may also be used in other embodiments. A description of relevant properties of gammatone kernels is provided in [9].
With reference to
In this exemplary model, the backward masking length BL, i.e. the length of the trailing tail of the curves of
FL
h=round(100Fs arctan(dh)) (18)
The magnitude of the temporal masking curve zh(n), which is also referred to as the sensation level, depends on the amplitude a of the masking Gammatone, for example as defined by equation (19):
Here, Gh represents the maximum value of the frequency response of a normalized Gammatone kernel in channel h, QTh represents the threshold in quiet for channel h. The threshold in quiet is based on the absolute threshold of hearing but is elevated in certain channels due to the short time duration of Gammatone kernels in these same channels. Elevating the threshold for these channels means that the amplitude of corresponding Gammatones must be louder than that of kernels in other channels to be perceived, since they do not last as long as the other kernels. Further details on the computation of the threshold in quiet is given in [9], which is incorporated herein by reference.
The sensation level SL in equation (19) is expressed in decibels; a corresponding equation for its amplitude value can be easily obtained from eq. (19).
In a next step, the actual amount SLeff(a, h, p) by which a temporal masking curve is amplified is computed by subtracting an offset CTM(a, h, p) from the sensation level of the masker SL(a, h):
SLeff(a,h,p)=SL(a,h)−CTM(a,h,p) (20)
In one embodiment, the offset CTM(a, h, p) may be selected in dependence on the properties of the signal to be decomposed in different frequency channels and at different time positions. The offset may be set relatively higher for portions of the signal which exhibit a lot of structure, i.e. many tonal sections, and thus are more likely to be perceptually important, resulting in less masking for these portions. In contrast, signal portions which contain mostly noise may be given a smaller offset, allowing for more masking in these portions. The reader is referred to [9] for further details on the computation of the offset SL(a, h). In one embodiment, the offset CTM(a, h, p) is set to a constant value that may be chosen empirically.
Equations (17) to (20) define temporal masking effects due to a masker corresponding to a particular gammatone kernel, i.e. due to the presence of s strong output of a particular neuron 400 that is associated with the particular kernel.
The exemplary model used in this implementation enables to take into account making effect on Gammatones not only in the same frequency channel as the masking Gammatone, but also in the channels just above and just below. The masking effects imparted on Gammatones which lie in a channel just below that of the masker are assumed to be equal to the temporal masking effects described in the previous section, minus an offset due to a downward channel decay parameter SLdown. In one implementation, an empirically obtained value of 27 dB is used for this decay, i.e. SLdown=27 [9]. Likewise, the masking effects imparted on Gammatones which lie in a channel just above that of the masker are equal to the temporal masking effects described in the previous section, minus an offset representing an upward channel decay SLup. In one implementation, the upward decay depends also on the sensation level of the masker and its frequency channel, for example as follows:
SLup(a,h)=24+230/fh−0.2SL(a,h) (21)
When combined with the original in-channel temporal masking effects, the overall masking effects of a masker can be represented by a surface in a shape of a tent in the time-frequency plane, as illustrated in
The masking model described hereinabove can be conveniently implemented within the PLCA framework using a masking matrix Ω, which is shown in
In the exemplary masking model wherein the maskers in one channel can only affect maskees in the same channel, or in channels just above and below, only the diagonal blocks of the masking matrix Ω and those just above and below the diagonal contain temporal masking matrices Γ(h). The rest of the matrix contains zeros. Note that elements of the matrices Γ(h) are not directly related to the weights Γm,n used hereinabove with reference to
Each temporal masking matrix Γ(h) represents all nodes 400 corresponding to a same frequency channel h and is of size p×p; it contains masking curves for the frequency channel which it represents. Since the columns of the masking matrix represent the maskers, the temporal curves zh(n) are placed in Γ(h) in a column-wise fashion facing downwards. This is analogous to each masker having its own curve in a non-matrix context. Since all kernels within a frequency channel occur at different time positions spaced by the hop size p, the masking curves zh(n) in successive columns of the temporal masking matrix Γ(h) are accordingly shifted downwards.
The temporal masking matrix Γ(h) shown in
The zero-th element of each masking curve z(0), i.e. the diagonal elements of the matrix, is set to zero to prevent a masker from imparting masking effects on itself. The first curve zh(n) in the matrix, i.e. first column, h=1, begins at n=0. This is because the kernel (i.e. masker) corresponding to this curve is positioned at the first time position in the spikegram and therefore cannot exhibit any backward masking effects. Likewise, the last curve in the matrix (i.e. last column) ends at n=0. This is because the kernel (i.e. masker) corresponding to this curve is positioned at the last time position in the spikegram and therefore cannot exhibit any simultaneous and forward masking effects beyond its own time position. Lastly, as the temporal masking matrix has a number of rows and columns equal to the number of time positions, the masking curves in the matrix are downsampled according to the hop size p by taking every qth sample when going outwards from the masker position n=0.
The off-channel masking effects of the masking model can be taken into account by an off-channel decay matrix Ψ(a) that is illustrated in the bottom of
Each downward and upward decay matrix is a square matrix of the same dimension as the temporal masking matrix Γ(h). Each downward decay matrix X is composed of replicas of a scalar downward decay value SLdown that may be an empirically set parameter, X SLdownp×p.
The upward decay of the masking model is a function of the amplitude and channel of the masker. As in the case of the temporal masking matrix, each column of the upward decay matrix corresponds to a masker. The upward decay matrix Y(a,h) is built by copying replicas of the upward decay of each masker for each column based on the frequency channel and amplitude of the masker, see
The next step in the process of adapting the masking model to the PLCA is the conversion of the neuron outputs, as the masker amplitudes, into their respective effective sensation levels SLeff(a,h,l). This conversion is shown by a second equation in
The masking effect felt by a ‘maskee’ node ‘m’ from all ‘masker’ nodes 400 can be obtained by multiplying, element by element, the mth row corresponding to the maskee in the masking matrix Ω, as denoted by Ω(m,
v′
m=max{[Ω(m,*)ä(a)]−Ψ(a)(m,*)} (22)
Here, the multiplication of a row Ω(m,*) of the masking matrix by the vector ä of the converted amplitudes is an element by element multiplication representing simply a weighting of the masking matrix elements, rather than a dot product. The values v′m are in decibel, and are converted to the amplitude values vm using equation (23):
is the sensation level given by eq. (19) converted to linear domain from decibel. In Equation 2.3, the use of the sign function ensures that a masking effect which would be null (i.e. zero) in a converted domain remains zero in the amplitude domain. Note that the masking effect Vm felt by a maskee cannot be negative since the elements of the masking matrix and the off-channel decay matrix outside the masking zones are respectively zero, thereby meaning that some of the elements resulting from the subtraction in eq. (22) are guaranteed to be zero.
In one embodiment of the coder 20, node masking values Vm are used as input sensitivity thresholds of the nodes 400. In mathematical terms, the dynamics of the nodes 400 in this embodiment of the coder 20 can be described by the following equations:
where αm is the algebraic sum of all inputs into the mth neuron:
and γm is a binary weight, or a binary thresholding function, which sets inputs into the mth neuron 400 to zero, i.e. blocks it when these inputs in total are smaller than the computed node masking values vm, due to the combined auditory masking effect from other active nodes:
In one embodiment, this input thresholding is accomplished by providing each neuron 400 with an input thresholding element 440, as illustrated in
In one embodiment of the coder 20, the shaping unit 340a incorporates memory 345 that stores pre-determined signal shaping characteristics, and a masking processor 349 for implementing the adaptive perceptual shaping of the coded signal ŝ. The pre-determined signal shaping characteristics stored in memory 345 may include for example elements of the masking matrix Ω and the off-channel decay matrix Ψ, which together represent frequency and temporal auditory masking curves. The masking processor 349 receives outputs am, from each of the nodes 400, as represented by the arrow 112, and, based on these outputs 112 and the signal shaping characteristics stored in 345, generates sensitivity thresholds Vm for the neurons 400, for example in accordance with equations (23) and (22), as described hereinabove. These sensitivity thresholds Vm are then provided as thresholding values to corresponding neurons 400.
Referring to
In one embodiment, the input thresholding element 440 coexists with the output thresholding element 430, which may have its threshold set to a node-independent value δ, as in the prior art LCA.
In one embodiment, the output thresholding element 430 may be omitted, and all thresholding functions are performed by the input thresholding element 440. In another embodiment wherein the node 400 includes only the output thresholding element 430 and the input thresholding element 440 is absent, the sensitivity thresholds Vm are provided to the thresholding elements 430 for setting the thresholds thereof. In these embodiment, the thresholding elements 440 or 430 of each of the neurons 400 may in addition verify whether the neuron sensitivity value Vm falls below a minimum threshold value δ, and if it does, set its threshold to δ, so as to ensure a desired sparsity of the resulting representation when the masking effects are weak. In other embodiments, the responsibility to ensure that the node input or output thresholds do not fall below a desired lower limit in the case of a single thresholding element may lie with the masking processor 349.
The performance of the PLCA coder 20 implementing the aforedescribed adaptive perceptual masking of the coded signal ŝ though input thresholding of the neurons has been tested using computer simulations for three input audio files, namely a castanet file, a speech file, and a percussion file. The audio quality of reconstructed signals was evaluated using the PEAQ model, which is an International Telecommunication Union (ITU) standard for evaluating audio quality. Contrary to the SNR and SSNR measures, the PEAQ model not only takes into account waveform samples when evaluating audio quality but also human behaviour in mimicking the human auditory processing system. Given a reconstructed signal and its original version, the model first pre-processes the signals based on the psychoacoustic properties of the human ear. The model then sends the resulting signals through a neural network which has been trained a priori from auditory tests with humans to mimic the cognitive aspects of the human auditory processing system. Lastly, the model outputs a set of variables which map to a score ranging between 0 and −5. Scores above −1 are said to be of broadcast quality. Based on the above evaluation metric, the performance of the PLCA with input masking, labeled LCAM in the following, against that of the LCA was thus evaluated by making use of the procedure which follows for each sound file. The threshold of the hard-thresholding function is first set for the sound file in question such that the reconstructed signal corresponding to the sparse representation produced by the LCA yields a PEAQ score above −1 (i.e. broadcast quality). The LCAM is then executed for the sound file using the threshold which was established for the file in question when using the LCA. For all three files, the LCAM yielded higher PEAQ scores than the LCA, while also exhibiting lower SNRs.
Although the invention has been described hereinabove with reference to specific exemplary embodiments, it is not limited thereto, but is defined by the spirit and scope of the appended claims. Various improvements and modifications of the aforedescribed embodiments will be apparent to those skilled in the art from the present specification. For example, although the invention has been described hereinabove with reference to coding of audio signals, the invention may be equally applied to sparse adaptive coding of other signal types, including video and images. Furthermore, various features described hereinabove with reference to particular embodiments could be used in other described embodiments and their modifications, and various embodiments may be combined. For example, the encoder 20 of
Other embodiments and modifications of the embodiments described herein are also possible.
The present invention claims priority from U.S. Provisional Patent Application No. 61/366,613 filed Jul. 22, 2010, which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
61366613 | Jul 2010 | US |