The present invention generally relates to methods and apparatuses for signal processing, and more particularly relates to encoding of audio signals and other types of media signals.
Many types of signals can be well-approximated by a small subset of elements from an over complete dictionary. The process of choosing a good subset of dictionary elements from an overcomplete dictionary set, along with their weights, to represent a signal is known as sparse approximation, sparse representation, or sparse coding.
In the over-complete shift-invariant representations, the number of dictionary elements in the dictionary set, which are also referred to as basis vectors or kernels, is greater than the real dimensionality—number of non-zero eigenvalues in the covariance matrix—of the input signal. This technique matches the best kernels to different acoustic cues using different convergence criteria such as residual energy. However, the minimization of the energy of the residual, or error, signal is not sufficient to get an over-complete representation of the input signal. Other constraints such as sparseness are considered in order to obtain a unique solution. In order to find the “best matching kernels”, typically a matching pursuit (MP) technique is employed. Description of the MP technique is provided, for example, in Gribonval “Fast matching pursuit with a multiscale dictionary of Gaussian chirps” (IEEE Trans. Signal Processing, 49(5):994-1001, 2001), and in U.S. Patent Application 2008/0219466. Advantageously, over-complete representations are more robust in the presence of noise than conventional coding techniques that represent signal in an orthogonal basis, such as Discrete Cosine Transform (DCT), Modified Discrete Cosine Transform (MDCT) and Discrete Fourier Transform (DFT).
A sparse representation of audio signals is disclosed in an article by R. Pichevar et al., entitled “Auditory-Inspired Sparse Representation of Multimedia Signals with Applications to Audio Coding”, Speech Communication, 2010, which is referred to hereinafter as Ref [1], in U.S. Patent Applications 2008/0219466, and 2012/0023051, all three of which are incorporated herein by reference. These publications describe a biologically-inspired approach, in which an audio signal is projected onto a set of gammatone/gammachirp kernels that generates a sparse two-dimensional time-frequency representation dubbed as a spikegram. A masking model is applied to the spikegrams to remove inaudible spikes and to increase the coding efficiency. U.S. Patent Application 2008/0219466 application, a method to obtain the spikegram using the MP technique is disclosed. U.S. Patent Application 2012/0023051 teaches generating spikegrams using neural networks.
However, addressing each spike in a spikegram individually in the previously proposed approaches may be costly in terms of bits when audio coding applications are considered. It is therefore desirable to provide a method and system that provides a more compact representation of audio signals.
Accordingly, the present invention relates to efficient encoding of media signals using sparse encoding of the media signal to generate a two-dimensional spikegram, applying a two-dimensional matrix factor deconvolution to the spikegram to obtain three-dimensional weight and component matrices, and/or adaptive quantization using integer programming to determine an optimal quantization scheme. Media signals as defined herein include audio signals, video signals, and signals representing images.
One aspect of the present invention relates to a method for encoding an audio signal by an audio encoding apparatus comprising data processing hardware, the method comprising: a) receiving a sequence of electrical signal samples representing a selected duration of the audio signal; b) from the received sequence of electrical signal samples, obtaining a two-dimensional spikegram sparsely representing the selected duration of the audio signal in time and frequency domains in terms of an overcomplete signal library; c) generating a set of weight matrices W and a set of component matrices H by performing a two-dimensional non-negative matrix factorization (NMF2D) of the spikegram, or of a non-negative matrix V obtained therefrom, under a sparsity constrain; d) quantizing non-zero values of the weight matrices W and the component matrices H; e) encoding W and H to obtain encoded audio data; and, f) outputting the encoded audio data for transmission to a complimentary audio decoder or storing in a computer-readable medium.
Another aspect of the present invention relates to an audio signal processing apparatus, which comprises a spikegram generation logic for receiving a sequence of electrical signal samples representing a selected duration of an audio signal, and for generating a spikegram based thereon, wherein the spikegram represents the selected duration of the input audio signal in time and frequency domains in terms of an overcomplete signal library, and a matrix factorization logic for generating a set of weight matrices W and a set of component matrices H by performing the two-dimensional non-negative matrix factorization (NMF2D) of the spikegram, or of a non-negative matrix V obtained therefrom, under a sparsity constrain. The audio signal processing apparatus further comprises a quantizer for quantizing non-zero values of the weight matrices W and the component matrices H, and an encoder for encoding the weight matrices W and the component matrices H to obtain encoded audio data for transmission to a complimentary audio decoder or storing in a computer-readable medium.
Another feature of the present invention provides an article of manufacture comprising at least one of: a hardware device having hardware logic for performing operations for encoding an audio signal, and a computer readable storage medium including a computer program code embodied therein that is executable by a computer, said computer program code comprising instructions for performing the operations for encoding the audio signal, said computer program code further comprising distinct software modules, the distinct software modules comprising a spikegram generating module for generating a spikegram and a matrix factorization module for performing a two-dimensional matrix non-negative matrix factorization (NMF2D) of the spikegram. The operations comprise: a) receiving a sequence of electrical signal samples representing a selected duration of the audio signal; b) from the received sequence of electrical signal samples, obtaining a two-dimensional spikegram sparsely representing the selected duration of the audio signal in time and frequency domains in terms of an overcomplete signal library; c) generating a set of weight matrices W and a set of component matrices H by performing NMF2D of the spikegram, or of a non-negative matrix V obtained therefrom, under a sparsity constrain; d) quantizing non-zero values of the weight matrices W and the component matrices H; e) encoding W and H to obtain encoded audio data; and, f) outputting the encoded audio data for transmission to a complimentary audio decoder or storing in a computer-readable medium.
The invention will be described in greater detail with reference to the accompanying drawings which represent preferred embodiments thereof, in which like elements are indicated with like reference numerals, and wherein:
In the following description of the exemplary embodiments of the present invention, reference is made to the accompanying drawings which form a part thereof, and which show by way of illustration specific embodiments in which the invention may be practiced. It is understood that other embodiments may be utilized and structural changes may be made within the scope of the present invention. Reference herein to any embodiment means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.
In the context of this specification, the term “computing” is used generally to mean generating an output based on one or more inputs using digital hardware, analog hardware, or a combination thereof, and is not limited to operations performed by a digital computer. Similarly, the term ‘processor’ when used with reference to hardware, may encompass digital and analog hardware or a combination thereof. The term processor may also refer to a functional unit or module implemented in software or firmware using a shared hardware processor. The terms ‘output’ and ‘input’ encompass analog and digital electromagnetic signals that may represent data sequences and single values. The terms ‘data’ and ‘signal’ are used herein interchangeably. The terms ‘coupled’ and ‘connected’ are used interchangeably; these terms and their derivatives encompass direct connections and indirect connections using intervening elements, unless clearly stated otherwise. Note that as used herein, the terms “first”, “second” and so forth are not intended to imply sequential ordering, but rather are intended to distinguish one element from another unless explicitly stated.
In the following description, reference is made to the accompanying drawings which form a part thereof and which illustrate several embodiments of the present invention. It is understood that other embodiments may be utilized and structural and operational changes may be made without departing from the scope of the present invention. The drawings include flowcharts and block diagrams. The functions of the various elements shown in the drawings may be provided through the use of data processing hardware such as but not limited to dedicated logical circuits within a data processing device, as well as data processing hardware capable of executing software in association with appropriate software. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. The term “processor” should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include without limitation, logical hardware circuits dedicated for performing specified functions, digital signal processor (“DSP”) hardware, application specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), read-only memory (“ROM”) for storing software, random access memory (“RAM”), and non-volatile storage. The term ‘memory’ as used herein refers to computer-readable non-transient data storage and encompasses optical disks, magnetic memory devices such as hard drives, solid-state memory, and other types of hardware memory devices.
Embodiments of the present invention use two-dimensional non-negative matrix factorization (NMF2D) to obtain a more compact representation of a spikegram in terms of a collection of its constituent ‘parts’, or ‘components’, and their associated weights. The term ‘more compact’ is used herein to mean a representation that requires less bits than the original, while maintaining perceptual quality of the signal after decoding at an acceptable level.
Part-based representations are used in signal processing and artificial intelligence, since they enable extracting constituent objects of a scene by extracting localized features. One way of achieving part-based analysis is the use of non-negative kernels in a linear model. In this scheme, since each signal is generated by adding up positive, or more generally non-negative kernels, no part of the kernels can be cancelled out by addition. Therefore the kernels, or basis vectors representing them, must be parts of the underlying data. In addition, combining sparseness and non-negativity gives a suitable representation for signals, as has been shown for example in P. Hoyer, Non-negative Matrix Factorization with sparseness constraints, Journal of Machine Learning Research 5: 1457-1469, 2004, and R. Pichevar and J. Rouat, An Improved Sparse Non-Negative Part-Based Image Coder via Simulated Annealing and Matrix Pseudo-Inverse, ICASSP 2008.
Mathematically, a conventional non-negative matrix decomposition is a technique that optimally solves a matrix equation
V≈W·H, (1)
Where V, W and H are matrices having only non-negative elements, where H contains as its rows K basis vectors, or components, of the decomposition, where K is an integer that is greater than 1. The K columns of the weight, or mixing, matrix W contain the corresponding weights that give the contribution of each basis vector in the input matrix V; each element wi,j of W is a projection of the input signal onto the jth vector of H−1. The goal of the NMF is to find both W and H that approximate V by minimizing a cost function E. A variety of cost functions can be used. By way of example, the cost function E may be defined by the following equation (2):
E=∥V−WH∥
2
+λ∥H∥, (2)
where the notation ∥X∥ represents a norm of matrix X, wherein the subscript “2” refers to norm L2. The first term in equation (2) represents the error of the approximation of equation (1), and the second term represents a sparseness penalty, with a selectable positive parameter.
The conventional NMF as described above may not be suitable for some applications, such as when applied to audio or video signals. Some audio components may be repetitive in time and/or frequency. Using the standard NMF would mean that the size of W and H should be increased to take into account all the repetitions. An increase in the size of the matrices W and H would mean an increase in the bitrate when audio coding applications are considered. An article by P. Samaragadis, Convolutive Speech Bases and their Application to Speech Separation, IEEE Trans. on Speech and Audio Processing U.S., and U.S. Pat. No. 7,415,392 issued to Smaragdis, both of which are incorporated herein by reference, disclose the use of a convolutive NMF, also referred to as non-negative matrix factor deconvolution (NMFD), for separation of a mixture of sounds received from a single-channel audio signal by associating different columns of the component matrix with different sound sources. In this approach, the input matrix V is formed of a sequence of spectrograms of short single frames, and the learning of the NMFD is performed off-line, using a large database of sound files. This approach may not be suitable for the purpose of audio encoding, when the encoding must be performed in real time.
Therefore, embodiments of the present invention utilize a version of the non-negative matrix factor 2-D (two-dimensional) deconvolution (NMF2D), which has been disclosed in M. Morup and M. Schmidt, Sparse Non-Negative Matrix Factor 2-D Deconvolution, Technical University of Denmark, 2008, which is incorporated herein by reference. We discovered that the NMF2D approach provides superior performance for audio encoding than the NMFD approach, in particular in terms of the achievable bit rate reduction for a same perceptual quality of the decoded audio at the decoder.
In the 2-D convolution version of the NMF, i.e. the NMF2D, the decomposition of the initial matrix V is done in the following form:
where ↓φ denotes the downward shift operator which moves each element in the matrix φ rows down while adding φ all-zero rows at the top, and →τ denotes the right shift operator which moves each element in the matrix τ columns to the right, while adding τ all-zero columns at the left of the matrix. Here, instead of a single base matrix H and a single weight matrix W, we have a set of L base matrices Hφ and a set of D weight matrix Wτ, where L is the total number of shifts φ for the weight matrices Wτ, and D is the total number of shifts τ for the components matrices Hφ. The base matrices Hφ are also referred to herein as component matrices, as their columns contain ‘hidden’ components of the original ‘signal’ V that are being extracted by the NMF2D process. Accordingly, the process of obtaining the component matrices Hφ may also be referred to as the component, or object, extraction. In the following indices ‘τ’ and ‘φ’ may be omitted where it doesn't lead to a confusion, and the sets of matrices Wτ and Hφ may be referred to as the set of weight matrices W and the set of component matrices H, respectively. It will be appreciated that the set of D weight matrices Wτ can be viewed as a 3D (three-dimensional) weight matrix, and the set of L base matrices Hφ can be viewed as a 3D component matrix; accordingly, theses sets of matrices that are generated by the NMF2D may also be referred to herein as the 3D weight and component matrices.
The combined ‘weight’ ∥H∥ of the matrices Hφ, which enters into the sparseness penalty term of the cost function E, may be defined as follows using the norm L½:
Based on gradient descent, the following recursive updates may be used to compute the sets of matrices Wτ and Hφ:
where {tilde over (W)}r is a matrix which elements are defined by the following equation:
and {tilde over (Λ)} is calculated as follows after each update of Hφ and Wτ:
Update formulae for other types of sparsity penalties can be found in the aforecited article by M. Morup and M. Schmidt. Note that the set of matrices Hφ may be viewed as a 3D matrix, or more particularly a 3-D tensor. In embodiments when the NMF2D is applied to a spikegram, this tensor contains the chosen number K of hidden components, which are also referred to herein as objects, on one dimension, the length of the signal Non the second dimension, and the number of shifts L on the third dimension. Similarly, W is a 3-D tensor with the number of channels Mon the spikegram as the first dimension, the number K of hidden components as the second dimension, and the number of shifts D as the third dimension.
Referring now to
i) receiving a sequence of signal samples representing a selected duration of the audio signal x(t) at step 110;
ii) from the received sequence of signal samples, obtaining at step 115 a two-dimensional sparse matrix S 116 sparsely representing the selected duration of the audio signal in time and frequency domains in terms of an overcomplete signal library; if the matrix S116 includes negative elements, in step 120 it is mapped to a non-negative matrix V 121;
iii) generating in step 125 a set of weight matrices W and a set of component matrices H by performing the NMF2D of the sparse matrix S 116, or of a non-negative matrix V121 obtained therefrom, under a sparsity constrain; and,
iv) in step 130, outputting the sets 126 of weight and component matrices W, H for further processing or for storing in a computer readable memory; in accordance with the stated hereinabove, the sets 126 of weight and component matrices W, H are also referred to herein as 3Dweight (W) and component (H) matrices.
Referring now to
In one embodiment, the 2-D sparse matrix S 116 is a spikegram that is computed by representing the input audio signal in terms of an over-complete library G of kernels gm(t), which is composed of nm time-shifted copies of M base dictionary elements gm(t), each gm(t) corresponding to a different center frequency fm, m=1, . . . , M, where M denotes the number of frequency channels in the representation. In the case of audio signals, these base dictionary elements gm(t) may be, for example, gammatone filter functions or gammachirp functions. The impulse responses of the gammatone filters approach that of actual responses observed in the human hearing system, and are given, for example, in our earlier U.S. Patent Application 2008/0219466 that is assigned to the assignee of the present application, and in an article H. Najaf-Zadeh, R. Pichevar, H. Landili, and L. Thibault, “Perceptual matching pursuit for audio coding,” in Audio Engineering Society Convention 124, 5 2008, both of which are incorporated herein by reference for all purposes.
In mathematical notations, a signal x(t) can be decomposed into the overcomplete kernels as follows:
where τim and aim are the temporal position and amplitude of the ith instance of the kernel gm, respectively. The kernels are not generally restricted in form or length.
The dictionary elements gm(t) can be realized both in analog and digital domain, for example as digital or analog filters or correlators, or in software. Considering digital implementations by way of example, the input signal x(t) is digitized and is in the form of a sequence of frames of length N each, with N being the number of signal samples in one frame. In one embodiment, the input signal x(t) is a sampled audio signal; in other embodiments it can be a sampled video or image signal. Each dictionary element gm(t) may be viewed as an impulse responses of a finite impulse response (FIR) filter and mathematically represented as a vector of length N. In the overcomplete dictionary G, each base element gm has a length Ngm<N and is present in nm time-shifted copies that are spread over the frame length N, preferably uniformly. In one embodiment, each consecutive copy of a base element gk is shifted by q samples from the previous copy, thereby sampling each frame of the input signal x(t) with a sampling period q=N/nm, which may be referred to as the hop size.
Different techniques could be used to find an optimal subset of kernels gm, and the corresponding τim and aim. In one embodiment, MP can be used in step 115 to generate the spikegram 116, wherein the signal x(t) is decomposed over a set of kernels so as to capture the structure of the signal. The MP approach, which is well known in the art and will not be described here in detail, involves iteratively approximating the input signal x(t) with successive orthogonal projections onto the basis set of kernels, which may be described mathematically using the following equation (10):
x(t)=<x(t), gm(ti)>·gm+Rx(t) (10)
where <x(t), gm(ti)> is the inner product between the signal and the kernel and is equivalent to aim in equation (9), which in the context of this specification is referred to as “spike amplitude”. Rx(t) is the residual signal representing an error of the approximation. In one embodiment wherein the input signal x(t) is an audio signal, the over-complete dictionary of kernels gm(t) may be gammatones or gammachirps. The process of generating a spikegram from a selected duration of an input audio signal using the MP technique is described, for example in U.S. Patent application No 2008/0219466, which is incorporated herein by reference.
In one embodiment, the spikegram 116 may be generated using an adaptive neural network, as described in U.S. Patent application No 2012/0023051, which is incorporated herein by reference.
In both techniques, the spikegram is generated by minimizing a cost function including an error function and a sparseness term. In some embodiments, the error function may be perceptually shaped according to an auditory masking pattern as described in Ref [1] and U.S. Patent application No 2012/0023051.
Using notations introduced hereinabove, in one embodiment the spikegram S 116 is an M×N matrix, with M being the number of channels and N being the length of the signal, i.e. the number of consecutive signal samples in the selected duration of the input audio signal, hereinafter referred to as ‘frame’. The matrix S has nonzero elements, or ‘spikes’, aim at cells (m,τim) of the matrix, with all the other elements being zero. In one embodiment, only channels that have more than a pre-determined minimal number of spikes per unit time, for example more than 10 spikes/second, are kept to increase ‘sparseness’ of the spikegram matrix.
Note that the spikegram matrix S can have either positive or negative elements aim, since the projections <x(t), gm(ti)> may be either positive or negative. Accordingly, in one embodiment the spikegram S 116 is mapped to a non-negative matrix V 121. A variety of mapping procedures may be envisioned within the context of the present invention, resulting in a non-negative matrix V that is referred to herein as a modified spikegram 121. By way of example, this mapping may be performed in accordance with the following equations (11), yielding the non-negative matrix V of size 2M×N:
By way of example,
Step 125 then involves applying the NMF2D technique, which is described herein above with reference to equations (3)-(8), to the modified spikegram matrix V 121 to obtain the sets 126 of the component, or base, matrices H and the weight matrices W. In one embodiment, this techniques involves iteratively updating the sets 126 of weights and component matrices to reduce a sparsity-dependent cost function below a threshold level. The technique may further include tuning one or more parameters of the technique so that a good reconstruction of the audio signal is achieved. The parameters that can be tuned in step 125 include one or more of the following: the number of hidden signal components K to look for, which defines the number of columns of the component matrices, the range of summation for φ which defines the number L of the component matrices Hφ, the range of summation for τ which defines the number D of the weight matrices Wτ, as well as the sparseness parameter λ.
With reference to
Advantageously, we found that for audio signal encoding the optimal ranges in which the NMF2D parameters vary are typically small. By way of example, the range for L and D may be from 5 to 7; the initial value of λ may be selected to be 5.10−4, and may typically decrease down to 5.10−5.
With reference to
Referring now to
In one embodiment, the DAP 5 is further configured for implementing the NMF2D processing with a perceptual weighting of the deconvolution error that is comprised in the cost function E that the NMF2D engine 35 minimizes. In the conventional NMF2D algorithm that is described hereinabove with reference to equations (1) to (8), the cost function E includes the least square NMF error ∥V−WH∥, see for example equation (2). However, in the case of audio signals, the least square error is not very relevant perceptually. Therefore in at least one embodiment the least square error term in the NMF2D cost function E is perceptually weighted with a perceptual 2D mask PM={mb,k}, which may be generated for example as disclosed in Ref. [1]. In one embodiment, the perceptually weighted cost function that is minimized by the NMF2D engine 35 takes the form given by the following equation (12):
where mb,k are elements of the perceptual mask matrix PM, νb,k are elements of the modified spikegram V 121, or of a frequency segment Vθ thereof. By defining PM as the masking matrix with elements mb,k and replacing in equations (5), (6) ‘V’ with ‘PM·V’ and ‘HφWτ’ with ‘PM·HφWτ’, where symbol ‘·’ denotes element by element multiplication, we obtain the perceptually flavored update formulae for the component and weight matrices Hφ and Wτ. The advantage of the perceptually weighted update formulae is that the errors are concentrated below the mask instead of being spread evenly throughout the spectrum.
Accordingly, in one embodiment the DAP 5 includes perceptual masking logic (PML) 70. The PML 70 is coupled to the SGL 25, and in operation dynamically generates the perceptual mask PM 80 in dependence upon the spikegram 30. Note that the perceptual mask 80 accounts for dynamic masking of weak components of the audio signal by adjacent stronger components thereof, and thus generally varies from one audio frame 20 to another. Once generated PM 80 is saved in a perceptual mask memory 80, from which it is provided to the NMF2D engine 35 for implementing the perceptually weighted NMF2D algorithm as described hereinabove. Details of computing the perceptual mask 80 in embodiments wherein the SGL 25 implements the MP technique can be found in Ref. [1] which is incorporated herein by reference. In embodiments wherein the SGL 25 implements the neural network technique can be found in U.S. Patent Application 2012/0023051, which is incorporated herein by reference.
Continuing to refer to
The quantized matrices Vq 50 and Wq 55 are then forwarded to the coder 60, which performs an entropy or arithmetic encoding of the quantized matrix values as known in the art; the resulting encoded data are then assembled in one or more frames by the output framer 65 and output as the encoded data 88. By way of example, in one embodiment the framer 65 concatenates the rows, columns and depth of the quantized and encoded 3D matrices W, H in a pre-defined order that is known to a complementary audio decoding device6 shown in
With reference to
Turning now back to
One drawback of conventional quantization of the W and H matrices, either vector or scalar, is that it makes it difficult or even impossible to control the amount of quantization error in the reconstructed spikegram 96, and ultimately in the reconstructed audio signal 98. Indeed, in embodiments employing a standard quantization scheme, e.g., scalar or vector quantization, the quantization precision for each element hij of the component matrix H is set independently on the weights wik, which are elements of the weight matrix W. Considering the standard NMF rather than the NMF2D for illustration purposes and by way of example, each element νi,j of the reconstructed spikegram is computed in accordance with the following equation (13):
We note at this point that only one row of W, which we will denote Wi, and one column of H, which we will denote Hj, are involved in defining a specific νij. Therefore, other rows of W and columns of H are not relevant to defining νij, which enables reducing the search in an optimization procedure described hereinbelow by only limiting the search space to one column or one row.
Furthermore, we make the following observations.
1) The sensitivity of the reconstructed signal 98 to the quantization error in any particular hkj depends on the weight wik associated with it. A small wik will lead to a relatively smaller contribution from the product wikhkj to the overall sum in the RHS of equation (13) and vice versa. For instance, in the extreme case when the weight wi1=0, any error can be tolerated on hi1 since the result of the multiplication of wi1h1j is always zero. Therefore, one can set in that case hkj=0, no matter the real value of hkj, and use the minimum number of bits to address that element. Accordingly, the NMF quantization scheme that is provided in accordance with an embodiment of the present invention employs an adaptive quantization approach wherein the quantization level of a particular element hkj of a component matrix H is determined in dependence upon the magnitude of the associated weight.
2) For a given bound εij on the error of the spike amplitude vij and a given set wik in Wi, there might be different combinations of hkj in Hj that will satisfy the quantization error bounds [vij−εij; vij+εij] on the magnitude of vij. However, not all of the solutions will result in a minimum bitrate for transmitting the output data 88. When an entropy coding is used, the bitrate depends on the number of non-zero elements hkj in the quantized matrix Hq and their distribution in magnitude P(hkj), which defines what fraction of all matrix elements hkj has a magnitude within a particular range. One solution to this problem would be to choose a combination of hkj that will lead to the sparest solution, i.e. has the greatest possible number of hkj=0.
3) There is no one-to-one correspondence between the tolerated error ε in vij and the actual error in each element hkj when the standard quantization is used. Without a trial and error based approach, it is not possible to know a priori how many quantization levels q should be used in the scalar quantizer so that the error in the reconstructed spikes does not cross the target maximum allowable quantization error ε.
Accordingly, an embodiment of the present invention utilizes a weight-adaptive quantization approach wherein the number of quantization levels for each non-zero spike magnitude hkj is chosen adaptively based on the magnitude of the corresponding weight wik and the maximum allowable quantization error εi,j. The approach is different from non-adaptive scalar and vector quantization where the number of quantization levels is fixed for all values to be quantized. The technique differs from prior-art adaptive quantizers, wherein the number of levels of quantization varies in time/space as a function of the signal statistics in a given frame. Contrary to that, in the adaptive quantization that is used in embodiments of the present invention, the number of levels of quantization for each hkj varies in dependence on the magnitude of the respective weight coefficient wik.
With reference to
At step 501, provide an error matrix ε 43, elements of which define maximum allowable error for each element of the spikegram matrix V30; in one embodiment elements of the error matrix ε are quantized to a maximum desired precision corresponding to a pre-defined maximum number of quantization levels Q. The quantization error matrix ε 43 defines the upper and lower error bounds for the ‘quantized’ matrix Vq=WqHq:
x
max
=└V+ε┘
x
min
=└V−ε┘ (14)
Equations (14) are written in a matrix form and hold for each element of the respective matrices at the same positions. The notation ‘└X┘’ defines the largest integer equal or less than X.
At step 502, execute NMF or NMF2D to generate the weight and component matrices W, H.
At step 503, upscale the weight matrix W to the maximum desired precision, i.e. using Q levels of quantization, to obtain a quantized weight matrix Wq. Note that in this scheme the quantization is adaptive and the number of quantization levels for each element in H can range from 0 levels up to the maximum number of Q levels. After this up-scaling operation, a maximum value in W is equal to Q.
At step 504, execute an optimization procedure to determine an optimal Hq for the given fixed Wq that yield Vq which lies within the error bounds defined by equations (14), i.e.
x
min
≦Vq≦x
max, (15)
and save resulting Hq in memory;
At step 505, execute an optimization procedure to determine an optimal Wq for the Hq saved in the previous step that yields Vq within the error bounds (15), and save resulting Wq in memory; this step is optional given the small size of W, and can be omitted in some embodiments.
f) optionally repeat steps 504 and 505 if required.
A variety of suitable optimization procedures could be used in steps 504 and 505 within the scope of the present invention. In one embodiment, these steps are conveniently performed using integer programming.
The integer programming, as known in the art, is a method for a computer for solving a linear optimization problem that can be mathematically formulated as follows:
subject to the following conditions:
Different algorithms for solving the problem of the type given by equations (16), (17) are known in the art and could be used in embodiments of the present invention. One suitable algorithm for integer programming is the branch-and-bound algorithm, which is disclosed, for example, in A. H. Land and A. G. Doig (1960). “An automatic method of solving discrete programming problems”. Econometrica 28 (3): pp. 497-520. The integer linear program is a linear program further constrained by the integrality restrictions. Thus, in a maximization/minimization problem, the value of the objective function, at the linear-program optimum, will always be an upper bound on the optimal integer-programming objective. In addition, any integer feasible point is always a lower bound on the optimal linear-program objective value. The idea of branch-and-bound is to utilize these observations to systematically subdivide the linear programming feasible region and make assessments of the integer-programming problem based upon these subdivisions.
In one embodiment, an integer programming algorithm is used in step 504 to determine an optimal quantization scheme for elements of the complement matrix by solving the following linear integer problem:
min|Hqj| (18)
under the condition
x
min
ij
<W
i
H
q
j
<x
max
ij (19)
where Hqj denotes jth column of the optimized matrix Hq, and Wi as the ith row of W, and xminij and xmaxij as the (i, j)th element of matrices xmin and xmax, respectively, which elements define the upper and lower error bounds for the reconstructed spikegram matrix Vq 96.
Since only one row and one column of W and H contribute to the quantization error in any vij, the optimization may be performed for each row i of W and each column j of H separately in order to save in computational cost.
In this embodiment, the quantizer logic 53 is configured to find, using an integer programming algorithm, a minimum-weight vector Hqj which non-zero elements are integer and satisfy condition (19), i.e. result, for a given weight matrix W, in elements of reconstructed spikegram that lie within the predetermined error bounds. In another embodiment, the ‘lowest-weight’ condition (19) is not used, and the integer programming algorithm is executed with the constrain (19) but without any object function to minimize.
In step (e), in one embodiment an integer programming algorithm is used to solve the following linear integer problem:
find min |Wqj| (20)
under the condition
x
min
ij
<W
q
i
H
q
j
<x
max
ij (21)
where Hqj denotes jth column of the optimized matrix Hq saved in step 504, and Wqi denotes the ith row of the matrix Wq. Again, in this step the optimization is done on one row and one column of W and H. In this embodiment, the quantizer logic 53 is configured to find, using an integer programming algorithm, a minimum-weight vector Wqi which non-zero elements are integer and satisfy condition (21), i.e. result, for a give weight matrix W, in elements of reconstructed spikegram that lie within the predetermined error bounds. In another embodiment, the ‘lowest-weight’ condition (20) is not used, and the integer programming algorithm is executed with the constrain (21) but without any object function to minimize
The minimum-weight conditions (18), (20) which are imposed on columns of the quantized component matrix and/or the rows of the quantized weight matrix, have that advantage that they increases the sparseness of the representation, thereby enabling lower bitrates. The technique is extended to NMF2D by replacing in equations (19), (21) with the RHS of equation (8).
In one embodiment, the quantizer logic 53 implements in step 504 a piecewise-linear quantization by integer programming (QIP) of the NMF2D matrix H as described hereinbelow. The QIP process enables to quantize elements of H with different precisions in different regions, which may be preferable when the distribution of values in H is not uniform. In this case, it is preferable to use higher precision for those ranges of values for which the concentration is high in the distribution and a coarser precision for those regions with sparser density.
The process may be explained as follows. We denote an element of the up-scaled to Q component matrix to be quantized as H temporarily omitting indices i, j for clarity, and further denote the quantization levels that are to be used for ranges of values with upper bounds a1, a2, a3, . . . aN as q1, q2, q3, . . . , qN, respectively, wherein N here is the desired number of distinct quantization regions and can range, for example, from 2 to (Q−1). These sets of quantization ranges ai and their corresponding quantization levels qi are pre-defined prior to the processing and stored in memory. By way of example, for audio signals smaller values that correspond to quieter sound could be quantized with finer precision than larger values that correspond to louder sound. The task of the following processing is to determine optimal distributions of the quantization levels in each of the pre-defined ranges of values [ai ai+1]. In that case, if a quantized value of H, i.e. Hq, falls in the region [aiai+1], it can be written as:
Where l is an integer between ai and ai+1. We further introduce variables δi, i=1, . . . , N, such that they satisfy the following conditions:
Note that the variable δ1 corresponds to the amount by which H exceed 0, but is less than or equal to a1; δ2 corresponds to the amount by which H exceed a1, but is less than or equal to a2. δN corresponds to the amount by which H exceed aN−1, but is less than or equal to aN. Note that δi in
Therefore Hq, which is the quantized value for H, may be computed by summing suitably scaled variables δi:
Equation (24) is applied to every non-zero element of the 3D component matrix H. In order for the equation (24) to be valid, we should further require that δ1=al whenever δ2>0, δ2=a2 whenever δ3>0 and so on. These conditional constraints can be modeled by introducing binary variables
By integrating the binary variable bi into the equations (23), we obtain:
With these notations, the minimization problem to be solved may be formulated as follows: find all δi that minimise the objective function given by an absolute value of their sum,
under the following conditions:
x
min
i,j
<W
i
H
q
i
<x
max
i,j (29)
and
L
i
·b
i<δi<Li·bi−1 (30)
where each element of Hqi is computed in accordance with Equation (24); xmini,j and xmaxi,j are defined in Equation (19).
This technique can be applied to audio coding, in particular to quantize elements of the component matrices H in step 504 of the process of
Accordingly, in one embodiment of the invention the data processing in step 504 includes the following steps for each element H of the 3D component matrix: i) providing the pre-defined sets of quantization ranges ai and their corresponding quantization levels qi thereby defining a piecewise linear function is in accordance with equation (22); ii) generate an initial set of N variables δi in accordance with equations (23); iv) generate binary variables bi in accordance with equation (25); v) use an integer programming algorithm to solve the optimization problem as stated in equations (28)-(30) to determine optimal δi and optimal binary variables bi. vi) compute the quantized component matrix value Hq using equation (24).
The integer streams 51 and 52 are then passed to the encoder 60, such as an arithmetic encoder known in the art for additional encoding to reduce size. The arithmetic encoding step is optional and may be omitted in some embodiments. The framer 65 then combines the encoded stream of quantized elements of the component matrices H with an encoded stream of quantized elements of the weight matrices, for example by concatenating, to form one frame of the output encoded audio signal 88. The encoded stream 88 may also contain side information when needed to decode the stream.
By way of example, Table 1 shows the reconstruction error at the decoder 6 for different values of the maximum number Q of quantization levels for an exemplary audio signal when using the QIP technique. The table shows that the error is inside the lower bounds for Q=64, 128, 256, 512, and 1024 quantization levels. Note that when the maximum number of quantization levels Q is said to be 1024, it means that values hk are between 0 and 1024. In the case of 32 levels, the optimization was not achieved. Note that with a scalar quantization the error bounds were not respected even with 1024 quantization levels. The term ‘Relative Minimal Difference from Upper Bound’ as used in the table means the minimum of the difference between the upper bound and actual values, for all elements in the matrix H and/or W, divided by the value of the element itself. The term Relative Minimal Difference from Lower Bound’ as used in the table means the maximum of the difference between the lower bound and actual values, for all elements in the matrix H and/or W, divided by the value of the element itself.
Advantageously, we found that for a typical matrix H generated by the NMF2D engine, the required number of bits reduces by 65% on average when aforedescribed QIP technique is used compared to the standard scalar quantization with 1024 quantization levels. In order to perform a fair comparison between the two approaches, we generated two bitstreams: one for the output of the scalar quantizer and one for the output of the QIP quantizer. We then arithmetic coded both streams. If we didn't want to use arithmetic coding for the standard scalar quantizer, which is not a mandatory step, the bitrate would even be higher. In fact, in the case where a fixed number of bits, i.e., 10 bits by way of example, is used for the scalar quantizer, the gain of the QIP approach and is even higher and is around 70% on average. Furthermore, when the additional optimization cost function |H| is used in the QIP technique as disclosed hereinabove, and arithmetic coding is used in both cases, the gain in ‘bit-compactness’ is increased to 75% compared to the scalar quantization.
The audio encoding logic and operations described hereinabove may be implemented as a method, apparatus or article of manufacture using standard programming and/or engineering techniques to produce software, firmware, hardware, or any combination thereof.
In one embodiment, DAP 5 may be implemented using one or more digital processors, and may be incorporated into a portable device such as a cellular phone including a smartphone, or generally in any digital audio recording device. Each of the functional blocks 25, 35, 53, 60 and 65 of DAP 5 of
One implementation of the invention provides an article of manufacture that comprises at least one of a hardware device having hardware logic and a computer readable storage medium including a computer program code embodied therein that is executable by a computer, said computer program code comprising instructions for performing some of all of the operations described hereinabove for encoding digitized audio signal, said computer program code further comprising distinct software modules, the distinct software modules comprising a spikegram generating module, an NMF2D module, and a quantizer module. In one implementation, one or more of these modules are implemented with hardware logic using an ASIC and/or an FPGA.
The term “article of manufacture” as used herein refers to code or logic implemented in hardware logic, for example an integrated circuit chip, FPGA, ASIC, etc., or a computer readable medium, such as but not limited to magnetic storage medium, for example hard disk drives, floppy disks, tape, etc., optical storage, for example CD-ROMs, optical disks, etc., volatile and non-volatile memory devices, for example EEPROMs, ROMs, PROMs, RAMs, DRAMs, SRAMs, firmware, programmable logic, etc. Code in the computer readable medium is accessed and executed by a processor. Of course, those skilled in the art will recognize that many modifications may be made to this configuration without departing from the scope of the present invention, and that the article of manufacture may comprise any information bearing non-transient tangible medium known in the art.
Although embodiments of the invention have been described hereinabove with reference to audio signals, the aforedescribed sparse encoding with object extraction using NMF2D, and the adaptive quantization using the QIP technique are also applicable to other types of signals such as video signals and signals representing images, and the application of the aforedescribed techniques to such signals are also intended to be within the scope of the present invention. For example, in the case of video or image signals, the overcomplete set G of kernels, such as Gabor kernels for images, may be defined for spatial coordinates(x,y) in the image, resulting in a 2D spikegram representing spatial coordinates (x,y). It should also be understood that each of the preceding embodiments of the present invention may utilize a portion of another embodiment
Of course numerous other embodiments may be envisioned without departing from the spirit and scope of the invention.
The present invention claims priority from U.S. Provisional Patent Application No. 61/494,460 filed Jun. 8, 2011, which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
61494460 | Jun 2011 | US |