The present disclosure relates to sound coding, more specifically to speech/music classification and core encoder selection in, in particular but not exclusively, a multi-channel sound codec capable of producing a good sound quality for example in a complex audio scene at low bit-rate and low delay.
In the present disclosure and the appended claims:
Historically, conversational telephony has been implemented with handsets having only one transducer to output sound only to one of the user's ears. In the last decade, users have started to use their portable handset in conjunction with a headphone to receive the sound over their two ears mainly to listen to music but also, sometimes, to listen to speech. Nevertheless, when a portable handset is used to transmit and receive conversational speech, the content is still mono but presented to the user's two ears when a headphone is used.
With the newest 3GPP speech coding standard, EVS (Enhanced Voice Services) as described in Reference [1] of which the full content is incorporated herein by reference, the quality of the coded sound, for example speech and/or audio, that is transmitted and received through a portable handset has been significantly improved. The next natural step is to transmit stereo information such that the receiver gets as close as possible to a real life audio scene that is captured at the other end of the communication link.
In audio codecs, for example as described in Reference [2] of which the full content is incorporated herein by reference, transmission of stereo information is normally used.
For conversational speech codecs, mono signal is the norm. When a stereo sound signal is transmitted, the bit-rate is often doubled since both the left and right channels of the stereo sound signal are coded using a mono codec. This works well in most scenarios, but presents the drawbacks of doubling the bit-rate and failing to exploit any potential redundancy between the two channels (left and right channels of the stereo sound signal). Furthermore, to keep the overall bit-rate at a reasonable level, a very low bit-rate for each of the left and right channels is used, thus affecting the overall sound quality. To reduce the bit-rate, efficient stereo coding techniques have been developed and used. As non-limitative examples, two stereo coding techniques that can be efficiently used at low bit-rates are discussed in the following paragraphs.
A first stereo coding technique is called parametric stereo. Parametric stereo encodes two inputs (left and right channels) as mono signals using a common mono codec plus a certain amount of stereo side information (corresponding to stereo parameters) which represents a stereo image. The two inputs are down-mixed into a mono signal and the stereo parameters are then computed. This is usually performed in frequency domain (FD), for example in the Discrete Fourier Transform (DFT) domain. The stereo parameters are related to so-called binaural or inter-channel cues. The binaural cues (see for example Reference [3], of which the full content is incorporated herein by reference) comprise Interaural Level Difference (ILD), Interaural Time Difference (ITD) and Interaural Correlation (IC). Depending on the sound signal characteristics, stereo scene configuration, etc., some or all binaural cues are coded and transmitted to the decoder. Information about what binaural cues are coded and transmitted is sent as signaling information, which is usually part of the stereo side information. Also, a given binaural cue can be quantized using different coding techniques which results in a variable number of bits being used. Then, in addition to the quantized binaural cues, the stereo side information may contain, usually at medium and higher bit-rates, a quantized residual signal that results from the down-mixing. The residual signal can be coded using an entropy coding technique, e.g. an arithmetic encoder.
Another stereo coding technique is a technique operating in time-domain. This stereo coding technique mixes the two inputs (left and right channels) into so-called primary and secondary channels. For example, following the method as described in Reference [4], of which the full content is incorporated herein by reference, time-domain mixing can be based on a mixing ratio, which determines respective contributions of the two inputs (left and right channels) upon production of the primary and secondary channels. The mixing ratio is derived from several metrics, for example normalized correlations of the two inputs (left and right channels) with respect to a mono signal or a long-term correlation difference between the two inputs (left and right channels). The primary channel can be coded by a common mono codec while the secondary channel can be coded by a lower bit-rate codec. Coding of the secondary channel may exploit coherence between the primary and secondary channels and might re-use some parameters from the primary channel.
Further, in last years, the generation, recording, representation, coding, transmission, and reproduction of audio is moving towards enhanced, interactive and immersive experience for the listener. The immersive experience can be described, for example, as a state of being deeply engaged or involved in a sound scene while sounds are coming from all directions. In immersive audio (also called 3D (Three-Dimensional) audio), the sound image is reproduced in all three dimensions around the listener, taking into consideration a wide range of sound characteristics like timbre, directivity, reverberation, transparency and accuracy of (auditory) spaciousness. Immersive audio is produced for a particular sound playback or reproduction system such as loudspeaker-based-system, integrated reproduction system (sound bar) or headphones. Then, interactivity of a sound reproduction system may include, for example, an ability to adjust sound levels, change positions of sounds, or select different languages for the reproduction.
There exist three fundamental approaches to achieve an immersive experience.
A first approach to achieve an immersive experience is a channel-based audio approach using multiple spaced microphones to capture sounds from different directions, wherein one microphone corresponds to one audio channel in a specific loudspeaker layout. Each recorded channel is then supplied to a loudspeaker in a given location. Examples of channel-based audio approaches are, for example, stereo, 5.1 surround, 5.1+4, etc.
A second approach to achieve an immersive experience is a scene-based audio approach which represents a desired sound field over a localized space as a function of time by a combination of dimensional components. The sound signals representing the scene-based audio are independent of the positions of the audio sources while the sound field is transformed to a chosen layout of loudspeakers at the renderer. An example of scene-based audio is ambisonics.
The third approach to achieve an immersive experience is an object-based audio approach which represents an auditory scene as a set of individual audio elements (for example singer, drums, guitar, etc.) accompanied by information such as their position, so they can be rendered by a sound reproduction system at their intended locations. This gives the object-based audio approach a great flexibility and interactivity because each object is kept discrete and can be individually manipulated.
Each of the above described audio approaches to achieve an immersive experience presents pros and cons. It is thus common that, instead of only one audio approach, several audio approaches are combined in a complex audio system to create an immersive auditory scene. An example can be an audio system that combines scene-based or channel-based audio with object-based audio, for example ambisonics with a few discrete audio objects.
According to a first aspect, the present disclosure provides a two-stage speech/music classification device for classifying an input sound signal and for selecting a core encoder for encoding the sound signal, comprising: a first stage for classifying the input sound signal into one of a number of final classes; and a second stage for extracting high-level features of the input sound signal and for selecting the core encoder for encoding the input sound signal in response to the extracted high-level features and the final class selected in the first stage.
According to a second aspect, there is provided a two-stage speech/music classification method for classifying an input sound signal and for selecting a core encoder for encoding the sound signal, comprising: in a first stage, classifying the input sound signal into one of a number of final classes; and in a second stage, extracting high-level features of the input sound signal and selecting the core encoder for encoding the input sound signal in response to the extracted high-level features and the final class selected in the first stage.
The foregoing and other objects, advantages and features of a sound codec, including the two-stage speech/music classification device and method will become more apparent upon reading of the following non-restrictive description of illustrative embodiments thereof, given by way of example only with reference to the accompanying drawings.
In the appended drawings:
In recent years, 3GPP (3rd Generation Partnership Project) started working on developing a 3D (Three-Dimensional) sound codec for immersive services called IVAS (Immersive Voice and Audio Services), based on the EVS codec (See Reference [5] of which the full content is incorporated herein by reference).
The present disclosure describes a speech/music classification technique and a core encoder selection technique in an IVAS coding framework. Both techniques are part of a two-stage speech/music classification method of which the result is core encoder selection.
Although the speech/music classification method and device are based on that in EVS (See Reference [6] and Reference [1], Section 5.1.13.6, of which the full content is incorporated herein by reference), several improvements and developments have been implemented. Also, the two-stage speech/music classification method and device are described in the present disclosure, by way of example only, with reference to an IVAS coding framework referred to throughout this disclosure as IVAS codec (or IVAS sound codec). However, it is within the scope of the present disclosure to incorporate such a two-stage speech/music classification method and device in any other sound codec.
The stereo sound processing and communication system 100 of
Still referring to
The left 103 and right 123 channels of the original analog stereo sound signal are supplied to an analog-to-digital (A/D) converter 104 for converting them into left 105 and right 125 channels of an original digital stereo sound signal. The left 105 and right 125 channels of the original digital stereo sound signal may also be recorded and supplied from a storage device (not shown).
A stereo sound encoder 106 codes the left 105 and right 125 channels of the original digital stereo sound signal thereby producing a set of coding parameters that are multiplexed under the form of a bit-stream 107 delivered to an optional error-correcting encoder 108. The optional error-correcting encoder 108, when present, adds redundancy to the binary representation of the coding parameters in the bit-stream 107 before transmitting the resulting bit-stream 111 over the communication link 101.
On the receiver side, an optional error-correcting decoder 109 utilizes the above mentioned redundant information in the received bit-stream 111 to detect and correct errors that may have occurred during transmission over the communication link 101, producing a bit-stream 112 with received coding parameters. A stereo sound decoder 110 converts the received coding parameters in the bit-stream 112 for creating synthesized left 113 and right 133 channels of the digital stereo sound signal. The left 113 and right 133 channels of the digital stereo sound signal reconstructed in the stereo sound decoder 110 are converted to synthesized left 114 and right 134 channels of the analog stereo sound signal in a digital-to-analog (D/A) converter 115.
The synthesized left 114 and right 134 channels of the analog stereo sound signal are respectively played back in a pair of loudspeaker units, or binaural headphones, 116 and 136. Alternatively, the left 113 and right 133 channels of the digital stereo sound signal from the stereo sound decoder 110 may also be supplied to and recorded in a storage device (not shown).
For example, the stereo sound encoder 106 of
1. Two-Stage Speech/Music Classification
As indicated in the foregoing description, the present disclosure describes a speech/music classification technique and a core encoder selection technique in an IVAS coding framework. Both techniques are part of the two-stage speech/music classification method (and corresponding device) the result of which is the selection of a core encoder for coding a primary (dominant) channel (in case of Time Domain (TD) stereo coding) or a down-mixed mono channel (in case of Frequency Domain (FD) stereo coding). The basis for the development of the present technology is the speech/music classification in the EVS codec (Reference [1]). The present disclosure describes modifications and improvements that were implemented therein and that are part of a baseline IVAS codec framework.
The first stage of the speech/music classification method and device in the IVAS codec is based on a Gaussian Mixture Model (GMM). The initial model, taken from the EVS codec, has been extended, improved and optimized for the processing of stereo signals.
In summary:
Referring to
The core encoder selection technique (second stage of the two-stage speech/music classification device and method) in the IVAS codec is built on top of the first stage of the two-stage speech/music classification device and method and delivers a final output to perform selection of the core encoder from ACELP (Algebraic Code-Excited Linear Prediction), TCX (Transform-Coded eXcitation) and GSC (Generic audio Signal Coder) as described in Reference [7], of which the full content is incorporated herein by reference). Other suitable core encoders can also be implemented within the scope of the present disclosure.
In summary:
Referring to
First, it should be mentioned that the GMM model is trained using an Expectation-Maximization (EM) algorithm on a large, manually labeled database of training samples. The database contains the mono items used in the EVS codec and some additional stereo items. The total size of the mono training database is approximately 650 MB. The original mono files are converted to corresponding dual mono variants before being used as inputs to the IVAS codec. The total size of the additional stereo training database is approximately 700 MB. The additional stereo database contains real recordings of speech signals from simulated conversations, samples of music downloaded from open sources on the internet and some artificially created items. The artificially created stereo items are obtained by convolving mono speech samples with pairs of real Binaural Room Impulse Responses (BRIRs). These impulse responses correspond to some typical room configurations, e.g. small office, seminar room, auditorium, etc. The labels for the training items are created semi-automatically using the Voice Activity Detection (VAD) information extracted from the IVAS codec; this is not optimal but frame-wise manual labeling is impossible given the size of the database.
2.1 State Machine for Signal Partitioning
Referring to
The concept of state machine in the first stage is taken from the EVS codec. No major modifications have been made to the IVAS codec. The purpose of the state machine 201 is to partition the incoming sound signal into one of four states, INACTIVE, ENTRY, ACTIVE and UNSTABLE.
The schematic diagram of
The INACTIVE state 401, indicative of background noise, is selected as the initial state.
The state machine 201 switches from the INACTIVE state 401 to the ENTRY state 402 when a VAD flag 403 (See Reference [1]) changes from “0” to “1”. In order to produce the VAD flag used by the first stage of the two-stage speech/music classification method and device, any VAD detector or SAD (Sound Activity Detection) detector may be utilized. The ENTRY state 402 marks the first onset or attack in the input sound signal after a prolonged period of silence.
After, for example eight frames 405 in the ENTRY state 402, the state machine 201 enters the ACTIVE state 404 which marks the beginning of a stable sound signal with sufficient energy (a given level of energy). If the energy 409 of the signal suddenly decreases while the state machine 201 is in the ENTRY state 402, the state machine 201 changes from the ENTRY state to the UNSTABLE state 407, corresponding to an input sound signal with a level of energy close to background noise. Also, if the VAD flag 403 changes from “1” to “0” while the state machine 201 is in the ENTRY state 402, the state machine 201 returns to the INACTIVE state 401. This ensures continuity of classification during short pauses.
If the energy 406 of the stable signal (ACTIVE state 404) suddenly drops closer to the level of background noise or the VAD flag 403 changes from “1” to “0”, the state machine 201 switches from the ACTIVE state 404 to the UNSTABLE state 407.
After a period of, for example, 12 frames 410 in the UNSTABLE state 407, the state machine 201 reverts to the INACTIVE state 401. If the energy 408 of the unstable signal suddenly increases or the VAD flag 403 changes from “0” to “1” while the state machine 201 is in the UNSTABLE state 407, the state machine 210 returns to the ACTIVE state 404. This ensures continuity of classification during short pauses.
In the following description, the current state of the state machine 201 is denoted ƒSM. The constants assigned to the individual states may be defined as follows:
INACTIVE ƒSM=−8
UNSTABLE ƒSM∈-7,-1
ENTRY ƒSM∈0,7
ACTIVE ƒSM=+8
In the INACTIVE and ACTIVE states, ƒSM corresponds to a single constant whereas in the UNSTABLE and ENTRY states, ƒSM takes on multiple values depending on the progression of the state machine 201. Thus, in the UNSTABLE and ENTRY states, ƒSM may be used as a short-term counter.
2.2 Onset/Attack Detector
Referring to
The onset/attack detector 202 and the corresponding onset/attack detection operation 252 are adapted to the purposes and functions of the speech/music classification of the IVAS codec. The objective comprises, in particular but not exclusively, localization of both the beginnings of speech utterances (attacks) and the onsets of musical clips. These events are usually associated with abrupt changes in the characteristics of the input sound signal. Successful detection of signal onsets and attacks after a period of signal inactivity allows a reduction of the impact of past information in the process of score smoothing (described herein below). The onset/attack detection logic plays a similar role as the ENTRY state 402 of
The relative frame energy Er may be computed as the difference between the frame energy in dB and the long-term average energy. The frame energy in dB may be computed using the following relation:
where ECB(i) are the average energies per critical band (See Reference [1]). The long-term average frame energy may be computed using the following relation:
Ēƒ=0.99Ēƒ+0.01Et
with initial value Ēƒ=45 dB. The relative frame energy may be calculated as
Er=Et−Ēƒ
The parameter used by the onset/attack detector 252 is a cumulative sum of differences between the relative energy of the input sound signal in a current frame and the relative energy of the input sound signal in a previous frame updated in every frame. This parameter is initialized to 0 and updated only when the relative energy in the current frame, Er(n), is greater than the relative energy in the previous frame, Er(n−1). The onset/attack detector 252 updates the cumulative sum vrun(n) using, for example, the following relation:
vrun(n)=vrun((n−1)+(Er(n)−Er(n−1))
where n is the index of the current frame. The onset/attack detector 252 uses the cumulative sum vrun(n) to update a counter of onset/attack frames, vcnt. The counter of the onset/attack detector 252 is initialized to 0 and incremented by 1 in every frame in the ENTRY state 402 where vrun>5. Otherwise, it is reset to 0.
The output of the attack/onset detector 202 is a binary flag, ƒatt, which is set to 1 for example when 0<vrun<3 to indicate detection of an onset/attack. Otherwise, this binary flag is set to 0 to indicate no detection of onset/attack. This can be expressed as follows:
The operation of the onset/attack detector 202 is demonstrated, as a non-limitative example, by the graph of
2.3 Feature Extractor
Referring to
In the training stage of the GMM model, the training samples are resampled to 16 kHz, normalized to −26 dBov (dBov is a dB level relative to the overload point of the system) and concatenated. Then, the resampled and concatenated training samples are fed to the encoder of the IVAS codec to collect features using the feature extractor 203. For the purpose of feature extraction, the IVAS codec may be run in a FD stereo coding mode, TD stereo coding mode or any other stereo coding mode and at any bit-rate. As a non-limitative example, the feature extractor 203 is run in a TD stereo coding mode at 16.4 kbps. The feature extractor 203 extracts the following features used in the GMM model for speech/music/noise classification:
xy3
With the exception of the MFCC feature, all of the above features are already present in the EVS codec (See Reference [1]).
The feature extractor 203 uses the open-loop pitch TOL and the voicing measure
The MFCC feature is a vector of Nmel values corresponding to mel-frequency cepstral coefficients, which are the results of a cosine transform of the real logarithm of the short-term energy spectrum expressed on a mel-frequency scale (See Reference [8], of which the full content is incorporated herein by reference).
The calculation of the last two features Pdiff and Psta uses, for example, the normalized per-bin power spectrum, {tilde over (P)}k defined as
where Pk is the per-bin power spectrum in the current frame calculated in the IVAS spectral analysis routine (See Reference [1]). The normalization is performed in the range klow, khigh=3, 70) corresponding to the frequency range of 150-3500 Hz.
The power spectrum difference, Pdiff, may be defined as
where the index (n) has been added to denote the frame index explicitly.
The spectral stationarity feature, Psta, may be calculated from the sum of ratios of the normalized per-bin power spectrum and the power differential spectrum, using the following relation:
The spectral stationarity is generally higher in frames containing frequency bins with higher amplitude and smaller spectrum difference at the same time.
2.4 Outlier Detector Based on Individual Feature Histograms
Referring to
The GMM model is trained on vectors of the features collected from the IVAS codec on the large training database. The accuracy of the GMM model is affected to a large extent by the statistical distribution of the individual features. Best results are achieved when features are distributed normally, for example when X˜N(μ, σ) where N represents a statistical distribution having a mean y and a variance u
The GMM model can represent to some extent features with non-normal distribution. If the value of one or more features is significantly different from its mean value, the vector of features is determined as an outlier. Outliers usually lead to incorrect probability estimates. Instead of discarding the vector of features, it is possible to replace the outlier features, for example with feature values from the previous frame, an average feature value across a number of previous frames, or by a global mean value over a significant number of previous frames.
The detector 204 detects outliers in the first stage 200 of the two-stage speech/music classification device on the basis of the analysis of individual feature histograms (See for example
where H(i) is the feature histogram normalized such that max(H(i))=1, i is a frequency bin index ranging from 0 to I=500 bins, and imax is the bin containing a maximum value of the histogram for this feature. The threshold thrH is set to 1e−4. This specific value for the threshold thrH has the following explanation. If the true statistical distribution of the feature was normal with zero mean μ and variance σ, it could be re-scaled such that its maximum value was equal to 1. In that case the probability density function (PDF) could be expressed as
By substituting ƒx(xo, a2) with the threshold thrH and rearranging the variables the following relation is obtained:
x=σ√{square root over (−2 log (thrH))}
For thrH=1e−4 the following is obtained:
x≅2.83σ
Thus, applying the threshold of 1e−4 leads to trimming the probability density function to the range of +2.83σ around the mean value provided that the distribution was normal and scaled such that the density probability function ƒxs(0|0,σ2)=1. The probability that a feature value lies outside the trimmed range is given by, for example, the following relation:
where erf(.) is the Gauss error function known from the theory of statistics.
If the variance of the feature values was σ=1, then the percentage of the detected outliers would be approximately 0.47%. The above calculations are only approximate since the true distribution of feature values is not normal. This is illustrated by the histogram of the non-stationarity feature nsta in
The lower Hlow and the upper Hhigh bounds are calculated for each feature used by the first stage 250/200 of the two-stage speech/music classification method and device and stored in the memory of the IVAS codec. When running the encoder of the IVAS codec, the outlier detector 204 compares the value X1(n) of each feature j in the current frame n against the bounds Hlow and Hhigh of that feature, and marks the features j having a value lying outside of the corresponding ranges defined between the lower and upper bounds as an outlier feature. This can be expressed as
where F is the number of features. The outlier detector 204 comprises a counter (not shown) of outlier features, codv representing the number of detected outliers, using, for example, the following relation:
If the number of outlier features is equal to or higher than, for example 2, then the outlier detector 204 set a binary flag, ƒout, to 1. This can be expressed as follows:
The flag ƒout is used for signaling that the vector of the features is an outlier. If the flag ƒout is equal to one, then the outlier features Xj (n) are replaced, for example, with the values from the previous frame, as follows:
Xj(n)=Xj(n−1) for j=1, . . . ,F if ƒodv(j)=1
2.5 Short-Term Feature Vector Filter
Referring to
The speech/music classification accuracy is improved with feature vector smoothing. This can be performed by applying the following short-term Infinite Impulse Response (IIR) filter used as the short-term feature vector filter 205:
{tilde over (X)}j(n)=αm{tilde over (X)}j(n−1)+(1−αmXj(n)) for j=1, . . . ,F
where {tilde over (X)}j(n) represents the short-term filtered features in frame n and αm=00.5 is a so-called forgetting factor.
Feature vector smoothing (operation 255 of filtering a short-term feature vector) is not performed in frames in the ENTRY state 402 of
{tilde over (X)}j(n)=Xj(n) for j=1, . . . ,F
In the following description, the original symbol for feature values Xj(n) is used instead of Xj(n), i.e. it is assumed that
{tilde over (X)}j(n)=Xj(n) for j=1, . . . ,F
2.6 Non-Linear Feature Vector Transformation (Box-Cox)
Referring to
As shown by the histograms of
where λ is the exponent of the power transform which varies from −5 to +5 (See
where N is the number of samples of the feature in the training database.
During the training process, the non-linear feature vector transformer 206 considers and tests all values of the exponent) to select an optimal value of exponent λ based on a normality test. The normality test is based on the D'Agostino and Pearson's method as described in Reference [10], of which the full content is incorporated herein by reference, combining skew and kurtosis of the probability distribution function. The normality test produces the following skew and kurtosis measure rsk (S-K measure):
rsk=s2+k2
where s is the z-score returned by the skew test and k is the z-score returned by the kurtosis test. See Reference [11], of which the full content is incorporated herein by reference, for details about the skew test and the kurtosis test.
The normality test also returns a two-sided chi-squared probability for null hypothesis, i.e. that the feature values were drawn from a normal distribution. The optimal value of the exponent λ minimizes the S-K measure. This can be expressed by the following relation:
where the subscript j means that the above minimization process is done for each individual feature j=1, . . . , F.
In the encoder, the non-linear feature vector transformer 206 applies the Box-Cox transformation only to selected features satisfying the following condition related to the S-K measure:
where rsk(j) is the S-K measure calculated on the jth feature before the Box-Cox transformation and rsk′(j) is the S-K measure after Box-Cox transformation with optimal value of exponent λj. The optimal exponent values, λj, and the associated biases, Δj, of the selected features are stored in the memory of the VAS codec.
In the following description, the original symbol for feature values Xj(n) will be used instead of Xbox,j(n), i.e. it is assumed that
Xj(n)←Xbox,j(n) for the selected features
2.7 Principal Component Analyzer
Referring to
After the operation 255 of short-term feature vector filtering and the operation 256 of non-linear feature vector transformation, the principal component analyzer 207 standardizes the feature vector by removing a mean of the features and scaling them to unit variance. For that purpose, the following relation can be used:
where {acute over (X)}j(n) represents the standardized feature, μj is the mean and sj the standard deviation of feature Xj across the training database and, as mentioned above, n represents the current frame.
The mean and the deviation μj of feature Xj may be calculated as follows:
with N representing the total number of frames in the training database.
In the following description, the original symbol for feature values Xj(n) will be used instead of {acute over (X)}j(n), i.e. it is assumed that:
Xj(n)←{acute over (X)}j(n) for n=1, . . . ,N
The principal component analyzer 207 then processes the feature vector using PCA where the dimensionality is reduced, for example, from F=15 to FPCA=12. PCA is an orthogonal transformation to convert a set of possibly correlated features into a set of linearly uncorrelated variables called principal components (See Reference [12], of which the full content is incorporated herein by reference). In the speech/music classification method, the analyzer 207 transforms the feature vectors using, for example, the following relation:
Y(n)=WTX(n)
where X(n) is a F-dimensional column feature vector and W is a F×FPCA matrix of PCA loadings whose columns are the eigenvectors of XT(n)X(n), where the superscript T indicates vector transpose. The loadings are found by means of Singular Value Decomposition (SVD) of the feature samples in the training database. The loadings are calculated in the training phase only for active frames, for example in frames where the VAD flag is 1. The calculated loadings are stored in the memory of the IVAS codec.
In the following description, the original symbol for the vector of features X(n) will be used instead of Y(n), i.e. it is assumed that:
X(n)←Y(n)
2.8 Gaussian Mixture Model (GMM)
Referring to
A multivariate GMM is parameterized by a mixture of component weights, component means and covariance matrices. The speech/music classification method uses three GMMs, each trained on its own training database, i.e. a “speech” GMM, a “music” GMM and a “noise” GMM. In a GMM with K components, each component has its own mean, pk and its covariance matrix, Σk. In the speech/music classification method the three (3) GMMs are fixed with K=6 components. The component weights are denoted ϕk, with the constraint that Σk=1Kϕk=1 so that the probability distribution is normalized. The probability p(X) that a given feature vector X is generated by the GMM may be calculated using the following relation:
In the above relation, calculation of the exponential function exp( . . . ) is a complex operation. The parameters of the GMMs are calculated using an Expectation-Maximization (EM) algorithm. It is well known that an Expectation-Maximization algorithm can be used for latent variables (variables that are not directly observable and are actually inferred from the values of the other observed variables) in order to predict their values with the condition that the general form of probability distribution governing those latent variables is known.
To reduce the complexity of probability calculations, the above relation may be simplified by taking the logarithm of the inner term inside the summation term Σ, as follows:
The output of the above, simplified formula is called the “score”. The score is an unbounded variable proportional to the log-likelihood. The higher the score, the higher the probability that a given feature vector was generated by the GMM. The score is calculated by the GMM calculator 208 for each of the three GMMs. The score scores(X) on the “speech” GMM and the score scoreM(X) on the “music” GMM are combined into a single value Δs(X) by calculating their difference to obtain a differential score Δs(X), using, for example, the following relation:
Δs(X)=scoreM(X)−scores(X)
Negative values of the differential score are indicative that the input sound signal is a speech signal whereas positive values are indicative that the input sound signal is a music signal. It is possible to introduce a decision bias bs in the calculation of the differential score dlp(X, bs) by adding a non-negative value to the differential score, using the following relation:
dlp(X,bs)=scoreM(X)−scores(X)+bs
The value of the decision bias, bs, is found based on the ensemble of differential scores calculated on the training database. The process of finding the value of the decision bias bs can be described as follows.
Let Xt represent a matrix of the feature vectors from the training database. Let yt be a corresponding label vector. Let the values of ground-truth SPEECH frames in this vector be denoted as +1.0 and the values in the other frames as 0. The total number of ACTIVE frames in the training database is denoted as Nact.
The differential scores dlp(X, bs) may be calculated in the active frames in the training database after EM training, i.e. when the parameters of the GMM are known. It is then possible to predict labels Ypred(n) in the active frames of the training database using, for example, the following relation:
ypred(n)=0.5*[sign[−1.0 dlp(X(n),bs=0)]+1.0]
where sign[.] is a signum function and dlp(X(n), bs=0) represents the differential scores calculated under the assumption of bs=0. The resulting values of the labels Ypred(n) are either equal to +1.0 indicating SPEECH or 0 indicating MUSIC or NOISE.
The accuracy of this binary predictor can be summarized with the following four statistical measures:
where Er is the relative frame energy which is used as a sample weighting factor. The statistic measures have the following meaning: ctp is the number of true positives, i.e. the number of hits in the SPEECH class, cfp is the number of false positives, i.e. the number of incorrectly classified frames in the MUSIC class, ctn is the number of true negatives, i.e. the number of hits in the MUSIC/NOISE class and cfn is the number of false negatives, i.e. the number of incorrectly classified frames in the SPEECH class.
The above-defined statistics may be used to calculate a true positive rate, commonly referred to as the recall
and the true negative rate, commonly referred to as the specificity
The recall TPR and the specificity TNR may be combined into a single number by taking the harmonic mean of TPR and TNR using the following relation:
The result is called the harmonic balanced accuracy.
A value of the decision bias bs may be found by maximizing the above defined harmonic balanced accuracy achieved with the labels/predictors ypred(n), where bs is selected from the interval (−2, 2) in successive steps. The spacing of candidate values for the decision bias is approximately logarithmic with higher concentration of values around 0.
The differential score dlp(X, bs), calculated with the found value of the decision bias bs, is limited to the range of, for example, (−30.0, +30.0). The differential score dlp(X, bs) is reset to 0 when the VAD flag is 0 or when the total frame energy, Etot, is lower than 10 dB or when the speech/music classification method is in the ENTRY state 402 and either ƒatt or ƒout are 1.
2.9 Adaptive Smoother
Referring to
The adaptive smoother 209 comprises, for example, an adaptive IIR filter to smooth the differential score dlp(X, bs) for frame n, identified as dlp(n), from the GMM calculator 208. The adaptive smoothing, filtering operation 259 can be described using the following operation:
wdlp(n)=wght(n)·wdlp(n−1)+(1−wght(n))·dlp(n)
where wdlp(n) is the resulting smoothed differential score, wght(n) is a so-called forgetting factor of the adaptive IIR filter, and n represents the frame index.
The forgetting factor is a product of three individual parameters as shown in the following relation:
wght(n)=wrelE(n)·wdrop(n)·wrise(n)
The parameter wrelE(n) is linearly proportional to the relative energy of the current frame, Er(n), and may be calculated using the following relation:
The parameter wrelE(n) is limited, for example, to the interval (0.9, 0.99). The constants used in the relation above have the following interpretation. The parameter wrelE(n) reaches the upper threshold of 0.99 when the relative energy is higher than 15 dB. Similarly, the parameter wrelE(n) reaches the lower threshold of 0.9 when the relative energy is lower than −15 dB. The value of the parameter wrelE(n) influences the forgetting factor wght(n) of the adaptive IIR filter of smoother 209. Smoothing is stronger in energetically weak segments where it is expected that the features carry less relevant information about the input signal.
The parameter wdrop(n) is proportional to a derivative of the differential score dlp(n). First, a short-term mean dlpST(n) of the differential score dlp(n) is calculated using, for example, the following relation:
dlpST(n)=0.8·dlpST(n−1)+0.2·dlp(n)
The parameter wdrop(n) is set to 0 and is modified only in frames where the following two conditions are met:
dlp(n)<0
dlp(n)<dlpST(n)
Thus, the adaptive smoother 209 updates the parameter wdrop(n) only when the differential score dlp(n) has decreasing tendency and when it indicates that the current frame belongs to the SPEECH class. In the first frame, when the two conditions are met, and if dlpST(n)>0, the parameter wdrop(n) is set to
wdrop(n)=−dlp(n)
Otherwise, the adaptive smoother 209 steadily increases the parameter wdrop(n) using, for example, the following relation:
wdrop(n)=wdrop(n−1)+(dlpST(n−1)−dlp(n))
If the above defined two conditions are not true, the parameter wdrop(n) is reset to 0. Thus, the parameter wdrop(n) reacts to sudden drops of the differential score dlp(n) below the zero-level indicating potential speech onset. The final value of the parameter wdrop(n) is linearly mapped to the interval of, for example, (0.7, 1.0), as shown in the following relation:
Note that the value of wdrop(n) is “overwritten” in the formula above to simplify notation.
The adaptive smoother 209 calculates the parameter wrise(n) similarly as the parameter wdrop(n) with the difference that it reacts to sudden rises of the differential score dlp(n) indicating potential music onsets. The parameter wrise(n) is set to 0 but is modified in frames where the following conditions are met:
ƒSM(n)=8 (ACTIVE)
dlpST(n)>0
dlpST(n)>dlpST(n−1)
Thus, the adaptive smoother 209 updates the parameter wrise(n) only in the ACTIVE state 404 of the input sound signal (See
In the first frame, when the above three (3) specified conditions are met, and if the short-term mean dlpST(n−1)<0, the third parameter wrise(n) is set to:
wrise(n)=−dlpST(n)
Otherwise, the adaptive smoother 209 steadily increases the parameter wrise(n) according to, for example, the following relation:
wrise(n)=wrise(n−1)+(dlpST(n)−dlpST(n−1))
If the above three (3) conditions are not true, the parameter wrise(n) is reset to 0. Thus, the third parameter wrise(n) reacts to sudden rises of the differential score dlp(n) above the zero-level indicating potential music onset. The final value of the parameter wrise(n) is linearly mapped to the interval of, for example, (0.95, 1.0), as follows:
Note, that the value of the parameter wrise(n) is “overwritten” in the formula above to simplify notation.
The forgetting factor wght(n) of the adaptive IIR filter of the adaptive smoother 209 is decreased in response to strong SPEECH signal content or strong MUSIC signal content. For that purpose, the adaptive smoother 209 analyzes a long-term mean
In the ENTRY state 402 (
The expression rm2v(n) corresponds to a long-term standard deviation of the differential score. The forgetting factor wght(n) of the adaptive IIR filter of the adaptive smoother 259 is decreased in frames where rm2v(n)>15 using, for example, the following relation:
wght(n)←0.9·wght(n)
The final value of the forgetting factor wght(n) of the adaptive IIR filter of the adaptive smoother 209 is limited to the range of, for example, (0.01, 1.0). In frames where the total frame energy, Etot(n), is below 10 dB, the forgetting factor wght(n) is set to, for example, 0.92. This ensures proper smoothing of the differential score dlp(n) during silence.
The filtered, smoothed differential score, wdlp(n), is a parameter for categorical decisions of the speech/music classification method, as described below.
2.10 State-Dependent Categorical Classifier
Referring to
The operation 260 is the final operation of the first stage 250 of the two-stage speech/music classification method and comprises a categorization of the input sound signal into the following three final classes:
In the above, numbers in parentheses are the numeric constants associated with the three final classes. The above set of classes is slightly different than the classes that have been discussed so far in relation to the differential score. The first difference is that the SPEECH class and the NOISE class are combined. This is to facilitate the core encoder selection mechanism (described in the following description) in which an ACELP encoder core is usually selected for coding both speech signals and background noise. A new class has been added to the set, namely the UNCLEAR final class. Frames falling into this category are usually found in speech segments with a high level of additive background music. The smoothed differential scores wdlp(n) of frames in class UNCLEAR are mostly close to 0.
Let dSMC(n) denote the final class selected by the state-dependent categorical classifier 210.
When the input sound signal is, in the current frame, in the ENTRY state 402 (See
where nENTRY marks the beginning (frame) of the ENTRY state 402 and ak(n−nENTRY) are the weights corresponding to the samples of dlp(n) in the ENTRY state. Thus, the number of samples used in the weighted average wdlpENTRY(n) ranges from 0 to 7 depending on the position of the current frame with respect to the beginning (frame) of the ENTRY state. This is illustrated in
If the absolute frame energy, Etot, is, in the current frame, lower than, for example, 10 dB, the state-dependent categorical classifier 210 sets the final class dSMC(n) to SPEECH/NOISE regardless of the differential score dlp(n). This is to avoid misclassifications during silence.
If the weighted average of differential scores in the ENTRY state wdlpENTRY(n) is less than, for example, 2.0, the state-dependent categorical classifier 210 sets the final class dSMC(n) to SPEECH/NOISE.
If the weighted average of differential scores in the ENTRY state wdlpENTRY(n) is higher than, for example, 2.0, the state-dependent categorical classifier 210 sets the final class dSMC(n) based on the non-smoothed differential score dlp(n) in the current frame. If dlp(n) is higher than, for example, 2.0, the final class is MUSIC. Otherwise, it is UNCLEAR.
In the other states (See
dSMC(n)=dSMC(n−1)
The decision can be changed by the state-dependent categorical classifier 210 if the smoothed differential score wdlp(n) crosses a threshold of class (See Table 3) that is different from the class selected in the previous frame. These transitions between classes are illustrated in
As mentioned herein above, the transitions between classes are driven not only by the value of the smoothed differential score wdlp(n) but also by the final classes selected in the previous frames. A complete set of rules for transitions between the classes is shown in the class transition diagram of
The arrows in
In
where Cnorm[k](n) is the normalized autocorrelation function in the current frame and the upper index [k] refers to the position of the half-frame window. The normalized autocorrelation function is computed as part of the open-loop pitch analysis module of the IVAS codec (See Reference [1], Section 5.1.11.3.2).
The short pitch flag ƒsp may be set in the pre-selected frames as follows
where
and
In
The parameter vrun(n) is defined in Section 2.2 (Onset/attack detection) of the present disclosure.
3. Core Encoder Selection
In the second stage 350/300 of the two-stage speech/music classification method and device, the final class dSMC(n) selected by the state-dependent categorical classifier 210 is “mapped” into one of the three core encoder technologies of the IVAS codec, i.e. ACELP (Algebraic Code-Excited Linear Prediction), GSC (Generic audio Signal Coding) or TCX (Transform-Coded eXcitation). This is referred to as the three-way classification. This does not guarantee that the selected technology will be used as core encoder since there exist other factors affecting the decision such as bit-rate or bandwidth limitations. However, for common types of input sound signals the initial selection of core encoder technology is used.
Besides the class dSMC(n) selected by the state-dependent categorical classifier 210 in the first stage, the core encoder selection mechanism takes into consideration some additional high-level features.
3.1 Additional High-Level Features Extractor
Referring to
In the first stage 200/250 of the two-stage speech/music classification device and method, most features are usually calculated on short segments (frames) of the input sound signal not exceeding 80 ms. This allows for a quick reaction to events such as speech onsets or offsets in the presence of background music. However, it also leads to a relatively high rate of misclassifications. The misclassifications are mitigated to some extent by means of adaptive smoothing, described in above Section 2.9 but for certain types of signal this is not sufficiently efficient. Therefore, as part of the second stage 300/350 of the two-stage speech/music classification device and method, the class dSMC(n) can be altered in order to select the most appropriate core encoder technology for certain types of signal. To detect such types of signal, the detector calculates additional high-level features and/or flags, usually on longer segments of the input signal.
3.1.1 Long-Term Signal Stability
Long-term signal stability is a feature of the input sound signal that can be used for successful discrimination between vocal music from opera. In the context of core encoder selection, signal stability is understood as long-term stationarity of segments with high autocorrelation. The additional high-level features extractor 301 estimates the long-term signal stability feature based on the “voicing” measure,
In the equation above,
For higher robustness, the voicing parameter
corLT(n)=0.9·corLT(n−1)+0.1·
If the smoothed voicing parameter corLT(n) is sufficiently high and the variance corvar(n) of the voicing parameter is sufficiently low, then the input signal is considered as “stable” for the purposes of core encoder selection. This is measured by comparing the values, corLT(n) and corvar(n), to predefined thresholds and setting a binary flag using, for example, the following rules:
The binary flag, ƒSTAB(n), is an indicator of long-term signal stability and it is used in the core encoder selection discussed later in the present disclosure.
3.1.2 Segmental Attack Detection
The extractor 301 extracts the segmental attack feature from a number, for example 32, of short segments of the current frame n as illustrated in
In each segment, the additional high-level features extractor 301 calculates the energy Eata(k) using, for example, the following relation:
where s(n) is the input sound signal in the current frame n, k is the index of the segment, and i is the index of the sample in the segment. Attack position is then calculated as the index of the segment with the maximum energy, as follows:
The additional high-level features extractor 301 estimates the strength strata of the attack by comparing the mean (numerator of the below relation) of the energy Eata(k) of the input sound signal s(n) from the attack (segment k=kata) to the end (segment 31) of the current frame n against the mean (denominator of the below relation) of the energy Eata(k) of the input signal s(n) from the beginning (segment 0) to 3/4 (Segment 24) of the current frame n. This estimation of the strength strata is made using, for example, the following relation:
If the value strata is higher than, for example, 8, then the attack is considered strong enough and the segment kata is used as an indicator for signaling the position of the attack inside the current frame n. Otherwise, indicator kata is set to 0 indicating that no attack was identified. The attacks are detected only in GENERIC frame types which is signaled by the IVAS frame type selection logic (See Reference [1]). To reduce false attack detections, the energy Eata(kata) of segment k=kata where the attack was identified is compared (str3_4(k)) to energies Eata(k) of segments in the first % of the current frame n (segments 2 to 21), using, for example, the following relation:
If any of the comparison values str3_4(k) for segments k=2, . . . , 21 is less than, for example, 2, k≠kata, then kata is set to 0 indicating that no attack was identified. In other words, the energy of the segment containing the attack must be at least twice as high as the energy of other segments in the first % of the current frame.
The mechanism described above ensures that attacks are detected mainly in the last % of the current frame which makes them suitable for encoding either with the ACELP technology or the GSC technology.
For unvoiced frames, classified as UNVOICED_CLAS, UNVOICED_TRANSITION or ONSET by the IVAS FEC classification module (See Reference [1]), the additional high-level features extractor 301 estimates the strength strata of the attack by comparing the energy Eata(kata) of the attack segment k=kata (numerator of the below relation) to the mean (denominator of the below relation) of the energy Eata(k) in the previous 32 segments preceding the attack, using, for example, the relation:
In the above relation, negative indices in the denominator refer to the values of segmental energies Eata(k) in the previous frame. If the strength strata, calculated with the formula above, is higher than, for example, 16 the attack is sufficiently strong and kata is used for signaling the position of the attack inside the current frame. Otherwise, kata is set to 0 indicating that no attack was identified. In case the last frame was classified as UNVOICED_CLAS by the IVAS FEC classification module, then the threshold is set to, for example, 12 instead of 16.
For unvoiced frames, classified as UNVOICED_CLAS, UNVOICED_TRANSITION or ONSET by the IVAS FEC classification module (See Reference [1]), there is another condition to be satisfied to consider the detected attack as sufficiently strong. The energy Eata(k) of the attack must be sufficiently high when compared to a long-term mean energy
with
is higher than 20. Otherwise, kata is set to 0 indicating that no attack was identified.
In the case an attack has been already detected in the previous frame, kata is reset to 0 in the current frame n preventing attack smearing effects.
For the other frame types (excluding UNVOICED and GENERIC as described above), the additional high-level features extractor 301 compares the energy Eata(kata) of the segment k=kata containing an attack against energies Eata(k) in the other segments in accordance with, for example, the following ratio:
and if any of the comparison values strother(k) for k=2, . . . , 21, k≠kata is lower than, for example, 1.3, then the attack is considered weak and kata is set to 0. Otherwise, segment kata is used for signaling the position of the attack inside the current frame.
Thus, the final output of the additional high-level features detector 301 regarding segmental attack detection is the index k=kata of the segment containing the attack or kata=0. If the index is positive, an attack is detected. Otherwise, no attack is identified.
3.1.3 Signal Tonality Estimation
Tonality of the input sound signal in the second stage of the two-stage speech/music classification device and method is expressed as a tonality binary flag reflecting both spectral stability and harmonicity in the lower frequency range of the input signal up to 4 kHz. The additional high-level features extractor 301 calculates this tonality binary flag from a correlation map, Smap(n, k), which is a by-product of the tonal stability analysis in the IVAS encoder (See Reference [1]).
The correlation map is a measure of both signal stability and harmonicity. The correlation map is calculated from the first, for example, 80 bins of the residual energy spectrum in the logarithmic domain, EdB,res(k),k=0, . . . , 79 (See Reference [1]). The correlation map is calculated in segments of the residual energy spectrum where peaks are present. These segments are defined by the parameter imin (p) where p=1, . . . , Nmin is the segment index and Nmin is the total number of segments.
Let's define the set of indices belonging to a particular segment x as
PK(p)={i|i≥imin(p), and i<imin(p+1), and i<80}
Then, the correlation map may be calculated as follows
The correlation map Mcor(PK(p)) is smoothed with an IIR filter and summed across the bins in the frequency range k=0, . . . , 79 to yield a single number, using, for example, the following two relations:
where n denotes the current frame and k denotes the frequency bin. The weight β(n) used in the equation above is called the soft VAD parameter. It is initialized to 0 and may be updated in each frame as
β(n)=0.95·β(n−1)+0.05·ƒVAD(n)
where ƒVAD(n) is the binary VAD flag from the IVAS encoder (See Reference [1]). The weight β(n) is limited to the range of, for example, (0.05, 0.95). The extractor 301 sets the tonality flag ƒton by comparing Smass with an adaptive threshold, thrmass. The threshold thrmass is initialized to, for example, 0.65 and incremented or decremented in steps of, for example, 0.01 in each frame. If Smass is higher than 0.65, then the threshold thrmass is increased by 0.01, otherwise it is decreased by 0.01. The threshold thrmass is upper limited to, for example, 0.75 and lower limited to, for example, 0.55. This adds a small hysteresis to the tonality flag ƒton.
The tonality flag, ƒton, is set to 1 if Smass is higher than thrmass. Otherwise, it is set to 0.
3.1.4 Spectral Peak-to-Average Ratio
Another high-level feature used in the core encoder selection mechanism is the spectral peak-to-average ratio. This feature is a measure of spectral sharpness of the input sound signal s(n). The extractor 301 calculates this high-level feature from the power spectrum of the input signal s(n) in logarithmic domain, SLT(n, k), k=0, . . . , 79, for example in the range from 0 to 4 kHz. However, the power spectrum SLT(n, k) is first smoothed with an IIR filter using, for example, the following relation:
where n denotes the current frame and k denotes the frequency bin. The spectral peak-to-average ratio is calculated using, for example, the following relation:
3.2 Core Encoder Initial Selector
Referring to
The initial selection of the core encoder by the selector 302 is based on (a) the relative frame energy Er, (b) the final class dSMC(n) selected in the first stage of the two-stage speech/music classification device and method and (c) the additional high-level features rp2a(n), Smass, and thrmass as described herein above. The selection mechanism used by the core encoder initial selector 302 is depicted in the schematic diagram of
Let dcore ∈{0, 1, 2} denote the core encoder technology selected by the mechanism in
3.3 Core Encoder Selection Refiner
Referring to
The core encoder selection refiner 303 may change the core encoder technology when dcore=1, i.e. when the GSC core encoder is initially selected for core coding. This situation can happen for example for musical items classified as MUSIC with low energy below 400 Hz. The affected segments of the input signal may be identified by analyzing the following energy ratio:
where Ebin(k), k=0, . . . , 127 is the power spectrum per frequency bin k of the input signal in linear domain and Etot is the total energy of the signal segment (frame).
The summation in the numerator extends over the first 8 frequency bins of the energy spectrum corresponding to a frequency range of 0 to 400 Hz. The core encoder selection refiner 303 calculates and analyzes the energy ratio ratLF in frames previously classified as MUSIC with a reasonably high accuracy. The core encoder technology is changed from GSC to ACELP under, for example, the following condition:
For signals with very short and stable pitch period, GSC is not the optimal core coder technology. Therefore, as a non-limitative example, when ƒsp=1, the core encoder selection refiner 303 changes the core encoder technology from GSC to ACELP or TCX as follows:
Highly correlated signals with low energy variation are another type of signals for which the GSC core encoder technology is not suitable. For these signals, the core encoder selection refiner 303 switches the core encoder technology from GSC to TCX. As a non-limitative example, this change of core encoder is made when the following conditions are met:
where TOL[0] is the absolute pitch value from the first half-frame of the open-loop pitch analysis (See Reference [1]) in current frame n.
Finally, in a non-limitative example, the core encoder selection refiner 303 may change the initial core encoder selection from GSC to ACELP in frames where an attack is detected, provided the following condition is fulfilled:
The flag ƒno_GSC is an indicator that the change of the core encoder technology is enabled.
The condition above ensures that this change of core encoder from GSC to ACELP happens only in segments with rising energy. If the condition above is fulfilled and, at the same time, a transition frame counter TCcnt has been set to 1 in the IVAS codec (Reference [1]), then the core encoder selection refiner 303 changes the core encoder to ACELP. That is:
Additionally, when the core encoder technology is changed to ACELP the frame type is set to TRANSITION. This means that the attack will be encoded with the TRANSITION mode of the ACELP core encoder.
If an attack is detected by the segmental attack detection procedure of the additional high-level features detection operation 351, as described in section 3.1.2 above, then the index (position) of this attack, kata, is further examined. If the position of the detected attack is in the last sub-frame of frame n, then the core encoder selection refiner 303 changes the core encoder technology to ACELP, for example when the following conditions are fulfilled:
Additionally, when the core encoder technology is changed to ACELP the frame type is set to TRANSITION and a new attack “flag” ƒata is set as follows
ƒata=kata+1
This means that the attack will be encoded with the TRANSITION mode of the ACELP core encoder.
If the position of the detected attack is not located in the last sub-frame but at least beyond the first quarter of the first sub-frame, then the core encoder selection is not changed and the attack will be encoded with the GSC core encoder. Similarly to the previous case, a new attack “flag” ƒata may be set as follows:
ƒata=kata+1 if ƒno_GSC=1 AND TCcnt≠1 AND kata>4
The parameter kata is intended to reflect the position of the detected attack, so the attack flag ƒata is somewhat redundant. However, it is used in the present disclosure for consistency with other documents and with the source code of the IVAS codec.
Finally, core encoder selection refiner 303 changes the frame type from GENERIC to TRANSITION in speech frames for which the ACELP core coder technology has been selected during the initial selection. This situation happens only in active frames where the local VAD flag has been set to 1 and in which an attack has been detected by the segmental attack detection procedure of the additional high-level features detection operation 351, described in section 3.1.2, i.e. where kata>0.
The attack flag is then similar as in the previous situation. That is:
ƒata=kata+1
The IVAS codec, including the two-stage speech/music classification device may be implemented as a part of a mobile terminal, as a part of a portable media player, or in any similar device. The IVAS codec, including the two-stage speech/music classification device (identified as 1500 in
The input 1502 is configured to receive the input sound signal s(n), for example the left and right channels of an input stereo sound signal in digital or analog form in the case of the encoder of the IVAS codec. The output 1504 is configured to supply an encoded multiplexed bit-stream in the case of the encoder of the IVAS codec. The input 1502 and the output 1504 may be implemented in a common module, for example a serial input/output device.
The processor 1506 is operatively connected to the input 1502, to the output 1504, and to the memory 1508. The processor 1506 is realized as one or more processors for executing code instructions in support of the functions of the various elements and operations of the above described IVAS codec, including the two-stage speech/music classification device and method as shown in the accompanying figures and/or as described in the present disclosure.
The memory 1508 may comprise a non-transient memory for storing code instructions executable by the processor 1506, specifically, a processor-readable memory storing non-transitory instructions that, when executed, cause a processor to implement the elements and operations of the IVAS codec, including the two-stage speech/music classification device and method. The memory 1508 may also comprise a random access memory or buffer(s) to store intermediate processing data from the various functions performed by the processor 1506.
Those of ordinary skill in the art will realize that the description of the IVAS codec, including the two-stage speech/music classification device and method are illustrative only and are not intended to be in any way limiting. Other embodiments will readily suggest themselves to such persons with ordinary skill in the art having the benefit of the present disclosure. Furthermore, the disclosed IVAS codec, including the two-stage speech/music classification device and method may be customized to offer valuable solutions to existing needs and problems of encoding and decoding sound, for example stereo sound.
In the interest of clarity, not all of the routine features of the implementations of the IVAS codec, including the two-stage speech/music classification device and method are shown and described. It will, of course, be appreciated that in the development of any such actual implementation of the IVAS codec, including the two-stage speech/music classification device and method, numerous implementation-specific decisions may need to be made in order to achieve the developer's specific goals, such as compliance with application-, system-, network- and business-related constraints, and that these specific goals will vary from one implementation to another and from one developer to another. Moreover, it will be appreciated that a development effort might be complex and time-consuming, but would nevertheless be a routine undertaking of engineering for those of ordinary skill in the field of sound processing having the benefit of the present disclosure.
In accordance with the present disclosure, the elements, processing operations, and/or data structures described herein may be implemented using various types of operating systems, computing platforms, network devices, computer programs, and/or general purpose machines. In addition, those of ordinary skill in the art will recognize that devices of a less general purpose nature, such as hardwired devices, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), or the like, may also be used. Where a method comprising a series of operations and sub-operations is implemented by a processor, computer or a machine and those operations and sub-operations may be stored as a series of non-transitory code instructions readable by the processor, computer or machine, they may be stored on a tangible and/or non-transient medium.
Elements and processing operations of the IVAS codec, including the two-stage speech/music classification device and method as described herein may comprise software, firmware, hardware, or any combination(s) of software, firmware, or hardware suitable for the purposes described herein.
In the IVAS codec, including the two-stage speech/music classification device and method, the various processing operations and sub-operations may be performed in various orders and some of the processing operations and sub-operations may be optional.
Although the present disclosure has been described hereinabove by way of non-restrictive, illustrative embodiments thereof, these embodiments may be modified at will within the scope of the appended claims without departing from the spirit and nature of the present disclosure.
The present disclosure mentions the following references, of which the full content is incorporated herein by reference:
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CA2021/050465 | 4/8/2021 | WO |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2021/207825 | 10/21/2021 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
9015038 | Vaillancourt et al. | Apr 2015 | B2 |
9196249 | Konchitsky | Nov 2015 | B1 |
20030004720 | Garudadri | Jan 2003 | A1 |
20040196913 | Chakravarthy | Oct 2004 | A1 |
20100004926 | Neoran | Jan 2010 | A1 |
20110016077 | Vasilache | Jan 2011 | A1 |
20110202337 | Fuchs | Aug 2011 | A1 |
20130339035 | Chordia | Dec 2013 | A1 |
20140108020 | Sharma | Apr 2014 | A1 |
20140192987 | Van Dongen | Jul 2014 | A1 |
20150332667 | Mason | Nov 2015 | A1 |
20160155456 | Wang | Jun 2016 | A1 |
20160293175 | Atti | Oct 2016 | A1 |
20160379627 | Yassa | Dec 2016 | A1 |
20190392852 | Hijazi | Dec 2019 | A1 |
20200075033 | Hijazi | Mar 2020 | A1 |
Number | Date | Country |
---|---|---|
2017049397 | Mar 2017 | WO |
Entry |
---|
Panagiotakis, Costas, and Georgios Tziritas. “A speech/music discriminator based on RMS and zero-crossings.” IEEE Transactions on multimedia 7.1 (2005): 155-166. (Year: 2005). |
Bicego, Manuele, and Sisto Baldo. “Properties of the Box-Cox transformation for pattern classification.” Neurocomputing 218 (2016): 390-400. (Year: 2016). |
3GPP TS 26.445 V12 “Universal Mobile Telecommunications System (UMTS); LTE;EVS Codec Detailed Algorithmic Description”, v.12.0.0, Nov. 5, 2014, pp. 1-89. |
I.T. Jolliffe “Principal Component Analysis, Second Edition”, 2002, pp. 1-25. |
3GPP SA4 contribution S4-170749 “New WID on EVS Codec Extension for Immersive Voice and Audio Services”, SA4 meeting #94, Jun. 26-30, 2017, 4 sheets http://www.3gpp.org/ftp/tsg_sa/WG4_CODEC/TSGS4_94/Docs/S4-170749.zip. |
Baumgarte et al., “Binaural cue coding—Part I: Psychoacoustic fundamentals and design principles,” IEEE Trans. Speech Audio Processing, vol. 11, No. 6, Nov. 2003, pp. 509-519. |
Box et al., “An analysis of transformations”, Journal of the Royal Statistical Society, Series B, 26, 1964, pp. 211-252. |
D'Agostino et al., “Tests for departure from normality”, Biometrika, vol. 60, No. 3, 1973, pp. 613-622. |
D'Agostino et al., “A suggestion for using powerful and informative tests of normality”, The American Statistician, vol. 44, No. 4, Nov. 1990, pp. 316-321. |
Malenovsky et al., “Two-stage speech/music classifier with decision smoothing and sharpening in the EVS codec,” 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, QLD, 2015, pp. 5718-5722. |
Neuendorf et al., “The ISO/MPEG Unified Speech and Audio Coding Standard—Consistent High Quality for All Content Types and at All Bit Rates”. J. Audio Eng. Soc., vol. 61, No. 12, Dec. 2013, pp. 956-977. |
Rao et al., “Speech Processing in Mobile Environments, Appendix A: MFCC features”, Springer International Publishing, 2014, pp. 103-121. |
Dietz et al., “Overview of the EVS codec architecture”, 2015 IEEE International Conference of Acoustics, Speech and Signal Processing (ICASSP), Apr. 19, 2015, pp. 5698-5702. |
Number | Date | Country | |
---|---|---|---|
20230215448 A1 | Jul 2023 | US |
Number | Date | Country | |
---|---|---|---|
63010798 | Apr 2020 | US |