Devices such as notebooks, desktop computers, mobile telephones, tablets, and other such devices may include speakers or utilize headphones to reproduce sound. The sound emitted from such devices may be subject to various processes that modify the sound quality.
Features of the present disclosure are illustrated by way of example and not limited in the following figure(s), in which like numerals indicate like elements, in which:
For simplicity and illustrative purposes, the present disclosure is described by referring mainly to examples. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. It will be readily apparent however, that the present disclosure may be practiced without limitation to these specific details. In other instances, some methods and structures have not been described in detail so as not to unnecessarily obscure the present disclosure.
Throughout the present disclosure, the terms “a” and “an” are intended to denote at least one of a particular element. As used herein, the term “includes” means includes but not limited to, the term “including” means including but not limited to. The term “based on” means based at least in part on.
Loudness enhancement based on multiband range compression apparatuses, methods for loudness enhancement based on multiband range compression, and non-transitory computer readable media having stored thereon machine readable instructions to provide loudness enhancement based on multiband range compression are disclosed herein. The apparatuses, methods, and non-transitory computer readable media disclosed herein provide for intelligent multiband range compression using perfect reconstruction filterbanks, implementation of standards, such as the International Telecommunication Union (ITU)-R BS.1770 standard for loudness estimation, and neural-network trained parameters to control loudness and thus audio quality (e.g., timbre). Further, the apparatuses, methods, and non-transitory computer readable media disclosed herein provide for enhancement of spatial performance of an immersive audio rendering chain, while overcoming masking by high amplitude front-channel signals over surround channel signals.
With respect to spatial rendering of audio signals, devices such as notebooks, desktop computers, mobile telephones, tablets, and other such devices may include speakers or utilize headphones to reproduce sound. Such devices may utilize a high-quality audio reproduction to create an immersive experience for cinematic and music content. The cinematic content may be multichannel (e.g., 5.1, 7.1, etc., where 5.1 represents “five point one” and includes a six channel surround sound audio system, 7.1 represents “seven point one” and includes an eight channel surround sound audio system, etc.). Elements that contribute towards a high-quality audio experience may include the frequency response (e.g., bass extension) of speakers or drivers, and proper equalization to attain a desired spectral balance. Other elements that contribute towards a high-quality audio experience may include artifact-free loudness processing to accentuate masked signals and improve loudness, and spatial quality that reflects artistic intent for stereo music and multichannel cinematic content.
With respect to spatial rendering with speakers, dynamic range compression may be categorized under either single-frequency band compression or multi-frequency band compression. The single-frequency band compression may be simpler to implement but may include technical challenges with respect to limited control of spectral balance and audio quality. The multi-frequency band compression may present technical challenges with respect to complexity of implementation due to overlapping band-pass filters that may cause artifacts around band-edges during audio resynthesis, but allows for increased control of audio quality through the use of multiple bands.
In order to address at least these technical challenges associated with implementation of single-frequency band compression and multi-frequency band compression, the apparatuses, methods, and non-transitory computer readable media disclosed herein provide loudness enhancement based on multiband range compression. In this regard, the loudness enhancement based on multiband range compression may include implementation of a nested perfect reconstruction filterbank, with multirate processing of an audio signal, to enable sub-band compression to control audio quality, while ensuring that there are no artifacts around band-edges. The parameters with respect to multiband compression processing may be designed by implementing standards, such as the ITU-R BS.1770 technique with a reference pink noise stimuli. Further, interpolation and generalization using machine learning (e.g., neural network) may be implemented to provide extensibility to arbitrary parameter settings.
For the apparatuses, methods, and non-transitory computer readable media disclosed herein, modules, as described herein, may be any combination of hardware and programming to implement the functionalities of the respective modules. In some examples described herein, the combinations of hardware and programming may be implemented in a number of different ways. For example, the programming for the modules may be processor executable instructions stored on a non-transitory machine-readable storage medium and the hardware for the modules may include a processing resource to execute those instructions. In these examples, a computing device implementing such modules may include the machine-readable storage medium storing the instructions and the processing resource to execute the instructions, or the machine-readable storage medium may be separately stored and accessible by the computing device and the processing resource. In some examples, some modules may be implemented in circuitry.
In some examples, the apparatus 100 may include or be provided as a component of a device such as a notebook, a desktop computer, a mobile telephone, a tablet, and other such devices. For the example of
Referring to
As disclosed herein with respect to
A specified loudness level determination module 112 is to ascertain a specified loudness level 114 for a device, such as the device 150.
A compression parameters determination module 116 is to determine, based on the trained learning model 110 (which may also be designated as a machine learning model, and include, for example, a neural network with an input layer, a hidden layer, and an output layer comprising neurons in each layer), the compression parameters 104 for the specified loudness level 114. In this regard, the trained learning model 110 may be used to determine, for example, the compression parameters 104 for the specified loudness level 114. Alternatively, the trained learning model 110 may be used to determine, for example, one of the compression parameters (e.g., compression threshold or compression ratio) based on an input of the specified loudness level 114 and another one of the compression parameters (e.g., compression threshold or compression ratio).
A sub-band compression module 118 is to perform, based on the determined compression parameters 104, sub-band compression of an input audio signal 120 by processing the input audio signal using a perfect reconstruction filterbank 122. The sub-band compression module 118 is to generate an output audio signal 124 based on the performance of the sub-band compression of the input audio signal 120. The output audio signal may be applied to another device, or to the device 150 to generate the specified loudness level 114. The compression parameters may be determined in each of the sub-bands using the learning model and the ITU loudness standard as disclosed here.
According to an example, as disclosed herein with respect to
According to an example, as disclosed herein with respect to
According to an example, as disclosed herein with respect to Equations (5) to (7) discussed below, the sub-band compression module 118 is to perform the sub-band compression of the input audio signal 120 by applying a first predetermined gain value to a corresponding band of the input audio signal 120 as a function of a compression threshold and a knee-width of the determined compression parameters 104 for a first specified duration of the input audio signal 120. Further, the sub-band compression module 118 is to perform the sub-band compression of the input audio signal 120 by applying a second predetermined gain value to the corresponding band of the input audio signal 120 as a function of the compression threshold, a compression ratio of the determined compression parameters 104, and the knee-width for a second specified duration of the input audio signal 120.
Referring to
The immersive audio renderer 200 may be extended to accommodate next-generation audio formats (including channel/objects or pure object-based signals and metadata) as input to the immersive audio renderer 200.
The immersive audio renderer 200 may include a low-frequency extension 202 that performs a synthesis of non-linear terms of the low pass audio signal in the side chain. Specifically auditory motivated filterbanks filter the audio signal, the peak of the signal may be tracked in each filterbank, and the maximum peak over all peaks or each of the peaks may be selected for nonlinear term generation. The nonlinear terms for each filterbank output may then be band pass filtered and summed into each of the channels to create the perception of low frequencies.
The immersive audio renderer 200 may include spatial synthesis and binaural downmix 204 where reflections and specified direction sounds may be mixed in prior to crosstalk cancellation. For example, the spatial synthesis and binaural downmix 204 may apply head-related transfer functions (HRTFs) to render virtual sources at desired angles (and distances). According to an example, the HRTFs may be perceptually-smoothed, and the perceptually-smoothed HRTFs may be for angles ±40° for the front left and front right sources (channels), 0° for the center, and ±110° degrees for the left and right surround sources (channels).
The immersive audio renderer 200 may include a crosstalk canceller 206 to perform crosstalk cancellation on a spatial synthesized audio signal, as described in further detail with respect to
The immersive audio renderer 200 may include multiband range compression 208 that performs multiband compression, for example, by using perfect reconstruction (PR) filterbanks, an ITU loudness model, and a neural network to generalize to arbitrary multiband dynamic range compression (DRC) parameter settings, as described in further detail herein.
With respect to the crosstalk canceller 206,
The crosstalk canceler 206 may be used to perform equalization of the ipsilateral signals (loudspeaker to same side ear) and cancel out contralateral crosstalk (loudspeaker to opposite side ear).
Referring to
For the example layout of the crosstalk canceller and the binaural acoustic transfer function of
Crosstalk cancellation with respect to the crosstalk canceller 206 may be based on perceptual smoothing of HRTFs, insertion of an inter-aural time difference, and time-domain inversion of a regularized matrix determined from the perceptually smoothed HRTFs. For example, a constrained inversion of the perceptually smoothed HRTFs may be performed through the use of regularization, and validation of a condition number of a regularized matrix before inversion. In this regard, a tradeoff may be achieved, for example, by analyzing the condition number with respect to an objective cancellation performance, a subjective audio quality, and robustness to head-movements.
For the crosstalk canceller 206, a perceptual smoothing may be performed on HRTFs corresponding to ipsilateral and contralateral transfer paths of sound emitted from first and second speakers (e.g., see
An inter-aural time difference may be inserted in the perceptually smoothed HRTFs corresponding to the contralateral transfer paths. According to an example, the inter-aural time difference may be determined as a function of a head radius of the user, and an angle of one of the speakers (e.g., see
The crosstalk canceller 206 may be generated by inverting the perceptually smoothed HRTFs corresponding to the ipsilateral transfer paths and the perceptually smoothed HRTFs corresponding to the contralateral transfer paths including the inserted inter-aural time difference. According to an example, the crosstalk canceller 206 may be generated by performing a time-domain inversion of a regularized matrix determined from the perceptually smoothed HRTFs corresponding to the ipsilateral transfer paths and the perceptually smoothed HRTFs corresponding to the contralateral transfer paths including the inserted inter-aural time difference. In this regard, a time-domain matrix may be determined from the perceptually smoothed HRTFs corresponding to the ipsilateral transfer paths and the perceptually smoothed HRTFs corresponding to the contralateral transfer paths including the inserted inter-aural time difference. Further, a regularization term (e.g., β) may be determined to control inversion of the time-domain matrix, and the time-domain matrix may be inverted based on the regularization term to generate the regularized matrix. Further, the regularization term to control the inversion of the time-domain matrix may be determined by comparing a condition number associated with a transpose of the time-domain matrix to a threshold (e.g., 100), and in response to a determination that the condition number is below the threshold, the time-domain matrix may be inverted based on the regularization term to generate the regularized matrix. Thus, the condition number of the regularized matrix may be validated prior to the performing of the time-domain inversion of the regularized matrix.
Referring again to
Referring to
With continued reference to
Referring to
For Equation (1), F0(z) represents a synthesis filter, z represents complex frequency (w=2 p f/fs, where fs is the sampling frequency and f is the frequency in Hz), and H1 represents an analysis filter. For Equation (2), F1(z) represents another synthesis filter, and H0 represents another analysis filter.
Referring to
Referring to
For Equation (5), (n) represents gain in dB, and (n) represents the dB value of the input signal (20log10(x(n))), T represents the threshold, and W represents knee-width. For Equation (6), CR represents compression ratio.
Per Equation (5), the (n)=xdB(n) for the range of (n)<T−(W/2). Per Equation (6),
for the range of T+(W/2)≤xdB(n)≤T−(W/2). Otherwise, per Equation (7), (n)=T+(xdB(n)−T)/CR.
Thus, per Equations (5) to (7), it is seen that the value of (n) varies with respect to the associated value of xdB(n).
Additionally, a make-up may be employed to bring the signal back up to 0 dB relative to full scale after compression.
Referring to
According to an example, the smoothing may be performed with a low-pass filter with a fast attack, and slow release time constants on the gain signal. The time constants may be tuned on various signals to ensure that the artifacts are non-audible. Alternatively, a machine learning technique may be utilized to adapt the time-constants to ensure that there are no artifacts on arbitrary signals. According to an example, compression processing may be eliminated in the 18-24 kHz band as this band is typically outside the hearing domain (viz., 20 Hz-20 kHz), and generally there are no signals that are reproduced at reasonably acceptable intensity for compression to be of benefit. Furthermore, transducers/loudspeakers may be highly directional and rolled off in this region. In this region, a delay of I=218 samples may be applied. Furthermore, even the 12-18 kHz band may be constrained to include this delay to reduce computational needs.
With respect to tuning the model of the multiband range compression 208, pink noise at reference level of −18 dBFS may be employed to determine the parameter set {θi,j, CRi,j} where i={DC, 6, 12, 18} and j={6, 12, 18, 24}, sequential integer set. Pink noise may represent a standard in SOCIETY OF MOTION PICTURES AND TELEVISION ENGINEERS™ (SMPTE) standard for setting sound level SMPTE, as well as setting and measuring levels for calibrating home theater receivers. Furthermore, pink noise (which includes a 3 dB/octave roll-off) reasonably matches the spectrum of general music and speech-signals on a sufficiently long-duration window. Hence, pink noise may represent a stimuli to use for loudness/level measurement and calibration, and may be used for measuring audio output levels.
According to an example, level measurements may be performed with a sound pressure level (SPL) meter set at C-weighting and slow integration time. This technique may be applied to compute levels with a simulated SPL-meter. Alternatively, the ITU-R BS.1770-4 technique may be employed to estimate loudness.
Referring to
Referring to
Referring to
With continued reference to
For Equation (7), x represents data points, P represents the number of data points, and k represents the order of the polynomial.
The resulting ITU values for each CR may be embedded in the vector y=(yj)(j=1, . . . , 5), where y1=ITUCR=1, y2=ITUCR=2,y3=ITUCR=4, etc. and x1=1,x2=2,x3=4 etc. The least-squares inversion for obtaining p=(p0,p2, . . . , p4) are
p
=(V(1:5)HV(1:5))−1V(1:5)Hy Equation (8)
For Equation (8), p represents coefficients for the polynomial of order P, H represents the hermetian operator (complex conjugate and transpose), and y represents the vector of values mentioned above.
The results from the technique are shown in
ITUx=Σk=04pkxk Equation (9)
For Equation (9), ITUx represents the ITU level interpolated for compression ratio x. In this regard, the polyfit function in MATLAB™ may be used to generate the pk for Equation (9).
Thus, according to
According to an example, with respect to determining the loudness for specific devices, the loudspeaker (or headphone) response may be accounted for in the ITU-R loudness computing stage that returns the various levels as a function of CR and θ, as the ITU standard assumes a flat magnitude response.
Referring to
Referring to
With continued reference to
The single hidden layer neural network model may then be presented with data not trained. According to an example, the output reported by the single hidden layer neural network model may be θ=−11.73 dB as the threshold, with input being a compression ratio CR=4 and desired ITU loudness of 7 dB LKFS.
According to another example, the output reported by the single hidden layer neural network model may be θ=−17.4 dB as the threshold, with input being a compression ratio CR=3 and desired ITU loudness of 9 dB LKFS.
Both of the aforementioned results of the single hidden layer neural network model are acceptable when viewed with respect to
The processor 1702 of
Referring to
The processor 1702 may fetch, decode, and execute the instructions 1708 to train a learning model 110 based on the variations in the compression parameters 104 and the corresponding variations in the loudness levels.
The processor 1702 may fetch, decode, and execute the instructions 1710 to ascertain a specified loudness level 114 for a device.
The processor 1702 may fetch, decode, and execute the instructions 1712 to determine, based on the trained learning model 110, the compression parameters 104 for the specified loudness level.
The processor 1702 may fetch, decode, and execute the instructions 1714 to perform, based on the determined compression parameters 104, sub-band compression of an input audio signal 120 by processing the input audio signal 120 using a perfect reconstruction filterbank 122.
Referring to
At block 1804, the method may include determining, based on a trained learning model 110, compression parameters 104 for the specified loudness level 114. According to an example, the determined compression parameters 104 may include a compression threshold and a compression ratio, and the learning model 110 may be trained based on variations in a plurality of the compression parameters 104 and corresponding variations in loudness levels.
At block 1806, the method may include performing, based on the determined compression parameters 104, sub-band compression of an input audio signal 120 by processing the input audio signal 120 using a perfect reconstruction filterbank 122.
Referring to
The processor 1904 may fetch, decode, and execute the instructions 1908 to determine, based on a trained learning model 110, compression parameters 104 for the specified loudness level. According to an example, the determined compression parameters 104 may include a compression threshold and a compression ratio, and the learning model 110 may be trained based on variations in a plurality of the compression parameters 104 and corresponding variations in loudness levels.
The processor 1904 may fetch, decode, and execute the instructions 1910 to perform, based on the determined compression parameters 104, sub-band compression of an input audio signal 120 by processing the input audio signal 120 using a perfect reconstruction filterbank 122.
The processor 1904 may fetch, decode, and execute the instructions 1912 to generate an output audio signal 124 based on the performance of the sub-band compression of the input audio signal 120.
What has been described and illustrated herein is an example along with some of its variations. The terms, descriptions and figures used herein are set forth by way of illustration only and are not meant as limitations. Many variations are possible within the spirit and scope of the subject matter, which is intended to be defined by the following claims—and their equivalents—in which all terms are meant in their broadest reasonable sense unless otherwise indicated.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2017/030181 | 4/28/2017 | WO | 00 |