The following disclosure(s) are submitted under 35 U.S.C. 102(b)(1)(A):
DISCLOSURE(S): Deep Learning Based Voice Extraction And Primary-Ambience Decomposition For Stereo To Surround Upmixing, Ricardo Thaddeus Piez-Amaro, Carlos Tejeda-Ocampo, Ema Souza-Blanes, Sunil Bharitkar, and Luis Madrid-Herrera, 154th Convention, May 13-15, 2023, Espoo, Helsinki, Finland, pp 1-8.
A portion of the disclosure of this patent document may contain material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the patent and trademark office patent file or records, but otherwise reserves all copyright rights whatsoever.
One or more embodiments relate generally to multimedia content upmixing, and in particular, to a deep learning based upmixing using a strategy combining voice extraction and primary-ambience decomposition.
Surround systems have gained popularity in home entertainment despite the fact that most of the cinematic content is delivered in two-channel stereo format. Although there are several upmixing options, it has proven challenging to deliver an upmixed signal that approximates the original directionality and timbre intended by the mixing artist.
One embodiment provides a computer-implemented method that includes determining directional sounds from a content mix using a machine learning unmixing model. The directional sounds are panned in an upmixed signal. Signal-dependent upmixing gains for specific frequency bins are computed on a frame-basis using a machine learning model for the upmixed signal. Dedicated voice clarity gains are computed using a hearing impairment model for multiple hearing-impaired profiles for achieving dialog enhancement. The signal dependent upmixing gains and voice clarity gains are transmitted as metadata with a downmixed signal representing the content mix.
Another embodiment includes a non-transitory processor-readable medium that includes a program that when executed by a processor performs dialog enhancement of extracted sources of an upmixed signal, including determining, by the processor, directional sounds from a content mix using a machine learning unmixing model. The processor pans the directional sounds in an upmixed signal. The processor further computes signal-dependent upmixing gains for specific frequency bins on a frame-basis using a machine learning model for the upmixed signal. The processor still further computes dedicated voice clarity gains using a hearing impairment model for multiple hearing-impaired profiles for achieving dialog enhancement. The signal dependent upmixing gains and voice clarity gains are transmitted as metadata with a downmixed signal representing the content mix.
Still another embodiment provides an apparatus that includes a memory storing instructions, and at least one processor executes the instructions including a process configured to determine directional sounds from a content mix using a machine learning unmixing model. The directional sounds are panned in an upmixed signal. Signal-dependent upmixing gains are computed for specific frequency bins on a frame-basis using a machine learning model for the upmixed signal. Dedicated voice clarity gains are computed using a hearing impairment model for multiple hearing-impaired profiles for achieving dialog enhancement. The signal dependent upmixing gains and voice clarity gains are transmitted as metadata with a downmixed signal representing the content mix.
These and other features, aspects and advantages of the one or more embodiments will become understood with reference to the following description, appended claims and accompanying figures.
For a fuller understanding of the nature and advantages of the embodiments, as well as a preferred mode of use, reference should be made to the following detailed description read in conjunction with the accompanying drawings, in which:
The following description is made for the purpose of illustrating the general principles of one or more embodiments and is not meant to limit the inventive concepts claimed herein. Further, particular features described herein can be used in combination with other described features in each of the various possible combinations and permutations. Unless otherwise specifically defined herein, all terms are to be given their broadest possible interpretation including meanings implied from the specification as well as meanings understood by those skilled in the art and/or as defined in dictionaries, treatises, etc.
A description of example embodiments is provided on the following pages. The text and figures are provided solely as examples to aid the reader in understanding the disclosed technology. They are not intended and are not to be construed as limiting the scope of this disclosed technology in any manner. Although certain embodiments and examples have been provided, it will be apparent to those skilled in the art based on the disclosures herein that changes in the embodiments and examples shown may be made without departing from the scope of this disclosed technology.
One or more embodiments relate generally to multimedia content upmixing, and in particular, to a deep learning based upmixing using a strategy combining voice extraction and primary-ambience decomposition. One embodiment provides a computer-implemented method that includes determining directional sounds from a content mix using a machine learning unmixing model. The directional sounds are panned in an upmixed signal. Signal-dependent upmixing gains for specific frequency bins are computed on a frame-basis using a machine learning model for the upmixed signal. Dedicated voice clarity gains are computed using a hearing impairment model for multiple hearing-impaired profiles for achieving dialog enhancement. The signal dependent upmixing gains and voice clarity gains are transmitted as metadata with a downmixed signal representing the content mix.
With conventional techniques, actual methods present general phasiness when outside the sweet spot; speech is also degraded due to improper voice extraction from a complex mixture of sources. The conventional techniques: do not support speech enhancement processing; do not perform well when input channels are already uncorrelated; do not sound natural; are designed for a particular type of content (e.g., music); are not applicable for hearing-impaired people typically found due to age-related hearing loss. Additionally, high-frequency energy has been traditionally neglected in speech perception research and enhancement. One or more embodiments address this overlooked component of human perception to bring greater accessibility.
Multichannel surround home theatres have become more accessible to consumers. Most audiovisual content, however, remains in stereo format. Since playing stereo content in surround systems does not offer the best possible listening experience, upmixing techniques have been used to derive signals in surround formats (e.g., 5.1, 7.1, 7.1.4) from an original 2-channel mix. Upmixing is the process where audio content of m channels is mapped into n channels, where n>m. These n-channels should be able to be played in a surround speaker setup and provide a better immersive experience to the listener than plain stereo. Some embodiments include the Voice-Primary-Ambience Extraction Upmixing (VPA) methodology. In one or more embodiments, VPA focuses on upmixing from two to five channels. VPA can comprise three main blocks: a hearing model to generate frequency depending gains of one or several Hearing Impairment (HI) models using vocal extraction, primary-ambience decomposition, and upmix rendering.
Some embodiments employ extraction of speech from a stereo signal; apply dialog enhancement; render speech to a center channel; time-frequency analysis of voice extracted signals; synthesizing frequency-dependent gains based on hearing loss profile(s); coding of frequency-dependent gains as metadata to be sent with downmixed signals and voice/ambience upmixing parameters (e.g., along with, alongside, in conjunction with, in a same transmission, etc.); decoding and extracting metadata parameters based on Hearing Impairment (HI) profile; applying voice/speech frequency-dependent gain (viz., metadata parameters) using a hearing loss profile; and the hearing loss profile identified by the consumer (e.g., with television (TV)/soundbar remote, TV interface, etc.).
The output from the XML format process 115 and the stereo downmix 100 are processed to result in encoded metadata 120 and audio encoded 125, which results in a streaming low bitrate output 130. The streaming low bitrate output 130 is processed into decoded metadata 121 and audio decoded 126. A metadata extractor 135 extracts the decoded metadata 121 from the decoded audio stream 131 (resulting from the streaming low bitrate output 130) while the audio signals ({circumflex over (x)}1(n) and {circumflex over (x)}2(n)) from the audio decoded 126 and the gains (gvoice(1)(n, f), gvoice(2)(n, f), and g1(n, f) through gN(n, f)) are processed by upmixer 140. The output from the upmixer 140 is upmixed audio 145 (y1(n) to y5(n)). In some embodiments, dedicated frequency-dependent gains are derived for dialog based on different HI profiles. In some embodiments, the HI profiles may be tailored to specific languages.
Unmixing refers to the process of separating the different sources which comprise a signal. In some embodiments, directional sounds (e.g., x1(n) and x2(n)) are determined from a content mix using an ML unmixing model to separate the channels to the stereo downmix 100. In one or more embodiments, determining directional sounds may be performed by isolating, identifying, detecting, extracting, etc. The nature of the audio sources present varies depending on the type of audio signal being upmixed. In music, the common sources are predictable to a certain extent: vocals, guitar, keyboard, bass, drums, among others. In cinematic content, however, there could be an unpredictable number of sources of different kinds. This makes it unfeasible to implement a broader sound separation approach for cinematic content upmixing. The most common approach to perform unmixing is by finding source patterns in the mix spectrogram and extracting them through a mask. There are different methods to achieve this, such as harmonic-percussive separation (HPS), non-negative matrix factorization (NMF), or neural networks. For example, OpenUnmix (UMX) is a Deep Learning model, trained for a source separation task in a musical context. In some embodiments, a vocals model (separation model) with pre-trained weights may be implemented. In one or more embodiments, although the vocals model is trained to extract singing voices, it also performs well extracting speech from cinematic content. The vocal reverberation, however, is not included in the extracted speech signal but is found in the residual signal in both cinematic and musical content cases. The core of the vocals model architecture may include a multi-layer bidirectional long short-term memory (BiLSTM) neural network (NN). The vocals model architecture may take as input the short-time Fourier transform (STFT) spectrogram of the mix, crops it to (e.g., 16 kHz), passes it through a fully connected layer, then through the BiLSTM, and two more fully connected layers, including additionally a skip connection right before and after the BiLSTM. Finally, the vocals model reshapes the output to match the original STFT shape and outputs a mask, which will be applied to the original spectrogram to perform the actual source extraction.
VPA uses an Equal-Levels Ambience Extraction (ELAE) algorithm. ELAE is based on the following assumptions: (i) an input signal is the result of adding up a primary (directional) component and ambience; (ii) in a stereo signal, the primary components are uncorrelated with their ambience, and the ambience signals are uncorrelated with each other; (iii) the correlation coefficient of the primary components is 1; (iv) ambience levels in both channels are equal; (v) it is possible to extract the ambience through a mask. Using the above assumptions and the physical constraint that the total ambience energy has to be lower than or equal to the total energy it is possible to find the masks as a function of the channels' cross-correlation and auto-correlations.
In some embodiments, the ML model 105 employs VPA processing. VPA can comprise three main blocks: voice extraction, ambience extraction and upmix rendering. The first block includes the pretrained vocals model as a source extractor. The first block receives the stereo downmix and produces a 4-channel audio, i.e., the concatenation of the extracted voice in stereo ([VL;VR]) with the residual also in stereo ([UL;UR]). For the first block, s is referred to as the stereo input signal with sL and sR being its left and right channels, respectively.
where V is the extracted voice, and U is the residual of s after removing V. The second block is the Primary-Ambience decomposition block, which is performed just over the residual U using ELAE.
where P contains the primary component of U and A contains the ambience of the residual U. Next, the upmix rendering block. Before obtaining the upmixed signal ŝ, the pre-upmixed channels
In order for VPA to be implemented in a consumer application it needs to be performed in real time. To achieve this, some embodiments employ a windowed approach, where small chunks of the audio are processed in overlapping slices. In one example embodiment, a window size of W=4096 with an overlap O=512 samples may be employed (other window sizes may also be employed as desired). In some embodiments, a deep learning model is trained using STFT windows with 4096 samples and overlap of 3072 samples, that configuration is maintained in the internal vocals model block; and for the ELAE's internal STFT some embodiments use a 128-sample window with 96 overlapping samples. To address the border artifacts, inherent to the STFT process and due to the rears' decorrelation, the last cE=96 samples of each window and the first cS=416 samples of the next window are taken out before concatenating them. The pseudocode for this approach is as follows:
where N is the total number of processed windows, s is the upmixed signal corresponding to the current window, and upmix is the final output with the complete upmixed signal.
In some embodiments, the baseline gain gi(n,f) computations are moved upstream for the upmixer 140 and these baseline gains are transmitted as metadata. One or more embodiments employ ML processing for determining baseline gains from content (e.g., a regression ML model, etc.) may be employed. Some embodiments include various hearing loss profiles for computing time-varying hearing-loss gains gvoice(n,f), which are applied to the center channel. Listening tests on a HI population sample may provide or inform of the values of these hearing-loss gains. Different individuals may likely have different hearing loss profiles (e.g., some exhibit loss starting say 4 kHz others at 8 kHz). These hearing loss gains are applied in conjunction to or will replace the baseline gains for HI people. These hearing-loss gains may be constant values or gvoice(n,f)=EQ(n, f), where EQ(n,f) is an equalization filter over [20, 20000]Hz for a given frame index n. Optionally, frame-independent equalization may be applied to each HI model such that gvoice(n,f)=EQ(f). Another way to achieve improvement to listening ability for hearing impaired profiles would be to apply dynamic range compression (DRC) and send the DRC parameters (compression ratio, threshold, and release-time constants) as parameters to enable dialog to be better heard by HI people. In some embodiments, the presets for this gain may be exposed to the end consumer and the gains would be tied specifically to enhancing the center-channel voice channel. An example of enhancing dialog for HI people is using ducking (attenuating other content relative to voice). In one or more embodiments, background noise (signal-to-noise ratio (SNR)) may be used as a modality to developing these gain presets. In some embodiments, instead of HI profiles, one could substitute with noise profiles before encoding. If monitoring reveals a background noise response, the appropriate preset gain gvoice(n,f) corresponding to a noise profile closest to developing gvoice(n,f) may be used.
where fmax=#of bins. In one or more embodiments, the metadata compression/decompression models
where ak(n) are the linear prediction coefficients (LPC) used to model the time-frequency gains. Thus for a given time-frame a few parameters (ak) may be used to represent the gain function that extends from 20-20,0000 Hz. The reduction enables smaller metadata packet-size for transmission in turn reducing bit-rate of the overall encoded content. At the decoder the LPC parameters are extracted and used to reconstruct “approximately” the frequency-dependent gain over that frame.
where fmax=#of bins. In one or more embodiments, the metadata decompression models
where ak(n) are the linear prediction coefficients (LPC) used to model the time-frequency gains for the HI model output gains (note: these ak(n) are different than those for the upmixing coefficients). Thus for a given time-frame a few parameters (ak) may be used to represent the gain function that extends from 20-20,0000 Hz. The reduction enables smaller metadata packet-size for transmission in turn reducing bit-rate of the overall encoded content. At the decoder the LPC parameters are extracted and used to reconstruct “approximately” the frequency-dependent gain over that frame.
In some embodiments, process 700 further includes performing, by a computing device, a primary-ambience decomposition process for the upmixed signal.
In one or more embodiments, process 700 further includes applying the signal-dependent upmixing gains to downmixed signal components.
In one or more embodiments, process 700 further provides that the content mix comprises a voice content mix.
In some embodiments, process 700 additionally provides that during upmixing, the signal-dependent upmixing gains are applied to primary and ambient signals to generate a final output.
In one or more embodiments, process 700 further provides that the signal-dependent upmixing gains are embedded as audio-codec metadata.
In some embodiments, process 700 further includes the feature that the audio-codec metadata is transmitted with (e.g., along with, alongside, in conjunction with, in a same transmission, etc.) encoded downmixed stereo signals.
In some embodiments, the disclosed technology may be used in cinematic content that is delivered in stereo format, speech and intelligibility enhancement for dialogue-based content, live music content, etc.
One or more embodiments may create a high dynamic range (HDR) 10+ ecosystem-driven upmixer: tie the edge-device (e.g., TV) upmixer parameters to gains for controlling dialog intelligibility. The gain values are computed before encoding and sent as metadata. Time-varying gain is computed before encoding, which eliminates the need of the edge-device from performing compute-intensive processing on a frame-by-frame basis). The upmixer is integrated with the HDR10+ video solution using an open source codec, such as Opus. The upmixer provides for playback on TVs, soundbars, smartphones, etc.
Information transferred via communications interface 807 may be in the form of signals such as electronic, electromagnetic, optical, or other signals capable of being received by communications interface 807, via a communication link that carries signals and may be implemented using wire or cable, fiber optics, a phone line, a cellular phone link, a radio frequency (RF) link, and/or other communication channels. Computer program instructions representing the block diagram and/or flowcharts herein may be loaded onto a computer, programmable data processing apparatus, or processing devices to cause a series of operations performed thereon to produce a computer implemented process.
In some embodiments, processing instructions for process 700 (
Embodiments have been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products. Each block of such illustrations/diagrams, or combinations thereof, can be implemented by computer program instructions. The computer program instructions when provided to a processor produce a machine, such that the instructions, which execute via the processor create means for implementing the functions/operations specified in the flowchart and/or block diagram. Each block in the flowchart/block diagrams may represent a hardware and/or software module or logic. In alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures, concurrently, etc.
The terms “computer program medium,” “computer usable medium,” “computer readable medium”, and “computer program product,” are used to generally refer to media such as main memory, secondary memory, removable storage drive, a hard disk installed in hard disk drive, and signals. These computer program products are means for providing software to the computer system. The computer readable medium allows the computer system to read data, instructions, messages or message packets, and other computer readable information from the computer readable medium. The computer readable medium, for example, may include non-volatile memory, such as a floppy disk, ROM, flash memory, disk drive memory, a CD-ROM, and other permanent storage. It is useful, for example, for transporting information, such as data and computer instructions, between computer systems. Computer program instructions may be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
As will be appreciated by one skilled in the art, aspects of the embodiments may be embodied as a system, method or computer program product. Accordingly, aspects of the embodiments may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the embodiments may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain or store a program for use by or in connection with an instruction execution system, apparatus, or device.
Computer program code for carrying out operations for aspects of one or more embodiments may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
Aspects of one or more embodiments are described above with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
References in the claims to an element in the singular is not intended to mean “one and only” unless explicitly so stated, but rather “one or more.” All structural and functional equivalents to the elements of the above-described exemplary embodiment that are currently known or later come to be known to those of ordinary skill in the art are intended to be encompassed by the present claims. No claim element herein is to be construed under the provisions of 35 U.S.C. section 112, sixth paragraph, unless the element is expressly recited using the phrase “means for” or “step for.”
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosed technology. As used herein, the singular forms “a” “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the embodiments has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the embodiments in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the disclosed technology.
Though the embodiments have been described with reference to certain versions thereof; however, other versions are possible. Therefore, the spirit and scope of the appended claims should not be limited to the description of the preferred versions contained herein.
This application claims the priority benefit of U.S. Provisional Patent Application Ser. No. 63/443,769, Feb. 7, 2023, which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63443769 | Feb 2023 | US |