Aspects of the present invention relate to digital signal processing of audio, particularly audio content recorded in stereo and separation based on content and remixing.
Psycho-acoustics relate to human perception of sound. A sound generated in a live performance, interacts acoustically with the environment, e.g. walls and seats of a concert hall. After propagating through the air and before arriving at the eardrum, a sound wave undergoes filtering and delays due to the size and shape of head and ears. Left and right ears receive signals differing slightly in level, phase, and time delay. A human brain processes simultaneously the signals received from both auditory nerves and derives spatial information related to location, distance, speed and environment of the source of the sound.
In a live performance recorded in stereo with two microphones, each microphone receives audio signals with time delays relating to the distances between the audio sources and the microphones. When recorded stereo is played using a stereo sound reproduction system with two loudspeakers, original time delays and levels are reproduced of the various sources to the microphones as recorded. The time delays and levels provide the brain with a spatial sense of the original sound sources. Moreover, both left and right ears receive audio from both the left and right loudspeakers, a phenomenon known as channel cross-talk. However, if the same content is reproduced on a headset, the left channel plays to only the left ear and the right channel plays only to the right ear, without reproducing channel cross-talk.
In a virtual binaural reproduction system using a headset with left and right channels, direction dependent head-related transfer functions (HRTF) may be used to simulate the filtering and delay effect due to the size and shape of our head and ears. Static and dynamic cues may be included to simulate acoustic effects and motion of audio sources within the concert hall. Channel cross-talk may be restored. Taken together, these techniques may be used to virtually localize in two or three dimensional space the original audio sources and to provide a spatial acoustic experience to the user.
Various computerized systems and methods are described herein including a trained machine configured to input a stereo sound track and separate the stereo sound track into multiple N separated stereo audio signals respectively characterized by multiple N audio content classes. Essentially all stereo audio as input in the stereo sound track is included in the N separated stereo audio signals. A mixing module is configured to spatially localize symmetrically and without cross-talk, between left and right, the N separated stereo audio signals into multiple output channels. The output channels include respective mixtures of one or more of the N separated stereo audio signals. Gain is adjusted of the output channels into left and right binaural outputs to conserve summed levels of the N separated stereo audio signals distributed over the output channels. The N audio content classes may include: (i) dialogue (ii) music, and (iii) sound effects. A binaural reproduction system may be configured to binaurally render the output channels. The gains may be summed in phase within a previously determined threshold, to suppress distortion arising during the separation of the stereo sound track into the N separated stereo audio signals. The binaural reproduction system may be further configured to spatially relocalise one or more of the N separated stereo audio signals by linear panning. The sum of audio amplitudes, of the N separated stereo audio signals as distributed over the output channels, may be conserved. The trained machine may be configured to transform the input stereo soundtrack into an input time-frequency representation and to process the time-frequency representation and output therefrom multiple time-frequency representations corresponding to the respective N separated stereo audio signals. For a time-frequency bin, a sum of magnitudes of the output time-frequency representations is within a previously determined threshold of a magnitude of the input time-frequency representation. The trained machine may be configured to output multiple N−1 of the time-frequency representations from the trained machine, and compute the Nth time-frequency representation as a residual time-frequency representation by subtracting for a time-frequency bin a sum of magnitudes of the N−1 time-frequency representations from a magnitude of the input time-frequency representation. The trained machine may be configured to prioritize at least one of the N audio content classes as a prior audio content class, and serially process the prior audio content class by separating the stereo sound track into the separate stereo audio signal of the prior audio content class prior to the other N−1 audio content classes. The prior audio content class may be dialogue. The trained machine may be configured to process the output time-frequency representations by extracting information from the input time-frequency representation for phase restoration.
Computer readable media are disclosed herein storing instructions for executing computerized methods as disclosed herein.
These, additional, and/or other aspects and/or advantages of the present invention are set forth in the detailed description which follows; possibly inferable from the detailed description; and/or learnable by practice of the present invention.
The invention is herein described, by way of example only, with reference to the accompanying drawings, wherein:
The foregoing and/or other aspects will become apparent from the following detailed description when considered in conjunction with the accompanying drawing figures.
Reference will now be made in detail to features of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the like elements throughout. The features are described below to explain the present invention by referring to the figures.
While sound mixing for motion pictures, audio content may be recorded as separate audio content classes, e.g. dialogue, music and sound effects, also referred to herein as “stems”. Recording as stems facilitates replacing dialogue with foreign language versions and also adapting the sound track to different reproduction systems, e.g. monaural, binaural and surround sound systems.
However, legacy films have a sound track including audio content classes, e.g. dialogue, music and sound effects previously recorded together, e.g in stereo with two microphones.
Separation of the original audio content into stems may be performed using one or more previously trained machines, e.g. neural networks. Representative references which describe separation of the original audio content into audio content classes using neural networks include:
Original audio content may not be perfectly separable and audible artifacts or distortion in the separated content may result from the separation process. The separated audio content classes or stems may be virtually localized in two dimensional or three dimensional space and remixed into multiple output channels. The multiple output channels may be input to an audio reproduction system to create a spatial sound experience. Features of the present invention are directed to remixing and/or virtually localizing the separated audio content classes in such a way as to reduce or cancel at least in part artifacts generated by an imperfect separation process.
Referring now to the drawings, reference is now made to
Reference is now made also to
Processors 20/1 to 20/N−1 may be configured as trained machines, e.g. supervised machine learning for outputting stems 1 . . . N−1. Alternatively or in addition, unsupervised machine learning algorithms may be used such as principle component analysis. Block 22 may be configured to sum together stems 1 to N−1 and may subtract the sum from input stereo signal 24 to produce a residual output as stem N so that summing audio signals from stems 1 . . . N substantively equals input stereo 24 within a previously determined threshold.
By way of example of N=3 stems, processor 20/1 masks input stereo 24 and outputs an audio signal stem 1, e.g. dialogue audio content. Processor 20/2 masks input stereo 24 and outputs stem 2, e.g. musical audio content. Residual block 22 outputs stem 3, essentially all other sound, e.g. sound effects, contained in input stereo 24 not masked out by processors 20/1 and 20/2. By using residual block 22, essentially all sound included in original input stereo 24 is included in stems 1 to 3. According to a feature of the present invention, stems 1 to N−1 may be computed in frequency domain and the subtraction or comparison performed in block 22 to output stem N may be in time domain, thus avoiding a final inverse transform.
Reference is now made also to
Reference is now also made to
Simple Wiener filtering or multi-channel Wiener filtering 47 may be used for estimating complex coefficients of the frequency data. Multichannel Wiener filtering 47 is an iterative procedure using expectation maximization A first estimate for the complex coefficients may be extracted from the STFT frequency bins 42 of the mixture and multiplied 46 with corresponding frequency magnitudes 44 output from post-processing block 45. Wiener filtering 47 assumes that the complex STFT coefficients are independent zero mean Gaussian random variables and under these assumptions a minimum mean squared error is computed of variances of sources for each frequency. The output of Wiener filter 47, STFT of stem 1, may be inverse transformed (block 48) to generate an estimate of stem 1 in time-domain. Trained machine 30/1 may compute in frequency domain output residual 1, by subtracting real-valued spectrogram 49 of stem 1 from spectrogram 42 of the mixture as output from transform block 40. Residual 1 may be output to trained machine 30/2 which may operate similarly as trained machine 30/1 however, as residual 1 is already in frequency domain, transform 40 is superfluous in trained machine 30/2. Residual 2 is output from trained machine 30/2 by subtracting, in frequency domain, STFT stem 2 from residual 1.
Referring again to
Reference is now also made to
Reference is now also made to
Spatial localization (step 65) may be performed symmetrically between left and right and without cross-talk, between left and right sides of stereo. In other words, sound originally recorded in input stereo 24 in a left channel is spatially localized (step 65) only in one or more left output channels (or center speaker) and similarly sound originally recorded in input stereo 24 in a right channel is spatially localized in one or more right channels (or center speaker).
Gains may be adjusted (step 67) of the output channels into left and right binaural outputs to conserve summed levels of the N separated stereo audio signals distributed over the output channels.
The output channels 18 may be binaurally rendered (step 69) or alternatively reproduced in a stereo loudspeaker system.
Reference is now made to
is shown of spatial relocalization of music R. Gain GC of music R is added to the center virtual speaker C and gain GR of right virtual speaker R is reduced linearly. Graphs of gain GC of music R in center virtual speaker C and gain GR of music R in right virtual speaker R are shown in an insert. Axes are gain (ordinate) against spatial angle θ (abscissa) in radians. Gain GC of music R in center virtual speaker C and gain GR of music R in right virtual speaker R vary according to the following equations.
For spatial angle, θ=+30 degrees
GC=⅓ and GR=⅔.
While linear panning, phases of the audio signal of music R from both the center virtual speaker C and from right virtual speaker R are reconstructed so that the normalized power of the two contributions to music R adds to or approaches unity for any spatial angle θ. Moreover, if separation (block 10, step 63) is not perfect and a dialogue peak in the right channel in frequency representation was separated into the music R stem, then linear panning under the conditions of preserving phase tends to restore at least in part the errant dialogue peak back with correct phase into the center virtual speaker which is rendering the dialogue stem, tending to correct for or suppress the distortion caused by the imperfect separation.
Reference is now made to
Spatial envelopment (step 65) is performed symmetrically between left and right and without cross-talk, between left and right sides of stereo. In other words, sound originally recorded in input stereo 24 in a left channel is spatially distributed (step 65) from only left output channels (or center speaker) and similarly sound originally recorded in input stereo 24 in a right channel is spatially distributed from one or more right channels (or center speaker). Phases are preserved so that the normalized gains in spatially distributed output channels on the left sum to unity gain of left input stereo 24 and similarly spatially distributed output channels on the right sum to unity gain for right input stereo 24.
The embodiments of the present invention may comprise a general-purpose or special-purpose computer system including various computer hardware components, which are discussed in greater detail below. Embodiments within the scope of the present invention also include computer-readable media for carrying or having computer-executable instructions, computer-readable instructions, or data structures stored thereon. Such computer-readable media may be any available media, transitory and/or non-transitory which is accessible by a general-purpose or special-purpose computer system. By way of example, and not limitation, such computer-readable media can comprise physical storage media such as RAM, ROM, EPROM, flash disk, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic or solid state storage devices, or any other media which can be used to carry or store desired program code means in the form of computer-executable instructions, computer-readable instructions, or data structures and which may be accessed by a general-purpose or special-purpose computer system.
In this description and in the following claims, a “network” is defined as any architecture where two or more computer systems may exchange data. The term “network” may include wide area network, Internet local area network, Intranet, wireless networks such as “Wi-Fi”, virtual private networks, mobile access network using access point name (APN) and Internet. Exchanged data may be in the form of electrical signals that are meaningful to the two or more computer systems. When data is transferred or provided over a network or another communications connection (either hard wired, wireless, or a combination of hard wired or wireless) to a computer system or computer device, the connection is properly viewed as a computer-readable medium. Thus, any such connection is properly termed a computer-readable medium. Combinations of the above should also be included within the scope of computer-readable media. Thus, computer readable media as disclosed herein may be transitory or non-transitory. Computer-executable instructions comprise, for example, instructions and data which cause a general-purpose computer system or special purpose computer system to perform a certain function or group of functions.
The term “server” as used herein, refers to a computer system including a processor, data storage and a network adapter generally configured to provide a service over the computer network. A computer system which receives a service provided by the server may be known as a “client” computer system.
The term “sound effects” as used herein refers to artificially created sound or an enhanced sound used to set mood, simulate reality or create an illusion in a motion picture. The term “sound effect” as used herein includes “foleys” which are sounds added to a production to provide a more realistic sense to the motion picture.
The term “source” or “audio source” as used herein refers one or more sources of sound in a recording. Sources may include vocalists, actors/actresses, musical instruments and sound effects, which may be sourced in recordings or synthesized
The term “audio content class” as used herein refers to a classification of audio sources which may depend on the type of content, by way of example (i) dialogue (ii) music, and (iii) sound effects are suitable audio content classes for an audio track of a motion picture. Other audio content classes may be contemplated depending on type content, for instance: strings, woodwinds, brass and percussion for a symphony orchestra. The term “stem” and “audio content class” are used herein interchangeably.
The term “spatially localizing” or “localizing” refers to angular or spatial placement in two or three dimensions relative to the head of a listener of one or more audio sources or stems. The term “localizing” includes “envelopment” in which audio sources sound to the listener as being spread out angularly and/or by distance.
The term “channels” or “output channels” as used herein refers to a mixture of audio sources as recorded or audio content classes as separated, rendered for reproduction.
The term “binaural” as used herein refers to hearing with both ears as with a headset or with two loudspeakers. The term “binaural rendering” or “binaural reproduction” refers to playing output channels, for example with localization to provide a spatial audio experience in two or three dimensions.
The term “conserved” as used herein referring to a sum of gains equals or approaches a constant. For normalized gains, the constant equals or approaches unity gain.
The term “stereo” as used herein refers to sound recorded with two microphones left and right and rendered with at least two output channels, left and right.
The term “cross-talk” as used herein refers to rendering at least of a portion of sound recorded in a left microphone to a right output channel or similarly rendering at least of a portion of sound recorded in a right microphone in a left output channel.
The term “symmetrically” as used herein refers to bilateral symmetry of localization about a sagittal plane, which divides a virtual listener's head into two mirror image left and right halves.
The term “sum” or “summing” as used herein in context of audio signals refers to combining the signals including respective frequencies and phases. For fully incoherent and/or uncorrelated audio waves, summing may refer to summing by energy or power.
For audio waves fully correlated in phase and frequency, summing may refer to summing respective amplitudes.
The term “panning” as used herein refers to adjusting a level, dependent on a spatial angle and in stereo simultaneously adjusting levels of right and left output channels.
The terms “moving picture”, “movie”, ‘motion picture”, “film” are used herein interchangeably and refers to a multimedia production in which a sound track is synchronized with video or moving pictures.
Unless otherwise indicated, the term “previously determined threshold” is implicit in the claims when appropriate, for instance “is conserved” means “is conserved within a previously determined threshold”; “without cross-talk” means “without cross-talk within a previously determined threshold”, by way of example. Similarly, the terms “all”, “essentially all”, “substantively all” refer to within a previously determined threshold.
The term “spectrogram” as used herein is a two-dimensional data structure in time-frequency space.
The indefinite articles “a”, “an” is used herein, such as “a time-frequency bin”, “a threshold” have the meaning of “one or more” that is “one or more time-frequency bins” or “one or more thresholds”.
All optional and preferred features and modifications of the described embodiments and dependent claims are usable in all aspects of the invention taught herein. Furthermore, the individual features of the dependent claims, as well as all optional and preferred features and modifications of the described embodiments are combinable and interchangeable with one another.
Although selected features of the present invention have been shown and described, it is to be understood the present invention is not limited to the described features.
Number | Date | Country | Kind |
---|---|---|---|
2105556.1 | Apr 2021 | GB | national |