An aspect of the disclosure here relates to spatially biased binaural audio for video recording. Other aspects are also described.
Binaural recording of audio facilitates a means for full 3D sound capture—in other words, being able to reproduce the exact sound scene and giving the user a sensation of ‘being there.’ This can be accomplished through spatial rendering of audio inputs using head related transfer functions (HRTF), which modifies a sound signal in order to induce the perception in a listener that the sound signal is originating from any point in space. While this approach is compelling for, for example, full virtual reality applications, in which a user can interact both visually and audibly in a virtual environment, in traditional video capture applications three dimensional sounds can distract the viewer from the screen. In contrast, monophonic or traditional stereophonic recordings may not provide a sufficient sense of immersion.
An aspect of the disclosure is directed to a method for producing a spatially biased sound pickup beamforming function, to be applied to a multi-channel audio recording of a video recording. The method includes generating a target directivity function. The target directivity function includes a set of spatially biased head related transfer functions. A left ear set of beamforming coefficients and a right ear set of beamforming coefficients may be generated by determining a best fit for the target directivity function based on a device steering matrix. The left ear set of beamforming coefficients and the right ear set of beamforming coefficients may then be output and applied to the multichannel audio recording to produce more immersive sounding, spatially biased audio for the video recording.
Another aspect is directed towards a method for producing the target directivity function, which includes a set of spatially biased HRTFs. The method includes selecting a set of left ear and right ear head related transfer functions (HRTFs). The left ear and right ear head HRTFs are multiplied with an on-camera emphasis function (OCE), to produce the spatially biased HRTFs. The OCE may be designed to modify the sound profile of the HRTFs to provide emphasis in one or more desired directions, e.g., directly ahead where the camera is being aimed, as a function of the orientation of the recording device when the device is recording video in a specific orientation.
An aspect is directed towards a system for producing a sound pickup beamforming function to be applied to a multi-channel audio recording of a recorded video, during playback of the recorded video. The system includes a processor that receives a device steering matrix and a target directivity function. The processor then generates the beamforming coefficients by employing numerical optimization techniques, such as the least squares method, to find the regularized best fit of the inputted device steering matrix to the target directivity function
Another aspect is a method for asymmetric equalization. Asymmetric equalization involves receiving a plurality of beamforming coefficients for a first ear and then calculating a diffuse field power average across a plurality of beamforming coefficients. A correction filter is applied to the beamforming coefficients such that the diffuse field power average of the plurality of beamforming coefficients equals the diffuse field power average of a single microphone, and then a first ear beamforming coefficient is output. The asymmetric equalization method reduces errors in the resulting inter-aural level differences that arise due to an asymmetric microphone arrangement on the device.)
The above summary does not include an exhaustive list of all aspects of the present disclosure. It is contemplated that the disclosure includes all systems and methods that can be practiced from all suitable combinations of the various aspects summarized above, as well as those disclosed in the Detailed Description below and particularly pointed out in the Claims section. Such combinations may have particular advantages not specifically recited in the above summary.
Several aspects of the disclosure here are illustrated by way of example and not by way of limitation in the figures of the accompanying drawings in which like references indicate similar elements. It should be noted that references to “an” or “one” aspect in this disclosure are not necessarily to the same aspect, and they mean at least one. Also, in the interest of conciseness and reducing the total number of figures, a given figure may be used to illustrate the features of more than one aspect of the disclosure, and not all elements in the figure may be required for a given aspect.
Several aspects of the disclosure with reference to the appended drawings are now explained. Whenever the shapes, relative positions and other aspects of the parts described are not explicitly defined, the scope of the invention is not limited only to the parts shown, which are meant merely for the purpose of illustration. Also, while numerous details are set forth, it is understood that some aspects of the disclosure may be practiced without these details. In other instances, well-known circuits, structures, and techniques have not been shown in detail so as not to obscure the understanding of this description.
In the description, certain terminology is used to describe the various aspects of the disclosure here. For example, in certain situations, the terms “component,” “unit,” “module,” and “logic” are representative of hardware and/or software configured to perform one or more functions. For instance, examples of “hardware” include, but are not limited or restricted to an integrated circuit such as a processor (e.g., a digital signal processor, microprocessor, application specific integrated circuit, a micro-controller, etc.). Further, “a processor” may encompass one or more processors, such as a processor in a remote server working with a processor on a local client machine. Similarly, aspects of the disclosure that appear to be conducted by multiple processors could be accomplished by a single processor. Of course, the hardware may be alternatively implemented as a finite state machine or even combinatorial logic. An example of “software” includes executable code in the form of an application, an applet, a routine or even a series of instructions that may be part of an operating system. The software may be stored in any type of machine-readable medium.
Referring to
The target directivity function defines a desired beam width and direction, and is applied to the steering matrix to yield a set of beamforming coefficients. The latter are then applied by the spatial sound processor 140 (see
Referring now to the flow diagram of
The OCE function given in block 112 is a collection of spatial weights that are designed to modify a target or selected HRTF such that the sound field of the HRTF will be given a predetermined geometry (induced sound field geometry) by emphasizing level at one or more desired directions and reducing (e.g., minimizing) level at undesired directions. For example,
Returning to
Any suitable approach may be used in block 114, to find an optimal fit for the directionally biased HRTF. In one aspect, an iterative least squares method may be used to find the optimal fit, in which the target directivity function (e.g., steering direction) and the device steering matrix are inputs to a least squares beamformer design algorithm (executed by the processor 130.) The method of least-squares is an approach in regression analysis to approximate the solution of overdetermined systems, i.e., sets of equations in which there are more equations than unknowns. “Least-squares” means that the overall solution minimizes the sum of the squares of the residuals made in the results of every single equation. The best fit in the least-squares sense minimizes the sum of squares of residuals, where a residual can be described as the difference between an observed value and the fitted value provided by a model. The least-squares method may determine for each microphone an optimal fit between i) the spatial weights of the directionally biased HRTF that best corresponds with a microphone and ii) the transfer functions for the corresponding microphone represented in the device steering matrix. The least squares beamformer design algorithm outputs a set of beamforming coefficients for each microphone, such that the output includes left beamforming coefficients for all microphones represented in the array and right beamforming coefficients for all microphones represented in the array.
In one aspect, the iterative-least squares method may be subject to a determined white-noise gain constraint. White-noise gain is uncorrelated noise, such as from electric self-noise, that may be amplified during the optimal fit process. The white-noise gain constraint is a maximum noise amplification that is allowable while finding the best fit. The iterative-least squares method produces regularizer parameters, which is the “error” value that is allowed by the best fit when considering the white-noise gain constraint. The regularizer parameters derived for the first ear are then used when determining the best fit of the OCE-adjusted HRTFs based on the device steering matrix for the second ear in order to generate beamforming coefficients for the second ear.
The determination of whether a left side or a right side constitutes a first ear may be made based on the microphone configuration of the device 100. In an aspect, the first ear may be the ear that is on the same side of a vertical center axis, of the device 100, as the side of the device 100 that has a lower microphone density. For instance, the first ear is the left ear of the user who is holding the device 100 during the video recording, if the left side of the device 100 has a lower density of microphones than the right side of the device as in the orientation shown in
In one aspect, the process in
The sets of beamforming coefficients produced by the least fits method may be spectrally-biased, such that the perceived timbre of the resulting binaural signal may not match the desired timbre. Moreover, since a regularizer is chosen based on a single ear, the resulting spectrum at the left and right ears may not be consistent, particularly when the arrangement of the microphones 7 that constitute the array 133 is not left-right symmetric. Since at high frequencies the human auditory localization system relies on interaural level differences, such spectral discrepancies may result in competing auditory cues, which may cause a degradation in spatial localization. The asymmetric equalizer 138 applies a correction filter to the beamforming coefficients, such that the diffuse field power average of the resulting beamforming weights of both ears (averaging both space and ears) would equal the diffuse field power average of a reference microphone on the device 100. The same transfer function is applied to the sets of coefficients for both the left ear and the right ear, resulting in symmetric equalization. In an aspect considering asymmetric equalization, the diffuse field average is computed independently for the left ear and the right ear, resulting in a left filter for the left ear and a right filter for the right ear. The correction filter for the left ear is applied to the left ear beamforming coefficients, and the correction filter for the right ear is applied to the right ear beamforming coefficients, correcting for the interaural-level difference errors in the device 100 that has left-right asymmetrical microphone arrays.
Finally, the asymmetric equalizer 138 outputs the corrected, left set of spatially weighted beamforming coefficients, and corrected, right set of spatially weighted beamforming coefficients, to the spatial sound processor 140 (e.g., a binaural renderer.) The latter then applies those beamforming coefficients to the multichannel audio pickup produced by the array 133 of the device 100, as part of the recorded audio-video program.
In an aspect, the multimedia recording device 100 is capable of recording video and audio in a variety of orientations. For instance, the multimedia recording device 100 may have two or more cameras, either of which may be used to make the video recording.
Each orientation may have an associated, respective On-Camera Emphasis (OCE) function. Sets of left beamforming coefficients and right beamforming coefficients may be generated for each orientation, using the OCE that is associated with that orientation. Thus, a library of sets of beamforming coefficients is generated, wherein each set is associated with a possible multimedia recording device 100 orientation.
In an aspect, a set of left beamforming coefficients and right beamforming coefficients for a specific orientation may be selected, which orientation matches that of the multimedia recording device 100 while it is recording video. This set selected left beamforming coefficients and right beamforming coefficients are then output to the spatial sound processor 140. The spatial sound processor 140 may use the left beamforming coefficients to generate a binaural output signal from an audio input signal, by beamforming. The binaural output signal may be output to a speaker system, such as left and right earphones (headphones) of a headset. In an aspect, the multimedia recording device 100 may generate the set of left beamforming coefficients and right beamforming coefficients in real-time, by the processor 130 executing an algorithm, instead of selecting a set of beamforming coefficients from a library that has been created “offline” or not on the multimedia recording device.
In an aspect, the induced sound field geometry (for sound pickup—see
Various aspects of generating sets of beamforming coefficients from spatially weighted HRTFs may be used in applications where spatially biased audio is desired, by creating an OCE that emphasizes spatial focus in a determined direction. For example, it may be desirable for a hearing aid to focus sound in a direction that a user is facing, and so an OCE could be designed for that case which shapes the sound profile by the method discussed.
In another aspect of the disclosure here, the programmed processor automatically selects a more “aggressive” OCE that is associated with a narrower pickup beam width, or higher directivity, in response to detecting that the recording device 100 is zooming in (the lens system of the camera is being adjusted past a first threshold, such that an object that is captured in the video now appears larger.)
In some cases, equalization (spectral shaping) is applied to correct for timbre changes that appear due to the newly selected OCE (e.g., when the OCE is focusing on the voice of a person.) To reduce the likelihood of such timbre changes (when switching between different OCEs), block 114 of
When rendering spatial audio that is responsive to the camera zooming in, one of the following choices can be made when computing the new beamforming coefficients (for the zoomed in setting.) In one choice, a constraint is placed on the beamforming algorithm that leads to the on-axis sound level becoming greater (e.g., the person at the center of the video images is being zoomed in upon and their voice will become louder) while off-axis sound levels (e.g., voices of persons and objects that are not at the center of the video images) remain unchanged. In another choice, the constraint placed on the beamforming algorithm leads to the on-axis sound level remaining unchanged while off-axis sound levels are attenuated.
In yet another aspect, the programmed processor omits or does not apply any OCE (that narrows the focus of the sound pickup) in response to detecting that the user of the recording device 100 is manually zooming out (adjusting the lens system of the camera past a second threshold such that the object that is being captured in the video will appear smaller in the images.
In summary, aspects of the disclosure are directed to methods and systems for maintaining the immersion offered by binaural recordings while at the same time keeping auditory focus on the video playback. The method involves using an On Camera Emphasis (OCE) function, which modifies HRTFs to enhance directional bias. The output is a binaural signal which amplifies sounds in the direction of the camera and attenuates sounds in other directions, while maintaining spatialization.
While certain aspects have been described and shown in the accompanying drawings, it is to be understood that such aspects are merely illustrative of and not restrictive on the broad disclosure, and that the disclosure is not limited to the specific constructions and arrangements shown and described, since various other modifications may occur to those of ordinary skill in the art. For example,
To aid the Patent Office and any readers of any patent issued on this application in interpreting the claims appended hereto, applicants wish to note that they do not intend any of the appended claims or claim elements to invoke 35 U.S.C. 112(f) unless the words “means for” or “step for” are explicitly used in the particular claim.
This non-provisional patent application claims the benefit of the earlier filing date of U.S. provisional patent application No. 62/752,292 filed Oct. 29, 2018.
Number | Name | Date | Kind |
---|---|---|---|
9338420 | Xiang | May 2016 | B2 |
10178490 | Sheaffer et al. | Jan 2019 | B1 |
20100123785 | Chen et al. | May 2010 | A1 |
20130177923 | Uttenthal | Jul 2013 | A1 |
20130272539 | Kim | Oct 2013 | A1 |
20130342731 | Lee | Dec 2013 | A1 |
20150230026 | Eichfeld | Aug 2015 | A1 |
20160165339 | Benattar | Jun 2016 | A1 |
20160330560 | Chon | Nov 2016 | A1 |
20180046431 | Thagadur Shivappa | Feb 2018 | A1 |
20190246218 | Hertzberg | Aug 2019 | A1 |
Entry |
---|
Heller, Aaron J., et al., “A Toolkit for the Design of Ambisonic Decoders,” Presented at the Linux Audio Conference 2012, Apr. 12, 2012, 12 pages. |
Madmoni, Lior, et al., “Beamforming-based Binaural Reproduction by Matching of Binaural Signals,” AES 2020 Conference on Audio for Virtual and Augmented Reality, Aug. 17, 2020, 8 pages. |
Number | Date | Country | |
---|---|---|---|
20200137489 A1 | Apr 2020 | US |
Number | Date | Country | |
---|---|---|---|
62752292 | Oct 2018 | US |