The present invention relates generally to the field of video viewing applications such as those that may be used in video teleconferencing systems and in the viewing of videos with associated audio (e.g., movies), and more particularly to a method and apparatus for enabling an improved experience by better matching of the auditory space to the visual space thereof.
Video teleconferencing systems are becoming ubiquitous for both business and personal applications. Moreover, everyone watches movies and other videos with associated audio in a huge variety of environments including at home and at work. And most such prior art video systems make use of at least two audio speakers (e.g., either loudspeakers or headphone speakers) to provide the audio (i.e., the sound) which is to be played concurrently with the associated displayed video. However, such prior art systems rarely succeed in (assuming that they even try) matching accurately the auditory space with the corresponding visual space. That is, in general, a prior art video teleconferencing system participant or other audio-video (e.g., movie) viewer who is watching a video display while listening to the corresponding audio will often not hear the sound as if it were accurately emanating from the proper physical (e.g., directional) location (e.g., an apparent physical location of a human speaker visible in the video). Even when a stereo (i.e., two or more channel) audio signal is provided, it will typically not match the appropriate corresponding visual angle, unless it happens to do so by chance. Therefore, a method and apparatus for accurately matching auditory space to visual space in video teleconferencing applications and video (e.g., movie) viewing applications would be highly desirable. Specifically, what is desired is a spatial audio rendering method that accurately matches spatial audio to video, regardless of whether video is presented in 2D (i.e., as a two dimensional video image projection) or in 3D (i.e., as a three-dimensional video image). (Note that 3D video display screens are likely to become far more common as 3D display technology—particularly those technologies that do not require the viewer to wear cumbersome eyeglasses—continues to develop.)
The instant inventor has recognized that at least one reason that prior art audio-video systems often fail to provide accurate spatial audio rendering is that the viewer's physical location relative to the video display screen is not taken into account. As such, the instant inventor has derived a method and apparatus for enabling an improved experience by better matching of the auditory space to the visual space in video viewing applications such as those that may be used in video teleconferencing systems and in the viewing of videos with associated audio (e.g., movies). In particular, in accordance with certain illustrative embodiments of the present invention, a viewer's location and head position relative to a video display screen is determined, one or more desired sound locations (which may, for example, be related to a projection on the video display) are determined, and binaural stereo audio signals which accurately locate the sound sources at the desired sound locations are advantageously generated.
More specifically, in accordance with one illustrative embodiment of the present invention, a method is provided for generating a spatial rendering of an audio sound to an observer using a plurality of speakers, the audio sound related to a video being displayed to said observer on a video screen having a given physical location, the method comprising receiving a video input signal for use in displaying said video to said observer on said video screen; receiving one or more audio input signals related to said video input signal, the one or more audio input signals including said audio sound; determining a desired physical location relative to said video screen for spatially rendering said audio sound, the desired physical location being determined based on a position on the video screen at which a particular portion of said video corresponding to the audio sound is being displayed; determining a current physical location of the observer relative to said video screen; and generating a plurality of audio output signals based on said determined desired physical location for spatially rendering said audio sound and further based on said determined current physical location of the observer relative to said video screen, said plurality of audio signals being generated such that when delivered to said observer using said plurality of speakers, the observer hears said audio sound as being rendered from said determined desired physical location for spatially rendering said audio sound.
In addition, in accordance with another illustrative embodiment of the present invention, an apparatus is provided for generating a spatial rendering of an audio sound to an observer, the apparatus comprising a plurality of speakers; a video screen having a given physical location, the video screen for displaying a video to the observer, the audio sound related to the video being displayed to the observer; a video input signal receiver which receives a video input signal used to display the video to said observer on said video screen; an audio input signal receiver which receives one or more audio input signals related to said video input signal, the one or more audio input signals including said audio sound; a processor which (a) determines a desired physical location relative to said video screen for spatially rendering said audio sound, the desired physical location being determined based on a position on the video screen at which a particular portion of said video corresponding to the audio sound is being displayed, and (b) determines a current physical location of the observer relative to said video screen; and an audio output signal generator which generates a plurality of audio output signals based on said determined desired physical location for spatially rendering said audio sound and further based on said determined current physical location of the observer relative to said video screen, said plurality of audio signals being generated such that when delivered to said observer using said plurality of speakers, the observer hears said audio sound as being rendered from said determined desired physical location for spatially rendering said audio sound.
Moreover, if the monitor size is changed, for example, the visual and auditory spaces will also no longer match.
However, since the angle from observer 32 to the visual projection of the (same) speaker on video display screen 33 at location 36 thereof differs from the corresponding angle in the setup of
Other approaches that have been employed include (a) binaural audio rendering and (b) sound field synthesis techniques. In binaural audio rendering, which is fully familiar to those of ordinary skill in the art, two audio signals are produced, one for the left ear and one for the right ear. Binaural audio can therefore be easily directly reproduced with headphones. When played over a pair of loudspeakers, however, the binaural signals need to be processed by a cross-talk canceller to preprocess each of the loudspeaker signals such that the cross-talk from the right loudspeaker to the left ear and vice-versa properly cancels out at the listener's individual ears. Such techniques are well known and familiar to those of ordinary skill in the art. Moreover, added realism for binaural rendering for headphones may be achieved when head-tracking is used to assist the rendering process. In particular, such system may advantageously adjust the synthesized binaural signal such that the location of a sound source does not inappropriately turn along with the head of the listener, but rather stays fixed in space regardless of the rotational head movement of the listener. For example, one prominent application of this technique is in the rendering of “3/2 stereo” (such as Dolby 5.1®) over headphones. In such a case, the five individual loudspeaker signals are mixed down to a binaural signal accounting for the standardized positional angles of the loudspeakers. For example, the front-left speaker positioned at 30 degree to the left of the listener may be advantageously convolved with the head-related impulse response corresponding to a 30 degrees sound arrival incident.
Unfortunately, such systems are limited to the compensation of horizontal head-rotation—other head movements, such as forward-backward and left-right movements, are not appropriately compensated for. In PC-based teleconferencing applications, for example, where the participant's distance to the video display (e.g., the monitor) is usually much closer than it is in typical movie playback systems, a sideward head movement may, for example, be as large as the size of the monitor itself. As such, a failure to compensate for such movements (among others) significantly impairs the ability of the system to maintain the correct directional arrival of the sound. Furthermore, the generation of binaural signals is commonly based on the assumption that the listener's position is fixed (except for his or her rotational head movement), and therefore cannot, for example, allow the listener to move physically around and experience the changes of arrival directions of sound sources—for example, such systems do not allow a listener to walk around a sound source. In other words, prior-art methods of binaural audio take movements of sound sources into account, as well as rotation of a listener's head, but they do not provide a method to take a listener's body movements into account.
More specifically, generating binaural signals is commonly based on sound arrival angles, whereby distance to the sound source is typically modeled by sound level, ratio of direct sound to reflected/reverberated sound, and frequency response changes. Such processing may be sufficient as long as either (a) the listener only moves his head (pitch, jaw, role), but does not move his entire body to another location, or (b) the sound source is significantly distant from the listener such that lateral body movements are much smaller in size compared to the distance from the listener to the sound source. For example, when binaural room impulse responses are used to reproduce with headphones the listening experience of a loudspeaker set in a room at a particular listener position, some minimal lateral body movement of the listener will be acceptable, as long as such movement is substantially smaller than the distance to the reproduced sound source (which, for stereo, is typically farther away than the loudspeakers themselves). On the other hand, for a PC-based audiovisual telecommunication setup, for example, lateral movements of the listener can no longer be neglected, since they may be of a similar magnitude to the distance between the listener and the sound source.
Sound field synthesis techniques, on the other hand, include “Wavefield Synthesis” and “Ambisonics,” each of which is also familiar to those skilled in the art. Wavefield synthesis (WFS) is a 3D audio rendering technique which has the desirable property that a specific source location may be defined, expressed, for example, by both its depth behind or in front of screen, as well as its lateral position. When 3D video is presented with WFS, for example, the visual space and the auditory space match over a fairly wide area. However, when 2D video is presented with WFS rendered audio, the visual space and auditory space typically match only in a small area in and around the center position.
Ambisonics is another sound field synthesis technique. A first-order Ambisonics system, for example, represents the sound field at a location in space by the sound pressure and by a three dimensional velocity vector. In particular, sound recording is performed using four coincident microphones—an omnidirectional microphone for sound pressure, and three “figure-of-eight” microphones for the corresponding velocity in each of the x, y, and z directions. Recent studies have shown that higher order Ambisonics techniques are closely related to WFS techniques.
The center of the coordinate system may be advantageously chosen to coincide with the center of true-to-life size video display 43. As is shown in the figure, sound source location 41 (S) is laterally displaced from the center of the screen by xs and the appropriate depth of the source is ys. Likewise, observer 44 (V) is laterally displaced from the center of the screen by xv and the distance of observer 44 (V) from the screen is yv.
In accordance with the principles of the present invention, we can advantageously correctly render binaural sound for observer 44 (V) by advantageously determining the sound arrival angle:
γ=α+β,
where β can be advantageously determined as follows:
Once γ has been advantageously determined, it will be obvious to those of those of ordinary skill in the art, that based on prior art binaural audio techniques, a proper binaural audio-visual rendering of sound source S may be performed in accordance with the first illustrative embodiment of the present invention. In a similar manner, a proper binaural audio-visual rendering of sound source location 42 (S*), which may for example, be from a person shown on video display 43 at screen position 46 who is currently speaking, may also be performed in accordance with this illustrative embodiment of the present invention.
If the video display differs from true-to-life size, however, the use of angle γ as determined in accordance with the illustrative embodiment of
Note that only the auditory locations of sound source locations 51 (S) and 52 (S*) are shown in the figure, without their corresponding visual locations being shown. (The actual visual locations will, for example, differ for 2D and 3D displays.) Note also that the person creating sound at sound source location 52 (S*) will, in fact, visually appear to observer 54 on the screen of video display 53, even though, assuming that angle γ is used as described above with reference to
Specifically, in accordance with this illustrative embodiment of the present invention, the proper correspondence between the auditory rendering and the non-true-to-life visual rendering is addressed by advantageously scaling the spatial properties of the audio proportionally to the video. In particular, a scaling factor r is determined as follows:
where W0 denotes the screen width which would be required for true-to-life size visual rendering (e.g., the screen width of video display 43 shown in
Specifically,
The center of the coordinate system again may be advantageously chosen to coincide with the center of (reduced size) video display 63. As is shown in the figure, sound source location 61 (S) is laterally displaced from the center of the screen by xs and the depth of the source is ys. Likewise, observer 64 (V) is laterally displaced from the center of the screen by xv and the distance of observer 44 (V) from the screen is yv.
Therefore, in accordance with the principles of the present invention, and further in accordance with the illustrative embodiment shown in
(a) originally located sound source 71 (S), which may, for example, be a person shown on video display screen 73 who is currently speaking and whose projection point 76 (Sp) is located on the display screen at position (xsp, 0), and
(b) the “effective” location of the video camera lens—location 72 (C)—which captured (or is currently capturing) the video being displayed on video display 73—specifically, C is shown in the figure as located at position (0, yc), even though it is, in fact, probably not actually located in the same physical place as viewer 74. In particular, the value yc represents the effective distance of the camera which captured (or is currently capturing) the video being displayed on the display screen.
Specifically, then, in accordance with this third illustrative embodiment of the present invention, we advantageously relocate the sound source to scaled sound source 75 (S′), which is to be advantageously located at position (xs′, ys′). To do so, we advantageously derive the value of angle β′ as follows:
First, we note that given the coordinate xsp of the projection point 76 (Sp), and based on the similar triangles in the figure, we find that
and, therefore, that
Then, we can advantageously determine the coordinates (xs′, ys′) of the scaled sound source 75. For this purpose, we advantageously introduce a scaling factor 0≦ρ≦1 to determine how the sound source is to be advantageously scaled along the line spanned by the two points S (original sound location 71) and Sp (projection point 76). For ρ=1, for example, the originally located sound source 71 (S) would not be scaled at all—that is, scaled sound source 75 (S′) would coincide with originally located sound source 71 (S). For ρ=0, on the other hand, the originally located sound source 71 (S) would be scaled maximally—that is, scaled sound source 75 (S′) would coincide with projection point 76 (Sp). Given such a definition of the scaling factor ρ, we advantageously obtain:
x
S
′=x
SP+ρ·(xS−xSP); or
x
S
′=x
SP·(1−ρ)+ρ·xS
and using the above derivation of xsp, we advantageously obtain:
Using the coordinates (xs′, ys′) of scaled sound source 75 (S′), we can then advantageously determine the value of angle β′ as follows:
Note that in response to a change in the display size, we may advantageously scale the coordinates of (xS, yS) and (xC, yC) in a similar manner to that described and shown in
Finally, taking into account the fact that the observer's head position—that is, viewing direction 75—is turned to the right by angle α, we can advantageously compute the sum of α and β to advantageously render accurate binaural sound for observer 74 (V) by advantageously determining the (total) sound arrival angle α+β′.
Note that an appropriate scaling factor ρ may be advantageously derived from a desired maximum tolerable visual and auditory source angle mismatch. As shown in
From this equation, or directly from
Note that from the camera's view angle (v as shown in
Conference room 801 contains motor-driven dummy head 803, a motorized device which takes the place of a human head and moves in response to commands provided thereto. Such dummy heads are fully familiar to those skilled in the art. Dummy head 803 comprises right in-ear microphone 804, left in-ear microphone 805, right in-eye camera 806, and left in-eye camera 807. Microphones 804 and 805 advantageously capture the sound which is produced in conference room 801, and cameras 806 and 807 advantageously capture the video (which may be produced in stereo vision) from conference room 801—both based on the particular orientation (view angle) of dummy head 803.
In accordance with the principles of the present invention, and in accordance with the fourth illustrative embodiment thereof, the head movements of remote participant 812 are tracked with head tracker 808, and the resultant head movement data is transmitted by link 815 from remote room 802 to conference room 801. There, this head movement data is provided to dummy head 803 which properly mimics the head movements of remote participant 812 in accordance with an appropriate angle conversion function f(Δφ) as shown on link 815. (The function “f” will depend on the location of the dummy head in conference room 801, and will be easily ascertainable by one of ordinary skill in the art. Illustratively, the function “f” may simply be the identity function, i.e., f(Δφ)=Δφ, or it may simply scale the angle, i.e., f(Δφ)=qΔφ, where q is a fraction.) Moreover, the video captured in conference room 801 by cameras 806 and 807 is transmitted by link 813 back to remote room 802 for display on video display screen 811, and the binaural (L/R) audio captured by microphones 804 and 805 is transmitted by link 814 back to remote room 802 for use by speakers 809 and 810. Video display screen 811 may display the received video in either 2D or 3D. However, in accordance with the principles of the present invention, and in accordance with the fourth illustrative embodiment thereof, the binaural audio played by speakers 809 and 810 will be advantageously generated in accordance with the principles of the present invention based, inter alia, on the location of the human speaker on video display screen 811, as well as on the physical location of remote participant 812 in remote room 802 (i.e., on the location of remote participant 812 relative to video display screen 811).
Conference room 901 contains 360 degree camera 903 (or in accordance with other illustrative embodiments of the present invention, a partial angle video camera) which advantageously captures video representing at least a portion of the activity in conference room 901, as well as a plurality of microphones 904—preferably one for each conference participant distributed around conference room table 905—which advantageously capture the sound which is produced by conference participants in conference room 901.
In accordance with the principles of the present invention, and in accordance with one illustrative embodiment thereof as shown in
In accordance with one illustrative embodiment of the present invention, camera 903 may be a full 360 degree camera and the entire 360 degree video may be advantageously transmitted via link 913 to remote room 902. In this case, the video displayed on the video screen may comprise video extracted from or based on the entire 360 degree video, as well as on the head movements of remote participant 912 (tracked with head tracker 908). In accordance with this illustrative embodiment of the present invention, transmission of the head movement data to conference room 901 across link 915 need not be performed. In accordance with another illustrative embodiment of the present invention, camera 903 may be either a full 360 degree camera or a partial view camera, and based on the head movement data received over link 915, a particular limited portion of video from conference room 901 is extracted and transmitted via link 913 to remote room 902. Note that the latter described illustrative embodiment of the present invention will advantageously enable a substantial reduction of the data rate employed in the transmission of the video across link 913.
In accordance with either of these above-described illustrative embodiments of the present invention as shown in
In addition, as shown in the figure, angle computation module 1006 advantageously receives viewer location data which provides information regarding the physical location of viewer 1012 (Dx, Dy, Dz)—in particular, with respect to the known location of the video display screen being viewed (which is not shown in the figure), as well as the tilt angle (Δφ), if any, of the viewer's head. In accordance with one illustrative embodiment of the present invention, the viewer's location may be fixed (i.e., the viewer does not move in relation to the display screen), in which case this fixed location information is provided to angle computation module 1006. In accordance with another illustrative embodiment of the present invention, the viewer's location may be determined with use of (optional) head tracking module 1007, which, as shown in the figure, is provided position information for the viewer with use of position sensor 1009. As pointed out above in the discussion of
In any case, based on both the sound source location information and on the viewer location information, as well as on the knowledge of the screen size of the given video display screen being used, angle computation module 1006, using the principles of the present invention and in accordance with an illustrative embodiment thereof, advantageously generates the desired angle information (illustratively φ1 thorough φn) for each one of the corresponding plurality of monaural audio signals (illustratively, s1 through sn) and provides this desired angle information to binaural mixer 1005. Binaural mixer 1005 then generates a pair of stereo binaural audio signals, in accordance with the principles of the present invention and in accordance with an illustrative embodiment thereof, which will advantageously provide improved matching of auditory space to visual space. In accordance with one illustrative embodiment of the present invention, viewer 1012 uses headphones (not shown in the figure as representing a different illustrative embodiment of the present invention) which comprises a pair of speakers (a left ear speaker and a right ear speaker) to which these two stereo binaural audio signals are respectively and directly provided.
In accordance with the illustrative embodiment of the present invention as shown in
The preceding merely illustrates the principles of the invention. It will thus be appreciated that those skilled in the art will be able to devise various arrangements which, although not explicitly described or shown herein, embody the principles of the invention and are included within its spirit and scope. Furthermore, all examples and conditional language recited herein are principally intended expressly to be only for pedagogical purposes to aid the reader in understanding the principles of the invention and the concepts contributed by the inventor(s) to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions. Moreover, all statements herein reciting principles, aspects, and embodiments of the invention, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents as well as equivalents developed in the future, i.e., any elements developed that perform the same function, regardless of structure.
Thus, for example, it will be appreciated by those skilled in the art that the block diagrams herein represent conceptual views of illustrative circuitry embodying the principles of the invention. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudocode, and the like represent various processes which may be substantially represented in computer readable medium and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.
A person of ordinary skill in the art would readily recognize that steps of various above-described methods can be performed by programmed computers. Herein, some embodiments are also intended to cover program storage devices, e.g., digital data storage media, which are machine or computer readable and encode machine-executable or computer-executable programs of instructions, wherein said instructions perform some or all of the steps of said above-described methods. The program storage devices may be, e.g., digital memories, magnetic storage media such as magnetic disks and magnetic tapes, hard drives, or optically readable digital data storage media. The embodiments are also intended to cover computers programmed to perform said steps of the above-described methods.
The functions of any elements shown in the figures, including functional blocks labeled as “processors” may be provided through the use of dedicated hardware as well as hardware capable of executing software in association with appropriate software. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. Moreover, explicit use of the term “processor” or “controller” should, not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, digital signal processor (DSP) hardware, read only memory (ROM) for storing software, random access memory (RAM), and non volatile storage. Other hardware, conventional and/or custom, may also be included. Similarly, any switches shown in the figures are conceptual only. Their function may be carried out through the operation of program logic, through dedicated logic, through the interaction of program control and dedicated logic, or even manually, the particular technique being selectable by the implementer as more specifically understood from the context.
In the claims hereof any element expressed as a means for performing a specified function is intended to encompass any way of performing that function including, for example, a) a combination of circuit elements which performs that function or b) software in any form, including, therefore, firmware, microcode or the like, combined with appropriate circuitry for executing that software to perform the function. The invention as defined by such claims resides in the fact that the functionalities provided by the various recited means are combined and brought together in the manner which the claims call for. Applicant thus regards any means which can provide those functionalities as equivalent as those shown herein.
The present application is related to co-pending U.S. patent application Ser. No. ______, “Method And Apparatus For Improved Matching Of Auditory Space To Visual Space In Video Teleconferencing Applications Using Window-Based Displays,” filed by W. Etter on even date herewith and commonly assigned to the assignee of the present invention.