Method and apparatus for improved matching of auditory space to visual space in video viewing applications

Abstract
A method and apparatus for enabling an improved experience by better matching of the auditory space to the visual space in video viewing applications such as those that may be used in video teleconferencing systems and in the viewing of videos with associated audio (e.g., movies). In one embodiment, a viewer's location and head position relative to a video display screen is determined, one or more desired sound source locations (which may, for example, be related to a projection on the video display) are determined, and binaural stereo audio signals which accurately locate the sound sources at the desired sound source locations are advantageously generated.
Description
FIELD OF THE INVENTION

The present invention relates generally to the field of video viewing applications such as those that may be used in video teleconferencing systems and in the viewing of videos with associated audio (e.g., movies), and more particularly to a method and apparatus for enabling an improved experience by better matching of the auditory space to the visual space thereof.


BACKGROUND OF THE INVENTION

Video teleconferencing systems are becoming ubiquitous for both business and personal applications. Moreover, everyone watches movies and other videos with associated audio in a huge variety of environments including at home and at work. And most such prior art video systems make use of at least two audio speakers (e.g., either loudspeakers or headphone speakers) to provide the audio (i.e., the sound) which is to be played concurrently with the associated displayed video. However, such prior art systems rarely succeed in (assuming that they even try) matching accurately the auditory space with the corresponding visual space. That is, in general, a prior art video teleconferencing system participant or other audio-video (e.g., movie) viewer who is watching a video display while listening to the corresponding audio will often not hear the sound as if it were accurately emanating from the proper physical (e.g., directional) location (e.g., an apparent physical location of a human speaker visible in the video). Even when a stereo (i.e., two or more channel) audio signal is provided, it will typically not match the appropriate corresponding visual angle, unless it happens to do so by chance. Therefore, a method and apparatus for accurately matching auditory space to visual space in video teleconferencing applications and video (e.g., movie) viewing applications would be highly desirable. Specifically, what is desired is a spatial audio rendering method that accurately matches spatial audio to video, regardless of whether video is presented in 2D (i.e., as a two dimensional video image projection) or in 3D (i.e., as a three-dimensional video image). (Note that 3D video display screens are likely to become far more common as 3D display technology—particularly those technologies that do not require the viewer to wear cumbersome eyeglasses—continues to develop.)


SUMMARY OF THE INVENTION

The instant inventor has recognized that at least one reason that prior art audio-video systems often fail to provide accurate spatial audio rendering is that the viewer's physical location relative to the video display screen is not taken into account. As such, the instant inventor has derived a method and apparatus for enabling an improved experience by better matching of the auditory space to the visual space in video viewing applications such as those that may be used in video teleconferencing systems and in the viewing of videos with associated audio (e.g., movies). In particular, in accordance with certain illustrative embodiments of the present invention, a viewer's location and head position relative to a video display screen is determined, one or more desired sound locations (which may, for example, be related to a projection on the video display) are determined, and binaural stereo audio signals which accurately locate the sound sources at the desired sound locations are advantageously generated.


More specifically, in accordance with one illustrative embodiment of the present invention, a method is provided for generating a spatial rendering of an audio sound to an observer using a plurality of speakers, the audio sound related to a video being displayed to said observer on a video screen having a given physical location, the method comprising receiving a video input signal for use in displaying said video to said observer on said video screen; receiving one or more audio input signals related to said video input signal, the one or more audio input signals including said audio sound; determining a desired physical location relative to said video screen for spatially rendering said audio sound, the desired physical location being determined based on a position on the video screen at which a particular portion of said video corresponding to the audio sound is being displayed; determining a current physical location of the observer relative to said video screen; and generating a plurality of audio output signals based on said determined desired physical location for spatially rendering said audio sound and further based on said determined current physical location of the observer relative to said video screen, said plurality of audio signals being generated such that when delivered to said observer using said plurality of speakers, the observer hears said audio sound as being rendered from said determined desired physical location for spatially rendering said audio sound.


In addition, in accordance with another illustrative embodiment of the present invention, an apparatus is provided for generating a spatial rendering of an audio sound to an observer, the apparatus comprising a plurality of speakers; a video screen having a given physical location, the video screen for displaying a video to the observer, the audio sound related to the video being displayed to the observer; a video input signal receiver which receives a video input signal used to display the video to said observer on said video screen; an audio input signal receiver which receives one or more audio input signals related to said video input signal, the one or more audio input signals including said audio sound; a processor which (a) determines a desired physical location relative to said video screen for spatially rendering said audio sound, the desired physical location being determined based on a position on the video screen at which a particular portion of said video corresponding to the audio sound is being displayed, and (b) determines a current physical location of the observer relative to said video screen; and an audio output signal generator which generates a plurality of audio output signals based on said determined desired physical location for spatially rendering said audio sound and further based on said determined current physical location of the observer relative to said video screen, said plurality of audio signals being generated such that when delivered to said observer using said plurality of speakers, the observer hears said audio sound as being rendered from said determined desired physical location for spatially rendering said audio sound.





BRIEF DESCRIPTION OF THE DRAWINGS


FIG. 1 shows a prior art environment for providing monaural audio rendering of a sound source in a video teleconferencing application.



FIG. 2 shows a prior art environment for providing stereo audio rendering of a sound source in a video teleconferencing application.



FIG. 3 shows a prior art environment for providing stereo audio rendering of a sound source in a video teleconferencing application but which uses a smaller monitor/screen size as compared to the prior art environment of FIG. 2.



FIG. 4 shows an illustrative environment for providing true-to-life size audio-visual rendering of a sound source in a video teleconferencing application, in accordance with a first illustrative embodiment of the present invention.



FIG. 5 shows the effect on the illustrative environment for providing audio-visual rendering of a sound source in a video teleconferencing application as shown in FIG. 4, when a smaller monitor/screen size is used.



FIG. 6 shows an illustrative environment for providing audio-visual rendering of a sound source in a video teleconferencing application which provides screen-centered scaling for auditory space, in accordance with a second illustrative embodiment of the present invention.



FIG. 7 shows an illustrative environment for providing audio-visual rendering of a sound source in a video teleconferencing application which provides camera-lens-centered scaling, in accordance with a third illustrative embodiment of the present invention.



FIG. 8 shows an illustrative environment for providing binaural audio-visual rendering of a sound source in a video teleconferencing application using a video display screen and a dummy head in the subject conference room, in accordance with a fourth illustrative embodiment of the present invention.



FIG. 9 shows an illustrative environment for providing binaural audio-visual rendering of a sound source in a video teleconferencing application using a video display screen and a 360 degree or partial angle video camera in the subject conference room, in accordance with a fifth illustrative embodiment of the present invention.



FIG. 10 shows a block diagram of an illustrative system for providing binaural audio-visual rendering of a sound source in a video teleconferencing application using head tracking and adaptive crosstalk cancellation, in accordance with a sixth illustrative embodiment of the present invention.





DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS


FIG. 1 shows a prior art environment for providing monaural audio rendering of a sound source in a video teleconferencing application. Such an environment is probably the most common setup in today's PC-based teleconferencing systems. Although two speakers are commonly used—left speaker 14 located on the left side of the monitor (i.e., video display screen 13), and right speaker 15 located on the right side of the monitor (i.e., video display screen 13)—the audio signal is commonly a monaural signal—that is, both left and right loudspeakers receive the same signal. As a result, the audio appears to observer 12 (shown as being located at position xv, yv) to be emanating from audio source location 11 (shown as being located at position xs, ys), which is merely a “phantom” source which happens to be located in the middle of the two speakers. Although the monitor may be showing multiple conference participants in different visual positions, or a video (e.g., a movie) comprising human speakers located at various positions on the screen, each of their auditory positions appear to be in the same location—namely, right in the middle of the monitor. Since the human ear is typically able to distinguish auditory angle differences of about 1 degree, such a setup produces a clear conflict between visual and auditory space. In addition, the monaural reproduction reduces intelligibility, particularly in a videoconferencing environment when multiple people try to speak at the same time, or when an additional noise source disturbs the audio signal.



FIG. 2 shows a prior art environment for providing stereo audio rendering of a sound source in a video teleconferencing or video (e.g., movie) viewing application. In this environment, observer 22 (shown as being located at position xv, yv) and the pair of loudspeakers—left speaker 24 located on the left side of the monitor (i.e., video display screen 23), and right speaker 25 located on the right side of the monitor (i.e., video display screen 23)—typically span a roughly equilateral triangle. That is, the angle between the two speakers and the listener (i.e., the observer) is approximately 60 degrees. Furthermore, in such a stereo rendering environment, the loudspeakers now receive different signals, which are typically generated by panning the audio sources to the desired positions within the stereo basis. Specifically, this “fixed” environment may, in fact, be specifically set up such that both visual and auditory spaces do match. Namely, if the individual loudspeaker signals are properly generated, then, when a speaker is visually projected on video display screen 23 at, for example, screen location 26 thereof, the audio source location of the speaker may, in fact, appear to observer 22 as being located at source location 21 (shown as being located at position xs, ys), which properly corresponds to the visual location thereof (i.e., visual projection screen location 26 on video display screen 23). However, the “proper” operation of this setup (wherein the visual and auditory spaces do match) necessarily requires that observer 22 is, in fact, located at the precise “sweet spot”—namely, as shown in the figure at position xv, yv, which is, as pointed out above, typically pre-calculated to be at an approximately 60 degree angle from the two speakers. If, on the other hand, the observer changes the distance “D” to the screen, or otherwise moves his or her physical location (e.g., moves sideways), the visual and auditory spaces will clearly no longer match.


Moreover, if the monitor size is changed, for example, the visual and auditory spaces will also no longer match. FIG. 3 shows the prior art environment for providing stereo audio rendering of a sound source in a video teleconferencing application which uses a smaller monitor/screen size as compared to the prior art environment of FIG. 2. Specifically, the figure shows observer 32 (shown as being located at position xv, yv) and the pair of loudspeakers—left speaker 34 located on the left side of the monitor (i.e., video display screen 33), and right speaker 35 located on the right side of the monitor (i.e., video display screen 33)—such that, as in the case of the environment of FIG. 2, they span an equilateral triangle. That is, the angle between the two speakers and the listener (i.e., the observer) remains at 60 degrees. Also, as in the environment of FIG. 2, the loudspeakers receive the same (different) individual audio signals as they were assumed to receive in the case of FIG. 2, which have been generated by the same panning of the audio sources to the desired positions within the stereo basis.


However, since the angle from observer 32 to the visual projection of the (same) speaker on video display screen 33 at location 36 thereof differs from the corresponding angle in the setup of FIG. 2, the audio source location of the speaker will now, in fact, appear to observer 32 as being located at source location 31 (shown as being located at position xx, ys), which no longer properly corresponds to the visual location thereof (i.e., visual projection screen location 36 on video display screen 33). That is, even when observer 32 maintains the 60 degree angle to the loudspeakers and the distance “D” from the screen, the visual and auditory spaces will no longer match—rather, it would now be required that the sound sources are panned to different angles to match the auditory space to the visual space, based on the changed video display size.


Other approaches that have been employed include (a) binaural audio rendering and (b) sound field synthesis techniques. In binaural audio rendering, which is fully familiar to those of ordinary skill in the art, two audio signals are produced, one for the left ear and one for the right ear. Binaural audio can therefore be easily directly reproduced with headphones. When played over a pair of loudspeakers, however, the binaural signals need to be processed by a cross-talk canceller to preprocess each of the loudspeaker signals such that the cross-talk from the right loudspeaker to the left ear and vice-versa properly cancels out at the listener's individual ears. Such techniques are well known and familiar to those of ordinary skill in the art. Moreover, added realism for binaural rendering for headphones may be achieved when head-tracking is used to assist the rendering process. In particular, such system may advantageously adjust the synthesized binaural signal such that the location of a sound source does not inappropriately turn along with the head of the listener, but rather stays fixed in space regardless of the rotational head movement of the listener. For example, one prominent application of this technique is in the rendering of “3/2 stereo” (such as Dolby 5.1®) over headphones. In such a case, the five individual loudspeaker signals are mixed down to a binaural signal accounting for the standardized positional angles of the loudspeakers. For example, the front-left speaker positioned at 30 degree to the left of the listener may be advantageously convolved with the head-related impulse response corresponding to a 30 degrees sound arrival incident.


Unfortunately, such systems are limited to the compensation of horizontal head-rotation—other head movements, such as forward-backward and left-right movements, are not appropriately compensated for. In PC-based teleconferencing applications, for example, where the participant's distance to the video display (e.g., the monitor) is usually much closer than it is in typical movie playback systems, a sideward head movement may, for example, be as large as the size of the monitor itself. As such, a failure to compensate for such movements (among others) significantly impairs the ability of the system to maintain the correct directional arrival of the sound. Furthermore, the generation of binaural signals is commonly based on the assumption that the listener's position is fixed (except for his or her rotational head movement), and therefore cannot, for example, allow the listener to move physically around and experience the changes of arrival directions of sound sources—for example, such systems do not allow a listener to walk around a sound source. In other words, prior-art methods of binaural audio take movements of sound sources into account, as well as rotation of a listener's head, but they do not provide a method to take a listener's body movements into account.


More specifically, generating binaural signals is commonly based on sound arrival angles, whereby distance to the sound source is typically modeled by sound level, ratio of direct sound to reflected/reverberated sound, and frequency response changes. Such processing may be sufficient as long as either (a) the listener only moves his head (pitch, jaw, role), but does not move his entire body to another location, or (b) the sound source is significantly distant from the listener such that lateral body movements are much smaller in size compared to the distance from the listener to the sound source. For example, when binaural room impulse responses are used to reproduce with headphones the listening experience of a loudspeaker set in a room at a particular listener position, some minimal lateral body movement of the listener will be acceptable, as long as such movement is substantially smaller than the distance to the reproduced sound source (which, for stereo, is typically farther away than the loudspeakers themselves). On the other hand, for a PC-based audiovisual telecommunication setup, for example, lateral movements of the listener can no longer be neglected, since they may be of a similar magnitude to the distance between the listener and the sound source.


Sound field synthesis techniques, on the other hand, include “Wavefield Synthesis” and “Ambisonics,” each of which is also familiar to those skilled in the art. Wavefield synthesis (WFS) is a 3D audio rendering technique which has the desirable property that a specific source location may be defined, expressed, for example, by both its depth behind or in front of screen, as well as its lateral position. When 3D video is presented with WFS, for example, the visual space and the auditory space match over a fairly wide area. However, when 2D video is presented with WFS rendered audio, the visual space and auditory space typically match only in a small area in and around the center position.


Ambisonics is another sound field synthesis technique. A first-order Ambisonics system, for example, represents the sound field at a location in space by the sound pressure and by a three dimensional velocity vector. In particular, sound recording is performed using four coincident microphones—an omnidirectional microphone for sound pressure, and three “figure-of-eight” microphones for the corresponding velocity in each of the x, y, and z directions. Recent studies have shown that higher order Ambisonics techniques are closely related to WFS techniques.



FIG. 4 shows an illustrative environment for providing true-to-life size audio-visual rendering of a sound source in a video teleconferencing application, in accordance with a first illustrative embodiment of the present invention. Specifically, the figure shows an illustrative scenario for true-to-life size audio-visual rendering of sound source location 41 (S), which may, for example, be from a person shown on video display 43 at screen position 45 who is currently speaking, where the sound source is to be properly located at position (xs, ys), and where observer 44 (i.e., listener V) is physically located at position (xv, yv). For simplicity, FIG. 4 only shows the horizontal plane. However, it will be obvious to those of ordinary skill in the art that the same principles as described herein may be easily applied to the vertical plane. Note also that video display 43 may be a 3D (three dimensional) display or it may be 2D (two dimensional) display.


The center of the coordinate system may be advantageously chosen to coincide with the center of true-to-life size video display 43. As is shown in the figure, sound source location 41 (S) is laterally displaced from the center of the screen by xs and the appropriate depth of the source is ys. Likewise, observer 44 (V) is laterally displaced from the center of the screen by xv and the distance of observer 44 (V) from the screen is yv. FIG. 4 further indicates that the observer's head position—that is, viewing direction 47—is turned to the right by angle α.


In accordance with the principles of the present invention, we can advantageously correctly render binaural sound for observer 44 (V) by advantageously determining the sound arrival angle:





γ=α+β,


where β can be advantageously determined as follows:






β
=

arc





tan









x
V

-

x
S




y
V

-

y
S



.






Once γ has been advantageously determined, it will be obvious to those of those of ordinary skill in the art, that based on prior art binaural audio techniques, a proper binaural audio-visual rendering of sound source S may be performed in accordance with the first illustrative embodiment of the present invention. In a similar manner, a proper binaural audio-visual rendering of sound source location 42 (S*), which may for example, be from a person shown on video display 43 at screen position 46 who is currently speaking, may also be performed in accordance with this illustrative embodiment of the present invention.


If the video display differs from true-to-life size, however, the use of angle γ as determined in accordance with the illustrative embodiment of FIG. 4 may result in inaccurate audio rendering. In particular, FIG. 5 shows the effect on the illustrative environment for providing audio-visual rendering of a sound source in a video teleconferencing application as shown in FIG. 4, when a smaller monitor/screen size is used, but where no adjustment is made to the audio rendering (as determined in accordance with the above description with reference to FIG. 4). Specifically, FIG. 5 shows sound source locations 51 (S) and 52 (S*), along with video display 53, which is illustratively smaller than a true-to-life size screen (such as, for example, the one illustratively shown in FIG. 4).


Note that only the auditory locations of sound source locations 51 (S) and 52 (S*) are shown in the figure, without their corresponding visual locations being shown. (The actual visual locations will, for example, differ for 2D and 3D displays.) Note also that the person creating sound at sound source location 52 (S*) will, in fact, visually appear to observer 54 on the screen of video display 53, even though, assuming that angle γ is used as described above with reference to FIG. 4, the sound source itself will arrive from outside the visual area. This will disadvantageously produce an apparent mismatch between the visual and auditory space. Similarly, the sound source S will also be mismatched from the corresponding visual representation of the speaker.



FIG. 6 shows an illustrative environment for providing audio-visual rendering of a sound source in a video teleconferencing application which provides screen-centered scaling for auditory space, in accordance with a second illustrative embodiment of the present invention. In particular, the illustrative embodiment of the present invention shown in FIG. 6 may advantageously be used with screen sizes that are not true-to-life size (e.g., smaller), as is typical. Again, the display may be a 3D (three dimensional) display or it may be 2D (two dimensional) display.


Specifically, in accordance with this illustrative embodiment of the present invention, the proper correspondence between the auditory rendering and the non-true-to-life visual rendering is addressed by advantageously scaling the spatial properties of the audio proportionally to the video. In particular, a scaling factor r is determined as follows:







r
=

W

W
0



,




where W0 denotes the screen width which would be required for true-to-life size visual rendering (e.g., the screen width of video display 43 shown in FIG. 4) and where W denotes the (actual) screen width of video display 63 as shown in FIG. 6. Given scaling factor r, the coordinates of the source location may be advantageously scaled to derive an equation for an angle {tilde over (γ)} as follows:








γ
~

=

α
+

β
~



,




where







β
~

=

arc





tan





x
V

-

r
·

x
S





y
V

-

r
·

y
S




.






Specifically, FIG. 6 shows originally located sound source location 61 (S), which may, for example, be a person shown on video display 63 at screen position 67 who is currently speaking, where the sound source would, in accordance with the determination of angle r as shown above in connection with FIG. 4, be improperly located at position (xs, ys). However, properly relocated sound source location 65 should (and will, in accordance with the illustrative embodiment of the invention shown in connection with this FIG. 6) be advantageously located at position (rxs, rys) instead. Note that observer 64 (i.e., listener V) is physically located at position (xv, yv). Similarly, originally located sound source location 62 (S*), which may, for example, be a person shown on video display 63 at screen position 68 who is currently speaking, should (and will, in accordance with the illustrative embodiment of the invention shown in connection with this FIG. 6) be advantageously located at properly relocated sound source location 66. Again, for simplicity, FIG. 6 only shows the horizontal plane. However, it will be obvious to those of ordinary skill in the art that the same principles as described herein may be easily applied to the vertical plane. Note also that video display 63 may be a 3D (three dimensional) display or it may be 2D (two dimensional) display.


The center of the coordinate system again may be advantageously chosen to coincide with the center of (reduced size) video display 63. As is shown in the figure, sound source location 61 (S) is laterally displaced from the center of the screen by xs and the depth of the source is ys. Likewise, observer 64 (V) is laterally displaced from the center of the screen by xv and the distance of observer 44 (V) from the screen is yv. FIG. 6 further indicates that the observer's head position—that is, viewing direction 69—is turned to the right by angle α.


Therefore, in accordance with the principles of the present invention, and further in accordance with the illustrative embodiment shown in FIG. 6, we can advantageously correctly render binaural sound for observer 64 (V) by advantageously determining the sound arrival angle {tilde over (γ)} as determined above. In view of the geometrical interpretation of this illustrative scaling procedure, it has been referred to herein as screen-centered scaling. Note that as the size of a video display is changed, the video itself is always scaled in this same manner—both for 2D and 3D video display implementations.



FIG. 7 shows an illustrative environment for providing audio-visual rendering of a sound source in a video teleconferencing application which provides camera-lens-centered scaling, in accordance with a third illustrative embodiment of the present invention. In accordance with this illustrative embodiment of the present invention, which may be advantageously employed with use of a 2D video projection, we advantageously scale the sound source location such that it moves on a line between:


(a) originally located sound source 71 (S), which may, for example, be a person shown on video display screen 73 who is currently speaking and whose projection point 76 (Sp) is located on the display screen at position (xsp, 0), and


(b) the “effective” location of the video camera lens—location 72 (C)—which captured (or is currently capturing) the video being displayed on video display 73—specifically, C is shown in the figure as located at position (0, yc), even though it is, in fact, probably not actually located in the same physical place as viewer 74. In particular, the value yc represents the effective distance of the camera which captured (or is currently capturing) the video being displayed on the display screen.


Specifically, then, in accordance with this third illustrative embodiment of the present invention, we advantageously relocate the sound source to scaled sound source 75 (S′), which is to be advantageously located at position (xs′, ys′). To do so, we advantageously derive the value of angle β′ as follows:


First, we note that given the coordinate xsp of the projection point 76 (Sp), and based on the similar triangles in the figure, we find that








x
S



y
S

-

y
C



=


x
SP


-

y
C







and, therefore, that







x
SP

=



y
C



y
C

-

y
S



·


x
S

.






Then, we can advantageously determine the coordinates (xs′, ys′) of the scaled sound source 75. For this purpose, we advantageously introduce a scaling factor 0≦ρ≦1 to determine how the sound source is to be advantageously scaled along the line spanned by the two points S (original sound location 71) and Sp (projection point 76). For ρ=1, for example, the originally located sound source 71 (S) would not be scaled at all—that is, scaled sound source 75 (S′) would coincide with originally located sound source 71 (S). For ρ=0, on the other hand, the originally located sound source 71 (S) would be scaled maximally—that is, scaled sound source 75 (S′) would coincide with projection point 76 (Sp). Given such a definition of the scaling factor ρ, we advantageously obtain:






x
S
′=x
SP+ρ·(xS−xSP); or






x
S
′=x
SP·(1−ρ)+ρ·xS


and using the above derivation of xsp, we advantageously obtain:








x
S


=




y
C



y
C

-

y
S



·

x
S

·

(

1
-
ρ

)


+

ρ
·

x
S




;





or







x
s


=


(




y
C

·

(

1
-
ρ

)




y

C
.


-

y
S



+
ρ

)

·

x
S



;





or






x
S


=



(




y
C

·

(

1
-
ρ

)


+

ρ


(


y
C

-

y
S


)





y
C

-

y
S



)

·

x
S


=


(




y
C

-

ρ
·

y
S



)



y
C

-

y
S



)

·

x
S








and






y
S


=

ρ
·


y
S

.






Using the coordinates (xs′, ys′) of scaled sound source 75 (S′), we can then advantageously determine the value of angle β′ as follows:








β


=

arc






tan


(



x
V

-

x
S





y
V

-

y
S




)




;





or






β


=

arc






tan
(



x
V

-


(




y
C

-

ρ
·

y
S



)



y
C

-

y
S



)

·

x
S





y
V

-

ρ
·

y
S




)






Note that in response to a change in the display size, we may advantageously scale the coordinates of (xS, yS) and (xC, yC) in a similar manner to that described and shown in FIG. 6 above, thereby maintaining the location coordinates (xV, yV) of observer 74 (V). Note that, as shown in the figure, video display 73 is illustratively of true-to-life size W0 (as in FIG. 4 above). Specifically, then, using the scaling factor r (illustratively, r=1 in FIG. 7) as defined in connection with the description of FIG. 6 above,








β


=

arc






tan
(



x
V

-


(




r
·

y
C


-

ρ
·
r
·

y
S



)



r
·

y
C


-

r
·

y
S




)

·
r
·

x
S





y
V

-

ρ
·
r
·

y
S




)



;





or






β


=

arc






tan
(



x
V

-


(




y
C

-

ρ
·

y
S



)



y
C

-

y
S



)

·
r
·

x
S





y
V

-

ρ
·
r
·

y
S




)






Finally, taking into account the fact that the observer's head position—that is, viewing direction 75—is turned to the right by angle α, we can advantageously compute the sum of α and β to advantageously render accurate binaural sound for observer 74 (V) by advantageously determining the (total) sound arrival angle α+β′.


Note that an appropriate scaling factor ρ may be advantageously derived from a desired maximum tolerable visual and auditory source angle mismatch. As shown in FIG. 7, the video shown on video display 73 has been (or is being) advantageously captured (elsewhere) with a video camera located at a relative position (0, yc) and a camera's angle of view v. The auditory and visual angles will advantageously match naturally only if viewer 74 is located (exactly) at position (0, yc). Any other location for viewer 74 will result in a mismatch of auditory and visual angle indicated by ε as shown in FIG. 7. Therefore, using the two triangles VSpVp and VSVs, we can advantageously derive the angle mismatch






ɛ
=



δ
V

-

δ
A


=


arc





tan




x
V

-

x
SP



y
V



-

arc





tan




x
V

-

x
S




y
V

-

y
S










From this equation, or directly from FIG. 7, it is apparent that the mismatch angle depends on three locations: (a) the source location, S, (b) the viewer location, V, and (c) the effective camera lens location, C (via xSP). To limit the angle mismatch ε in accordance with one embodiment of the present invention, these three locations may be constrained. However, in accordance with another illustrative embodiment of the present invention, the positions of these three locations that lead to the largest angle mismatch may be advantageously determined, and based on the determined largest angle mismatch, an appropriate scaling factor can be advantageously determined such that the resultant angle mismatch will always be within a pre-defined acceptable maximum, based on perception—illustratively, for example, 10 degrees. For example, the scaled source location may be derived as shown in FIG. 7 so as to result in an angle mismatch of ε′.


Note that from the camera's view angle (v as shown in FIG. 7) and the size of the display screen (illustratively shown in FIG. 7 to be true-to-life size—namely, W0), the camera distance yc can be easily derived in accordance with one illustrative embodiment of the present invention. Source locations can determined in a number of ways, in accordance with various illustrative embodiments of the present invention. For example, they may be advantageously derived from an analysis of the video itself, they may be advantageously derived from the audio signal data, or they may be advantageously generated spontaneously as desired. In addition, the source locations and/or the camera view angle may be advantageously transmitted to an illustrative system in accordance with various illustrative embodiments of the present invention as meta-data.



FIG. 8 shows an illustrative environment for providing binaural audio-visual rendering of a sound source in a video teleconferencing application using a video display screen and a dummy head in the subject conference room, in accordance with a fourth illustrative embodiment of the present invention. The figure shows two rooms—conference room 801 and remote room 802. Remote room 802 contains remote participant 812 who is viewing the activity (e.g., a set of conference participants) in conference room 801 using video display screen 811 and listening to the activity (e.g., one or more speaking conference participants) in conference room 801 using a headset comprising right speaker 809 and left speaker 810. The headset also advantageously comprises head tracker 808 for determining the positioning of the head of remote participant 812. (In accordance with alternative embodiments of the present invention, head tracker 808 may be independent of the headset, and may be connected to the person's head or alternatively may comprise an external device—i.e., one not connected to remote participant 812. Moreover, in accordance with other illustrative embodiments of the present invention, the headset containing speakers 809 and 810 may be replaced by a corresponding pair of loudspeakers positioned appropriately in remote room 802, in which case adaptive crosstalk cancellation may be advantageously employed to reduce or eliminate crosstalk between each of the loudspeakers and the non-corresponding ears of remote participant 812—see discussion of FIG. 10 below.)


Conference room 801 contains motor-driven dummy head 803, a motorized device which takes the place of a human head and moves in response to commands provided thereto. Such dummy heads are fully familiar to those skilled in the art. Dummy head 803 comprises right in-ear microphone 804, left in-ear microphone 805, right in-eye camera 806, and left in-eye camera 807. Microphones 804 and 805 advantageously capture the sound which is produced in conference room 801, and cameras 806 and 807 advantageously capture the video (which may be produced in stereo vision) from conference room 801—both based on the particular orientation (view angle) of dummy head 803.


In accordance with the principles of the present invention, and in accordance with the fourth illustrative embodiment thereof, the head movements of remote participant 812 are tracked with head tracker 808, and the resultant head movement data is transmitted by link 815 from remote room 802 to conference room 801. There, this head movement data is provided to dummy head 803 which properly mimics the head movements of remote participant 812 in accordance with an appropriate angle conversion function f(Δφ) as shown on link 815. (The function “f” will depend on the location of the dummy head in conference room 801, and will be easily ascertainable by one of ordinary skill in the art. Illustratively, the function “f” may simply be the identity function, i.e., f(Δφ)=Δφ, or it may simply scale the angle, i.e., f(Δφ)=qΔφ, where q is a fraction.) Moreover, the video captured in conference room 801 by cameras 806 and 807 is transmitted by link 813 back to remote room 802 for display on video display screen 811, and the binaural (L/R) audio captured by microphones 804 and 805 is transmitted by link 814 back to remote room 802 for use by speakers 809 and 810. Video display screen 811 may display the received video in either 2D or 3D. However, in accordance with the principles of the present invention, and in accordance with the fourth illustrative embodiment thereof, the binaural audio played by speakers 809 and 810 will be advantageously generated in accordance with the principles of the present invention based, inter alia, on the location of the human speaker on video display screen 811, as well as on the physical location of remote participant 812 in remote room 802 (i.e., on the location of remote participant 812 relative to video display screen 811).



FIG. 9 shows an illustrative environment for providing binaural audio-visual rendering of a sound source in a video teleconferencing application using a video display screen and a 360 degree or partial angle video camera in the subject conference room, in accordance with a fifth illustrative embodiment of the present invention. The figure shows two rooms—conference room 901 and remote room 902. Remote room 902 contains remote participant 912 who is viewing the activity (e.g., a set of conference participants) in conference room 901 using video display screen 911 and listening to the activity (e.g., one or more speaking conference participants) in conference room 901 using a headset comprising right speaker 909 and left speaker 910. The headset also advantageously comprises head tracker 908 for determining the positioning of the head of remote participant 912. (In accordance with alternative embodiments of the present invention, head tracker 908 may be independent of the headset, and may be connected to the person's head or alternatively may comprise an external device—i.e., one not connected to remote participant 912. Moreover, in accordance with other illustrative embodiments of the present invention, the headset containing speakers 909 and 910 may be replaced by a corresponding pair of loudspeakers positioned appropriately in remote room 902, in which case adaptive crosstalk cancellation may be advantageously employed to reduce or eliminate crosstalk between each of the loudspeakers and the non-corresponding ears of remote participant 812—see discussion of FIG. 10 below.)


Conference room 901 contains 360 degree camera 903 (or in accordance with other illustrative embodiments of the present invention, a partial angle video camera) which advantageously captures video representing at least a portion of the activity in conference room 901, as well as a plurality of microphones 904—preferably one for each conference participant distributed around conference room table 905—which advantageously capture the sound which is produced by conference participants in conference room 901.


In accordance with the principles of the present invention, and in accordance with one illustrative embodiment thereof as shown in FIG. 9, the head movements of remote participant 912 may be tracked with head tracker 908, and the resultant head movement data may be transmitted by link 915 from remote room 902 to conference room 901. There, this head movement data may be provided to camera 903 such that the captured video image (based, for example, on the angle that the camera lens is pointing) properly mimics the head movements of remote participant 912 in accordance with an appropriate angle conversion function f(Δφ) as shown on link 915. (The function “f” will depend on the physical characteristics of camera 903 and conference room table 905 in conference room 901, and will be easily ascertainable by one of ordinary skill in the art. Illustratively, the function “f” may simply be the identity function, i.e., f(Δφ)=Δφ, or it may simply scale the angle, i.e., f(Δφ)=qΔφ, where q is a fraction.)


In accordance with one illustrative embodiment of the present invention, camera 903 may be a full 360 degree camera and the entire 360 degree video may be advantageously transmitted via link 913 to remote room 902. In this case, the video displayed on the video screen may comprise video extracted from or based on the entire 360 degree video, as well as on the head movements of remote participant 912 (tracked with head tracker 908). In accordance with this illustrative embodiment of the present invention, transmission of the head movement data to conference room 901 across link 915 need not be performed. In accordance with another illustrative embodiment of the present invention, camera 903 may be either a full 360 degree camera or a partial view camera, and based on the head movement data received over link 915, a particular limited portion of video from conference room 901 is extracted and transmitted via link 913 to remote room 902. Note that the latter described illustrative embodiment of the present invention will advantageously enable a substantial reduction of the data rate employed in the transmission of the video across link 913.


In accordance with either of these above-described illustrative embodiments of the present invention as shown in FIG. 9, the video captured in conference room 901 by camera 903 (or a portion thereof) is transmitted by link 913 back to remote room 902 for display on video display screen 911, and multi-channel audio captured by microphones 904 is transmitted by link 914 back to remote room 902 to be advantageously processed and rendered in accordance with the principles of the present invention for speakers 909 and 910. Video display screen 911 may display the received video in either 2D or 3D. However, in accordance with the principles of the present invention, and in accordance with the fifth illustrative embodiment thereof, the binaural audio played by speakers 909 and 910 will be advantageously generated in accordance with the principles of the present invention based, inter alia, on the location of the human speaker on video display screen 911, as well as on the physical location of remote participant 912 in remote room 902 (i.e., on the location of remote participant 912 relative to video display screen 911).



FIG. 10 shows a block diagram of an illustrative system for providing binaural audio-visual rendering of a sound source in a video teleconferencing application using head tracking and adaptive crosstalk cancellation, in accordance with a sixth illustrative embodiment of the present invention. The figure shows a plurality of audio channels being received by (optional) demultiplexer 1001, which is advantageously included in the illustrative system if the plurality of audio channels are provided as a (single) multiplexed signal, in which case demultiplexer 1001 generates a plurality of monaural audio signals (illustratively s1 through sn), which feed into binaural mixer 1005. (Otherwise, a plurality of multichannel audio signals feed directly into binaural mixer 1005.) Moreover, either a video input signal is received by (optional) sound source location detector 1002, which determines the appropriate locations in the corresponding video where given sound sources (e.g., the locations in the video of the various possible human speakers) are to be located, or, alternatively, such location information (i.e., of where in the corresponding video the given sound sources are located) is received directly (e.g., as meta-data). In either case, such sound source location information is advantageously provided to angle computation module 1006.


In addition, as shown in the figure, angle computation module 1006 advantageously receives viewer location data which provides information regarding the physical location of viewer 1012 (Dx, Dy, Dz)—in particular, with respect to the known location of the video display screen being viewed (which is not shown in the figure), as well as the tilt angle (Δφ), if any, of the viewer's head. In accordance with one illustrative embodiment of the present invention, the viewer's location may be fixed (i.e., the viewer does not move in relation to the display screen), in which case this fixed location information is provided to angle computation module 1006. In accordance with another illustrative embodiment of the present invention, the viewer's location may be determined with use of (optional) head tracking module 1007, which, as shown in the figure, is provided position information for the viewer with use of position sensor 1009. As pointed out above in the discussion of FIGS. 8 and 9, head tracking may be advantageously performed with use of a head tracker physically attached to the viewer's head (or to a set of headphones or other head-mounted device), or, it may be performed with an external device which uses any one of a number of possible techniques—many of which will be familiar to those skilled in the art—to locate the position of the viewer's head. Position sensor 1009 may be implemented in any of these possible ways, each of which will be fully familiar to those skilled in the art.


In any case, based on both the sound source location information and on the viewer location information, as well as on the knowledge of the screen size of the given video display screen being used, angle computation module 1006, using the principles of the present invention and in accordance with an illustrative embodiment thereof, advantageously generates the desired angle information (illustratively φ1 thorough φn) for each one of the corresponding plurality of monaural audio signals (illustratively, s1 through sn) and provides this desired angle information to binaural mixer 1005. Binaural mixer 1005 then generates a pair of stereo binaural audio signals, in accordance with the principles of the present invention and in accordance with an illustrative embodiment thereof, which will advantageously provide improved matching of auditory space to visual space. In accordance with one illustrative embodiment of the present invention, viewer 1012 uses headphones (not shown in the figure as representing a different illustrative embodiment of the present invention) which comprises a pair of speakers (a left ear speaker and a right ear speaker) to which these two stereo binaural audio signals are respectively and directly provided.


In accordance with the illustrative embodiment of the present invention as shown in FIG. 10, however, the two stereo binaural audio signals are provided to adaptive crosstalk cancellation module 1008, which generates a pair of loudspeaker audio signals for left loudspeaker 1010 and right loudspeaker 1011, respectively. These loudspeaker audio signals are advantageously generated by adaptive crosstalk cancellation module 1008 from the stereo binaural audio signals supplied by binaural mixer 1005 based upon the physical viewer location (as either known to be fixed or as determined by head tracking module 1007). Specifically, the generated loudspeaker audio signals will advantageously produce: (a) from left loudspeaker 1010, left ear direct sound 1013 (hLL), which has been advantageously modified by adaptive crosstalk cancellation module 1008 to reduce or eliminate right-speaker-to-left-ear crosstalk 1016 (hRL) generated by right loudspeaker 1011, and (b) from right loudspeaker 1011, right ear direct sound 1014 (hRR), which has been advantageously modified by adaptive crosstalk cancellation module 1008 to reduce or eliminate left-speaker-to-right-ear crosstalk 1015 (hLR) generated by left loudspeaker 1010. Such adaptive crosstalk cancellation techniques are conventional and fully familiar to those of ordinary skill in the art.


Addendum to the Detailed Description

The preceding merely illustrates the principles of the invention. It will thus be appreciated that those skilled in the art will be able to devise various arrangements which, although not explicitly described or shown herein, embody the principles of the invention and are included within its spirit and scope. Furthermore, all examples and conditional language recited herein are principally intended expressly to be only for pedagogical purposes to aid the reader in understanding the principles of the invention and the concepts contributed by the inventor(s) to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions. Moreover, all statements herein reciting principles, aspects, and embodiments of the invention, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents as well as equivalents developed in the future, i.e., any elements developed that perform the same function, regardless of structure.


Thus, for example, it will be appreciated by those skilled in the art that the block diagrams herein represent conceptual views of illustrative circuitry embodying the principles of the invention. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudocode, and the like represent various processes which may be substantially represented in computer readable medium and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.


A person of ordinary skill in the art would readily recognize that steps of various above-described methods can be performed by programmed computers. Herein, some embodiments are also intended to cover program storage devices, e.g., digital data storage media, which are machine or computer readable and encode machine-executable or computer-executable programs of instructions, wherein said instructions perform some or all of the steps of said above-described methods. The program storage devices may be, e.g., digital memories, magnetic storage media such as magnetic disks and magnetic tapes, hard drives, or optically readable digital data storage media. The embodiments are also intended to cover computers programmed to perform said steps of the above-described methods.


The functions of any elements shown in the figures, including functional blocks labeled as “processors” may be provided through the use of dedicated hardware as well as hardware capable of executing software in association with appropriate software. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. Moreover, explicit use of the term “processor” or “controller” should, not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, digital signal processor (DSP) hardware, read only memory (ROM) for storing software, random access memory (RAM), and non volatile storage. Other hardware, conventional and/or custom, may also be included. Similarly, any switches shown in the figures are conceptual only. Their function may be carried out through the operation of program logic, through dedicated logic, through the interaction of program control and dedicated logic, or even manually, the particular technique being selectable by the implementer as more specifically understood from the context.


In the claims hereof any element expressed as a means for performing a specified function is intended to encompass any way of performing that function including, for example, a) a combination of circuit elements which performs that function or b) software in any form, including, therefore, firmware, microcode or the like, combined with appropriate circuitry for executing that software to perform the function. The invention as defined by such claims resides in the fact that the functionalities provided by the various recited means are combined and brought together in the manner which the claims call for. Applicant thus regards any means which can provide those functionalities as equivalent as those shown herein.

Claims
  • 1. A method for generating a spatial rendering of an audio sound to an observer using a plurality of speakers, the audio sound related to a video being displayed to said observer on a video screen having a given physical location, the method comprising: receiving a video input signal for use in displaying said video to said observer on said video screen;receiving one or more audio input signals related to said video input signal, the one or more audio input signals including said audio sound;determining a desired physical location relative to said video screen for spatially rendering said audio sound, the desired physical location being determined based on a position on the video screen at which a particular portion of said video corresponding to the audio sound is being displayed;determining a current physical location of the observer relative to said video screen; andgenerating a plurality of audio output signals based on said determined desired physical location for spatially rendering said audio sound and further based on said determined current physical location of the observer relative to said video screen, said plurality of audio signals being generated such that when delivered to said observer using said plurality of speakers, the observer hears said audio sound as being rendered from said determined desired physical location for spatially rendering said audio sound.
  • 2. The method of claim 1 wherein said observer is a participant in a video teleconference, wherein the observer and the video screen are located in a remote room, and wherein the video input signal and the one or more audio input signals are received in said remote room from a separate conference room having one or more other video teleconference participants located therein.
  • 3. The method of claim 1 wherein said observer is a viewer of a previously recorded video with an associated audio soundtrack and wherein the video input signal comprises the video portion of said prerecorded video and wherein the one or more audio input signals comprise the audio soundtrack thereof.
  • 4. The method of claim 1 wherein the plurality of speakers comprises a headphone set worn by the observer, wherein the headphone set comprises at least a left speaker for providing sound to a left ear of the observer and a right speaker for providing sound to a right ear of the observer, and wherein said generating the plurality of audio output signals comprises generating binaural audio signals comprising at least a left audio output signal which is used to drive the left speaker and a right audio output signal which is used to drive the right speaker.
  • 5. The method of claim 1 wherein the plurality of speakers comprises a plurality of loudspeakers placed in predetermined physical locations relative to the given physical location of the video screen, wherein the plurality of loudspeakers includes at least a left loudspeaker whose predetermined physical location comprises a position left of the video screen and a right loudspeaker whose predetermined physical location comprises a position right of the video screen, and wherein said generating the plurality of audio output signals comprises generating binaural audio signals comprising at least a left audio output signal which is used to drive the left loudspeaker and a right audio output signal which is used to drive the right loudspeaker.
  • 6. The method of claim 5 wherein the left audio output signal has been adapted to reduce crosstalk from the right loudspeaker to a left ear of the observer, and wherein the right audio output signal has been adapted to reduce crosstalk from the left loudspeaker to a right ear of the observer.
  • 7. The method of claim 1 wherein said generating said plurality of audio output signals is further based on an effective location of a video camera lens relative to said video screen, wherein said effective location of a video camera lens relative to said video screen has been determined based on a location of a camera lens which has captured said video, relative to said captured video.
  • 8. The method of claim 1 wherein the determining said current physical location of the observer relative to said video screen further determines a physical orientation of the observer relative to said video screen, wherein said physical orientation comprises an angle of right-to-left orientation relative to said video screen, and wherein said generating said plurality of audio output signals is further based on said determined angle of right-to-left orientation relative to said video screen.
  • 9. The method of claim 1 wherein said current physical location of the observer relative to said video screen is determined with use of a head position tracker.
  • 10. The method of claim 9 wherein said head position tracker is physically attached to said observer.
  • 11. The method of claim 1 wherein said position on the video screen at which the particular portion of said video corresponding to the audio sound is being displayed is determined based on an analysis of said video input signal.
  • 12. The method of claim 1 further comprising receiving meta-data which specifies said position on the video screen at which the particular portion of said video corresponding to the audio sound is being displayed.
  • 13. The method of claim 1 wherein the plurality of audio output signals are generated with use of a sound field synthesis technique.
  • 14. An apparatus for generating a spatial rendering of an audio sound to an observer, the apparatus comprising: a plurality of speakers;a video screen having a given physical location, the video screen for displaying a video to the observer, the audio sound related to the video being displayed to the observer;a video input signal receiver which receives a video input signal used to display the video to said observer on said video screen;an audio input signal receiver which receives one or more audio input signals related to said video input signal, the one or more audio input signals including said audio sound;a processor which (a) determines a desired physical location relative to said video screen for spatially rendering said audio sound, the desired physical location being determined based on a position on the video screen at which a particular portion of said video corresponding to the audio sound is being displayed, and(b) determines a current physical location of the observer relative to said video screen; andan audio output signal generator which generates a plurality of audio output signals based on said determined desired physical location for spatially rendering said audio sound and further based on said determined current physical location of the observer relative to said video screen, said plurality of audio signals being generated such that when delivered to said observer using said plurality of speakers, the observer hears said audio sound as being rendered from said determined desired physical location for spatially rendering said audio sound.
  • 15. The apparatus of claim 14 wherein said apparatus comprises a portion of a video teleconferencing system and wherein said observer is a participant in a video teleconference using said video teleconferencing system, wherein the apparatus and the observer are located in a remote room, and wherein the video input signal and the one or more audio input signals are received in said remote room from a separate conference room having one or more other video teleconference participants located therein.
  • 16. The apparatus of claim 14 wherein said observer is a viewer of a previously recorded video with an associated audio soundtrack and wherein the video input signal comprises the video portion of said prerecorded video and wherein the one or more audio input signals comprise the audio soundtrack thereof.
  • 17. The apparatus of claim 14 wherein the plurality of speakers comprises a headphone set worn by the observer, wherein the headphone set comprises at least a left speaker for providing sound to a left ear of the observer and a right speaker for providing sound to a right ear of the observer, and wherein said audio output signal generator generates binaural audio signals comprising at least a left audio output signal which is used to drive the left speaker and a right audio output signal which is used to drive the right speaker.
  • 18. The apparatus of claim 14 wherein the plurality of speakers comprises a plurality of loudspeakers placed in predetermined physical locations relative to the given physical location of the video screen, wherein the plurality of loudspeakers includes at least a left loudspeaker whose predetermined physical location comprises a position left of the video screen and a right loudspeaker whose predetermined physical location comprises a position right of the video screen, and wherein said audio output signal generator generates binaural audio signals comprising at least a left audio output signal which is used to drive the left loudspeaker and a right audio output signal which is used to drive the right loudspeaker.
  • 19. The apparatus of claim 18 wherein said audio output signal generator adapts the left audio output signal to reduce crosstalk from the right loudspeaker to a left ear of the observer, and adapts the right audio output signal to reduce crosstalk from the left loudspeaker to a right ear of the observer.
  • 20. The apparatus of claim 14 wherein said audio output signal generator generates the plurality of audio output signals further based on an effective location of a video camera lens relative to said video screen, wherein said effective location of a video camera lens relative to said video screen has been determined based on a location of a camera lens which has captured said video, relative to said captured video.
  • 21. The apparatus of claim 14 wherein the processor further (c) determines a physical orientation of the observer relative to said video screen, wherein said physical orientation comprises an angle of right-to-left orientation relative to said video screen, and wherein said audio output signal generator generates said plurality of audio output signals further based on said determined angle of right-to-left orientation relative to said video screen.
  • 22. The apparatus of claim 14 further comprising a head position tracker, and wherein said processor determines the current physical location of the observer relative to said video screen with use of said head position tracker.
  • 23. The apparatus of claim 22 wherein said head position tracker is physically attached to said observer.
  • 24. The apparatus of claim 14 wherein said processor determines the position on the video screen at which the particular portion of said video corresponding to the audio sound is being displayed based on an analysis of said video input signal.
  • 25. The apparatus of claim 14 further comprising a meta-data receiver which receives meta-data specifying said position on the video screen at which the particular portion of said video corresponding to the audio sound is being displayed.
  • 26. The method of claim 14 wherein said audio output signal generator generates said plurality of audio output signals with use of a sound field synthesis technique.
CROSS-REFERENCE TO RELATED APPLICATION

The present application is related to co-pending U.S. patent application Ser. No. ______, “Method And Apparatus For Improved Matching Of Auditory Space To Visual Space In Video Teleconferencing Applications Using Window-Based Displays,” filed by W. Etter on even date herewith and commonly assigned to the assignee of the present invention.