Video conferencing is an established method of simulated face-to-face collaboration between participants located at one or more remote environments and participants located at a local environment. Typically, one or more cameras, one or more microphones, one or more video displays, and one or more speakers are located at the remote environments and the local environment. This allows participants at the local environment to see, hear, and talk to the participants at the remote environments. For example, video images at the remote environments are broadcast onto the one or more video displays at the local environment and accompanying audio signals (e.g., sometimes referred to as an audio images) are broadcast to the one or more speakers (e.g., sometimes referred to as an audio display) at the local environment.
One of the objectives of videoconferencing is to create a quality telepresence experience, where the participants at the local environment feel is though they are actually present at a remote environment and are interacting with participants at the remote environments. However, one of the problems in creating a quality telepresence experience is a directionality mismatch between the audio and video images. That is, the sound of a participant's voice may appear to be coming from a location that is different from where that participant's image is located on the video display. For example, the participant who is speaking may appear at the left of the video display, but the sound may appear to be coming from the right of the video display.
In the following detailed description of the present embodiments, reference is made to the accompanying drawings that form a part hereof, and in which are shown by way of illustration specific embodiments that may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice disclosed subject matter, and it is to be understood that other embodiments may be utilized and that process, electrical or mechanical changes may be made without departing from the scope of the claimed subject matter. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the claimed subject matter is defined only by the appended claims and equivalents thereof.
Audio/video system 100 receives an encoded combined audio/video signal A/V from an audio/video source, such as an audio/video system of one or more remote video conference rooms, over a network, for example. For example, encoded combined audio/video signal A/V may be received at a signal divider 105, such as a transport processor, that extracts an encoded audio signal A and an encoded video signal V from audio/video signal A/V.
Encoded video signal V and encoded audio signal A are respectively decoded at a video signal decoder 110 and an audio signal decoder 115. The decoded video signal is sent to a video processor 125 that in turn sends a processed video signal, for one embodiment, to a projector, e.g., as part of a front or rear projection system, that projects images contained in the video signal onto a video display 130, such as a passive display or an active display with electronics, either from the front or the rear. For another embodiment, video display 130 may be a projectionless display, such as a liquid crystal display or a plasma display, in which case the video signals are sent directly from video processor 125 to video display 130.
The decoded audio signal is sent to an audio processor 135 that in turn sends a processed audio signal to one or more speakers 140. A controller 145 sends signals (e.g., referred to as commands or instructions) to the audio and video decoders and the audio and video processors for controlling the audio and video decoders and the audio and video processors. For example, video processor 125 may send video signals to video display 130 in response to a command from controller 145 and audio processor 135 may send audio signals to speakers 140 in response to another command from controller 145.
For one embodiment, controller 145 includes processor 150 for processing computer/processor-readable instructions. These computer-readable instructions are stored in a memory 155, such as a computer-usable media, and may be in the form of software, firmware, or hardware. In a hardware solution, the instructions are hard coded as part of processor 150, e.g., an application-specific integrated circuit (ASIC) chip. In a software or firmware solution, the instructions are stored for retrieval by the processor 150. Some additional examples of computer-usable media include static or dynamic random access memory (SRAM or DRAM), read-only memory (ROM), electrically-erasable programmable ROM (EEPROM or flash memory), magnetic media and optical media, whether permanent or removable. Most consumer-oriented computer applications are software solutions provided to the user on some removable computer-usable media, such as a compact disc read-only memory (CD-ROM). The computer-readable instructions cause controller 145 to perform various methods, such as controlling the audio and video decoders and the audio and video processors. For example, computer-readable instructions may cause controller 145 to send commands to audio processor 135 to apply certain gains and timing (e.g., time delays) to the audio signals received at audio processor 135 so that audio processor 135 can correlate the sound from the speakers to a portion of video display 130 from which the sound appears to be originating, as discussed below.
The images displayed on video display 130 may be received from one or more remote video conference rooms, e.g., as described above in conjunction with
For one embodiment, the video configurations are predetermined for each video-conference-room configuration. For example, it may be predetermined that video contained in respective ones of video signals V1-VN be displayed on respective ones of predetermined video monitors of a display having multiple video monitors. For example, for a display 130 with three video monitors 210, as shown in
For embodiments where a single video monitor is used, it is predetermined that video contained in respective ones of video signals V1-VN be displayed on respective ones of predetermined portions of the single video monitor. For example, it may be predetermined that the video contained in decoded video signal V1 be displayed in a left portion of the single monitor, the video contained in decoded video signal V2 be displayed in a center portion of the single monitor, and the video contained in decoded video signal VN be displayed in a right portion of the single monitor.
For embodiments where video monitors are part of a projection system, decoded video signals V1, V2, and VN are received at one or more projectors from video processor 125, and the images from decoded video signals V1, V2, and VN are respectively projected onto the respective video monitors 2101, 2102, and 2103 or are respectively projected onto a left portion, a center portion, and a right portion of a single video monitor. For embodiments where video monitors 2101, 2102, and 2103 are projectionless video monitors, decoded video signals V1, V2, and VN are respectively sent directly to video monitors 2101, 2102, and 2103 from video processor 125. For a single projectionless video monitor, for example, decoded video signals V1, V2, and VN may be respectively sent directly to a left portion, a center portion, and a right portion of that monitor.
For one embodiment, video contained in the video signals V1-VN is adjusted so that the objects, such as a table 220 and participants 230 appear continuous across the boundaries of video monitors 210. For other embodiments, cameras at the originating remote video conference rooms may be adjusted so that the objects appear continuous across the boundaries of video monitors 210.
For one embodiment, a speaker 140 may be located on either side of video display 130. For another embodiment, a speaker may be located below one or more of the video monitors 210 in lieu of or in addition to speakers 140. Speakers may also be located on the ceiling and/or the floor of the video conferencing room. During operation, as video images are displayed on video monitors 210, audio signals (e.g., sometimes referred to as audio images) corresponding to the video images are sent to speakers 140.
For one embodiment, encoded video signals V1-VN respectively correspond to encoded audio signals A1-AN. That is, the audio contained in respective ones of audio signals A1-AN corresponds the video contained respective ones of video signals V1-VN. For one embodiment, encoded audio signals A1-AN (
Alternatively, encoded audio signals A1-AN may be respectively received at audio signal encoder 115 from different remote video conference rooms, and the respective corresponding encoded video signals V1-VN may be respectively received at video signal encoder 110 from those conference rooms. For example, encoded audio signal A1 may be received from one or more microphones in a first video conference room, and the corresponding encoded video signal V1 may be received from one or more cameras in the first video conference room. Similarly, encoded audio signal A2 may be received from one or more microphones in a second video conference room, and the corresponding encoded video signal V2 may be received from one or more cameras in the second video conference room. Likewise, encoded audio signal AN may be received from one or more microphones in an Nth video conference room, and the corresponding encoded video signal VN may be received from one or more cameras in the Nth video conference room.
Audio signal decoder 115 sends decoded audio signals 3101 to 310N to each of output channels 1-M of audio processor 135, as shown in
Channels 1-M respectively output audio signals 3401-340M to speakers 1401-140M. For example, at each of channels 1-M, audio processor 135 applies a gain and/or timing to the signals 310 received at that channel, e.g., in response to commands from controller 145. Then, at each channel, the audio signals 3101-310M with the respective gains and/or timing applied thereto are respectively output as audio signals 3401-340M. For one embodiment, the timing may involve delaying one or more of audio signals 3401-340M with respect to others.
For another embodiment, when it is determined that the sound corresponding to an audio signal appears to be originating from certain a portion of video display 130, such as video monitor 2101 when participant 2301 is speaking (
For one embodiment, the portion of video display 130 from which the sound appears to be originating is predetermined in that the predetermined portion of video display 130 on which the image, such as participant 2301, that is producing the sound defines and corresponds to the portion of video display 130 from which the sound appears to be originating. The distance from each speaker 140 to different portions of the video display 130 is also predetermined, for some embodiments, so that the distance between each speaker 140 and each portion of video display 130 from which the sound appears to be originating is predetermined. Therefore, the audio signal corresponding to the video signal that contains the image producing the sound can be adjusted, as just described, based on the predetermined distances between the predetermined portion of the video display 130 from which the sound appears to be originating and the speakers 140.
For the example of
In order for the sound coming from the speakers to appear as though it is originating from participant 2301, the location 1 gain applied to audio signal 3101 at channel 1, e.g., in response to a command from controller 145, may be greater than the location 1 gain applied to audio signal 3101 at channel M, e.g., in response to a command from controller 145. That is, a higher gain is applied the audio signal 3101 destined for speaker 1401 that is closer to the apparent sound origin on the video display, such as participant 2301, than the audio signal 3101 destined for speaker 140M that is further from the apparent sound origin on the video display. For example, the sound pressure level of the audio signal 3401 resulting from the gain applied to audio signal 3101 destined for speaker 1401 is greater than the sound pressure level of the audio signal 340M resulting from the gain applied to audio signal 3101 destined for speaker 140M.
For other embodiments involving additional speakers, the gain may be applied to the audio signals 310, e.g., in response to a command from controller 145, according to the distance from the apparent sound origin on the video display, such as participant 2301, to the speakers 140 for which those audio signals 310 are destined. For example, the gain may decrease as the distance from participant 2301 to a speaker increases. For example, if speaker 1402 is closer to participant 2301 than speaker 140M and further away from participant 2301 than speaker 1401, the gain applied at channel 2 to audio signal 3101 destined for speaker 1402 might be less than the gain applied to the audio signal 3101 destined for speaker 1401 and greater than audio signal 3101 destined for speaker 140M such that the sound pressure level of audio signal 3402 is greater than the sound pressure level of audio signal 340M and less than the sound pressure level of audio signal 3401.
Continuing with the example illustrated in
For other embodiments involving additional speakers, the delay, e.g., in response to a command from controller 145, may be applied to the audio signals 310 according to the distance from the apparent sound origin on the video display, such as participant 2301, to the speakers 140 for which those audio signals 310 are destined. For example, the delay may decrease as the distance from participant 2301 to a speaker decreases or vice versa, starting with a zero delay, for example, applied to the signal destined for the speaker closest to the apparent sound origin on the video display. For example, if speaker 1402 is closer to participant 2301 than speaker 140M and further away from participant 2301 than speaker 1401, the delay applied at channel 2 to audio signal 3101 destined for speaker 1402 might be less than the delay applied to the audio signal 3101 destined for speaker 140M and greater than the delay (e.g., a zero delay) applied to the audio signal 3101 destined for speaker 1401.
For one embodiment, the delay may be on the order of the time delay resulting from the difference in path lengths between the speakers and a certain location within the video conference room in which the speakers are located, such as the location of a table in the video conference room at which participants may be positioned. For example, the delay applied to audio signal 3101 destined for speaker 140M might be on the order of the delay due to the difference in path lengths between speakers 1401 and 140M and the certain location. For another embodiment, the delay may be, for example, substantially equal to or greater than the delay due to the difference in path lengths between the speakers and the certain location.
For the example illustrated in
For other embodiments involving additional speakers, both a delay and a gain may be applied to the audio signals 310, e.g., in response to a command from controller 145, according to the distance from the apparent sound origin on the video display, such as participant 2301, to the speakers 140 for which those audio signals 310 are destined. For example, if speaker 1402 is closer to participant 2301 than speaker 140M and further away from participant 2301 than speaker 1401, the audio signal 3402 received at speaker 1402 has a lower gain and sound pressure level than the audio signal 3401 received at speaker 1401 and is delayed with respect to the audio signal 3401 received at speaker 1401, and the audio signal 340M received at speaker 140M has a lower gain and sound pressure level than the audio signal 3402 received at speaker 1402 and is delayed with respect to the audio signal 3402 received at speaker 1402.
Although the above examples were directed to audio signals 3101 from remote location 1, it will be appreciated that similar examples may be provided for each of the remaining audio signals 310 for the remaining remote locations. For example, participant 2302 may be at remote location N. For an example where participant 2302 is speaking and participant 2301 is not, the audio signal 310N, corresponding to the video signal that produces the image of participant 2302 on video monitor 130, destined for speaker 1401, which is further away from participant 2302 than speaker 140M, may have lower gain applied thereto at channel 1 than the gain applied the audio signal 310N destined for speaker 140M at channel M and/or the audio signal 310N destined for speaker 1401 may be delayed with respect to the audio signal 310N destined for speaker 140M. Therefore, the audio signal 3401 output from channel 1 and received at speaker 1401 will have a lower sound pressure level than the audio signal 340M output from channel M and received at speaker 140M and/or the audio signal 3401 will be delayed with respect to audio signal 340M. As a result, the sound appears to be coming from speaker 140M, which is closest to participant 2302, who is speaking.
For one embodiment, audio signal gains and/or delays may be determined for each speaker for different types of video conferencing systems (e.g., different video displays, different speaker setups, etc.) and different types video conference rooms (e.g., different distances between the video displays and participant seating locations, different distances between the speakers and participant seating locations different numbers of participants, different distances between the speakers and various locations of the video display, etc.). For example, numerical values corresponding to different audio signal gains and/or time delays may be stored in memory 155 of controller 145, e.g., in a look-table 160, as shown in
For another embodiment, a numerical value representative of the distance from each speaker to different locations on the video display may be stored in memory 155, such as in look-up table 160, for a plurality of video conference rooms. In addition, the predetermined locations on the video display at which the video from the video signals, and thus the predetermined locations of the apparent sound origins, may also be stored in memory 155, such as in look-up table 160, for a plurality of video conference room configurations. Therefore, controller 145 can enter look-up table 160 with given room configuration and cause the video contained in each video signal to be displayed at the predetermined locations on the video display. In addition, controller 145 can enter look-up table 160 with a predetermined location of the apparent sound origin on the video display and extract the numerical value representative of the distance from each speaker to the apparent sound origin on the video display for the given room, and subsequently instruct audio processor 135 to adjust the gains and delays for each speaker according to the determined distances.
Although specific embodiments have been illustrated and described herein it is manifestly intended that the scope of the claimed subject matter be limited only by the following claims and equivalents thereof.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US08/72982 | 8/13/2008 | WO | 00 | 2/10/2011 |