A spatial audio teleconference between two or more geographically distant sites is typically achieved by processing audio signals captured with microphones at one site to produce spatial audio data. This spatial audio data is then transmitted to the other sites and processed at each of these sites to generate a plurality of output audio signals that are played through multiple audio speakers in a manner that spatializes the sound from a sending site to a distinct location in the receiving site. This process is repeated at all the sites resulting in the voices of participants at other sites seeming to a participant at the receiving site as if they are emanating from different locations in the receiving site. This spatializing of the voices of other participants in the receiving site is typically accomplished using only the spatial audio data received from the other sites.
This Summary is provided to introduce a selection of concepts, in a simplified form, that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
Audio source positioning technique embodiments described herein are generally employed in a video teleconference or telepresence session between a local site and one or more remote sites. In one embodiment, each of these sites has one participant, and a virtual scene is constructed and displayed at each site that depicts each of the participants from the other sites. However, rather than simply playing audio captured at the other site or sites in the viewing participant's site, audio source positioning technique embodiments described herein are used to make it seem to a participant viewing a rendering of the virtual scene that the voice of each depicted participant is emanating from a location on the display device where that participant is depicted.
In general this audio source positioning is accomplished at a site (referred to as the local site for convenience) by transmitting data to the other site or sites (referred to as remote sites for convenience), which is then used at those sites to construct the aforementioned virtual scene with spatialized audio. In addition, similar data is received from the other site or sites to construct a virtual scene with spatialzed audio at the local site.
More particularly, in one general embodiment, streams of sensor data generated from an arrangement of sensors that capture participant data are input into a computing device or devices resident at the local site. This arrangement of sensors includes a plurality of video and audio devices. Each video capture device captures the participant from a different geometric perspective, and each audio capture device captures the voice of the participant. Scene proxies are generated from the streams of sensor data, which geometrically describes the local site including the participant on a frame by frame basis. In addition, the streams of video sensor data and a face tracking technique are employed to identify a 3D point representing the location of the participant in the local site for each frame of the scene proxies. The scene proxies representing each frame are transmitted in the order generated over a data communication network to each remote site, along with, two additional items. Namely, audio data representing the local site participant's voice captured, if any, during the time period between the frame currently being transmitted and next frame of scene proxies to be transmitted, and the 3D point coordinates representing the location of the participant in the local site for the frame currently being transmitted.
Meanwhile, the local site's computing device or devices receive scene proxies representing successive scene proxy frames from each remote site. In addition, audio data representing the remote site participant's voice captured, if any, during the time period between the currently received frame and the next frame of scene proxies to be received from the remote site, and a 3D point representing the location of the participant in the remote site, are received from each remote site that is facilitating audio source positioning at the local site. For each frame of scene proxies received from a remote site if there is only one remote site sending frames, or for each group of frames of scene proxies contemporaneously received from remote sites if there are multiple remote sites sending frames, a frame of a virtual scene is rendered from the last-received frame or frames of scene proxies that includes a depiction of each of the remote site participants. The rendered frame is then displayed to the local site participant via a display device. In addition, for each remote site participant depicted in the last-rendered frame of the virtual scene that is resident at a remote site that sent the aforementioned audio data representing the remote site participant's voice and the 3D point representing the location of the participant in the remote site, a spatial audio technique is employed to make it seem to the local site participant that the voice of the remote site participant is emanating from a location on the display device where the remote participant is depicted.
The specific features, aspects, and advantages of the disclosure will become better understood with regard to the following description, appended claims, and accompanying drawings where:
In the following description of audio source positioning technique embodiments reference is made to the accompanying drawings which form a part hereof, and in which are shown, by way of illustration, specific embodiments in which the technique may be practiced. It is understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the technique.
It is also noted that for the sake of clarity specific terminology will be resorted to in describing the audio source positioning technique embodiments described herein and it is not intended for these embodiments to be limited to the specific terms so chosen. Furthermore, it is to be understood that each specific term includes all its technical equivalents that operate in a broadly similar manner to achieve a similar purpose. Reference herein to “one embodiment”, or “another embodiment”, or an “exemplary embodiment”, or an “alternate embodiment”, or “one implementation”, or “another implementation”, or an “exemplary implementation”, or an “alternate implementation” means that a particular feature, a particular structure, or particular characteristics described in connection with the embodiment or implementation can be included in at least one embodiment of the audio source positioning technique. The appearances of the phrases “in one embodiment”, “in another embodiment”, “in an exemplary embodiment”, “in an alternate embodiment”, “in one implementation”, “in another implementation”, “in an exemplary implementation”, “in an alternate implementation” in various places in the specification are not necessarily all referring to the same embodiment or implementation, nor are separate or alternative embodiments/implementations mutually exclusive of other embodiments/implementations. Yet furthermore, the order of process flow representing one or more embodiments or implementations of the audio source positioning technique does not inherently indicate any particular order nor imply any limitations of the audio source positioning technique.
The term “sensor” is used herein to refer to any one of a variety of scene-sensing devices which can be used to generate a stream of sensor data that represents a given scene. Generally speaking and as will be described in more detail hereafter, the audio source positioning technique embodiments described herein employ one or more sensors which can be configured in various arrangements to capture a scene, thus allowing one or more streams of sensor data to be generated each of which represents the scene from a different geometric perspective. Each of the sensors can be any type of video capture device (e.g., any type of video camera), or any type of audio capture device, or any combination thereof. Each of the sensors can also be either static (i.e., the sensor has a fixed spatial location and a fixed rotational orientation which do not change over time), or moving (i.e., the spatial location and/or rotational orientation of the sensor change over time). The audio source positioning technique embodiments described herein can employ a combination of different types of sensors to capture a given scene.
Audio source positioning technique embodiments described herein are generally employed in a video teleconference or telepresence session between a local site and one or more remote sites. In one embodiment, each of these sites has one participant, and a virtual scene is constructed and displayed at each site that depicts each of the participants from the other sites in the constructed scene. Thus, it appears to a participant who is viewing the virtual scene that he or she is in a space with the participant or participants from the other site or sites. The construction of such a virtual scene is accomplished using conventional methods, with an exception. Rather than simply playing audio captured at the other site(s) in the viewing participant's site, audio source positioning technique embodiments described herein are used to co-locate the voice of each of the other participant(s) with the depiction of that person on a display. In other words, audio source positioning technique embodiments described herein make it seem to a participant viewing a rendering of the virtual scene that the voice of another participant is emanating from a location on the display device where the remote participant is depicted. This audio illusion enhances the video teleconference or telepresence session experience and makes it seem more like the viewing participant is actually present with the other participant(s) in the virtual scene.
It is noted that for convenience, the participant who is viewing the rendered virtual scene will be referred to as a local or first participant, and the site that this participant is viewing from will be referred to the local or first site. Each of the other participants involved will be referred to as a remote or other participant, and the site associated with a remote participant will be referred to as a remote or other site. Given this, it will be evident that any of the sites participating in a video teleconference or telepresence session can be considered the local site with the others being the remote sites.
Referring to
In addition to generating scene proxies, the streams of sensor data are used, along with a face tracking technique, to identify a 3D point representing the location of the participant in the local site for each frame of the scene proxies (block 104). In one embodiment, this 3D point representing the location of the participant in the local site is a 3D point representing the location of the participant's head in the local site. In another embodiment, the 3D point representing the location of the participant in the local site is a 3D point representing the location of the participant's mouth in the local site.
The scene proxies representing each frame are transmitted in the order generated over a data communication network to the remote site or sites, along with, audio data representing the local site participant's voice captured, if any, during the time period between the frame currently being transmitted and next frame of scene proxies to be transmitted, and the 3D point coordinates representing the location of the participant in the local site identified for the frame of scene proxies currently being transmitted (block 106). It is noted that the “if any” caveat refers to the fact that the local participant may not speak during the frame time period alluded to above.
The foregoing process actions provided the data used at a remote site to perform audio source positioning. Thus, the foregoing action can be said to facilitate audio source positioning at a remote site. If audio source positioning is to be implemented at the local site as well, then the same type of data is provided from a remote site or sites. Referring to
For each frame of scene proxies received from a remote site if there is only one remote site, or for each group of frames of scene proxies contemporaneously received from remote sites if there are multiple remote sites, a frame of a virtual scene is rendered (block 204). As indicated previously, the virtual scene frame includes a depiction of each of the remote site participants from the last-received frame or frames of scene proxies. The rendered virtual frame is then displayed to the local site participant via a display device (block 206). It is noted that the term contemporaneously used above is not to be taken literally. For example, in one implementation, the frames of scene proxies coming from multiple remote sites are considered contemporaneous if they arrive before the next frame from any of the sites.
In addition, for each remote site participant depicted in the last-rendered frame of the virtual scene that is resident at a remote site that sent audio data representing the remote site participant's voice and the 3D point representing the location of the participant in the remote site, a spatial audio technique is employed to make it seem to the local site participant that the voice of the remote site participant is emanating from a location on the display device where the remote participant is depicted (block 208). This is accomplished using conventional methods given the received audio data and 3D point representing the location of the participant in the remote site.
As mentioned previously, the 3D point representing the location of the participant in the remote site can be the person's head or mouth. More particularly, in one embodiment, the 3D point representing the location of a participant in a remote site is a 3D point representing the location of the participant's mouth in the remote site when the mouth of that participant is visible in the last-rendered frame of the virtual scene. In another embodiment, the 3D point representing the location of a participant in a remote site is a 3D point representing the location of the participant's head in the remote site when the mouth of that participant is not visible by the sensor used to determine that 3D point.
With regard to the foregoing action of rendering the frames of the virtual scene, it is noted that as part of this process, for each remote site, a first transform is computed that converts 3D locations in the remote site to points in the frame of the virtual scene. In addition, the action of displaying a rendered frame to the local site participant involves the use of a second transform that converts points in a frame of the virtual scene to screen coordinates on the local site's display device. These transforms are used in the aforementioned spatial audio technique. More particularly, referring to
It is noted that the location of the local participant within the local site has an effect on how audio source positioning is accomplished. In general, a parallax effect results when the local site residence moves and in one embodiment, the spatial audio technique compensates based on the current location of the local participant. Generally, the head of the local site participant is tracked and periodically a 3D point representative of the location of the local site participant's head in the local site is computed. The point is then used in the audio source positioning. More particularly, in one embodiment, each time a 3D point representative of the location of the local site participant's head in the local site is computed, the spatial audio technique is used to make it seem to the local site participant that the voice of the remote site participant is emanating from a location on the display device where the remote participant is depicted taking into consideration the last-computed 3D point representative of the location of the local site participant's head.
It is noted that to provide a more realistic experience for the local participant, the rate at which 3D points representative of the location of the local site participant's head in the local site are computed should be high. In one embodiment, this rate exceeds the rate at which frames of the virtual scene are calculated. For example, a typical virtual scene frame rate is 30 frames per second (fps). In one implementation, the rate at which 3D points representative of the location of the local site participant's head are rendered is four times the virtual frame rate-namely 120 times per second. Thus, while the content of the scene may only be updated at 30 fps, the depiction of the scene from the point of view of the local participant is updated at 120 fps. In other words the scene is calculated at 30 fps, but rendered at 120 fps.
The audio source positioning technique embodiments described so far make it seem to a participant viewing a rendering of the virtual scene that the voice of another participant is emanating from a location on the display device where the remote participant is depicted. However, there is another enhancement that can be made to make the video teleconference or telepresence session experience even more like the viewing participant is actually present with the other participant(s) in the virtual scene. This enhancement involves simulating the reverberations a participant's voice would create in the virtual scene (e.g., reverberations of the sound against the virtual walls or other virtual objects in the scene) and playing these reverberation in the participant's site.
This reverberation enhancement can be accomplished at the local site, given, from each remote site, the 3D point representing the location of the remote participant in the remote site and a modified version of the audio data representing the remote site participant's voice site. The modification to the audio data involves suppressing reverberations and noise in the audio captured at the remote site. While this modification can be performed at the local site given certain information about the remote site, the more efficient method would be for the reverberations and noise in the audio captured at the remote site to be suppressed in the audio data prior to the data being sent to the local site. In either case, conventional suppression techniques are employed to accomplish the modification.
Assuming the above-described modified audio data and the 3D point representing the location of the remote participant has been received from a remote site, one general embodiment of the audio source positioning technique that adds reverberation on a frame-by-frame basis involves, from the viewpoint of the local site, using the local site computing device to perform the following process actions. First, the previously-described first transform computed to convert 3D locations in the remote site to points in the last-rendered frame of the virtual scene is employed to convert the 3D point representing the location of the remote participant in the remote site to a point in the last-rendered frame of the virtual scene (block 400). In this embodiment, the 3D point representing the location of the remote participant in the remote site corresponds to a 3D point representing the location of the remote participant's mouth in the remote site. Next, the orientation of the remote site participant's face in the virtual scene, as depicted in the last-rendered virtual scene frame, is identified (block 402). Conventional methods are employed to accomplish this task. The direction that the remote participant's voice projects in the virtual space from the point in the last-rendered frame of the virtual scene that corresponds to the 3D point representing the location of the remote participant's mouth is then computed based on the orientation of the remote site participant's face in the virtual scene (block 404). In addition, the reverberation characteristics of the virtual scene, as depicted in the last-rendered virtual scene frame, are estimated (block 406).
Given the point representing the location of the remote participant's mouth in the virtual scene and the computed direction, reverberation audio data is then computed that when added to the received audio data simulates the reverberations of the remote participant's voice in the virtual space for the current frame (block 408). This computed reverberation audio data is then added into the audio played in the local site in conjunction with the display of the current virtual scene frame (block 410).
The audio source positioning technique embodiments described herein can be employed in a variety of video conferencing or telepresence applications. Generally, any video conferencing or telepresence application that involves the generation and display of a virtual scene for each participant can be enhanced using the audio source positioning technique embodiments described herein.
One exemplary video conferencing or telepresence application supports the generation, storage, distribution, and presentation of a virtual scene (such as a virtual conference room). The exemplary video conferencing or telepresence application can support various types of traditional, single viewpoint virtual scene presentations in which the viewpoint of the scene is fixed when the video is recorded/captured and this viewpoint cannot be controlled or changed by a participant while they are viewing the virtual scene. In other words, in a single viewpoint virtual scene the viewpoint of the scene is fixed and cannot be modified when the scene is being rendered and displayed to a participant. However, the exemplary video conferencing or telepresence application can support various types of free viewpoint video in which the viewpoint of the virtual scene can be interactively controlled and changed by a participant at will while they are viewing the scene. In other words, in a free viewpoint video a participant can interactively generate different viewpoints of the scene on-the-fly when the virtual scene is being rendered and displayed.
Referring again to
Referring again to
Referring again to
The rendering sub-stage 520 of the processing pipeline 500 inputs the scene proxies from the storage and distribution stage 514, and then generates successive frames of the virtual scene (one of which 524 is shown in
It is noted that in a video conferencing or telepresence application that can support various types of free viewpoint video in which the viewpoint of the virtual scene can be interactively controlled and changed by a participant at will while they are viewing the scene, in addition to the foregoing, the rendering sub-stage 520 inputs the scene proxies output from the storage and distribution stage 514 (or stages if multiple other sites are involved), and then generates a frame exhibiting a current synthetic viewpoint. The current synthetic viewpoint is either a default viewpoint, or if the participant has specified a viewpoint, is the last-specified viewpoint. The participant-specified viewpoint comes from the participant viewing experience sub-stage 522, which inputs it from the participant via a user interface.
Referring again to
In one implementation, the video capture devices 510 include a circular arrangement of eight genlocked sensors used to capture a site which includes the participant, where each of the sensors has a combination of one infrared structured-light projector, two infrared video cameras, and one color camera. Accordingly, the sensors each generate a different stream of video data which includes both a stereo pair of infrared image streams and a color image stream. The pair of infrared image streams and the color image stream generated by each sensor are used to generate different depth map image streams. The different depth map image streams are then merged into a stream of calibrated point cloud reconstructions of the scene. These point cloud reconstructions can then used to generate a stream of mesh models of the scene. A conventional view-dependent texture mapping method which accurately represents specular textures such as skin is then used to extract texture data from the color image stream generated by each sensor and map this texture data to the stream of mesh models of the scene. The combination of the mesh models and texture data, among other information, forms the scene proxies. Finally, these sensors and their data streams are also used in a face tracking process to identify the 3D location of the participant (which as described above can be the location of the participant's head or mouth).
In another implementation, the video capture devices 510 include four genlocked visible light video cameras used to capture a site which includes the participant, where the cameras are evenly placed around the site. Accordingly, the cameras each generate a different stream of video data which includes a color image stream. An existing 3D geometric model of a human body can be used in the scene proxies as follows. Conventional methods can be used to kinematically articulate the model over time in order to fit (i.e., match) the model to the streams of video data generated by the cameras. The kinematically articulated model can then be colored as follows. A conventional view-dependent texture mapping method can be used to extract texture data from the color image stream generated by each camera and map this texture data to the kinematically articulated model. The combination of the kinematically articulated model and texture data, among other information, forms the scene proxies. Here again, the cameras and their video data streams are also used in a face tracking process to identify the 3D location of the participant (which can be the location of the participant's head or mouth).
The audio source positioning technique embodiments described herein are operational within numerous types of general purpose or special purpose computing system environments or configurations.
For example,
To allow a device to implement the audio source positioning technique embodiments described herein, the device should have a sufficient computational capability and system memory to enable basic computational operations. In particular, as illustrated by
In addition, the simplified computing device of
The simplified computing device of
Retention of information such as computer-readable or computer-executable instructions, data structures, program modules, etc., can also be accomplished by using any of a variety of the aforementioned communication media to encode one or more modulated data signals or carrier waves, or other transport mechanisms or communications protocols, and includes any wired or wireless information delivery mechanism. Note that the terms “modulated data signal” or “carrier wave” generally refer to a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. For example, communication media includes wired media such as a wired network or direct-wired connection carrying one or more modulated data signals, and wireless media such as acoustic, RF, infrared, laser, and other wireless media for transmitting and/or receiving one or more modulated data signals or carrier waves. Combinations of the any of the above should also be included within the scope of communication media.
Further, software, programs, and/or computer program products embodying some or all of the various audio source positioning technique embodiments described herein, or portions thereof, may be stored, received, transmitted, or read from any desired combination of computer or machine readable media or storage devices and communication media in the form of computer executable instructions or other data structures.
Finally, the audio source positioning technique embodiments described herein may be further described in the general context of computer-executable instructions, such as program modules, being executed by a computing device. Generally, program modules include routines, programs, objects, components, data structures, etc., that perform particular tasks or implement particular abstract data types. The embodiments described herein may also be practiced in distributed computing environments where tasks are performed by one or more remote processing devices, or within a cloud of one or more devices, that are linked through one or more communications networks. In a distributed computing environment, program modules may be located in both local and remote computer storage media including media storage devices. Still further, the aforementioned instructions may be implemented, in part or in whole, as hardware logic circuits, which may or may not include a processor.
While the audio source positioning technique embodiments described so far involve only one participant at each site, in one embodiment it is possible to have any number of participants at a site, as long as a separate audio stream and separate location information are sent for each participant. In general, the operation is the same as described for a site sending audio source positioning data to another site or sites, except that audio data representing a remote site participant's voice and the 3D point representing the location of the participant in the remote site is sent for each participant at the site. At a site receiving this data, the virtual scene is rendered so as to include all the remote site participants as before (including each participant at a site having multiple participants). If the receiving site has one participant, then the spatial audio technique employed to spatialize the audio is accomplished in the same manner as described previously. However, if the receiving site has more than one participant, the sound is separately spatialized as described previously for each participant. This can be easily accomplished if the participants each wear audio earphones (i.e., the plurality of audio speakers at the site are sets of headphones) and a spatial audio technique designed for earphones is employed.
It is noted that any or all of the aforementioned embodiments throughout the description may be used in any combination desired to form additional hybrid embodiments. In addition, although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.
This application claims the benefit of and priority to provisional U.S. patent application Ser. No. 61/653,983 filed May 31, 2012.
Number | Date | Country | |
---|---|---|---|
61653983 | May 2012 | US |