The present invention relates to mixing spatialized audio signals. Acoustic sources may be re-panned before being mixed.
With continued globalization, teleconferencing is becoming increasing important for effective communications over multiple geographical locations. A conference call may include participants located in different company buildings of an industrial campus, different cities in the United States, or different countries throughout the world. Consequently, it is important that spatialized audio signals are combined to facilitate communications among the participants of the teleconference.
Some prior art spatial audio re-panning solutions perform a short time Fourier transform (STFT) analysis on the stereo signal. Within the time-frequency domain, the coherence between left and right channels is determined using cross correlation function. The coherence value indicates the dominance of ambience in stereo signal. Correlation of stereo channels also provides a similarity value indicating the stereo panning of the source within the stereo image.
However, mixing of spatialized signals may be difficult or even impractical in certain teleconferencing scenarios. For example, when two independently spatialized signals are blindly mixed, the resulting mixed signal may map sound sources to overlapping auditory locations. Consequently, the resulting mixed signal may be confusing to the participants when tracking dialog among the participants.
Consequently, there is a real market need to provide effective teleconferencing capability of spatialized audio signals that can be practically implemented by a teleconferencing system.
An aspect of the present invention provides methods, computer-readable media, and apparatuses for re-panning multiple audio signals by applying spatial cue processing. Sound sources may be re-panned before they are mixed to a combined signal. Processing, according to an aspect of the invention, may be applied for example in a conference bridge that receives two omni-directionally recorded audio signals. The conference bridge subsequently re-pans the given signals to the listeners left and right side. The source image mapping and panning may further be adaptively based on the content and use case. Mapping may be done by manipulating the directional parameters prior to directional decoding or before directional mixing.
With another aspect of the invention, re-panned input signals are mixed to form an output signal that is rendered to a user. The rendered output signal may be converted into an acoustic signal through a set of loudspeakers or may be recorded on a storage device.
With another aspect of the invention, directional information that is associated with an audio input signal is remapped in order to place input sources into virtual source positions. The virtual sources may be placed with respect to actual loudspeakers using spatial cue processing.
A more complete understanding of the present invention and the advantages thereof may be acquired by referring to the following description in consideration of the accompanying drawings, in which like reference numbers indicate like features and wherein:
In the following description of the various embodiments, reference is made to the accompanying drawings which form a part hereof, and in which is shown by way of illustration various embodiments in which the invention may be practiced. It is to be understood that other embodiments may be utilized and structural and functional modifications may be made without departing from the scope of the present invention.
As will be further discussed, embodiments of the invention may support the re-panning multiple audio (sound) signals by applying spatial cue coding. Sound sources in each of the signals may be re-panned before the signals are mixed to a combined signal. For example, processing may be applied in a conference bridge that receives two omni-directionally recorded (or synthesized) sound field signals as will be further discussed. The conference bridge subsequently re-pans one of the signals to the listeners left side and the signal to the right side. The source image mapping and panning may further be adaptively based on the content and use case. Mapping may be done by manipulating the directional parameters prior to directional decoding or before directional mixing.
As will be further discussed, embodiments of the invention support a signal format that is agnostic to the transducer system used in reproduction. Consequently, a processed signal may be played through headphones and different loudspeaker setups.
Architecture 100 may be applied to systems that have knowledge of the spatial characteristics of the original sound fields and that may re-synthesize the sound field from audio signal 151 and available spatial metadata (e.g., directional information 153). Spatial metadata may be available by an analysis method (performed by module 101) or may be included with audio signal 151. Spatial re-panning module 103 subsequently modifies directional information 153 to obtain modified directional information 157. (As shown in
Directional re-synthesis module 105 forms re-panned signal 159 from audio signal 155 and modified directional information 157. The data stream (comprising audio signal 155 and modified directional information 157) typically has a directionally coded format (e.g., B-format as will be discussed) after re-panning.
Moreover, several data streams may be combined, in which each data stream includes a different audio signal with corresponding directional information. The re-panned signals may then be combined (mixed) by directional re-synthesis module 105 to form output signal 159. If the signal mixing is performed by re-synthesis module 105, the mixed output stream may have the same or similar format as the input streams (e.g., audio signal with directional information). A system performing mixing is disclosed by U.S. patent application Ser. No. 11/478,792 (“DIRECT ENCODING INTO A DIRECTIONAL AUDIO CODING FORMAT”, Jarmo Hiipakka) filed Jun. 30, 2006, which is hereby incorporated by reference. For example, two audio signals associated with directional information are combined by analyzing the signals for combining the spatial data. The actual signals are mixed (added) together. Alternatively, mixing may happen after the re-synthesis, so that signals from several re-synthesis modules (e.g. module 105) are mixed. The output signal may be rendered to a listener by directing an acoustic signal through a set of loudspeakers or earphones. With embodiments of the invention, the output signal may be transmitted to the user and then rendered (e.g., when processing takes place in conference bridge.) Alternatively, output is stored in a storage device (not shown).
Modifications of spatial information (e.g., directional information 153) may include remapping any range (2D) or area (3D) of positions to a new range or area. The remapped range may include the whole original sound field or may be sufficiently small that it essentially covers only one sound source in the original sound field. The remapped range may also be defined using a weighting function, so that sound sources close to the boundary may be partially remapped. Re-panning may also consist of several individual re-panning operations together. Consequently, embodiments of the invention support scenarios in which positions of two sound sources in the original sound field are swapped.
If directional information 153 contains information about the diffuseness of the sound field, diffuseness is typically processed by module 103 when re-panning the sound field. Consequently, it may be possible to maintain the natural character of the diffuse field. However, it is also possible to map the original diffuseness component of the sound field to a specific position or a range of positions in the modified sound field for special effects.
To record a B-format signal, the desired sound field is represented by its spherical harmonic components in a single point. The sound field is then regenerated using any suitable number of loudspeakers or a pair of headphones. With a first-order implementation, the sound field is described using the zeroth-order component (sound pressure signal W) and three first-order components (pressure gradient signals X, Y, and Z along the three Cartesian coordinate axes). Embodiments of the invention may also determine higher-order components.
The first-order signal that consists of the four channels W, X, Y, and Z, often referred as the B-format signal. One typically obtains a B-format signal by recording the sound field using a special microphone setup that directly or through a transformation yields the desired signal.
Besides recording a signal in the B-format, it is possible to synthesize the B-format signal. For encoding a monophonic audio signal into the B-format, the following coding equations are required:
where x(t) is the monophonic input signal, θ is the azimuth angle (anti-clockwise angle from center front), φ is the elevation angle, and W(t), X(t), Y(t), and Z(t) are the individual channels of the resulting B-format signal. Note that the multiplier on the W signal is a convention that originates from the need to get a more even level distribution between the four channels. (Some references use an approximate value of 0.707 instead.) It is also worth noting that the directional angles can, naturally, be made to change with time, even if this was not explicitly made visible in the equations. Multiple monophonic sources can also be encoded using the same equations individually for all sources and mixing (adding together) the resulting B-format signals.
If the format of the input signal is known beforehand, the B-format conversion can be replaced with simplified computation. For example, if the signal can be assumed the standard 2-channel stereo (with loudspeakers at +/−30 degrees angles), the conversion equations reduce into multiplications with constants. Currently, this assumption holds for many application scenarios.
Embodiments of the invention support parameter space re-panning for multiple sound scene signals by applying spatial cue coding. Sound sources in each of the signals are re-panned before they are mixed to a combined signal. Processing may be applied, for example, in a conference bridge that receives two omni-directionally recorded (or synthesized) sound field signals, which then re-pans one of these to the listeners left side and the other to the right side. The source image mapping and panning may further be adaptively based on content and use. Mapping may be performed by manipulating the directional parameters prior to directional decoding or before directional mixing.
Embodiments of the invention support the following capabilities in a teleconferencing system:
As shown in
DirAC reproduction (re-synthesis) is based on taking the signal recorded by the omni-directional microphone, and distributing this signal according to the direction and diffuseness estimates gathered in the analysis phase.
DirAC re-synthesis may generalize a system by supporting the same representation for the sound field and use an arbitrary loudspeaker (or transducer, in general) setup in reproduction. The sound field may be coded in parameters that are independent of the actual transducer setup used for reproduction, namely direction of arrival angles (azimuth, elevation) and diffuseness.
With 3D teleconferencing, one major concern is to mix sound field signals originating from multiple conference spaces to better represent the teleconference. A microphone array may be used to pick-up the sound field from a conference space to produce an omnidirectional sound field signal or a binaural signal. (Alternatively, 3D representation of participants may be created using binaural synthesis) Signals 451 and 453 (from conference sites A and B, respectively) are then transmitted to the conference bridge. If the conference bridge directly combines two omnidirectional signals (corresponding to signal 455), sound source positions (401b-413b) may be mapped on top of each other (e.g., sound positions 401b and 409b). Direct mapping may be confusing for participants when some participants are essentially mapped to same position and the physical locations of the participants are not related to the position of the sound source.
Embodiments of the invention may re-pan sound field signals before they are mixed together (corresponding to re-panned signal 457 as shown in
With embodiments of the invention, the re-panning processing (e.g., as shown in
For example, re-panning may be performed at a conference server that combines signals in a centralized system and sends combined signals to the receiving terminals. With a decentralized conference architecture, where terminals have direct connection to each other, processing may be performed at the receiving terminal. With other architectures, re-panning processing may be performed at the transmitting terminal.
Since the compressed audio images are represented with the same 5.1 loudspeaker layout, sound sources may be remapped to the new loudspeaker setup seen by the new compressed image. The original 360 degree image is constructed using five loudspeakers (center loudspeaker 505a, left front loudspeaker 503a, right front loudspeaker 507a, left surround loudspeaker 501a, and right surround loudspeaker 509a), but compressed images 555a and 555b may be created with four loudspeakers. The left side image 555a uses center loudspeaker 505b, left front loudspeaker 503b, left surround loudspeaker 501b, and right surround loudspeaker 509b. The right side image 555b uses center loudspeaker 505b, right front loudspeaker 507b, right surround loudspeaker 509b, and left surround loudspeaker 501b. It should be noted that with this configuration, surround loudspeakers 501b and 509b contribute in representing both 180 degree compressed audio images.
With the example shown in
If the spatial audio content primarily resides behind the listener (i.e., with surround loudspeakers), it may not be feasible to split the image by selecting the cut off point at 180 degrees. Instead, the content manager or adaptive image control may select a relatively silent area in the spatial audio image and perform the split in that area.
The image mapping from 360 to 180 degrees may further be adapted based on the audio image. The silent areas in the image may be compressed more than the active areas. For example, when there are one or more speakers in the 360 degree image, the silent area between the speakers may be compressed by adjusting the mapping curve in
The combination of several audio images in
Virtual sound sources 611-619 may be placed in the audio image using binaural cue panning using separation angles 751-761 as shown in
Amplitude panning is the most common panning technique. The listener perceives a virtual source the direction of which is dependent on the gain factors, i.e., amplitude level differences (ILD) of a sound signal in adjacent loudspeakers. Another method is time panning. When a constant delay is applied to one loudspeaker in stereophonic listening, the virtual source is perceived to migrate towards the loudspeaker that radiates the earlier sound signal. Maximal effect is achieved when the delay (ITD) is approximately 1.0 ms. Time panning is typically not used to position sources to desired directions; rather, it is used when some special effects are created.
where g1 and g2 are the ILD values for loudspeakers 801 and 803, respectively. The amplitude panning for virtual center channel (VC) using loudspeakers Ls and Lf in
Similar amplitude panning is needed for each virtual source in
In total, nine ILD values are needed to map five virtual channels in the given configuration. Similar mapping is done for right hand side as well. One may not be able to solve EQ. 3 for all sound sources. However, since the overall loudness is maintained constant according to EQ. 4, the gain values for individual loudspeakers can be determined.
It should be noted that by using the presented combination of audio images, the surround loudspeakers (Ls) 601 and (Rs) 609 as well as center loudspeaker (C) 605 contribute to representation of both (left and right) virtual images. Therefore, when determining the gain values for the combined image, one should verify that the surround and center loudspeaker powers do not saturate.
The determined ILD values from EQs. 3 and 4 are applied to loudspeakers by multiplying the virtual source level with respective ILD value. Signals from all virtual sources are added together for each loudspeaker. For example, the left front loudspeaker signal is determined using four virtual sources as follows:
s
Lf(i)=gLf(VLf)sVLf(i)+gLf(VC)sVC(i)+gLf(VRf)sVLRf(i)+gL6f(VRs)sVRs(i) (EQ. 5)
If the audio image mapping and image compression are constant, one may need to determine the ILD values in EQs. 3 and 4 only once. However, when the image is adapted, either by changing the compression, cut of position, or the combination of the images, new ILD mapping values need to be determined again.
Apparatus 900 may assume different forms, including discrete logic circuitry, a microprocessor system, or an integrated circuit such as an application specific integrated circuit (ASIC).
As can be appreciated by one skilled in the art, a computer system with an associated computer-readable medium containing instructions for controlling the computer system can be utilized to implement the exemplary embodiments that are disclosed herein. The computer system may include at least one computer such as a microprocessor, digital signal processor, and associated peripheral electronic circuitry.
While the invention has been described with respect to specific examples including presently preferred modes of carrying out the invention, those skilled in the art will appreciate that there are numerous variations and permutations of the above described systems and techniques that fall within the spirit and scope of the invention as set forth in the appended claims.