Having thus described the invention in general terms, reference will now be made to the accompanying drawings, which are not necessarily drawn to scale, and wherein:
The present inventions now will be described more fully hereinafter with reference to the accompanying drawings, in which some, but not all embodiments of the invention are shown. Indeed, these inventions may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will satisfy applicable legal requirements. Like numbers refer to like elements throughout.
It will be appreciated from the following that many types of devices, including, for example, audio capture and recording devices, recording studio sound systems, sound editing devices and software, audio receivers and like audio synthesized reproduction devices, audio generating devices, video gaming systems, teleconferencing phones, teleconference server, teleconferencing software systems, speaker phones, radios, boomboxes, satellite radios, headphones, MP3 players, CD players, DVD players, televisions, personal computers, multimedia centers, laptop computers, intercom systems, and other audio products, may be used with embodiments of the present invention, as well as such as devices referenced herein as mobile stations, including, for example, mobile phones, personal data assistants (PDAs), gaming systems, and other portable handheld electronics. Further while embodiments of the present invention are described herein generally with regard to musical and vocal sounds, embodiments of the present invention apply to all types of sound.
Embodiments of the present invention may be described, for example, as extensions of the SIRR or DirAC methods, but may also be applied in similar spatial audio recording-reproduction methods which rely upon a sound signal and spatial information. Notably, however, embodiments of the present invention involve providing at least one sound source with known spatial information for the sound source which may be used for synthesis (reproduction) of the sound source in a manner that preserves or at least partially preserves a perception of the spatial information for the sound source.
As used herein, the term “monophonic input signal” is inclusive of, but not limited to: highly directional (single channel) sound recordings, such as sharply parabolic sound recordings; sound recordings with discrete or nearly-discrete spatial direction; sound recordings where actual spatial information is constrained to a discrete or nearly-discrete spatial direction; sound recordings where actual spatial information is disregarded and replaced by artificially generated spatial information; and, as for example in a virtual gaming environment, a generated sound with a virtual source position and direction. As noted in the above statement, any sound source may be interpreted (made to be) a monophonic input signal by disregarding any known spatial information for an actual (recorded) sound signal and mixing any separate channels, such as taking a W(t) channel from a B-format signal and treating it as a monophonic signal which can then be associated with generated spatial information.
In one embodiment of the present invention, a monophonic input audio signal (source) is used to synthetically produce a B-format signal which is then analyzed and reproduced using the DirAC technology. A monophonic audio signal may be encoded into a synthesized B-format signal using the following (Ambisonics) coding equation:
where x(t) is the monophonic input audio signal, θ is the azimuth angle (anti-clockwise angle from center front), φ is the elevation angle, and W(t), X(t), Y(t), and Z(t) are the individual channels of the resulting B-format signal. The multiplier on the W signal is a convention that originates from a desire to achieve a more even level distribution between the four channels, and some references use an approximate value of 0.707 for the multiplier. In effect, the B-format signal may be used to produce a spatial audio simulation from a DirAC formatted signal, as depicted in
Further, multiple monophonic sources can also be encoded for embodiments of the present invention. The above equation may be individually applied for multiple monophonic sources. The resulting B-format signals may then be individually encoded into separate DirAC signals, and then the separate DirAC signals may be directly encoded, as describe further below, into a single DirAC signal. This process is depicted in
Alternatively, the multiple B-format signals resulting from encoding multiple monophonic sources may be mixed (added together, i.e., combined or summed) into a single B-format signal. Because a B-format signal is essentially a representation of the physical sound field and, as such, adheres to the basic superposition principle of linear fields, B-format signals may be mixed, for example for a four channel signal, as W=W1+W2+ . . . +WN, X=X1+X2+ . . . +XN, Y=Y1+Y2+ . . . +YN, Z=Z1+Z2+ . . . +ZN,
In
If a frequency band is present only in one of the input signals, in entirety or over any time segment (ideally selected to be short enough not to impact human perception, such as 10 ms), the spatial parameters for that frequency band may be simply copied from the corresponding individual source input signal for the resulting DirAC formatted signal. However, when the contents of several input signals overlap in frequency and time, the information needs to be combined using more sophisticated techniques. The combination functionality may be based on mathematical identities. For example, the direction-of-arrival angles may be determined using vector algebra to combine the individual angles. Similarly, the diffuseness may be calculated from the number of sound sources, their relative positions, their original diffuseness, and the phase relationships between the signals. Optimally, the combination function may take into account perceptual rules that determine the perceived spatial properties from the attributes of each individual DirAC streams, which makes it possible to employ different combinatorial rules for different frequency regions in much the same manner that human hearing combines sound sources into an aggregate perception, for example, in case of normal two-channel stereophony. Various computational models of spatial audio perception may be used for this diffuseness calculation.
Although the frequency analysis may be performed for all the input signals separately, note, however, that the purpose of the frequency analysis is only to provide the spatial side information; the analysis results will not later be directly converted to an audio signal, except indirectly during synthesis (reproduction) in the form of spatial cues for perception of the audio signal W(t).
Additional descriptions follow related to more specific applications for embodiments of the present invention.
1. Multichannel Encoding
Conventional multichannel audio content formats are typically horizontal-only systems, where the loudspeaker positions are explicitly defined. Such systems include, for example, all the current 5.1 and 7.1 setups. Multiple source input signals targeted for these systems may be directly encoded into the DirAC format by an embodiment of the present invention by treating the individual channels as synchronized input sound sources with the directional information generated and set according to the optimal loudspeaker positions.
2. Stereo-to-Multichannel Up-Mix
Similar to multichannel encoding, in stereo-to-multichannel up-mixing, the two stereo channels are used as multiple source inputs to the encoding system. The direction-of-arrival angles may be set by an embodiment of the present invention according to the standard stereo triangle. Modified angles are also possible for implementing specific effects. A direct encoding system of an embodiment of the present invention may then produce estimates on the perceived sound source locations and the diffuseness. And the resulting stream may subsequently be decoded for another loudspeaker system, such as a standard 5.1 setup. Such decoding may result in a relevant center channel signal and distribute the diffuse field to all loudspeakers including the surround speakers.
3. Interactive 3-D Audio
Generating interactive audio, such as for games and other interactive applications, may include simulating sound sources in three dimensions, such that sources may be freely positioned in a virtual world with respect to the listener, such as around a virtual player in a video game environment. This may be readily implemented using an embodiment of the present invention. And the techniques of the present invention may also be beneficial for implementing a room effect, which is particularly useful for video games. A room effect normally consists of separate early reflections and diffuse late reverberation. A benefit from an embodiment of the present invention is that a room effect may be created as a monophonic signal with side information describing the spatial distribution of the effect. The early reflections may be created such that they are more diffuse than the direct sound but still may have a well-defined direction-of-arrival. The late reverberation, on the other hand, may be generated with the diffuseness factor set to one, and the decoding system may facilitate actually reproducing the reverb signal as diffuse.
4. Spatial Audio Teleconferencing
Spatial audio may also be used in teleconferencing applications, for example, to make it easier to distinguish between multiple participants on a teleconference and, particularly, to make it easier to distinguish between multiple participants on a teleconference talking simultaneously. The DirAC format may be used for teleconferencing applications, as teleconferencing typically requires transmitting just one actual audio signal with the spatial information communicated as side information. As such the DirAC format is also fully mono-compatible. So for a teleconference application, the DirAC format may be employed by directly recording speech from participants on a teleconference using, for example, a SoundField microphone, when multiple persons are present in the same acoustical space.
However, for a multi-party teleconference, a resulting DirAC signal could be produced, for example, in a teleconference server system, using multiple signals from the individual conference participants as multiple sound source inputs to an embodiment of the present invention. This adaptation may easily be employed with existing conference systems because the sound signals delivered in the system could be exactly the same as currently delivered. Only the spatial information would need to be generated in addition to transmit as spatial side information.
With regard to generating spatial information for teleconferencing applications, and similarly for applications of Internet phoning and voice chatting, 3-way calling, chat rooms having audio capabilities such as computer generated sounds and voices for participants, Internet gaming environments such as virtual poker tables and virtual roulette tables, and like electronic environments, software applications, and scenarios conveying communication in any audio format which are associated with any real or virtual aspect of the system, the generation of spatial information may be used to represent sound source locations to facilitate a user distinguishing the origin of the sound. For example, if spatial information is known for a particular sound source, that spatial information may be used, in whole or in part and/or as in reality or by a relative representation, by an embodiment of the present invention in relation to representing that sound source. For example, if telephone conference participants being located in California, New York, and Texas, spatial information may be generated to identify the participants at their geographic positions on a map with respect to each other, as where the Texas listener perceives the California participant to the left (west) and the New York participant to the front-right (northeast). An additional telephone conference participant located in Florida may be associated with spatial information such that the Texas listener perceives by the Florida participant to the right (east). Other geographic, topographic, and like positional representations of reality may be similarly used. Alternatively, virtual positional representations may be implemented by embodiments of the present invention. For example, if locations are unknown or not intended to be used, a telephone conferencing system operating in accordance with the present invention may place the participants at diverging locations about a closed surface or closed perimeter, such as a ring or sphere. Further, for example, if a teleconference involves four participants, each participant may be virtually located at, and their sound source associated with generated spatial information related to, four equidistance locations about the ring. If a fifth teleconference participant is involved and, for example, designated as the lead person for the teleconference, the fifth participant may be virtually located at, and his or her sound source associated with generated spatial information related to, a point in space located above the ring (i.e., orthogonal to the plane in which the ring exists). Similarly, the sound sources for participants of a virtual roulette table could be associated with spatial information related to the positions of the participants about the circumference of the virtual roulette table.
One of ordinary skill in the art will recognize that the present invention may be incorporated into hardware and software systems and subsystems, combinations of hardware systems and subsystems and software systems and subsystems, and incorporated into network systems and wired remote locations and wireless mobile stations thereof. In each of these systems and mobile stations, as well as other systems capable of using a system or performing a method of the present invention as described above, the system and mobile station generally may include a computer system including one or more processors that are capable of operating under software control to provide the techniques described above.
Computer program instructions for software control for embodiments of the present invention may be loaded onto a computer or other programmable apparatus to produce a machine, such that the instructions which execute on the computer or other programmable apparatus create means for implementing the functions described herein. The computer program instructions may also be loaded onto a computer or other programmable apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions described herein. It will also be understood that each element, and combinations of elements, may be implemented by hardware-based computer systems, software computer program instructions, or combinations of hardware and software which perform the specified functions or steps described herein.
Reference is now made to
As shown, the entity 40 capable of operating in accordance with an embodiment of the present invention for directly encoding into a directional audio coding format and can generally include a processor, controller, or the like 42 connected to a memory 44. The memory 44 can include volatile and/or non-volatile memory and typically stores content, data, or the like. For example, the memory 44 typically stores computer program code such as software applications or operating systems, instructions, information, data, content, or the like for the processor 42 to perform steps associated with operation of the entity in accordance with embodiments of the present invention. Also, for example, the memory 44 typically stores content transmitted from, or received by, the entity 40. Memory 44 may be, for example, random access memory (RAM), a hard drive, or other fixed data memory or storage device. The processor 42 may receive input from an input device 50 and may display information on a display 48. The processor can also be connected to at least one interface 46 or other means for transmitting and/or receiving data, content, or the like. Where the entity 40 provides wireless communication, such as in a Bluetooth network, a wireless LAN network, or other mobile network, the processor 42 may operate with a wireless communication subsystem of the interface 46. One or more processors, memory, storage devices, and other computer elements may be used in common by a computer system and subsystems, as part of the same platform, or processors may be distributed between a computer system and subsystems, as parts of multiple platforms.
The mobile device includes an antenna 47, a transmitter 48, a receiver 50, and a controller 52 that provides signals to and receives signals from the transmitter 48 and receiver 50, respectively. These signals include signaling information in accordance with the air interface standard of the applicable cellular system and also user speech and/or user generated data. In this regard, the mobile device may be capable of operating with one or more air interface standards, communication protocols, modulation types, and access types. More particularly, the mobile device may be capable of operating in accordance with any of a number of second-generation (2G), 2.5G and/or third-generation (3G) communication protocols or the like. Further, for example, the mobile device may be capable of operating in accordance with any of a number of different wireless networking techniques, including Bluetooth, IEEE 802.11 WLAN (or Wi-Fi®), IEEE 802.16 WiMAX, ultra wideband (UWB), and the like.
It is understood that the controller 52, such as a processor or the like, includes the circuitry required for implementing the video, audio, and logic functions of the mobile device. For example, the controller may be comprised of a digital signal processor device, a microprocessor device, and various analog to digital converters, digital to analog converters, and other support circuits. The control and signal processing functions of the mobile device are allocated between these devices according to their respective capabilities. The controller 52 thus also includes the functionality to convolutionally encode and interleave message and data prior to modulation and transmission. The controller 52 can additionally include an internal voice coder (VC) 52A, and may include an internal data modem (DM) 52B. Further, the controller 52 may include the functionality to operate one or more software applications, which may be stored in memory. For example, the controller may be capable of operating a connectivity program, such as a conventional Web browser. The connectivity program may then allow the mobile station to transmit and receive Web content, such as according to HTTP and/or the Wireless Application Protocol (WAP), for example.
The mobile device may also comprise a user interface such as including a conventional earphone or speaker 54, a ringer 56, a microphone 60, a display 62, all of which are coupled to the controller 52. The user input interface, which allows the mobile device to receive data, can comprise any of a number of devices allowing the mobile device to receive data, such as a keypad 64, a touch display (not shown), a microphone 60, or other input device. In embodiments including a keypad, the keypad can include the conventional numeric (0-9) and related keys (#, *), and other keys used for operating the mobile device and may include a full set of alphanumeric keys or set of keys that may be activated to provide a full set of alphanumeric keys. Although not shown, the mobile station may include a battery, such as a vibrating battery pack, for powering the various circuits that are required to operate the mobile station, as well as optionally providing mechanical vibration as a detectable output.
The mobile device can also include memory, such as a subscriber identity module (SIM) 66, a removable user identity module (R-UIM) (not shown), or the like, which typically stores information elements related to a mobile subscriber. In addition to the SIM, the mobile device can include other memory. In this regard, the mobile device can include volatile memory 68, as well as other non-volatile memory 70, which may be embedded and/or may be removable. For example, the other non-volatile memory may be embedded or removable multimedia memory cards (MMCs), Memory Sticks as manufactured by Sony Corporation, EEPROM, flash memory, hard disk, or the like. The memory can store any of a number of pieces or amount of information and data used by the mobile device to implement the functions of the mobile device. For example, the memory can store an identifier, such as an international mobile equipment identification (IMEI) code, international mobile subscriber identification (IMSI) code, mobile device integrated services digital network (MSISDN) code, or the like, capable of uniquely identifying the mobile device. The memory can also store content. The memory may, for example, store computer program code for an application and may store an update for computer program code for the mobile device.
In addition, the mobile device 52 may include one or more audio decoders 82, such as a “G-format” decoder, AC-3 decoder, DTS decoder, MPEG-2 decoder, MLP DVD-A decoder, SACD decoder, DVD-Video disc decoder, Ambisonic decoder, UHJ decoder, and like audio decoders capable of decoding a DirAC stream for such output as the 5.1 G-format, stereo format, and other multi-channel audio reproduction setups. The one or more audio decoders 82 may be capable of transmitting the resulting spatially representative sound signals to a loudspeaker system 86 having one or more loudspeakers 84 for synthesized reproduction of a natural or an artificial spatial sound environment.
Provided herein are improved systems, methods, and computer program products for direct encoding of spatial sound into a directional audio coding format. The direct encoding may also include providing spatial information for a monophonic sound source. The direct encoding of spatial information may be used, for example, in interactive audio applications such as gaming environments and in teleconferencing applications such as multi-party teleconferencing.
Many modifications and other embodiments of the inventions set forth herein will come to mind to one skilled in the art to which these inventions pertain having the benefit of the teachings presented in the foregoing descriptions and the associated drawings. Therefore, it is to be understood that the inventions are not to be limited to the specific embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of the appended claims. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.