The present disclosure relates to surround audio (i.e. surround sound) and more specifically to systems and methods for remixing audio for a physical surround sound system to three-dimensional audio (3D audio) corresponding to virtual speakers in a virtual environment.
Surround sound includes presenting multiple audio channels over speakers arranged in a layout to provide a user positioned in the layout with an immersive audio experience. The audio for each channel may be mixed so that some sounds are presented (i.e., played) at certain speakers.
In at least one aspect, the present disclosure generally describes a method for remixing an audiovisual (AV) presentation. The method includes receiving the AV presentation that includes a video portion and a real surround sound soundtrack. The real surround sound soundtrack includes multiple audio channels configured for playback on real speakers in a real surround sound setup that is arranged according to a surround sound setup specification. The method further includes defining virtual locations of virtual speakers in a virtual surround sound setup based on the surround sound setup specification. The method further includes modifying a virtual location of one of the virtual speakers according to a location of a first speaking character in the video portion of the AV presentation. The method further includes remixing the multiple audio channels based on the virtual locations of the virtual speakers and based on the modified virtual location of one of the virtual speakers to generate 3D audio corresponding to the virtual speakers playing the multiple audio channels in a virtual surround sound setup. The method further includes updating the AV presentation to include a virtual surround sound soundtrack including the 3D audio.
In a possible implementation of the method above, remixing the multiple audio channels includes operations performed for each of the multiple audio channels. The operations include splitting the audio channel into a left channel and a right channel. The operations further include receiving a corresponding virtual location for the audio channel. The operations further include adjusting one or more of an adjustable filter, an adjustable delay, and an adjustable amplifier/attenuator in the left channel and the right channel according to the corresponding virtual location to create one or more of a relative filter difference, a relative delay difference, and a relative gain/attenuation difference between the left channel and the right channel so that the 3D audio sounds to a user as being from the corresponding virtual location. Further, the operations can include combining the respective left channels and the respective right channels to create 3D audio that sounds to a user as being from a virtual surround sound setup after 3D audio is created for each of the multiple audio channels.
In another possible implementation of the method above, the modifying the virtual location of one of the virtual speakers to a location of a first speaking character in the video portion of the AV presentation includes selecting a dialog channel from the multiple audio channels and analyzing the dialog channel to identify speech. The modifying a virtual location further includes analyzing the video portion of the AV presentation to recognize gestures corresponding to the speech, locating the first speaking character in the video portion of the AV presentation based on the gestures corresponding to the speech, and modifying the virtual location of a virtual center speaker to playback the dialog channel at the location of the first speaking character.
In another possible implementation of the method above, the method further includes playing the AV presentation including the virtual surround sound soundtrack to a user, sensing a position/orientation of the user, and adjusting the virtual location of the virtual speakers in relation to the user according to the sensed position/orientation.
In another aspect, the present disclosure generally describes a method for remixing an audiovisual (AV) presentation. The method includes receiving the AV presentation, which includes a video portion and a real surround sound soundtrack. The real surround sound soundtrack includes multiple audio channels configured for playback on real speakers in a real surround sound setup. The real surround sound setup is arranged according to a surround sound setup specification. The method further includes selecting a dialog channel from the multiple audio channels and analyzing the dialog channel to identify speck from multiple speaking characters. The method further includes creating a plurality of new dialog channels for each of the multiple speaking characters, where each new dialog channel includes the speech from one of the multiple speaking characters. The method further includes determining locations for each of the multiple speaking characters in the video portion of the AV presentation and defining virtual locations of virtual dialog speakers for playback of the plurality of new dialog channels, the virtual locations of the virtual dialog speakers each corresponding to a location of one of the multiple speaking characters in the video portion of the AV presentation. The method further includes determining virtual locations of other virtual speakers for playback of the multiple audio channels not selected as the dialog channel, where each virtual location of each other virtual speaker corresponds to a surround sound setup specification. The method further includes remixing the multiple audio channels, including the plurality of new dialog channels, based on the virtual locations to generate 3D audio and updating the AV presentation to include a virtual surround sound soundtrack that includes the 3D audio.
In another aspect, the present disclosure generally describes a system for presenting an audiovisual (AV) presentation. The system includes a screen configured to display a video portion of the AV presentation to a user and an audio device worn by the user. The audio device includes a left speaker and a right speaker that are configured to play a virtual surround sound soundtrack including 3D audio corresponding to a virtual surround sound setup that includes a virtual speaker having a virtual location that tracks a speaking character on the screen. The audio device further includes a sensor configured to sense a position of the user relative to the screen and a processor configured to update the 3D audio so that the virtual location of the virtual speaker is adjusted based on the position of the user relative to the screen.
The foregoing illustrative summary, as well as other exemplary objectives and/or advantages of the disclosure, and the manner in which the same are accomplished, are further explained within the following detailed description and its accompanying drawings.
The components in the drawings are not necessarily to scale relative to each other. Like reference numerals designate corresponding parts throughout the several views.
For years audiovisual (AV) presentations (e.g., films, movies, videos, animations, shows, multimedia, media, etc.) have included a real surround sound soundtrack including multiple audio tracks mixed for presentation in a viewing location (e.g., theater, auditorium, hall, room, home theater, etc.) with real speakers (i.e. physical speakers) arranged according to a surround-sound specification. A user positioned at the viewing location in the real surround sound setup (i.e., physical surround sound setup) may view the AV presentation visually on a screen while the multi-channel audio may be presented at various real (i.e., physical) speakers around the user so that the user has a more immersive experience. Lately, more users may view an AV presentation using ear-worn audio devices (e.g., earbuds, headphones, etc.). An opportunity exists for improving the mix of the audio presented on the ear-worn audio devices (i.e., ear-worn devices) to not only replicate the immersive experience of previous (i.e., traditional) surround sound systems, but to further enhance the immersive experience based on information obtain from video content and/or from a user/environment. The present disclosure describes systems and methods to generate/provide this enhanced immersive audio experience, and the systems and methods may be applied to previously recorded and mixed AV presentations (e.g., movies with 5.1 surround sound) to improve an immersive quality of the audio when the audio is played using ear-worn audio devices.
The present disclosure is not limited to any particular stereo setup or surround-sound setup. For example, a surround sound setup may include five speakers surrounding a listener and a low-frequency effects (LFE) speaker (e.g., 5.1 surround sound) or seven speakers surrounding a listener and a LFE speaker (e.g., 7.1 surround sound). Additionally, the surround sound setup may use speakers (or reflectors) to create sounds from above a user (e.g. ATMOS surround sound). In the present disclosure, a 5.1 surround sound setup for a home theater is described because of its ubiquity. For example, many films (i.e., movies) include a 5.1 surround-sound soundtrack. The principles and technology disclosed, however, can be applied to other surround sound systems, other AV presentations, and other venues.
The screen 120 of the surround-sound layout 100 may be of various types and sizes. A width of the screen may correspond to an angular field of view 130 of the user that extends past the center speaker. As a result, in the surround-sound layout 100, a person speaking in at a left side of the screen or right side of the screen may appear at an angle off the user's line of sight 117. Audio signals corresponding to each speaker, however, may be transmitted along the same direction (i.e., along the user's line of sight 117) by the center speaker 101. In other words, an audio impression created by the setup may not be well aligned with a visual impression created by the setup because a speaker at a left or right side of the screen can have audio primarily transmitted by the center channel (i.e., along the user's line of sight 117).
Each speaker in the surround sound setup (i.e., surround sound system) may play a different channel of audio. In other words, an AV presentation may have audio mixed with the surround-sound setup in mind so that certain sounds play on certain speakers. Accordingly, the speakers in the surround-sound setup may have different operating characteristics (e.g., frequency response, dynamic range, etc.) to handle corresponding audio signals for each speaker. In some implementations, a speaker in the surround-sound setup may include several sub-speakers. For example, the center speaker may include one or more of a woofer sub-speaker configured for low frequency audio signals, a midrange sub-speaker configured for mid-range frequency audio signals, and a tweeter sub-speaker configured for high frequency audio signals.
The center signal 220 can be mixed, when the AV presentation is recorded, as a single channel in a surround-sound mix of audio channels (e.g., see
The 3D audio mixer 305 is configured to receive audio 301 that can be from a channel of a multi-channel audio recording. For example, the audio 301 can be one audio channel (i.e., track) of a 5.1 surround sound soundtrack. To remix a multi-channel audio recording, each channel may be applied to the 3D audio mixer 305 in series to obtain a set of 3D audio channels (i.e., tracks). Alternatively, each channel may be applied to a corresponding 3D audio mixer in parallel to obtain the set of 3D audio channels. After remixing, the left channels of the set of 3D audio channels can be combined for playing on a left speaker and the right channels of the set of 3D audio channels can be combined for playing on a right speaker. In some implementations only a subset of the multi-channel audio tracks are remixed to create the set of 3D audio channels. In these implementations, the remixed audio tracks can be combined with the original multi-channel tracks to form the 3D audio played on the L/R speakers of a user's listening device.
The 3D audio mixer 305 may be configured with a splitter at the input to form the left channel (L) and the right channel (R). The left channel may include one or more of a left-channel adjustable filter 310 (i.e., filter L), a left-channel adjustable delay 312 (i.e., delay_L), and a left-channel adjustable amplifier/attenuator 314 (i.e., gain_L). Likewise, the right channel may include one or more of a right-channel adjustable filter 311 (i.e., filter_R), a right-channel adjustable delay 312 (i.e., delay_R), and a right-channel adjustable amplifier/attenuator 314 (i.e., gain_R).
The filter, delay, and/or gain/attenuation of the left channel may be adjusted differently from the filter, delay, and/or gain/attenuation of the right channel so that the left and right channels have a binaural difference that includes a relative filtering difference, a relative delay difference, and a relative gain/attenuation difference. The binaural difference may be generated to map the 3D audio to a desired 3D location in the virtual environment 350. Accordingly, the 3D audio mixer 305 may adjust the filter, delay, and or gain/attenuation for the left and right channels based on location information. The location information may include virtual sound source location information 305 and/or AV presentation location information 304.
The virtual sound source location information 305 may include locations that correspond to a surround sound setup specification. For example, the locations in a 5.1 surround-sound setup may be defined based on angles as described previously. Further the virtual sound source location information may include assumptions that include (but are not limited to) a relative position of a user, a relative height of a virtual speaker, a range between a user and a virtual speaker, etc. (e.g., see
The techniques and methods disclosed can be used to remix multi-channel audio to create a virtual replica of a surround sound setup with fixed position speakers (i.e., virtual surround sound setup). In other words, a virtual surround-sound setup having virtual speakers in virtual locations may be generated by remixing.
AV presentations may include soundtracks for real (i.e., physical) surround sound setups that each include multiple audio channels configured for playback on a real surround sound setup arranged according to a surround sound setup specification (e.g., 5.1 surround sound). The present disclosure describes systems and methods to remix the soundtracks of AV presentations into 3D audio for playback over stereo devices, such as ear-worn devices to create a virtual surround sound soundtrack. When the virtual sound soundtrack is played a listener (i.e., user) can perceives the multiple audio channels being played back on a virtual surround sound setup that resembles the real surround sound setup. In other words, a soundtrack for a real surround sound setup (i.e., surround sound soundtrack for a physical surround sound setup) may be modified or replaced with a soundtrack for a virtual surround sound setup (i.e., virtual surround sound soundtrack).
The disclosed techniques and methods may enhance an immersive audio experience by making one or more of the virtual speakers movable and adjusting the virtual location of the one or more virtual speakers based on a video portion of the AV presentation. Returning to
In virtual surround sound with single-character-tracked audio, the virtual location of the virtual center speaker 401 may be adjusted based on content of the AV presentation. For example, the virtual location of the center virtual speaker 401 may be adjusted within a virtual area 421 corresponding to the screen 420. The virtual location may be adjusted (e.g., in real time) to follow an action on the screen 420. In a possible implementation, the action is a character of the AV presentation speaking on the screen. In this case, the virtual location in the virtual area 421 may be selected so that the virtual location of the center virtual speaker 401 may be adjusted to correspond to the character's location on the screen. Further, the virtual location of the center virtual speaker 401 may be adjusted to track (i.e., follow) the user as the user changes location on the screen. This form of virtual surround sound may be called enhanced virtual surround sound, with the enhancement being single-character-tracked audio, though other sound sources besides a character can be tracked as well.
The method 600 may further include analyzing the video of the AV presentation to identify 630 a speaker. The identification may be performed periodically or at different intervals (e.g., scene by scene) during the AV presentation and may include using face recognition and/or speech recognition. A character of the AV presentation may be determined as speaking by analyzing video content of the AV presentation to determine gestures corresponding to speaking (e.g., lip movement). A character may be selected for tracking based on a criterion, such as how much the character speaks, which can be determined scene by scene. For example, in a scene with one character, the analysis may select the character for tracking automatically. In a scene with multiple characters the analysis may select the character speaking the most for tracking.
The identified speaking character may then be located within the screen (i.e., within a frame presented on the screen). The locating may include receiving device specifications so that the location on the screen may be correlated with a location in the virtual environment.
The dialog channel may be projected 640 to its corresponding virtual location. The virtual location of the dialog channel (i.e., virtual center speaker) may be determined based on the location of the speaking character on the screen (i.e., screen location) and the virtual location of the virtual center speaker in the virtual surround sound layer (i.e., virtual sound source location information). The dialog channel may be remixed into spatial audio conveying the perception that the virtual center speaker is located at the virtual location of the dialog channel.
In a possible implementation, the remixing 625 may include combining the remixed audio channels into a left channel and a right channel for a stereo device, such as an ear-worn device. The AV presentation may then be updated 645 to include the 3D audio soundtrack (i.e., 3D audio soundtrack). The updating may include replacing the multi-channel audio soundtrack of the AV presentation with the 3D audio soundtrack or can include adding the 3D audio soundtrack to the AV presentation. The updated AV presentation may then be stored on a medium for retrieval a playback on a user's device.
In some implementations it may be desirable to track multiple characters on the screen so that the virtual center speaker may be moved from speaker to speaker in a scene. In this case the dialog channel may include a plurality of speaking characters so simply moving the virtual center speaker to a location of a speaker may be impossible or may provide an experience that is not immersive. Instead, the present disclosure describes systems and methods that can be configured to divide the dialog channel into different dialog channels and then spatially project the different dialog channels to different virtual locations in the 3D audio.
In some implementations, the remixing may include adjusting a volume for each audio track based on an estimated virtual range of the audio source to a user. For example, the first character 711 may be perceived as closer to the user 701 than the second character 721 (e.g., because a first size 712 of the first character is larger than a second size 722 of a second character). Accordingly, in some implementations, the first audio track may include a higher volume than the second audio track. This form of virtual surround sound may be called enhanced virtual surround sound, with the enhancement being multi-character-tracked audio, though other sound sources besides characters can be tracked as well. In some implementations, portions of a soundtrack of an AV presentation may be remixed to include enhanced virtual surround sound with single-character-tracked audio, while other portions of the soundtrack of the AV presentation may be remixed to include enhanced virtual surround sound with multi-character-tracked audio.
The enhanced virtual surround sound with single character tracked audio or the enhanced virtual surround sound with multiple character tracked audio may be generated before a user views the AV presentation based on the assumption that the user is within (e.g., centered within) the virtual surround sound setup. In these implementations, no information regarding the user's position is required, and the perceived virtual speaker positions may move with the user's head. For example, the center virtual speaker may remain virtually in front of the user even as the user's head is turned to the side.
A more immersive virtual experience may be created based on information regarding a user by adjusting the remixed 3D audio in real time to match a user's changing position/orientation relative to a virtual surround sound setup during playback. For example, the center virtual speaker may be perceived as moving to one side as the user's head is turned to the side. In these implementations, the enhanced virtual surround sound with single character tracked audio or the enhanced virtual surround sound with multiple character tracked audio may be adjusted in real time based on sensing a user, a user's device, and/or an environment. This form of virtual surround sound may be called enhanced virtual surround sound, with the enhancement being user-tracked adjustments applied to the single-character-tracked audio or the multi-character-tracked audio.
In the specification and/or figures, typical embodiments have been disclosed. The present disclosure is not limited to such exemplary embodiments. The use of the term “and/or” includes any and all combinations of one or more of the associated listed items. The figures are schematic representations and so are not necessarily drawn to scale. Unless otherwise noted, specific terms have been used in a generic and descriptive sense and not for purposes of limitation.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art. Methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present disclosure. As used in the specification, and in the appended claims, the singular forms “a,” “an,” “the” include plural referents unless the context clearly dictates otherwise. The term “comprising” and variations thereof as used herein is used synonymously with the term “including” and variations thereof and are open, non-limiting terms. The terms “optional” or “optionally” used herein mean that the subsequently described feature, event or circumstance may or may not occur, and that the description includes instances where said feature, event or circumstance occurs and instances where it does not. Ranges may be expressed herein as from “about” one particular value, and/or to “about” another particular value. When such a range is expressed, an aspect includes from the one particular value and/or to the other particular value. Similarly, when values are expressed as approximations, by use of the antecedent “about,” it will be understood that the particular value forms another aspect. It will be further understood that the endpoints of each of the ranges are significant both in relation to the other endpoint, and independently of the other endpoint.
While certain features of the described implementations have been illustrated as described herein, many modifications, substitutions, changes and equivalents will now occur to those skilled in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the scope of the implementations. It should be understood that they have been presented by way of example only, not limitation, and various changes in form and details may be made. Any portion of the apparatus and/or methods described herein may be combined in any combination, except mutually exclusive combinations. The implementations described herein can include various combinations and/or sub-combinations of the functions, components and/or features of the different implementations described.
It will be understood that, in the foregoing description, when an element is referred to as being on, connected to, electrically connected to, coupled to, or electrically coupled to another element, it may be directly on, connected or coupled to the other element, or one or more intervening elements may be present. In contrast, when an element is referred to as being directly on, directly connected to or directly coupled to another element, there are no intervening elements present. Although the terms directly on, directly connected to, or directly coupled to may not be used throughout the detailed description, elements that are shown as being directly on, directly connected or directly coupled can be referred to as such. The claims of the application, if any, may be amended to recite exemplary relationships described in the specification or shown in the figures.
As used in this specification, a singular form may, unless definitely indicating a particular case in terms of the context, include a plural form. Spatially relative terms (e.g., over, above, upper, under, beneath, below, lower, and so forth) are intended to encompass different orientations of the device in use or operation in addition to the orientation depicted in the figures. In some implementations, the relative terms above and below can, respectively, include vertically above and vertically below. In some implementations, the term adjacent can include laterally adjacent to or horizontally adjacent to.
Number | Name | Date | Kind |
---|---|---|---|
9591418 | Shenoy | Mar 2017 | B2 |
Number | Date | Country |
---|---|---|
111526242 | Aug 2020 | CN |
2018026963 | Feb 2018 | WO |
Entry |
---|
“Control Spatial Audio On Airpods With Iphone”, Apple, iPhone User Guide, Jan. 6, 2021, 6 pages. |
Number | Date | Country | |
---|---|---|---|
20220360933 A1 | Nov 2022 | US |