The present invention generally relates to spatial audio rendering techniques, namely systems and methods for automatically changing the rendering of spatial audio based on user input.
Loudspeakers, colloquially “speakers,” are devices that convert an electrical audio input signal or audio signal into a corresponding sound. Speakers are typically housed in an enclosure which may contain multiple speaker drivers. In this case, the enclosure containing multiple individual speaker drivers may itself be referred to as a speaker, and the individual speaker drivers inside can then be referred to as “drivers.” Drivers that output high frequency audio are often referred to as “tweeters.” Drivers that output mid-range frequency audio can be referred to as “m ids” or “mid-range drivers.” Drivers that output low frequency audio can be referred to as “woofers.” When describing the frequency of sound, these three bands are commonly referred to as “highs,” “mids,” and “lows.” In some cases, lows are also referred to as “bass.”
Audio tracks are often mixed for a particular speaker arrangement. The most basic recordings are meant for reproduction on one speaker, a format which is now called “mono.” Mono recordings have a single audio channel. Stereophonic audio, colloquially “stereo,” is a method of sound reproduction that creates an illusion of multi-directional audible perspective by having a known, two speaker arrangement coupled with an audio signal recorded and encoded for stereo reproduction. Stereo encodings contain a left channel and right channel, and assume that the ideal listener is at a particular point equidistant from a left speaker and a right speaker. However, stereo provides a limited spatial effect because typically only two front firing speakers are used. Stereo using fewer or greater than two loudspeakers can result in suboptimal rendering due to either down mixing or up mixing artifacts respectively.
Immersive formats now exist that require a much larger number of speakers and associated audio channels to try and correct the limitations of stereo. These higher channel count formats are often referred to as “surround sound.” There are many different speaker configurations associated with these formats such as, but not limited to, 5.1, 7.1, 7.1.4, 10.2, 11.1, and 22.2. However, a problem with these formats is that they require a large number of speakers to be configured correctly, and to be placed in prescribed locations. If the speakers are offset from their ideal locations, the audio rendering/reproduction can degrade significantly. In addition, systems that employ a large number of speakers often do not utilize all of the speakers when rendering channel-based surround sound audio encoded for fewer speakers.
Audio recording and reproduction technology has consistently striven for a higher fidelity experience. The ability to reproduce sound as if the listener were in the room with the musicians has been a key promise that the industry has attempted to fulfill. However, to date, the highest fidelity spatially accurate reproductions have come at the cost of large speaker arrays that must be arranged in a particular orientation with respect to the ideal listener location. Systems and methods described herein can ameliorate these problems and provide additional functionality by applying spatial audio reproduction principals to spatial audio rendering.
Systems and methods spatial audio rendering using spatialization shaders in accordance with embodiments of the invention are illustrated. One embodiment includes a spatial audio system, including a plurality of loudspeakers capable of rendering spatial audio, where each loudspeaker includes at least one driver, a processor, and a memory containing a spatial audio rendering application, where the spatial audio rendering application directs the processor to obtain a plurality of audio stems, obtain a position and a rotation of each loudspeaker in the plurality of loudspeakers, obtain a relative location for each audio stem to be rendered, calculate a plurality of tuning parameters for each loudspeaker in the plurality of loudspeakers, provide the plurality of tuning parameters, the position and rotation of each loudspeaker to a spatialization shader, generate a driver feed for each driver in the plurality of loudspeakers using the spatialization shader, and render each audio stem at their respective location using the plurality of loudspeakers and the tuning parameters.
In another embodiment, the plurality of tuning parameters includes a source focus parameter that defines energy distribution between loudspeakers in the plurality of loudspeakers and directivity behavior for each loudspeaker in the plurality of loudspeakers.
In a further embodiment, the plurality of tuning parameters includes a delay parameter and a gain parameter.
In still another embodiment, to calculate the delay parameter and the gain parameter when the locations of the positions of three given loudspeakers in the plurality of loudspeakers form a scalene triangle, the spatial audio rendering application directs the processor to determine a position dmax as the longest distance from a listener position to any of the three given loudspeakers, calculate the delay parameter for each of the three given loudspeakers as distance from the given loudspeaker to the listening position minus dmax, all divided by the speed of sound in air, and calculate the gain parameter for each of the three given loudspeakers as distance from the given loudspeaker to the listening position divided by dmax.
In a still further embodiment, the plurality of tuning parameters includes a bass-crossfeed parameter.
In yet another embodiment, the spatial audio rendering application further directs the processor to track a listener position, and move the location of each audio stem to maintain relative position to the tracked listener position.
In a yet further embodiment, the spatial audio rendering application further directs the processor to regularize the loudspeaker positions in a virtual map, calculate a minimum bounding box for the regularized loudspeaker positions in the virtual map, denote the center of the minimum bounding box as a reference position, where the reference position reflects the centroid of a polygon defined by the positions of the loudspeakers, and use the reference position to translate the virtual space of a user interface to the location of the loudspeaker positions.
In another additional embodiment, a method for spatial audio rendering, includes obtaining a plurality of audio stems, obtaining a position and a rotation for each loudspeaker in a plurality of loudspeakers, where each loudspeaker has at least one driver, obtaining a location for each audio stem is to be rendered, calculating a plurality of tuning parameters for each loudspeaker in the plurality of loudspeakers, providing the plurality of tuning parameters, the position and rotation of each loudspeaker to a spatialization shader, generating a driver feed for each driver in the plurality of loudspeakers using the spatialization shader, and rendering each audio stem at their respective location using the plurality of loudspeakers and the tuning parameters.
In a further additional embodiment, the plurality of tuning parameters includes a source focus parameter that defines energy distribution between loudspeakers in the plurality of loudspeakers and directivity behavior for each loudspeaker in the plurality of loudspeakers.
In another embodiment again, the plurality of tuning parameters includes a delay parameter and a gain parameter.
In a further embodiment again, calculating the delay parameter and the gain parameter when the locations of the positions of three given loudspeakers in the plurality of loudspeakers form a scalene triangle includes determining a position dmax as the longest distance from a listener position to any of the three given loudspeakers, calculating the delay parameter for each of the three given loudspeakers as distance from the given loudspeaker to the listening position minus dmax, all divided by the speed of sound in air, and calculating the gain parameter for each of the three given loudspeakers as distance from the given loudspeaker to the listening position divided by dmax.
In still yet another embodiment, the plurality of tuning parameters includes a bass-crossfeed parameter.
In a still yet further embodiment, the method further includes tracking a listener position, and moving the location of each audio stem to maintain relative position to the tracked listener position.
In still another additional embodiment, the method further includes regularizing the loudspeaker positions in a virtual map, calculating a minimum bounding box for the regularized loudspeaker positions in the virtual map, denoting the center of the minimum bounding box as a reference position, where the reference position reflects the centroid of a polygon defined by the positions of the loudspeakers, and using the reference position to translate the virtual space of a user interface to the location of the loudspeaker positions.
In a still further additional embodiment, a loudspeaker for spatial audio rendering includes at least one driver, a processor, and a memory containing a spatial audio rendering application, where the spatial audio rendering application directs the processor to obtain a plurality of audio stems, obtain a position and a rotation of each loudspeaker in a plurality of secondary loudspeakers communicatively coupled to the loudspeaker, where each secondary loudspeaker includes at least one driver, obtain a location for where each audio stem is to be rendered, calculate a plurality of tuning parameters for each loudspeaker in the plurality of loudspeakers, provide the plurality of tuning parameters, the position and rotation of each loudspeaker to a spatialization shader, generate a driver feed for each driver in the plurality of loudspeakers using the spatialization shader, transmit the driver feed to its respective driver, and render each audio stem at their respective location using the plurality of loudspeakers and the tuning parameters.
In still another embodiment again, the plurality of tuning parameters includes a source focus parameter that defines energy distribution between loudspeakers in the plurality of loudspeakers and directivity behavior for each loudspeaker in the plurality of loudspeakers.
In a still further embodiment again, the plurality of tuning parameters includes a delay parameter and a gain parameter.
In yet another additional embodiment, to calculate the delay parameter and the gain parameter when the locations of the positions of three given loudspeakers in the plurality of secondary loudspeakers form a scalene triangle, the spatial audio rendering application directs the processor to determine a position dmax as the longest distance from a listener position to any of the three given loudspeakers, calculate the delay parameter for each of the three given loudspeakers as distance from the given loudspeaker to the listening position minus dmax, all divided by the speed of sound in air, and calculate the gain parameter for each of the three given loudspeakers as distance from the given loudspeaker to the listening position divided by dmax.
In a yet further additional embodiment, the spatial audio rendering application further directs the processor to track a listener position, and move the location of each audio stem to maintain relative position to the tracked listener position.
In yet another embodiment again, the spatial audio rendering application further directs the processor to regularize the loudspeaker positions in a virtual map, calculate a minimum bounding box for the regularized loudspeaker positions in the virtual map, denote the center of the minimum bounding box as a reference position, where the reference position reflects the centroid of a polygon defined by the positions of the loudspeakers, and use the reference position to translate the virtual space of a user interface to the location of the loudspeaker positions.
Additional embodiments and features are set forth in part in the description that follows, and in part will become apparent to those skilled in the art upon examination of the specification or may be learned by the practice of the invention. A further understanding of the nature and advantages of the present invention may be realized by reference to the remaining portions of the specification and the drawings, which forms a part of this disclosure.
The description and claims will be more fully understood with reference to the following figures and data graphs, which are presented as exemplary embodiments of the invention and should not be construed as a complete recitation of the scope of the invention.
Turning now to the drawings, systems and methods for spatial audio rendering are illustrated. Spatial audio systems in accordance with many embodiments of the invention include one or more network connected speakers that can be referred to as “cells”. As described herein, cells are capable of producing directional audio in at least a horizontal plane. In several embodiments, the spatial audio system is able to receive an arbitrary audio source as an input and render spatial audio in a manner determined based upon the specific number and placement of cells in a space. In numerous embodiments, a user interface (UI) can be provided which enables a user to intuitively alter the sound field produced by the spatial audio system. For example, in many embodiments, one or more audio objects can be rendered such that sound associated with an object appears to be emanating from the location of the audio object, where the location of the audio object is not the same location as any of the cells. In several embodiments, the manner in which spatial audio is rendered is interactive. In a number of embodiments, the UI includes at least one affordance that enables movement of one or more audio objects throughout a space, e.g. by dragging them across a digital representation of the space. In certain embodiments, movement of one or more audio objects occurs automatically in response to information concerning the location of one or more listeners within the space. In various embodiments, a listener position can be tracked and used to maintain relative positioning of the user and audio objects.
In order to provide a translation between interactions with the UI and audio object placement, systems and methods described herein utilize “spatialization shaders” to parameterize audio objects for location dependent rendering of spatial audio. In numerous embodiments, an “audio source” is obtained which provides audio signals from a stream or file playback. The audio source can output one or more “stems,” where each stem describes one or more audio objects. In several embodiments, each stem can be visualized via the UI to the user as an object in a virtual space, which can be moved by the user. In numerous embodiments, the stem is visualized as a disk or “puck” which can be dragged around a virtual space in order to change the perceived location of the audio objects associated with the given stem.
In many embodiments, when the puck is moved, cells are directed to modify the location and rendering parameters of spatial audio objects that are provided to the audio rendering pipelines of the cells. Based upon the manner in which the locations and rendering parameters are changed, in many embodiments the listener perceives that the location of the spatial audio objects have changed. In some embodiments, the audio experience can be made to be similar irrespective of the location of the user.
In various embodiments, audio objects are channels in an audio mix. Audio objects can correspond, for example, to a left channel, right channel, center channel, left surround channel, right surround channel, etc. depending on the number of channels for a given mix. In various embodiments, audio objects can represent the audio produced by a single instrument in a mix, e.g. a guitar object, a vocalist object, a percussion object, etc. Spatialization shaders can take the stem position and properties and output real-world positions for each associated audio object belonging to the stem. Depending on the audio source and/or user preference, movement of the puck can differentially modify the placement of sound objects. For example, when the audio source is a television, moving the puck may modify listener position relative to the television in order to place the user at a “sweet spot” for the particular surround sound audio mix. In numerous embodiments, stereo content can be made to sound as if it is rendered from an arbitrary location, or alternatively from multiple locations to generate an immersive stereo experience irrespective of location. Spatial audio systems are described in further detail below before a discussion of spatial shaders.
Spatial audio systems are systems that utilize arrangements of one or more cells to render spatial audio for a given space. Cells can be placed in any of a variety of arbitrary arrangements in any number of different spaces, including (but not limited to) indoor and outdoor spaces. While some cell arrangements are more advantageous than others, spatial audio systems described herein can function with high fidelity despite imperfect cell placement. In addition, spatial audio systems in accordance with many embodiments of the invention can render spatial audio using a particular cell arrangement despite the fact that the number and/or placement of cells may not correspond with assumptions concerning the number and placement of speakers utilized in the encoding of the original audio source. In many embodiments, cells can map their surroundings and/or determine their relative positions to each other in order to configure their playback to accommodate for imperfect placement. In numerous embodiments, cells can communicate wirelessly, and, in many embodiments, create their own ad hoc wireless networks. In various embodiments, cells can connect to external systems to acquire audio for playback. Connections to external systems can also be used for any number of alternative functions, including, but not limited to, controlling internet of things (loT) devices, access digital assistants, playback control devices, and/or any other functionality as appropriate to the requirements of specific applications in accordance with various embodiments of the invention.
An example spatial audio system in accordance with an embodiment of the invention is illustrated in
Referring again to
The set of cells can obtain media data from media servers 130 via the network. In numerous embodiments, the media servers are controlled by 3rd parties that provide media streaming services such as, but not limited to: Netflix, Inc. of Los Gatos, California; Spotify Technology S.A. of Stockholm, Sweden; Apple Inc. of Cupertino, California; Hulu, LLC of Los Angeles, California; and/or any other media streaming service provider as appropriate to the requirements of specific applications in accordance with various embodiments of the invention. In numerous embodiments, cells can obtain media data from local media devices 140, including, but not limited to, cellphones, televisions, computers, tablets, network attached storage (NAS) devices and/or any other device capable of media output. Media can be obtained from media devices via the network, or, in numerous embodiments, be directly obtained by a cell via a direct connection. The direct connection can be a wired connection through an input/output (I/O) interface, and/or wirelessly using any of a number of wireless communication technologies.
The illustrated spatial audio system 100 can also (but does not necessarily need to) include a cell control server 150. In many embodiments, connections between media servers of various music services and cells within a spatial audio system are handled by individual cells. In several embodiments, cell control servers can assist with establishing connections between cells and media servers. For example, cell control servers may assist with authentication of user accounts with various 3rd party services providers. In a variety of embodiments, cells can offload processing of certain data to the cell control server. For example, mapping a room based on acoustic ranging may be sped up by providing the data to a cell control server which can in turn provide back to the cells a map of the room and/or other acoustic model information including (but not limited to) a virtual speaker layout. In numerous embodiments, cell control servers are used to remotely control cells, such as, but not limited to, directing cells to playback a particular piece of media content, changing volume, changing which cells are currently being utilized to playback a particular piece of media content, and/or changing the location of spatial audio objects in the area. However, cell control servers can perform any number of different control tasks that modify cell operation as appropriate to the requirements of specific applications in accordance with various embodiments of the invention. The manner in which different types of user interfaces can be provided for spatial audio systems in accordance with various embodiments of the invention are discussed further below.
In many embodiments, the spatial audio system 100 further includes a cell control device 160. Cell control devices can be any device capable of directly or indirectly controlling cells, including, but not limited to, cellphones, televisions, computers, tablets, and/or any other computing device as appropriate to the requirements of specific applications in accordance with various embodiments of the invention. In numerous embodiments, cell control devices can send commands to a cell control server which in turn sends the commands to the cells. For example, a mobile phone can communicate with a cell control server by connecting to the internet via a cellular network. The cell control server can authenticate a software application executing on the mobile phone. In addition, the cell control server can establish a secure connection to a set of cells which it can pass instructions to from the mobile phone. In this way, secure remote control of cells is possible. However, in numerous embodiments, the cell control device can directly connect to the cell via either the network, the ad hoc network, or via a direct peer-to-peer connection with a cell in order to provide instructions. In many embodiments, cell control devices can also operate as media devices. However, it is important to note that a control server is not a necessary component of a spatial audio system. In numerous embodiments, cells can manage their own control by directly receiving comments (e.g. through physical input on a cell, or via a networked device) and propagate those commands to other cells. However, many control devices can provide user interfaces such as those described below that utilize pucks.
Further, in numerous embodiments, network connected source input devices can be included in spatial audio systems to collect and coordinate media inputs. For example, a source input device may connect to a television, a computer, a media server, or any number of media devices. In numerous embodiments, source input devices have wired connections to these media devices to reduce lag. A spatial audio system that includes a source input device in accordance with an embodiment of the invention is illustrated in
While particular spatial audio systems are described above with respect to
Spatialization shaders can be used to generate a set of parameters that define how each cell in a spatial audio system plays back audio to create a desired sound field. In many embodiments, the desired sound field is implicated via a user interface. In numerous embodiments, pre-set sound fields can be used. For example, when playing back audio, a user may want audio to sound like it is emanating from a particular direction and/or location. The user can use a user interface to modify the rendering of the audio source. For example, the user interface can include an affordance that enables the user to indicate a particular direction and/or or location, and a spatialization shader can translate the information from the user interface into a particular set of parameters for each cell.
Turning now to
A puck 220 user interface affordance is located within the space which has been dragged to a desired position. While the puck 220 is shown in a particular position, as can be readily appreciated the puck can be dragged to any portion of the space, where dragging the puck offsets the sound field. Furthermore, any of a variety of different affordances can be utilized and user interfaces in accordance with various embodiments of the invention should be understood as not limited to puck affordances. In certain embodiments, pucks can be rotated to rotate the sound field. In various embodiments, the puck can be scaled, e.g. made smaller or larger, to change the envelopment and/or spread of the sound field. In numerous embodiments, a number of other pucks 230 are included in the user interface which can be dragged into the space to direct playback of associated content. Modifying tuning parameters in near real-time based on movement of the puck can be achieved using spatialization shaders.
Turning now to
In a number of embodiments, the number of stems depends on the audio source. For example, an audio source which contains separate channels for each instrument in the mix may be split into a stem representing each different instrument. However, stems can be merged as desired to group certain channels, e.g. guitar and vocal stems can be merged into a single stem. In numerous embodiments, the audio source includes metadata that contains a preferred set of stems. Each stem can be represented by a puck in the user interface.
The locations of cells in the spatial audio system are obtained (330). In numerous embodiments, the locations of cells are defined in a coordinate plane. In many embodiments, cells have multiple directional horns, and therefore the orientation (“rotation”) of each cell is associated with each location. The location of any stems is also determined (340). In numerous embodiments, the location of one or more stems are obtained via a user interface. As discussed herein, a puck can be used to determine the location of a stem. In some embodiments, the location of a puck determines the location of the associated stem. In various embodiments, the location of a puck determines the location of a listener at the negated coordinates of the puck, i.e. the location of the puck reflected about the origin.
Based on the location of the stems in the user interface, tuning parameters are generated (350) for each cell. While any number of different tuning parameters can be generated, several of note are the source focus parameter, and the delay and gain parameters, all three of which are discussed at length in subsections below. Other parameters can include (but are not limited to) the coordinates for the position of each audio object associated with the stem, volume, bass-crossfeed, and snapToHorn (if disabled, then allow rotation of the beam continuously; if enabled, allow beams in only directions corresponding to the number of horns). Audio is rendered (360) by each cell in accordance with their specific tuning parameters to generate the desired sound field. In numerous embodiments, volume, source focus, and position of the audio objects are calculated and utilized by spatialization shaders. In some embodiments, bass-crossfeed parameters are used as well. However, as can be readily appreciated, any subset of parameters (or one that includes additional parameters) can be used depending on the scenario as appropriate to the requirements of specific applications of embodiments of the invention.
While a specific process is illustrated with respect to
In many scenarios, cells are arranged in regular shapes, e.g. 3 cells in an equilateral triangle, 4 cells in a square, etc. However, individual users have the freedom to place cells in arbitrary locations within their homes. While some cell placements may be superior to others, some degree of flexibility is tolerable. In many embodiments, cells can automatically determine their relative locations to each other and construct a coordinate system which includes the relative rotation and location of cells. Once the coordinate is constructed, it can be used to map the UI space to the real-world. In many embodiments, the UI presents a uniform virtual space which has an obvious center, e.g. the center point of a circle. However, the real-world placement may not have such an obvious center. Further, the centroid of the polygon formed by the cell placement may not be the most useful position to consider as the center point of the virtual space. For example outlier cells which fall far away from the rest of the cells can drag the centroid to a position that would result in suboptimal playback. To address this, cell positions can be regularized due to each cell's ability to produce directional audio.
Turning now to
A minimum bounding box for the regularized cell positions is computed (530) and the center of this minimum bounding box is assigned (2040) as the reference position, e.g. corresponding to the center point of the virtual space of the UI. The reference position can enable translation from a simple, uniform UI virtual space (e.g. rectangle, square, circle, etc.) to a more complex real-space layout. Various example reference positions with respect to centroids for different cell layouts in accordance with various embodiments of the invention are illustrated in
In a number of embodiments, a source focus parameter can define the source energy distribution between cells as well as directivity behavior for each cell. In many embodiments, the source focus parameter contains a number of predetermined steps which can be transitioned between depending on the placement of audio objects, which can be determined (for example) by placement of the puck and/or a desired level of directionality from the location of the puck, where each step has a different behavior which is not required to be linear. In many embodiments, the source focus parameter ranges from 0.0 to 1.0, where lower numbers reflect a lack of focus, and higher numbers represent increasing focus. Each step of 0.1 can have a particular behavior, and values between steps of 0.1 can have an interpolated mix of behaviors from the bounding 0.1 step values. For example, a source focus parameter of 0.53 would mix behavior of 0.5 and 0.6, weighing slightly in favor of 0.5. In numerous embodiments, the effect of the source focus parameter can be tuned to the artistic desires of the user. A set of source focus parameter behaviors are described in the table below.
By way of further example, different source focus parameters are illustrated in the series of
In a number of embodiments, a VBAP/DBAP hybrid approach is utilized to generate audio signals for audio objects located within particular sub-regions of the region containing a group of cells. In this approach, Distance Based Amplitude Panning (DBAP) is utilized to determine the characteristics of three or more virtual audio sources located on the convex hull of the cells. Vector Based Amplitude Panning (VBAP) and/or pairwise-based panning approaches can then be utilized by pairs of cells (or three cells when an overhead cells is present) to generate audio signals for each of the cells enabling the rendering of audio by the cells in manner corresponding to each of the virtual audio sources determined using DBAP. In some configurations, beamforming can be utilized to generate audio signals directed towards a centroid defined based upon the cell configuration. In other configurations, beamforming can be utilized to generate audio signals directed toward the location of the spatial audio object. Utilizing beamforming in this way can increase the perceived directivity of the spatial audio object. As can readily be appreciated, any of a variety of panning techniques can be utilized to render spatial audio objects and the specific manner in which spatial audio objects are rendered in different regions can be determined and/or modified based upon factors including (but not limited to) the configuration of the cells, the type of audio source, artistic direction and/or any other factor appropriate to the requirements of specific applications.
While particular source focus parameters, values and behaviors are discussed above, as can be readily appreciated, additional behaviors can be added and described behaviors can be removed as appropriate to the requirements of specific applications of embodiments of the invention. Furthermore, the rendering of audio objects in accordance with various embodiments of the invention is not limited to the use of spatial audio parameters in the manner described above, but can instead utilize any of a variety of techniques for modifying the directionality and/or manner in which audio objects are rendered as appropriate to the requirements of specific application. The use of a delay/gain parameter in the rendering of spatial audio in accordance with several embodiments of the invention is discussed below.
In numerous embodiments, a user may indicate their listening location to a spatial audio system which will shift the sound field to optimize the listening experience for the indicated location. In many situations, when a listener is closer to one cell compared to the others, the spatial image will shift towards the closer cell. This is due to the Precedence effect—spatial perception shifts towards the direction of the first arriving sound. Further, the sound from the nearer cell tends to be less attenuated than far away cells which adds to the spatial dominance of the near cell (also referred to as Amplitude Panning). To compensate for these two phenomena, systems and methods described herein can utilize a gain parameter and a delay parameter for each cell to correct for the different traveling time/distance from each cell to the listener, and perceptually put all cells at the same distance for the listener. In numerous embodiments, delay and gain are used more often when the listener expects to be in a static listening position, e.g. sitting on a couch watching a television. In this situation, a near cell can be made to play more quietly with a slight delay to compensate for the shorter distance.
Turning now to
By way of further example, a three-cell system is illustrated in
In
In many situations, cells that are too far from the listening position may not be able to contribute to a ‘fused’ spatial sound event. In numerous embodiments, if a very far cell were included in the gain and delay computations, unacceptable latency could be introduced. Therefore, in various embodiments, a maximum latency is introduced outside of which cells are not considered for gain and delay with respect to the given listening position. In a variety of embodiments, the maximum latency is between 10 and 14 ms. This situation is illustrated in
In numerous embodiments, there is equal distribution of bass content to all cells. However, in many situations, depending on the layout of cells, bass frequencies can be disproportionately loud near cells that are not being used (or as significantly used) for rendering spatial audio objects. For example, if a listening area on an open floor plan including a kitchen and a living room contains multiple cells, and a user in the kitchen wants to listen to music and therefore places audio objects in the kitchen, an equal bass distribution across all cells equally may disturb the person in the living room by playing only bass near them. To address this scenario, a bass-crossfeed parameter can be used to tune the amount of bass fed from one cell to another. In many embodiments, the bass-crossfeed parameter is a number between 0.0, representing no crossfeed, and 1.0, representing full crossfeed. In many embodiments, the bass-crossfeed parameter can be artistically set by the system architect and/or user.
Turning now to
Spatialization shaders can use tuning parameters to place and parameterize audio objects from stems, which in turn can be used to render the desired spatial audio. In many embodiments, spatialization shaders are applied dynamically and can modify audio objects in near-real time based on changes made in the UI. In many embodiments, the spatialization shaders are differentially used depending on the audio source. For example, spatialization shaders may differentially calculate tuning parameters when provided 5.1 channel audio vs. stereo audio. Example puck positions and resulting audio objects for stereo upmixed to 10 channels parameterized by spatialization shaders in accordance with an embodiment of the invention are illustrated in
Spatial audio has traditionally been rendered with a static array of speakers located in prescribed locations. While, up to a point, more speakers in the array is conventionally thought of as “better,” consumer grade systems have currently settled on 5.1 and 7.1 channel systems, which use 5 speakers, and 7 speakers, respectively in combination with one or more subwoofers. Currently, some media is supported in up to 22.2 (e.g. in Ultra HD Television as defined by the International Telecommunication Union). In order to play higher channel sound on fewer speakers, audio inputs are generally either downmixed to match the number of speakers present, or channels that do not match the speaker arrangement are merely dropped. An advantage to systems and methods described herein is the ability to create any number of audio objects based upon the number of channels used to encode the audio source. For example, an arrangement of three cells could generate the auditory sensation of the presence of a 5.1 speaker arrangement by placing five audio objects in the room, encoding the five audio objects into a spatial representation (e.g. an ambisonic representation such as (but not limited to) B-format), and then rendering a sound field using the three cells by decoding the spatial representation of the original 5.1 audio source in a manner appropriate to the number and placement of cells (see discussion below). In many embodiments, the bass channel can be mixed into the driver signals for each of the cells. Processes that treat channels as spatial audio objects are extensible to any arbitrary number of speakers and/or speaker arrangements. In this way, fewer physical speakers in the room can be utilized to achieve the effects of a higher number of speakers. Furthermore, cells need not be placed precisely in order to achieve this effect.
Conventional audio systems typically have what is often referred to as a “sweet spot” at which the listener should be situated. In numerous embodiments, the spatial audio system can use information regarding room acoustics to control the perceived ratio between direct and reverberant sound in a given space such that it sounds like a listener is surrounded by sound, regardless of where they are located within the space. While most rooms are very non-diffuse, spatial rendering methods can involve mapping a room and determining an appropriate sound field manipulation for rendering diffuse audio (see discussion below). Diffuse sound fields are typically characterized by sound arriving randomly from evenly distributed directions at evenly distributed delays.
In many embodiments, the spatial audio system maps a room. Cells can use any of a variety of methods for mapping a room, including, but not limited to, acoustic ranging, applying machine vision processes, and/or any other ranging method that enables 3D space mapping. Other devices can be utilized to create or augment these maps, such as smart phones or tablet PCs. The mapping can include: the location of cells in the space; wall, floor, and/or ceiling placements; furniture locations; and/or the location of any other objects in a space. In several embodiments, these maps can be used to generate speaker placement and/or orientation recommendations that can be tailored to the particular location. In some embodiments, these maps can be continuously updated with the location of listeners traversing the space and/or a history of the location(s) of listeners. As is discussed further below, many embodiments of the invention utilize virtual speaker layouts to render spatial audio. In several embodiments, information including (but not limited to) any of cell placement and/or orientation information, room acoustic information, user/object tracking information can be utilized to determine an origin location at which to encode a spatial representation (e.g. an ambisonic representation) of an audio source and a virtual speaker layout to use in the generation of driver inputs at individual cells. Various systems and methods for rending of spatial audio using spatial audio systems in accordance with certain embodiments of the invention are discussed further below.
In a number of embodiments, upmixing can be utilized to create a number of audio objects that differs from the number of channels. In several embodiments, a stereo source containing two channels can be upmixed to create a number of left (L), center (C), and right (R) channels. In a number of embodiments, diffuse audio channels can also be generated via upmixing. Audio objects corresponding to the upmixed channels can then be placed relative to a space defined by a number of cells to create various effects including (but not limited to) the sensation of stereo everywhere within the space as conceptually illustrated in
Turning now to
Cell 3200 can further include an input/output (I/O) interface 3220. In many embodiments, the I/O interface includes a variety of different ports and can communicate using a variety of different methodologies. In numerous embodiments, the I/O interface includes a wireless networking device capable of establishing an ad hoc network and/or connecting to other wireless networking access points. In a variety of embodiments, the I/O interface has physical ports for establishing wired connections. However, I/O interfaces can include any number of different types of technologies capable of transferring data between devices. Cell 3200 further includes clock circuitry 3230. In many embodiments, the clock circuitry includes a quartz oscillator.
Cell 3200 can further include driver signal circuitry 3235. Driver signal circuitry is any circuitry capable of providing an audio signal to a driver in order to make the driver produce audio. In many embodiments, each driver has its own portion of the driver circuitry.
Cell 3200 can also include a memory 3240. Memory can be volatile memory, non-volatile memory, or a combination of volatile and non-volatile memory. Memory 3240 can store an audio player application such as (but not limited to) a spatial audio rendering application 3242. In numerous embodiments, spatial audio rendering applications can direct the processing circuitry to perform various spatial audio rendering tasks such as, but not limited to, those described herein. In numerous embodiments, the memory further includes map data 3244. Map data can describe the location of various cells within a space, the location of walls, floors, ceilings, and other barriers and/or objects in the space, and/or the placement of virtual speakers. In many embodiments, multiple sets of map data may be utilized in order to compartmentalize different pieces of information. In a variety of embodiments, the memory 3240 also includes audio data 3246. Audio data can include one or more pieces of audio content that can contain any number of different audio tracks and/or channels. In a variety of embodiments, audio data can include metadata describing the audio tracks such as, but not limited to, channel information, content information, genre information, track importance information, and/or any other metadata that can describe an audio track as appropriate to the requirements of specific applications in accordance with various embodiments of the invention. In many embodiments, audio tracks are mixed in accordance with an audio format. However, audio tracks can also represent individual, unmixed channels.
Memory can further include sound object position data 3248. Sound object position data describes the desired location of a sound object in the space. In some embodiments, sound objects are located at the position of each speaker in a conventional speaker arrangement ideal for the audio data. However, sound objects can be designated for any number of different audio tracks and/or channels and can be similarly located at any desired point.
The apparatus 3300 may be used to implement a cell. The apparatus 3300 includes a set of spatial audio control and production modules 3310 that includes a system encoder 3312, a system decoder 3332, a cell encoder 3352, and a cell decoder 3372. The apparatus 3300 can also include a set of drivers 3392. The set of drivers 3392 may include one or more subsets of drivers that include one or more of different types of drivers. The drivers 3392 can be driven by driver circuitry 3390 that generates the electrical audio signals for each of the drivers. The driver circuitry 3390 may include any bandpass or crossover circuits that may divide audio signals for different types of drivers.
In various aspects of the disclosure, as illustrated by the apparatus 3300, each cell may include a system encoder and a system decoder such that system-level functionality and processing of related information may be distributed over the group of cells. This distributed architecture can also minimize the amount of data that needs to be transferred between each of the cells. In other implementations, each cell may only include a cell encoder and a cell decoder, but not a system encoder nor a system decoder. In various embodiments, secondary cells only utilize their cell encoder and cell decoder.
The processing system 3320 can include one or more processors illustrated as a processor 3314. Examples of processors 3314 can include (but is not limited to) microprocessors, microcontrollers, digital signal processors (DSPs), field programmable gate arrays (FPGAs), programmable logic devices (PLDs), state machines, gated logic, discrete hardware circuits, and/or other suitable hardware configured to perform the various functionality described throughout this disclosure.
The apparatus 3300 may be implemented as having a bus architecture, represented generally by a bus 3322. The bus 3322 may include any number of interconnecting buses and/or bridges depending on the specific application of the apparatus 3302 and overall design constraints. The bus 3322 can link together various circuits including the processing system 3320, which can include the one or more processors (represented generally by the processor 3314) and a memory 3318, and computer-readable media (represented generally by a computer-readable medium 3316). The bus 3322 may also link various other circuits such as timing sources, peripherals, voltage regulators, and/or power management circuits, which are well known in the art, and therefore, will not be described any further. A bus interface (not shown) can provide an interface between the bus 3322 and a network adapter 3342. The network adapter 3342 provides a means for communicating with various other apparatus over a transmission medium. Depending upon the nature of the apparatus, a user interface (e.g., keypad, display, speaker, microphone, joystick) may also be provided.
The processor 3314 is responsible for managing the bus 3322 and general processing, including execution of software that may be stored on the computer-readable medium 3316 or the memory 3318. The software, when executed by the processor 3314, can cause the apparatus 3300 to perform the various functions described herein for any particular apparatus. Software shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, executables, threads of execution, procedures, functions, etc., whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise.
The computer-readable medium 3316 or the memory 3318 may also be used for storing data that is manipulated by the processor 3314 when executing software. The computer-readable medium 3316 may be a non-transitory computer-readable medium such as a computer-readable storage medium. A non-transitory computer-readable medium includes, by way of example, a magnetic storage device (e.g., hard disk, floppy disk, magnetic strip), an optical disk (e.g., a compact disc (CD) or a digital versatile disc (DVD)), a smart card, a flash memory device (e.g., a card, a stick, or a key drive), a random access memory (RAM), a read only memory (ROM), a programmable ROM (PROM), an erasable PROM (EPROM), an electrically erasable PROM (EEPROM), a register, a removable disk, and any other suitable medium for storing software and/or instructions that may be accessed and read by a computer. The computer-readable medium may also include, by way of example, a carrier wave, a transmission line, and any other suitable medium for transmitting software and/or instructions that may be accessed and read by a computer. Although illustrated as residing in the apparatus 3300, the computer-readable medium 3316 may reside externally to the apparatus 3300, or be distributed across multiple entities including the apparatus 3300. The computer-readable medium 3316 may be embodied in a computer program product. By way of example, a computer program product may include a computer-readable medium in packaging materials. Those skilled in the art will recognize how best to implement the described functionality presented throughout this disclosure depending on the particular application and the overall design constraints imposed on the overall system.
The multimedia content 3412 and the multimedia metadata 3414 related thereto may be referred to herein as “multimedia data.” The source manager 3400 includes a source selector 3422 and a source preprocessor 3424 that may be used by the source manager 3400 to select one or more sources in the multimedia data and perform any preprocessing to provide as the content 3448. The content 3448 is provided to the multimedia rendering engine along with the rendering information 3450 generated by the other components of the source manager 3400, as described herein.
The multimedia content 3412 and the multimedia metadata 3414 may be multimedia data from such sources as High-Definition Multimedia Interface (HDMI), Universal Serial Bus (USB), analog interfaces (phono/RCA plugs, stereo/headphone/headset plugs), as well as streaming sources using the Airplay protocol developed by Apple Inc. or the Chromecast protocol developed by Google. In general, these sources may provide sound information in a variety of content and formats, including channel-based sound information (e.g., Dolby Digital, Dolby Digital Plus, and Dolby Atmos, as developed by Dolby Laboratories, Inc.), discrete sound objects, sound fields, etc. Other multimedia data can include text-to-speech (TTS) or alarm sounds generated by a connected device or another module within the spatial multimedia reproduction system (not shown).
The source manager 3400 further includes an enumeration determinator 3442, a position manager 3444, and an interaction manager 3446. Together, these components can be used to generate the rendering information 3450 that is provided to the multimedia rendering engine. As further described herein, the sensor data 3416 and the preset/history information 3418, which may be referred to generally as “control data,” may be used by these modules to affect playback of the multimedia content 3412 by providing the rendering information 3450 to the multimedia rendering engine. In one aspect of the disclosure, the rendering information 3450 contains telemetry and control information as to how the multimedia rendering engine should playback the multimedia in the content 3448. Thus, the rendering information 3450 may specifically direct how the multimedia rendering engine is to reproduce the content 3448 received from the source manager 3400. In other aspects of the disclosure, the multimedia rendering engine may make the ultimate determination as to how to render the content 3448.
The enumeration determinator module 3442 is responsible for determining the number of sources in the multimedia information included in the content 3448. This may include multiple channels from a single source, such as, for example, two channels from a stereo sound source, as well as TTS or alarm/alert sounds such as those that may be generated by the system. In one aspect of the disclosure, the number of channels in each content source is part of the determination of the number of sources to produce the enumeration information. The enumeration information may be used in determining the arrangement and mixing of the sources in the content 3448.
The position manager 3444 can manage the arrangement of reproduction of the sources in the multimedia information included in the content 3448 using a desired position of reproduction for each source. A desired position may be based on various factors, including the type of content being played, positional information of the user or an associated device, and historical/predicted position information. With reference to
The playback location may be based on the object A/R 3514, which may be information for an AR object in a particular rendering for a room. Thus, the playback position of a sound source may match the NR object. In addition, the system may determine where cells are using visual detection and, through a combination of scene detection and view of the NR object being rendered, the playback position may be adjusted accordingly.
The playback position of a sound source may be adjusted based on a user interacting with a user interface through the UI position input 3516. For example, the user may interact with an app that includes a visual representation of the room in which a sound object is to be reproduced as well as the sound object itself. The user may then move the visual representation of the sound object to position the playback of the sound object in the room. In numerous embodiments, the position of the sound object is moved relative to the position of a listener based on tracked listener movement. In numerous embodiments, listeners can be tracked using any of a variety of tracking methods including (but not limited to) ultrawideband (UWB) tracking, radio-frequency identification (RFID) tracking, and/other tracking and/or triangulation modality as appropriate to the requirements of specific applications of embodiments of the invention.
The location of playback may also be based on other factors such as the last playback location of a particular sound source or type of sound source 3518. In general, the playback location may be based on a prediction based on factors including (but not limited to) type of the content, time of day, and/or other heuristic information. For example, the position manager 3544 may initiate playback of an audio book in a bedroom because the user plays back the audio book at night, which is the typical time that the user plays the audio book. As another example, a timer or reminder alarm may be played back in the kitchen if the user requests a timer be set while the user is in the kitchen.
In general, the position information sources may be classified into active or passive sources. Active sources refer to positional informational sources provided by a user. These sources may include user location and object location. In contrast, passive sources are positional informational sources that are not actively specified by users but used by the position manager 3544 to predict playback position. These passive sources may include type of content, time of day, day of the week, and based on heuristic information. In addition, a priority level may be associated with each content source. For example, alarms and alerts may have a higher level of associated priority than other content sources, which may mean that these are played at higher volumes if they are being played in a position next to other content sources.
The desired playback location may be dynamically updated as the multimedia is reproduced by the multimedia rendering engine. For example, playback of music may “follow” a user around a room by the spatial multimedia reproduction system receiving updated positional information of the user or a device being carried by the user.
An interaction manager 3446 can manage how each of the different multimedia sources are to be reproduced based on their interaction with each other. In accordance with one aspect of the disclosure, playback of a multimedia source such as a sound source may paused, stopped, or reduced in volume (also referred as “ducked”). For example, where an alarm needs to be rendered during playback of an existing multimedia source, such as a song, an interaction manager may pause or duck the song while the alarm is being played.
The components of a nested architecture of spatial encoders and spatial decoders can be implemented within individual cells within a spatial audio in a variety of ways. Software of a cell that can be configured act as a primary cell or a secondary cell within a spatial audio system in accordance with an embodiment of the invention is conceptually illustrated in
In the illustrated embodiment, an audio and midi application D #402 is provided to manage information passing between various software processes executing on the processing system of the cell and the hardware drivers. In several embodiments, the audio and midi application is capable of decoding audio signals for rendering on the sets of drivers of the cell. Any of the processes described herein for decoding audio for rendering on a cell can be utilized by the audio and midi application including the processes discussed in detail below.
A hardware audio source processes 3504 manage communication with external sources via the interface connector drivers. The interface connector drivers can enable audio sources to be directly connected to the cell. Audio signals can be routed between the drivers and various software processes executing on the processing system of the cell using an audio server 3506.
As noted above, audio signals captured by microphones can be utilized for a variety of applications including (but not limited to) calibration, equalization, ranging, and/or voice command control. In the illustrated embodiment, audio signals from the microphone can be routed from the audio and midi application 3502 to a microphone processor 3508 using the audio server 3506. The microphone processor can perform functions associated with the manner in which the cell generates spatial audio such as (but not limited to) calibration, equalization, and/or ranging. In several embodiments, the microphone is utilized to capture voice commands and the microphone processor can process the microphone signals and provide them to word detection and/or voice assistant clients 3510. When command words are detected, the voice assistant clients 3510 can provide audio and/or audio commands to cloud services for additional processing. The voice assistant clients 3510 can also provide response from the voice assistant cloud services to the application software of the cell (e.g. mapping voice commands to controls of the cell). The application software of the cell can then implement the voice commands as appropriate to the specific voice command.
In several embodiments, the cell receives audio from a network audio source. In the illustrated embodiment, a network audio source process 3512 is provided to manage communication with one or more remote audio sources. The network audio source process can manage authentication, streaming, digital rights management, and/or any other processes that the cell is required to perform by a particular network audio source to receive and playback audio. As is discussed further below, the received audio can be forwarded to other cells using a source server process 3514 or provided to a sound server 3516.
The cell can forward a source to another cell using the source server 3514. The source can be (but is not limited to) an audio source directly connected to the cell via a connector, and/or a source obtained from a network audio source via the network audio source process 3512. Sources can be forwarded between a primary in a first group of cells and a primary in second group of cells to synchronize playback of the source between the two groups of cells. The cell can also receive one or more sources from another cell or a network connected source input device via the source server 3514.
The sound server 3516 can coordinate audio playback on the cell. When the cell is configured as a primary, the sound server 3516 can also coordinate audio playback on secondary cells. When the cell is configured as a primary, the source server 3516 can receive an audio source and process the audio source for rendering using the drivers on the cell. As can readily be appreciated any of a variety of spatial audio processing techniques can be utilized to process the audio source to obtain spatial audio objects and to render audio using the cell's drivers based upon the spatial audio objects. In a number of embodiments, the cell software implements a nested architecture similar to the various nested architectures described above in which the source audio is used to obtain spatial audio objects. The sound server 3516 can generate the appropriate source audio objects for a particular audio source and then spatially encode the spatial audio objects. In several embodiments, the audio sources can already be spatially encoded (e.g. encoded in an ambisonic format) and so the sound server 3516 need not perform spatial encoding. The sound server 3516 can decode spatial audio to a virtual speaker layout. The audio signals for the virtual speakers can then be used by the sound server to decode audio signals specific to the location of the cell and/or locations of cells within a group. In several embodiments, the process of obtaining audio signals for each cell involves spatially encoding the audio inputs of the virtual speakers based upon the location of the cell and/or other cells within a group of cells. The spatial audio for each cell can then be decoded into separate audio signals for each set of drivers included in the cell. In a number of embodiments, the audio signal for the cell can be provided to the audio and midi application 3502, which generates the individual driver inputs. Where the cell is primary cell within a group of cells, the sound server 3516 can transmit the audio signals for each of the secondary cells over the network. In many embodiments, the audio signals are transmitted via unicast. In several embodiments, some of the audio signals are unicast and at least one signal is multicast (e.g. a bass signal that is used for rendering by all cells within a group). In a number of embodiments, the sound server 3516 generates direct and diffuse audio signals that are utilized by the audio and midi application 3502 to generate inputs to the cell's drivers using the hardware drivers. Direct and diffuse signals can also be generated by the sound server 3516 and provided to secondary cells.
When the cell is a secondary cell, the sound server 3502 can receive an audio signals that were generated on a primary cell and provided to the cell via a network. The cell can route the received audio signals to the audio and midi application 3502, which generates the individual driver inputs in the same manner as if the audio signals had been generated by the cell itself.
Various potential implementations of sound servers can be utilized in cells similar to those described above with reference to
The source graphs 3602 can be configured in a variety of different ways depending upon the nature of the audio. In several embodiments, the cell can receive sources that mono, stereo, any of a variety of multichannel surround sound formats, and/or audio encoded in accordance with an ambisonic format. Depending upon the encoding of the audio, the source graph can map an audio signal or an audio channel to an audio object. As discussed above, the received source can be upmixed and/or downmixed to create a number of audio objects that is different to the number of audio signals/audio channels provided by the audio source. When the audio is encoded in an ambisonic format, the source graph may be able to forward the audio source directly to the spatial encoder. In several embodiments, the ambisonic format may be incompatible with the spatial encoder and the audio source must be reencoded in an ambisonic format that is an appropriate input for the spatial encoder. As can readily be appreciated, an advantage of utilizing source graphs to process sources for input to a spatial encoder is that additional source graphs can be developed to support additional formats as appropriate to the requirements of specific applications.
A variety of spatial encoders can be utilized in sound servers similar to the sound server shown in
A graph for generating individual driver feeds based upon three audio signals corresponding to feeds for each of the set of drivers associated with each of the horns is illustrated in
While various nested architectures employing a variety of spatial audio encoding techniques are described above, any of a number of spatial audio reproduction processes including (but not limited to) distributed spatial audio reproduction processes and/or spatial audio reproduction processes that utilize virtual speaker layouts to determine the manner in which to render spatial audio can be utilized as appropriate to the requirements of different applications in accordance with various embodiments of the invention. Furthermore, a number of different spatial location metadata formats and components are described above. It should be readily appreciated that the spatial layout metadata generated and distributed within a spatial audio system is not in any way limited to specific pieces of data and/or specific formats. The components and/or encoding of spatial layout metadata largely is largely dependent upon the requirements of a given application. Accordingly, it should be appreciated that any of the above nested architectures and/or spatial encoding techniques can be utilized in combination and are not limited to specific combinations. Furthermore, specific techniques can be utilized in processes other than those specifically disclosed herein in accordance with certain embodiments of the invention.
Although specific methods for using spatialization shaders are discussed above, many different methods such as (but not limited to) those that use different tuning parameters, can be implemented in accordance with many different embodiments of the invention. It is therefore to be understood that the present invention may be practiced in ways other than specifically described, without departing from the scope and spirit of the present invention. Thus, embodiments of the present invention should be considered in all respects as illustrative and not restrictive. Accordingly, the scope of the invention should be determined not by the embodiments illustrated, but by the appended claims and their equivalents.
The current application claims the benefit of and priority under 35 U.S.C. § 119(e) to U.S. Provisional Patent Application No. 63/264,089 entitled “Systems and Methods for Rendering Spatial Audio using Spatialization Shaders” filed Nov. 15, 2021. The disclosure of U.S. Provisional Patent Application No. 63/264,089 is hereby incorporated by reference in its entirety for all purposes.
Number | Date | Country | |
---|---|---|---|
63264089 | Nov 2021 | US |