The present disclosure pertains to devices, systems and methods for rendering audio data for playback on audio devices.
Audio devices, including but not limited to smart audio devices, have been widely deployed and are becoming common features of many homes. Although existing systems and methods for controlling audio devices provide benefits, improved systems and methods would be desirable.
Throughout this disclosure, including in the claims, the terms “speaker,” “loudspeaker” and “audio reproduction transducer” are used synonymously to denote any sound-emitting transducer (or set of transducers). A typical set of headphones includes two speakers. A speaker may be implemented to include multiple transducers (e.g., a woofer and a tweeter), which may be driven by a single, common speaker feed or multiple speaker feeds. In some examples, the speaker feed(s) may undergo different processing in different circuitry branches coupled to the different transducers.
Throughout this disclosure, including in the claims, the expression performing an operation “on” a signal or data (e.g., filtering, scaling, transforming, or applying gain to, the signal or data) is used in a broad sense to denote performing the operation directly on the signal or data, or on a processed version of the signal or data (e.g., on a version of the signal that has undergone preliminary filtering or pre-processing prior to performance of the operation thereon).
Throughout this disclosure including in the claims, the expression “system” is used in a broad sense to denote a device, system, or subsystem. For example, a subsystem that implements a decoder may be referred to as a decoder system, and a system including such a subsystem (e.g., a system that generates X output signals in response to multiple inputs, in which the subsystem generates M of the inputs and the other X-M inputs are received from an external source) may also be referred to as a decoder system.
Throughout this disclosure including in the claims, the term “processor” is used in a broad sense to denote a system or device programmable or otherwise configurable (e.g., with software or firmware) to perform operations on data (e.g., audio, or video or other image data). Examples of processors include a field-programmable gate array (or other configurable integrated circuit or chip set), a digital signal processor programmed and/or otherwise configured to perform pipelined processing on audio or other sound data, a programmable general purpose processor or computer, and a programmable microprocessor chip or chip set.
Throughout this disclosure including in the claims, the term “couples” or “coupled” is used to mean either a direct or indirect connection. Thus, if a first device couples to a second device, that connection may be through a direct connection, or through an indirect connection via other devices and connections.
As used herein, a “smart device” is an electronic device, generally configured for communication with one or more other devices (or networks) via various wireless protocols such as Bluetooth, Zigbee, near-field communication, Wi-Fi, light fidelity (Li-Fi), 3G, 4G, 5G, etc., that can operate to some extent interactively and/or autonomously. Several notable types of smart devices are smartphones, smart cars, smart thermostats, smart doorbells, smart locks, smart refrigerators, phablets and tablets, smartwatches, smart bands, smart key chains and smart audio devices. The term “smart device” may also refer to a device that exhibits some properties of ubiquitous computing, such as artificial intelligence.
Herein, we use the expression “smart audio device” to denote a smart device which is either a single-purpose audio device or a multi-purpose audio device (e.g., an audio device that implements at least some aspects of virtual assistant functionality). A single-purpose audio device is a device (e.g., a television (TV)) including or coupled to at least one microphone (and optionally also including or coupled to at least one speaker and/or at least one camera), and which is designed largely or primarily to achieve a single purpose. For example, although a TV typically can play (and is thought of as being capable of playing) audio from program material, in most instances a modern TV runs some operating system on which applications run locally, including the application of watching television. In this sense, a single-purpose audio device having speaker(s) and microphone(s) is often configured to run a local application and/or service to use the speaker(s) and microphone(s) directly. Some single-purpose audio devices may be configured to group together to achieve playing of audio over a zone or user configured area.
One common type of multi-purpose audio device is an audio device that implements at least some aspects of virtual assistant functionality, although other aspects of virtual assistant functionality may be implemented by one or more other devices, such as one or more servers with which the multi-purpose audio device is configured for communication. Such a multi-purpose audio device may be referred to herein as a “virtual assistant.” A virtual assistant is a device (e.g., a smart speaker or voice assistant integrated device) including or coupled to at least one microphone (and optionally also including or coupled to at least one speaker and/or at least one camera). In some examples, a virtual assistant may provide an ability to utilize multiple devices (distinct from the virtual assistant) for applications that are in a sense cloud-enabled or otherwise not completely implemented in or on the virtual assistant itself. In other words, at least some aspects of virtual assistant functionality, e.g., speech recognition functionality, may be implemented (at least in part) by one or more servers or other devices with which a virtual assistant may communication via a network, such as the Internet. Virtual assistants may sometimes work together, e.g., in a discrete and conditionally defined way. For example, two or more virtual assistants may work together in the sense that one of them, e.g., the one which is most confident that it has heard a wakeword, responds to the wakeword. The connected virtual assistants may, in some implementations, form a sort of constellation, which may be managed by one main application which may be (or implement) a virtual assistant.
Herein, “wakeword” is used in a broad sense to denote any sound (e.g., a word uttered by a human, or some other sound), where a smart audio device is configured to awake in response to detection of (“hearing”) the sound (using at least one microphone included in or coupled to the smart audio device, or at least one other microphone). In this context, to “awake” denotes that the device enters a state in which it awaits (in other words, is listening for) a sound command. In some instances, what may be referred to herein as a “wakeword” may include more than one word, e.g., a phrase.
Herein, the expression “wakeword detector” denotes a device configured (or software that includes instructions for configuring a device) to search continuously for alignment between real-time sound (e.g., speech) features and a trained model. Typically, a wakeword event is triggered whenever it is determined by a wakeword detector that the probability that a wakeword has been detected exceeds a predefined threshold. For example, the threshold may be a predetermined threshold which is tuned to give a reasonable compromise between rates of false acceptance and false rejection. Following a wakeword event, a device might enter a state (which may be referred to as an “awakened” state or a state of “attentiveness”) in which it listens for a command and passes on a received command to a larger, more computationally-intensive recognizer.
As used herein, the terms “program stream” and “content stream” refer to a collection of one or more audio signals, and in some instances video signals, at least portions of which are meant to be heard together. Examples include a selection of music, a movie soundtrack, a movie, a television program, the audio portion of a television program, a podcast, a live voice call, a synthesized voice response from a smart assistant, etc. In some instances, the content stream may include multiple versions of at least a portion of the audio signals, e.g., the same dialogue in more than one language. In such instances, only one version of the audio data or portion thereof (e.g., a version corresponding to a single language) is intended to be reproduced at one time.
At least some aspects of the present disclosure may be implemented via one or more audio processing methods. In some instances, the method(s) may be implemented, at least in part, by a control system and/or via instructions (e.g., software) stored on one or more non-transitory media. Some such methods may involve receiving, by a control system and via an interface system, audio data, the audio data including one or more audio signals and associated spatial data. The spatial data may indicate an intended perceived spatial position corresponding to an audio signal of the one or more audio signals. The intended perceived spatial position may, for example, correspond to a channel of a channel-based audio format. Alternatively, or additionally, the intended perceived spatial position may correspond to positional metadata, for example, to positional metadata of an object-based audio format.
In some examples, the method may involve receiving, by the control system and via the interface system, listener position data indicating a listener position corresponding to a person in an audio environment. According to some examples, the method may involve receiving, by the control system and via the interface system, loudspeaker position data indicating a position of each loudspeaker of a plurality of loudspeakers in the audio environment. In some examples, the method may involve receiving, by the control system and via the interface system, loudspeaker orientation data. In some such examples, the loudspeaker orientation data may indicate a loudspeaker orientation angle between (a) a direction of maximum acoustic radiation for each loudspeaker of the plurality of loudspeakers in the audio environment; and (b) the listener position. In some such examples, listener position may be relative to a position of a corresponding loudspeaker. According to some examples, the loudspeaker orientation angle for a particular loudspeaker may be an angle between (a) the direction of maximum acoustic radiation for the particular loudspeaker and (b) a line between a position of the particular loudspeaker and the listener position.
According to some examples, the method may involve rendering, by the control system, the audio data for reproduction via at least a subset of the plurality of loudspeakers in the audio environment, to produce rendered audio signals. In some examples, the rendering may be based, at least in part, on the spatial data, the listener position data, the loudspeaker position data and the loudspeaker orientation data. In some examples, the rendering may involve applying a loudspeaker orientation factor that tends to reduce a relative activation of a loudspeaker based, at least in part, on an increased loudspeaker orientation angle.
In some examples, the method may involve providing, via the interface system, the rendered audio signals to at least the subset of the loudspeakers of the plurality of loudspeakers in the audio environment.
According to some examples, the method may involve estimating a loudspeaker importance metric for at least the subset of the loudspeakers. For example, the method may involve estimating a loudspeaker importance metric for each loudspeaker of the subset of the loudspeakers. In some examples, the loudspeaker importance metric may correspond to a loudspeaker's importance for rendering an audio signal at the audio signal's intended perceived spatial position. According to some examples, the rendering for each loudspeaker may be based, at least in part, on the loudspeaker importance metric. In some examples, the rendering for each loudspeaker may involve modifying an effect of the loudspeaker orientation factor based, at least in part, on the loudspeaker importance metric. According to some examples, the rendering for each loudspeaker may involve reducing an effect of the loudspeaker orientation factor based, at least in part, on an increased loudspeaker importance metric.
In some examples, the method may involve determining whether a loudspeaker orientation angle equals or exceeds a threshold loudspeaker orientation angle. According to some examples, the audio processing method may involve applying the loudspeaker orientation factor only if the loudspeaker orientation angle equals or exceeds the threshold loudspeaker orientation angle. In some examples, the loudspeaker importance metric may be based, at least in part, on a distance between an eligible loudspeaker and a line between (a) a first loudspeaker having a shortest clockwise angular distance from the eligible loudspeaker and (b) a second loudspeaker having a shortest counterclockwise angular distance from the eligible loudspeaker. In some such examples, an eligible loudspeaker may be a loudspeaker having a loudspeaker orientation angle that equals or exceeds the threshold loudspeaker orientation angle. In some instances, the first loudspeaker and the second loudspeaker may be ineligible loudspeakers having loudspeaker orientation angles that are less than the threshold loudspeaker orientation angle.
According to some examples, the rendering may involve determining relative activations for at least the subset of the loudspeakers by optimizing a cost that is a function of: a model of perceived spatial position of an audio signal of the one or more audio signals when played back over the subset of loudspeakers in the audio environment; a measure of proximity of the intended perceived spatial position of the audio signal to a position of each loudspeaker of the subset of loudspeakers; and one or more additional dynamically configurable functions. In some such examples, at least one of the one or more additional dynamically configurable functions may be based, at least in part, on the loudspeaker orientation factor. According to some such examples, at least one of the one or more additional dynamically configurable functions may be based, at least in part, on the loudspeaker importance metric. In some such examples, at least one of the one or more additional dynamically configurable functions may be based, at least in part, on a measurement or estimate of acoustic transmission from each loudspeaker in the audio environment to other loudspeakers in the audio environment.
Aspects of some disclosed implementations include a control system configured (e.g., programmed) to perform one or more disclosed methods or steps thereof, and a tangible, non-transitory, computer readable medium which implements non-transitory storage of data (for example, a disc or other tangible storage medium) which stores code for performing (e.g., code executable to perform) one or more disclosed methods or steps thereof. For example, some disclosed embodiments can be or include a programmable general purpose processor, digital signal processor, or microprocessor, programmed with software or firmware and/or otherwise configured to perform any of a variety of operations on data, including one or more disclosed methods or steps thereof. Such a general purpose processor may be or include a computer system including an input device, a memory, and a processing subsystem that is programmed (and/or otherwise configured) to perform one or more disclosed methods (or steps thereof) in response to data asserted thereto.
Some or all of the operations, functions and/or methods described herein may be performed by one or more devices according to instructions (e.g., software) stored on one or more non-transitory media. Such non-transitory media may include memory devices such as those described herein, including but not limited to random access memory (RAM) devices, read-only memory (ROM) devices, etc. Accordingly, some innovative aspects of the subject matter described in this disclosure can be implemented in a non-transitory medium having software stored thereon.
Details of one or more implementations of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages will become apparent from the description, the drawings, and the claims. Note that the relative dimensions of the following figures may not be drawn to scale.
Playback of spatial audio in a consumer environment has typically been tied to a prescribed number of loudspeakers placed in prescribed positions. Some examples include Dolby 5.1 and Dolby 7.1 surround sound. More recently, immersive, object-based spatial audio formats have been introduced, such as Dolby Atmos™, which break this association between the audio content and specific loudspeaker locations. Instead, the content may be described as a collection of individual audio objects, each of which may have associated time-varying metadata, such as positional metadata for describing the desired perceived location of said audio objects in three-dimensional space. At playback time, the content is transformed into loudspeaker feeds by a renderer which adapts to the number and location of loudspeakers in the playback system. Many such renderers, however, still constrain the locations of the set of loudspeakers to be one of a set of prescribed layouts (for example Dolby 3.1.2, Dolby 5.1.2, Dolby 7.1.4, Dolby 9.1.6, etc., with Dolby Atmos).
“Flexible rendering” methods have recently been developed that allow object-based audio—as well as legacy channel-based audio—to be rendered flexibly over an arbitrary number of loudspeakers placed at arbitrary positions. These methods generally require that the renderer have knowledge of the number and physical locations of the loudspeakers in the listening space. For such a system to be practical for the average consumer, an automated method for locating the loudspeakers is desirable. Accordingly, methods for automatically locating the positions of loudspeakers within a listening space, which may also be referred to herein as an “audio environment,” have recently been developed. Detailed examples of flexible rendering and automatic audio device location are provided herein.
Simultaneous to the introduction of object-based spatial audio in the consumer space has been the rapid adoption of so-called “smart speakers”, such as the Amazon Echo™ line of products. The tremendous popularity of these devices can be attributed to their simplicity and convenience afforded by wireless connectivity and an integrated voice interface (Amazon's Alexa™, for example), but the sonic capabilities of these devices has generally been limited, particularly with respect to spatial audio. In most cases these devices are constrained to mono or stereo playback. However, combining the aforementioned flexible rendering and auto-location technologies with a plurality of orchestrated smart speakers may yield a system with very sophisticated spatial playback capabilities and that still remains extremely simple for the consumer to set up. A consumer can place as many or few of the speakers as desired, wherever is convenient, without the need to run speaker wires due to the wireless connectivity, and the built-in microphones can be used to automatically locate the speakers for the associated flexible renderer.
The above-described flexible rendering methods take into account the locations of loudspeakers with respect to a listening position or area, but they do not take into account the orientation of the loudspeakers with respect to the listening position or area. In general, these methods model speakers as radiating directly toward the listening position, but in reality this may not be the case. The more that a loudspeaker's orientation points away from the intended listening position, the more that several acoustic properties may change, with two being most notable. First, the overall equalization heard at the listening position may change, with high frequencies usually falling off due to most loudspeakers exhibiting higher degrees of directivity at higher frequencies. Second, the ratio of direct to reflected sound at the listening position may decrease as more acoustic energy is directed away from the listening position and interacts with the room before eventually being heard.
In view of the potential effects of loudspeaker orientation, some disclosed implementations may involve one or more of the following:
According to some alternative implementations the apparatus 150 may be, or may include, a server. In some such examples, the apparatus 150 may be, or may include, an encoder. Accordingly, in some instances the apparatus 150 may be a device that is configured for use within an audio environment, whereas in other instances the apparatus 150 may be a device that is configured for use in “the cloud,” e.g., a server.
In this example, the apparatus 150 includes an interface system 155 and a control system 160. The interface system 155 may, in some implementations, be configured for communication with one or more other devices of an audio environment. The audio environment may, in some examples, be a home audio environment. In other examples, the audio environment may be another type of environment, such as an office environment, an automobile environment, a train environment, a street or sidewalk environment, a park environment, etc. The interface system 155 may, in some implementations, be configured for exchanging control information and associated data with audio devices of the audio environment. The control information and associated data may, in some examples, pertain to one or more software applications that the apparatus 150 is executing.
The interface system 155 may, in some implementations, be configured for receiving, for providing, or for both for receiving and providing, a content stream. The content stream may include audio data. The audio data may include, but may not be limited to, audio signals. In some instances, the audio data may include spatial data, such as channel data and/or spatial metadata. Metadata may, for example, have been provided by what may be referred to herein as an “encoder.” In some examples, the content stream may include video data and audio data corresponding to the video data.
The interface system 155 may include one or more network interfaces and/or one or more external device interfaces (such as one or more universal serial bus (USB) interfaces). According to some implementations, the interface system 155 may include one or more wireless interfaces. The interface system 155 may include one or more devices for implementing a user interface, such as one or more microphones, one or more loudspeakers, a display system, a touch sensor system and/or a gesture sensor system. In some examples, the interface system 155 may include one or more interfaces between the control system 160 and a memory system, such as the optional memory system 165 shown in
The control system 160 may, for example, include a general purpose single- or multi-chip processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, and/or discrete hardware components.
In some implementations, the control system 160 may reside in more than one device. For example, in some implementations a portion of the control system 160 may reside in a device within one of the environments depicted herein and another portion of the control system 160 may reside in a device that is outside the environment, such as a server, a mobile device (e.g., a smartphone or a tablet computer), etc. In other examples, a portion of the control system 160 may reside in a device within one of the environments depicted herein and another portion of the control system 160 may reside in one or more other devices of the environment. For example, control system functionality may be distributed across multiple smart audio devices of an environment, or may be shared by an orchestrating device (such as what may be referred to herein as a smart home hub) and one or more other devices of the environment. In other examples, a portion of the control system 160 may reside in a device that is implementing a cloud-based service, such as a server, and another portion of the control system 160 may reside in another device that is implementing the cloud-based service, such as another server, a memory device, etc. The interface system 155 also may, in some examples, reside in more than one device.
In some implementations, the control system 160 may be configured for performing, at least in part, the methods disclosed herein. According to some examples, the control system 160 may be configured to receive, via the interface system 155, audio data, listener position data, loudspeaker position data and loudspeaker orientation data. The audio data may include one or more audio signals and associated spatial data indicating an intended perceived spatial position corresponding to an audio signal. The listener position data may indicate a listener position corresponding to a person in an audio environment. The loudspeaker position data may indicate a position of each loudspeaker of a plurality of loudspeakers in the audio environment. The loudspeaker orientation data may indicate a loudspeaker orientation angle between (a) a direction of maximum acoustic radiation for each loudspeaker of the plurality of loudspeakers in the audio environment; and (b) the listener position, relative to a corresponding loudspeaker.
In some such examples, the control system 160 may be configured to render the audio data for reproduction via at least a subset of the plurality of loudspeakers in the audio environment, to produce rendered audio signals. According to some such examples, the rendering may be based, at least in part, on the spatial data, the listener position data, the loudspeaker position data and the loudspeaker orientation data. In some such examples, the rendering may involve applying a loudspeaker orientation factor that tends to reduce a relative activation of a loudspeaker based, at least in part, on an increased loudspeaker orientation angle.
In some examples, the control system 160 may be configured to estimate a loudspeaker importance metric for at least the subset of the loudspeakers. The loudspeaker importance metric may correspond to a loudspeaker's importance for rendering an audio signal at the audio signal's intended perceived spatial position. In some such examples, the rendering for each loudspeaker may be based, at least in part, on the loudspeaker importance metric.
Some or all of the methods described herein may be performed by one or more devices according to instructions (e.g., software) stored on one or more non-transitory media. Such non-transitory media may include memory devices such as those described herein, including but not limited to random access memory (RAM) devices, read-only memory (ROM) devices, etc. The one or more non-transitory media may, for example, reside in the optional memory system 165 shown in
In some examples, the apparatus 150 may include the optional microphone system 170 shown in
According to some implementations, the apparatus 150 may include the optional loudspeaker system 175 shown in
In some implementations, the apparatus 150 may include the optional sensor system 180 shown in
In some implementations, the apparatus 150 may include the optional display system 185 shown in
According to some such examples the apparatus 150 may be, or may include, a smart audio device. In some such implementations the apparatus 150 may be, or may include, a wakeword detector. For example, the apparatus 150 may be, or may include, a virtual assistant.
Previously-implemented flexible rendering methods mentioned earlier take into account the locations of loudspeakers with respect to a listening position or area, but they do not take into account the orientation of the loudspeakers with respect to the listening position or area. In general, these methods model speakers as radiating directly toward the listening position, but in reality this may not be the case. Associated with most loudspeakers is a direction along which acoustic energy is maximally radiated, and ideally this direction is pointed at the listening position or area. For a simple device with a single loudspeaker driver mounted in an enclosure, the side of the enclosure in which the loudspeaker is mounted would be considered the “front” of the device, and ideally the device is oriented such that this front is facing the listening position or area. More complex devices may contain multiple individually-addressable loudspeakers pointing in different directions with respect to the device. In such cases, the orientation of each individual loudspeaker with respect to the listening position or area may be considered when the overall orientation of the device with respect to the listening position or area is set. Additionally, devices may contain speakers with nonzero elevation (for example, oriented upward from the device); the orientation of these speakers with respect to the listening position may simply be considered in three dimensions rather than two.
According to this example, the audio environment 200 includes audio devices 210A, 210B and 210C. The audio devices 210A-210C may, in some examples, be instances of the apparatus 150 of
The orientation of each loudspeaker may be represented in various ways, depending on the particular implementation. In this example, the orientation of each loudspeaker is represented by the angle between the loudspeaker's direction of maximum radiation and the line connecting its associated device to the listening position. This orientation angle may vary between −180 and 180 degrees, with 0 degrees indicating that a loudspeaker is pointed directly at the listening position and −180 or 180 degrees indicating that a loudspeaker is pointed completely away from the listening position. The orientation angle of L1, represented by the value q1 in the figure, is close to zero, indicating that loudspeaker L1 is oriented almost directly at the listening position. On the other hand, q2 is close to 180 degrees, meaning that loudspeaker L2 is oriented almost directly away from the listening position. In audio device 210C, q3 and q4 have relatively small values, with absolute values less than 90 degrees, indicating the L3 and L4 are oriented substantially toward the listening position. However, q5 has a relatively large value, with an absolute value greater than 90 degrees, indicating that L5 is oriented substantially away from the listening position. The positions and orientations of a set of loudspeakers may be determined, or at least estimated, according to various techniques, including but not limited to those disclosed herein.
As noted above, the more that a loudspeaker's orientation points away from the intended listening position, the more that several acoustic properties may change, with two acoustic properties being most prominent. First, the overall equalization heard at the listening position may change, with high frequencies usually decreasing because most loudspeakers have higher degrees of directivity at higher frequencies. Second, the ratio of direct to reflected sound at the listening position may decrease, because relatively more acoustic energy is directed away from the listening position and interacts with walls, floors, objects, etc., in the audio environment before eventually being heard. The first issue can often be mitigated to a certain degree with equalization, but the second issue cannot.
When a loudspeaker that points away from the intended listening position is combined with others for the purposes of spatial reproduction, this second issue can be particularly problematic. Imaging of the elements of a spatial mix at their desired locations is generally best achieved when the loudspeakers contributing to this imaging all have a relatively high direct-to-reflected ratio at the listening position. If a particular loudspeaker does not because the loudspeaker is oriented away from the listening position, then the imaging may become inaccurate or “blurry”. In some examples, it may be beneficial to exclude this loudspeaker from the rendering process to improve imaging. However, in some instances, excluding such a loudspeaker from the rendering process may cause even larger impairments to the overall spatial rendering than including the loudspeaker in the rendering process. For example, if a loudspeaker is pointing away from the listening position, but it is the only loudspeaker to the left of the listening position, it may be better to keep this loudspeaker as part of the rendering rather than having the entire spatial mix collapse towards the right of the listening position due to its exclusion.
Some disclosed examples involve navigating such choices for a rendering system in which both the locations and orientations of loudspeakers are specified with respect to the listening position. For example, some disclosed examples involve rendering a set of one or more audio signals, each audio signal having an associated desired perceived spatial position, over a set of two or more loudspeakers. In some such examples, the location and orientation of each loudspeaker of a set of loudspeakers (for example, relative to a desired listening position or area) are provided to the renderer. According to some such examples, the relative activations of each loudspeaker may be computed as a function of the desired perceived spatial positions of the one or more audio signals and the locations and orientations of the loudspeakers. In some such examples, for any given location of a loudspeaker, the activation of a loudspeaker may be reduced as the orientation of the loudspeaker increases away from the listening position. According to some such examples, the degree of this reduction may itself be reduced as a function of a measure of the loudspeaker's importance for rendering any audio signal at its desired perceived spatial position.
The following paragraphs disclose an implementation that may achieve the results that are described with reference
In some aspects, this cost function may be represented by the following equation:
The derivation of equation 1 is set forth in detail below. In this example, the set {{right arrow over (s)}i} represents the positions of each loudspeaker of a set of M loudspeakers, {right arrow over (o)} represents the desired perceived spatial position of an audio signal, and g represents an M-dimensional vector of speaker activations. The first term of the cost function is represented by Cspatial, and the second is split into Cproximity and a sum of terms Cj(g, {{ô}, {ŝi}, {ê}}j) representing the additional costs. Each of these additional costs may be computed as a function of the general set {{ô}, {si}, {ê}}j, with {ô} representing a set of one or more properties of the audio signals being rendered, {ŝi} representing a set of one or more properties of the speakers over which the audio is being rendered, and {ê} representing one or more additional external inputs. In other words, each term Cj(g, {{ô}, {ŝi}, {ê}}j) returns a cost as a function of activations g in relation to a combination of one or more properties of the audio signals, speakers, and/or external inputs. It should be noted that the set {{ô}, {ŝi}, {ê}}j contains at a minimum only one element from any of {ô}, {ŝi}, or {ê}.
In some examples, one or more aspects of the present disclosure may be implemented by introducing one or more additional cost terms Cj that is or are a function of {ŝi}, which represents properties of the loudspeakers in the audio environment. According to some such examples, the cost may be computed as a function of both the position and orientation of each speaker with respect to the listening position.
In some such examples, the general cost function of equation 1 may be represented as a matrix quadratic, as follows:
The derivation of equation 2 is set forth in detail below. In some examples, the additional cost terms may each be parametrized by a diagonal matrix of speaker penalty terms, e.g., as follows:
Some aspects of the present disclosure may be implemented by computing a set of these speaker penalty terms wij as a function of both the position and orientation of each speaker i. According to some examples, penalty terms may be computed over different subsets of loudspeakers across frequency, depending on each loudspeaker's capabilities (for example, according to each loudspeaker's ability to accurately reproduce low frequencies).
The following discussion assumes that the position and orientation of each loudspeaker i are known, in this example with respect to a listening position. Some detailed examples of determining, or at least estimating, the position and orientation of each loudspeaker i are set forth below. Some previously-disclosed flexible rendering methods already took into account the position of each loudspeaker with respect to the listening position. Some flexible rendering methods of the present disclosure further incorporate the orientation of the loudspeakers with respect to the listening position, as well as the positions of loudspeakers with respect to each other. The loudspeaker orientations have already been parameterized in this disclosure as orientation angles θ1. The positions of loudspeakers with respect to each other, which may reflect the potential for impairment to the spatial rendering introduced by the speaker's penalization, are parameterized herein as αi, which also may be referred to herein simply as α. Accordingly, a may be referred to herein as a “loudspeaker importance metric.”
According to some disclosed examples, loudspeakers may be nominally divided into two categories, “eligible” and “ineligible,” meaning eligible or ineligible for penalization according to loudspeaker orientation. In some such examples, a determination of whether a loudspeaker is eligible or ineligible may be based, at least in part, on the loudspeaker's orientation angle θi. In some such examples, a determination of whether a loudspeaker is eligible or ineligible may be based, at least in part, on whether the loudspeaker's orientation angle θi equals or exceeds an orientation angle threshold Tθ. In some such examples, if a loudspeaker meets the condition |θi|>Tθ, the loudspeaker is eligible for penalization according to loudspeaker orientation; otherwise, the loudspeaker is ineligible. In one example, an orientation angle threshold
radians (110 degrees). However, in other examples, the orientation angle threshold Tθ may be greater than or less than 110 degrees, e.g., 100 degrees 105 degrees, 115 degrees, 120 degrees, etc. According to some examples, the position of each eligible speaker may be considered in relation to the position of the ineligible or well-oriented loudspeakers. In some such examples, for an eligible loudspeaker i, the loudspeakers i1 and i2 with the shortest clockwise and counterclockwise angular distances ϕ1 and ϕ2 from i may be identified in the set of ineligible loudspeakers. Angular distances between speakers may, in some such examples, be determined by casting loudspeaker positions onto a unit circle with the listening position at the center of unit circle.
In order to encapsulate the potential impairment, in some examples a loudspeaker importance metric α may be devised as a function of ϕ1 and ϕ2. In some examples, the loudspeaker importance metric αi for a loudspeaker i corresponds with the unit perpendicular distance from the loudspeaker i to a line connecting loudspeakers i1 and i2, which are two loudspeakers adjacent to the loudspeaker i. Following is one such example in which the loudspeaker importance metric α is expressed as a function of ϕ1 and ϕ2.
Each of the internal triangles 505a, 505b and 505c is an isosceles triangle having center angles Υ1, ϕ2 and ϕ3, respectively. An arbitrary internal triangle would also be isosceles and would have a center angle ϕn. The interior angles of a triangle sum to π radians. Each of the remaining congruent angles of the arbitrary internal triangle is therefore half of (π−ϕn) radians. One such angle, ζn=(π−ϕn)/2, is shown in
The law of sines defines the relationships between interior angles α, b, and c of a triangle and the lengths of the sides opposite each interior angle α, β and γ as follows:
In the example of triangle 605, the law of sines indicates:
Therefore, α=C1 sin α=C1 sin (ζ1+ζ2)=sin (ϕ2/2) sin (ζ1+ζ2). However,
Accordingly, the loudspeaker importance metric alpha may be expressed as follows:
In some implementations, ϕ1 or ϕ2 may be greater than π radians. In such instances, if α were computed according to equation 4, α would project outside the circle. In some such examples, equation 4 may be modified to
In some examples, if ϕ1=2, α may be computed as
because this function fits continuously into equation 4 when ϕ1 and ϕ2 are similar.
With the layout of loudspeakers shown in
still holds. One may see that, as compared to that of
As before, the loudspeaker that is being evaluated will be referred to as loudspeaker i, and the loudspeakers adjacent to the loudspeaker that is being evaluated will be referred to as loudspeakers i1 and i2. Accordingly, in
In
In some examples, the loudspeaker importance metric α1 may correspond to a particular behavior of the spatial cost system above. When the target audio object locations lie outside the convex hull of loudspeakers 805, according to some examples the solution with the least possible error places audio objects on the convex hull of speakers. In some such examples, the line connecting loudspeakers i1 and i2 would be part of the convex hull of loudspeakers 805 if loudspeaker i were penalized to the extent that it is deactivated, and therefore this line would become part of the minimum error solution. For example, referring to
According to some examples, for each loudspeaker that is eligible for penalization based on that loudspeaker's orientation angle, the loudspeaker importance metric αi may be computed. The larger the value of αi, the larger the potential for error. This is demonstrated in
Depending on the relative magnitudes of penalties in a cost function optimization, any particular penalty may be designed to elicit absolute or gradual behavior. In the case of the renderer cost function, a large enough penalty will exclude or disable a loudspeaker altogether, while a smaller penalty may quiet a loudspeaker without muting it. The arctangent function tan−1 x is an advantageous functional form for penalties, because it can be manipulated to reflect this behavior. tan−1(x→±∞) is effectively a step function or a switch, while tan−1(x→0) is effectively a linear ramp. Intermediate ranges yield intermediate behavior. Therefore, selecting a range of the arctangent about x=0 as the functional form of a penalty enables a significant level of control over system behavior.
For example, the penalty wij of equation 3 may be constructed generally as the multiplication of unit arctangent functions of αi and θi, respectively, along with a scaling factor η for precise penalty behavior. Equation 5 provides one such example:
In some examples, both x and γ∈[0,1]. The specific scaling factor and respective arctangent functions may be constructed to ensure precise and gradual deactivation of loudspeaker i from use as a function of both θi and αi. In some examples, the arctangent functions x and y of equation 5 may be constructed as follows, with the scale factor η=5.0 in these examples:
In equations 6 and 7, “r” represents an arctangent function tuning factor that corresponds with half of the range of the arctan function that is being sampled. For r=1, the total output space of the arctan function that is being sampled has a length of 2.
In this example, block 1405 involves receiving, by a control system and via an interface system, audio data. According to this example, the audio data includes one or more audio signals and associated spatial data. In this example, the spatial data indicates an intended perceived spatial position corresponding to an audio signal of the one or more audio signals. In some such examples, the spatial data may be, or may include, metadata. According to some examples, the metadata may correspond to an audio object. In some such examples, the audio signal may correspond to the audio object. In some instances, the audio data may be part of a content stream of audio signals, and in some cases video signals, at least portions of which are meant to be heard together. Examples include a selection of music, a movie soundtrack, a movie, a television program, the audio portion of a television program, a podcast, a live voice call, a synthesized voice response from a smart assistant, etc. In some examples, the audio data may be received from another apparatus, e.g., via wireless communications. In other instances, the audio data may be received, or retrieved, from a memory of the same apparatus that includes the control system.
According to this example, block 1410 involves receiving, by the control system and via the interface system, listener position data. In this example, the listener position data indicates a listener position corresponding to a person in an audio environment. In some instances, the the listener position data may indicate a position of the listener's head. In some implementations block 1410, or another block of method 1400, may involve receiving listener orientation data. Various methods of estimating a listener position and orientation are disclosed herein.
In this example, block 1415 involves receiving, by the control system and via the interface system, loudspeaker position data indicating a position of each loudspeaker of a plurality of loudspeakers in the audio environment. In some examples, the plurality may include all loudspeakers in the audio environment, whereas in other examples the plurality may include only a subset of the total number of loudspeakers in the audio environment.
According to this example, block 1420 involves receiving, by the control system and via the interface system, loudspeaker orientation data. The loudspeaker orientation data may vary according to the particular implementation. In this example, the loudspeaker orientation data indicates a loudspeaker orientation angle between (a) a direction of maximum acoustic radiation for each loudspeaker of the plurality of loudspeakers in the audio environment; and (b) the listener position, relative to a corresponding loudspeaker. According to some such examples, the loudspeaker orientation angle for a particular loudspeaker may be an angle between (a) the direction of maximum acoustic radiation for the particular loudspeaker and (b) a line between a position of the particular loudspeaker and the listener position. In other examples, the loudspeaker orientation data may indicate a loudspeaker orientation angle according to another frame of reference, such as an audio environment coordinate system, an audio device reference frame, etc. Alternatively, or additionally, in some examples the loudspeaker orientation angle may not be defined according to a direction of maximum acoustic radiation for each loudspeaker, but may instead be defined in another manner, e.g., by the orientation of a device that includes the loudspeaker.
In this example, block 1425 involves rendering, by the control system, the audio data for reproduction via at least a subset of the plurality of loudspeakers in the audio environment, to produce rendered audio signals. According to this example, the rendering is based, at least in part, on the spatial data, the listener position data, the loudspeaker position data and the loudspeaker orientation data. In this example, the rendering involves applying a loudspeaker orientation factor that tends to reduce a relative activation of a loudspeaker based, at least in part, on an increased loudspeaker orientation angle. In this example, block 1430 involves providing, via the interface system, the rendered audio signals to at least the subset of the loudspeakers of the plurality of loudspeakers in the audio environment.
In some examples, method 1400 may involve estimating a loudspeaker importance metric for at least the subset of the loudspeakers. According to some examples, the loudspeaker importance metric may correspond to a loudspeaker's importance for rendering an audio signal at the audio signal's intended perceived spatial position. In some examples, the rendering for each loudspeaker may be based, at least in part, on the loudspeaker importance metric.
According to some implementations, the rendering for each loudspeaker may involve modifying an effect of the loudspeaker orientation factor based, at least in part, on the loudspeaker importance metric. In some such examples, the rendering for each loudspeaker may involve reducing an effect of the loudspeaker orientation factor based, at least in part, on an increased loudspeaker importance metric.
According to some examples, method 1400 may involve determining whether a loudspeaker orientation angle equals or exceeds a threshold loudspeaker orientation angle. In some such examples, method 1400 may involve applying the loudspeaker orientation factor only if the loudspeaker orientation angle equals or exceeds the threshold loudspeaker orientation angle. In some examples, an “eligible loudspeaker” may be a loudspeaker having a loudspeaker orientation angle that equals or exceeds the threshold loudspeaker orientation angle. In this context, an “eligible loudspeaker” is a loudspeaker that is eligible for penalizing, e.g., eligible for being turned down (reducing the relative speaker activation) or turned off.
In some examples, the loudspeaker importance metric of a particular loudspeaker may be based, at least in part, on the position of that particular loudspeaker relative to the position of one or more other loudspeakers. For example, if a loudspeaker is relatively close to another loudspeaker, the perceptual change caused by penalizing either of these closely-spaced loudspeakers may be less than the perceptual change caused by penalizing another loudspeaker that is not close to other loudspeakers in the audio environment.
According to some examples, the loudspeaker importance metric may be based, at least in part, on a distance between an eligible loudspeaker and a line between (a) a first loudspeaker having a shortest clockwise angular distance from the eligible loudspeaker and (b) a second loudspeaker having a shortest counterclockwise angular distance from the eligible loudspeaker. This distance may, in some examples, correspond to the loudspeaker importance metric α that is disclosed herein. As noted above, in some examples an “eligible” loudspeaker is a loudspeaker having a loudspeaker orientation angle that equals or exceeds a threshold loudspeaker orientation angle. In some examples, the first loudspeaker and the second loudspeaker may be ineligible loudspeakers having loudspeaker orientation angles that are less than the threshold loudspeaker orientation angle. These ineligible loudspeakers may be ineligible for penalizing, e.g., ineligible for being turned down (reducing the relative speaker activation) or turned off.
In some examples, the rendering of block 1425 may involve determining relative activations for at least the subset of the loudspeakers by optimizing a cost function. In some such examples, block 1425 may involve determining relative activations for at least the subset of the loudspeakers by optimizing a cost that is a function of: a model of perceived spatial position of an audio signal of the one or more audio signals when played back over the subset of loudspeakers in the audio environment; a measure of proximity of the intended perceived spatial position of the audio signal to a position of each loudspeaker of the subset of loudspeakers; and one or more additional dynamically configurable functions.
According to some examples, at least one of the one or more additional dynamically configurable functions may be based, at least in part, on the loudspeaker orientation factor. In some examples, at least one of the one or more additional dynamically configurable functions may be based, at least in part, on the loudspeaker importance metric. According to some examples, at least one of the one or more additional dynamically configurable functions may be based, at least in part, on a measurement or estimate of acoustic transmission from each loudspeaker in the audio environment to one or more other loudspeakers in the audio environment.
As noted in the description of
Some such methods may involve receiving a direct indication by the user, e.g., using a smartphone or tablet apparatus to mark or indicate the approximate locations of audio devices on a floorplan or similar diagrammatic representation of the environment. Such digital interfaces are already commonplace in managing the configuration, grouping, name, purpose and identity of smart home devices. For example, such a direct indication may be provided via the Amazon Alexa smartphone application, the Sonos S2 controller application, or a similar application.
Some examples may involve solving the basic trilateration problem using the measured signal strength (sometimes called the Received Signal Strength Indication or RSSI) of common wireless communication technologies such as Bluetooth, Wi-Fi, ZigBee, etc., to produce estimates of physical distance between the audio devices, e.g., as disclosed in J. Yang and Y. Chen, “Indoor Localization Using Improved RSS-Based Lateration Methods,” GLOBECOM 2009-2009 IEEE Global Telecommunications Conference, Honolulu, HI, 2009, pp. 1-6, doi: 10.1109/GLOCOM.2009.5425237 and/or as disclosed in Mardeni, R. & Othman, Shaifull & Nizam, (2010) “Node Positioning in ZigBee Network Using Trilateration Method Based on the Received Signal Strength Indicator (RSSI)” 46, both of which are hereby incorporated by reference.
In U.S. Pat. No. 10,779,084, entitled “Automatic Discovery and Localization of Speaker Locations in Surround Sound Systems,” which is hereby incorporated by reference, a system is described which can automatically locate the positions of loudspeakers and microphones in a listening environment by acoustically measuring the time-of-arrival (TOA) between each speaker and microphone.
International Application Nos. PCT/US21/61506 and PCT/US21/61533, entitiled “AUTOMATIC LOCALIZATION OF AUDIO DEVICES” (“the Automatic Localization applications”), which are hereby incorporated by reference, disclose methods, devices and systems for automatically determining the locations and orientations of audio devices.
Some such methods may involve receiving, by the control system, configuration parameters. In some examples, the configuration parameters may correspond to the audio environment and/or may correspond to one or more audio devices of the audio environment. Some such methods may involve minimizing, by the control system, a cost function based at least in part on the DOA data and the configuration parameters, to estimate a position and/or an orientation of at least the first smart audio device and the second smart audio device.
According to some examples, the DOA data also may correspond to sound received by one or more passive audio receivers of the audio environment. In some examples, each of the one or more passive audio receivers may include a microphone array but, in some instances, may lack an audio emitter. In some such examples, minimizing the cost function also may provide an estimated location and orientation of each of the one or more passive audio receivers.
In some examples, the DOA data also may correspond to sound emitted by one or more audio emitters of the audio environment. In some instances, each of the one or more audio emitters may include at least one sound-emitting transducer but may, in some instances, lack a microphone array. In some such examples, minimizing the cost function also may provide an estimated location of each of the one or more audio emitters.
In some implementations, the DOA data also may correspond to sound emitted by third through Nth smart audio devices of the audio environment, N corresponding to a total number of smart audio devices of the audio environment. In some examples, the DOA data also may correspond to sound received by each of the first through Nth smart audio devices from all other smart audio devices of the audio environment. In some such examples, minimizing the cost function may involve estimating a position and/or an orientation of the third through Nth smart audio devices.
According to some examples, the configuration parameters may include a number of audio devices in the audio environment, one or more dimensions of the audio environment, and/or one or more constraints on audio device location and/or orientation. In some instances, the configuration parameters may include disambiguation data for rotation, translation and/or scaling.
Some methods may involve receiving, by the control system, a seed layout for the cost function. The seed layout may, in some examples, specify a correct number of audio transmitters and receivers in the audio environment and an arbitrary location and orientation for each of the audio transmitters and receivers in the audio environment.
Some methods may involve receiving, by the control system, a weight factor associated with one or more elements of the DOA data. The weight factor may, for example, indicate at the availability and/or reliability of the one or more elements of the DOA data.
Some methods may involve obtaining, by the control system, one or more elements of the DOA data using a beamforming method, a steered power response method, a time difference of arrival method, a structured signal method, or combinations thereof.
Some methods may involve receiving, by the control system, time of arrival (TOA) data corresponding to sound emitted by at least one audio device of the audio environment and received by at least one other audio device of the audio environment. In some such examples, the cost function may be based, at least in part, on the TOA data. Some such methods may involve estimating at least one playback latency and/or estimating at least one recording latency. In some examples, the cost function may operate with a rescaled position, a rescaled latency and/or a rescaled time of arrival.
According to some examples, the cost function may include a first term depending on the DOA data only. In some such examples, the cost function may include a second term depending on the TOA data only. In some such examples, the first term may include a first weight factor and the second term may include a second weight factor. In some instances, one or more TOA elements of the second term may have a TOA element weight factor indicating the availability and/or reliability of each of the one or more TOA elements.
In some examples, the configuration parameters may include playback latency data, recording latency data, data for disambiguating latency symmetry, disambiguation data for rotation, disambiguation data for translation, disambiguation data for scaling, and/or one or more combinations thereof.
Some other aspects of the present disclosure may be implemented via methods. Some such methods may involve device location. For example, some methods may involve localizing devices in an audio environment. Some such methods may involve obtaining, by a control system, direction of arrival (DOA) data corresponding to transmissions of at least a first transceiver of a first device of the environment. The first transceiver may, in some examples, include a first transmitter and a first receiver. In some instances, the DOA data may correspond to transmissions received by at least a second transceiver of a second device of the environment. In some examples, the second transceiver may include a second transmitter and a second receiver. In some instances, the DOA data may correspond to transmissions from at least the second transceiver received by at least the first transceiver.
In some examples, the first device and the second device may be audio devices and the environment may be an audio environment. According to some such examples, the first transmitter and the second transmitter may be audio transmitters. In some such examples, the first receiver and the second receiver may be audio receivers. In some implementations, the first transceiver and the second transceiver may be configured for transmitting and receiving electromagnetic waves.
Some such methods may involve receiving, by the control system, configuration parameters. In some instances, the configuration parameters may correspond to the environment, and/or may correspond to one or more devices of the environment. Some such methods may involve minimizing, by the control system, a cost function based at least in part on the DOA data and the configuration parameters, to estimate a position and/or an orientation of at least the first device and the second device.
In some examples, the DOA data also may correspond to transmissions received by one or more passive receivers of the environment. Each of the one or more passive receivers may, for example, include a receiver array but may lack a transmitter. In some such examples, minimizing the cost function also may provide an estimated location and/or orientation of each of the one or more passive receivers.
According to some examples, the DOA data also may correspond to transmissions from one or more transmitters of the environment. In some instances, each of the one or more transmitters may lack a receiver array. In some such examples, minimizing the cost function also may provide an estimated location of each of the one or more transmitters.
In some examples, the DOA data also may correspond to transmissions emitted by third through MA transceivers of third through MA devices of the environment, N corresponding to a total number of transceivers of the environment. In some such examples, the DOA data also may correspond to transmissions received by each of the first through Nth transceivers from all other transceivers of the environment. In some such examples, minimizing the cost function may involve estimating a position and/or an orientation of the third through Nth transceivers.
International Publication No. WO 2021/127286 A1, entitled “Audio Device Auto-Location,” which is hereby incorporated by reference, discloses methods for estimating audio device locations, listener positions and listener orientations in an audio environment. Some disclosed methods involve estimating audio device locations in an environment via direction of arrival (DOA) data and by determining interior angles for each of a plurality of triangles based on the DOA data. In some examples, each triangle has vertices that correspond with audio device locations. Some disclosed methods involve determining a side length for each side of each of the triangles and performing a forward alignment process of aligning each of the plurality of triangles to produce a forward alignment matrix. Some disclosed methods involve determining performing a reverse alignment process of aligning each of the plurality of triangles in a reverse sequence to produce a reverse alignment matrix. A final estimate of each audio device location may be based, at least in part, on values of the forward alignment matrix and values of the reverse alignment matrix.
Other disclosed methods of International Publication No. WO 2021/127286 A1 involve estimating a listener location and, in some instances, a listener location. Some such methods involve prompting the listener (e.g., via an audio prompt from one or more loudspeakers in the environment) to make one or more utterances and estimating the listener location according to DOA data. The DOA data may correspond to microphone data obtained by a plurality of microphones in the environment. The microphone data may correspond with detections of the one or more utterances by the microphones. At least some of the microphones may be co-located with loudspeakers. According to some examples, estimating a listener location may involve a triangulation process. Some such examples involve triangulating the user's voice by finding the point of intersection between DOA vectors passing through the audio devices. Some disclosed methods of determining a listener orientation involve prompting the user to identify a one or more loudspeaker locations. Some such examples involve prompting the user to identify one or more loudspeaker locations by moving next to the loudspeaker location(s) and making an utterance. Other examples involve prompting the user to identify one or more loudspeaker locations by pointing to each of the one or more loudspeaker locations with a handheld device, such as a cellular telephone that includes an inertial sensor system and a wireless interface configured for communicating with a control system that is controlling the audio devices of the audio environment (such as a control system of an orchestrating device). Some disclosed methods involve determining a listener orientation by causing loudspeakers to render an audio object such that the audio object seems to rotate around the listener, and prompting the listener to make an utterance (such as “Stop!”) when the listener perceives the audio object to be in a location, such as a loudspeaker location, a television location, etc. Some disclosed methods involve determining a location and/or orientation of a listener via camera data, e.g., by determining a relative location of the listener and one or more audio devices of the audio environment according to the camera data, by determining an orientation of the listener relative to one or more audio devices of the audio environment according to the camera data (e.g., according to the direction that the listener is facing), etc.
In Shi, Guangi et al, Spatial Calibration of Surround Sound Systems including Listener Position Estimation, (AES 137th Convention, October 2014), which is hereby incorporated by reference, a system is described in which a single linear microphone array associated with a component of the reproduction system whose location is predictable, such as a soundbar a front center speaker, measures the time-difference-of-arrival (TDOA) for both satellite loudspeakers and a listener to locate the positions of both the loudspeakers and listener. In this case, the listening orientation is inherently defined as the line connecting the detected listening position and the component of the reproduction system that includes the linear microphone array, such as a sound bar that is co-located with a television (placed directly above or below the television). Because the sound bar's location is predictably placed directly above or below the video screen, the geometry of the measured distance and incident angle can be translated to an absolute position relative to any point in front of that reference sound bar location using simple trigonometric principles. The distance between a loudspeaker and a microphone of the linear microphone array can be estimated by playing a test signal and measuring the time of flight (TOF) between the emitting loudspeaker and the receiving microphone. The time delay of the direct component of a measured impulse response can be used for this purpose. The impulse response between the loudspeaker and a microphone array element can be obtained by playing a test signal through the loudspeaker under analysis. For example, either a maximum length sequence (MLS) or a chirp signal (also known as logarithmic sine sweep) can be used as the test signal. The room impulse response can be obtained by calculating the circular cross-correlation between the captured signal and the MLS input.
The location and orientation of a person in an audio environment may be determined or estimated by various methods, including but not limited to those described in the following paragraphs.
In Hess, Wolfgang, Head-Tracking Techniques for Virtual Acoustic Applications, (AES 133rd Convention, October 2012), which is hereby incorporated by reference, numerous commercially available techniques for tracking both the position and orientation of a listener's head in the context of spatial audio reproduction systems are presented. One particular example discussed is the Microsoft Kinect. With its depth sensing and standard cameras along with a publicly available software (Windows Software Development Kit (SDK)), the positions and orientations of the heads of several listeners in a space can be simultaneously tracked using a combination of skeletal tracking and facial recognition. Although the Kinect for Windows has been discontinued, the Azure Kinect developer kit (DK), which implements the next generation of Microsoft's depth sensor, is currently available.
In U.S. Pat. No. 10,779,084, entitled “Automatic Discovery and Localization of Speaker Locations in Surround Sound Systems,” which is hereby incorporated by reference, a system is described which can automatically locate the positions of loudspeakers and microphones in a listening environment by acoustically measuring the time-of-arrival (TOA) between each speaker and microphone. A listening position may be detected by placing and locating a microphone at a desired listening position (a microphone in a mobile phone held by the listener, for example), and an associated listening orientation may be defined by placing another microphone at a point in the viewing direction of the listener, e.g. at the TV. Alternatively, the listening orientation may be defined by locating a loudspeaker in the viewing direction, e.g. the loudspeakers on the TV.
International Publication No. WO 2021/127286 A1, entitled “Audio Device Auto-Location,” which is hereby incorporated by reference, discloses methods for estimating audio device locations, listener positions and listener locations in an audio environment. Some disclosed methods involve estimating audio device locations in an environment via direction of arrival (DOA) data and by determining interior angles for each of a plurality of triangles based on the DOA data. In some examples, each triangle has vertices that correspond with audio device locations. Some disclosed methods involve determining a side length for each side of each of the triangles and performing a forward alignment process of aligning each of the plurality of triangles to produce a forward alignment matrix. Some disclosed methods involve determining performing a reverse alignment process of aligning each of the plurality of triangles in a reverse sequence to produce a reverse alignment matrix. A final estimate of each audio device location may be based, at least in part, on values of the forward alignment matrix and values of the reverse alignment matrix.
Other disclosed methods of International Publication No. WO 2021/127286 A1 involve estimating a listener location and, in some instances, a listener location. Some such methods involve prompting the listener (e.g., via an audio prompt from one or more loudspeakers in the environment) to make one or more utterances and estimating the listener location according to DOA data. The DOA data may correspond to microphone data obtained by a plurality of microphones in the environment. The microphone data may correspond with detections of the one or more utterances by the microphones. At least some of the microphones may be co-located with loudspeakers. According to some examples, estimating a listener location may involve a triangulation process. Some such examples involve triangulating the user's voice by finding the point of intersection between DOA vectors passing through the audio devices. Some disclosed methods of determining a listener orientation involve prompting the user to identify a one or more loudspeaker locations. Some such examples involve prompting the user to identify one or more loudspeaker locations by moving next to the loudspeaker location(s) and making an utterance. Other examples involve prompting the user to identify one or more loudspeaker locations by pointing to each of the one or more loudspeaker locations with a handheld device, such as a cellular telephone that includes an inertial sensor system and a wireless interface configured for communicating with a control system that is controlling the audio devices of the audio environment (such as a control system of an orchestrating device). Some disclosed methods involve determining a listener orientation by causing loudspeakers to render an audio object such that the audio object seems to rotate around the listener, and prompting the listener to make an utterance (such as “Stop!”) when the listener perceives the audio object to be in a location, such as a loudspeaker location, a television location, etc. Some disclosed methods involve determining a location and/or orientation of a listener via camera data, e.g., by determining a relative location of the listener and one or more audio devices of the audio environment according to the camera data, by determining an orientation of the listener relative to one or more audio devices of the audio environment according to the camera data (e.g., according to the direction that the listener is facing), etc.
In Shi, Guangi et al, Spatial Calibration of Surround Sound Systems including Listener Position Estimation, (AES 137*th Convention, October 2014), which is hereby incorporated by reference, a system is described in which a single linear microphone array associated with a component of the reproduction system whose location is predictable, such as a soundbar a front center speaker, measures the time-difference-of-arrival (TDOA) for both satellite loudspeakers and a listener to locate the positions of both the loudspeakers and listener. In this case, the listening orientation is inherently defined as the line connecting the detected listening position and the component of the reproduction system that includes the linear microphone array, such as a sound bar that is co-located with a television (placed directly above or below the television). Because the sound bar's location is predictably placed directly above or below the video screen, the geometry of the measured distance and incident angle can be translated to an absolute position relative to any point in front of that reference sound bar location using simple trigonometric principles. The distance between a loudspeaker and a microphone of the linear microphone array can be estimated by playing a test signal and measuring the time of flight (TOF) between the emitting loudspeaker and the receiving microphone. The time delay of the direct component of a measured impulse response can be used for this purpose. The impulse response between the loudspeaker and a microphone array element can be obtained by playing a test signal through the loudspeaker under analysis. For example, either a maximum length sequence (MLS) or a chirp signal (also known as logarithmic sine sweep) can be used as the test signal. The room impulse response can be obtained by calculating the circular cross-correlation between the captured signal and the MLS input.
As noted elsewhere herein, in various disclosed examples one or more types of audio processing changes may be based on the optimization of a cost function. Some such examples involve flexible rendering.
Flexible rendering allows spatial audio to be rendered over an arbitrary number of arbitrarily placed speakers. In view of the widespread deployment of audio devices, including but not limited to smart audio devices (e.g., smart speakers) in the home, there is a need for realizing flexible rendering technology that allows consumer products to perform flexible rendering of audio, and playback of the so-rendered audio.
Several technologies have been developed to implement flexible rendering. They cast the rendering problem as one of cost function minimization, where the cost function consists of two terms: a first term that models the desired spatial impression that the renderer is trying to achieve, and a second term that assigns a cost to activating speakers. To date this second term has focused on creating a sparse solution where only speakers in close proximity to the desired spatial position of the audio being rendered are activated.
Playback of spatial audio in a consumer environment has typically been tied to a prescribed number of loudspeakers placed in prescribed positions: for example, 5.1 and 7.1 surround sound. In these cases, content is authored specifically for the associated loudspeakers and encoded as discrete channels, one for each loudspeaker (e.g., Dolby Digital, or Dolby Digital Plus, etc.) More recently, immersive, object-based spatial audio formats have been introduced (Dolby Atmos) which break this association between the content and specific loudspeaker locations. Instead, the content may be described as a collection of individual audio objects, each with possibly time varying metadata describing the desired perceived location of said audio objects in three-dimensional space. At playback time, the content is transformed into loudspeaker feeds by a renderer which adapts to the number and location of loudspeakers in the playback system. Many such renderers, however, still constrain the locations of the set of loudspeakers to be one of a set of prescribed layouts (for example 3.1.2, 5.1.2, 7.1.4, 9.1.6, etc. with Dolby Atmos).
Moving beyond such constrained rendering, methods have been developed which allow object-based audio to be rendered flexibly over a truly arbitrary number of loudspeakers placed at arbitrary positions. These methods require that the renderer have knowledge of the number and physical locations of the loudspeakers in the listening space. For such a system to be practical for the average consumer, an automated method for locating the loudspeakers would be desirable. One such method relies on the use of a multitude of microphones, possibly co-located with the loudspeakers. By playing audio signals through the loudspeakers and recording with the microphones, the distance between each loudspeaker and microphone is estimated. From these distances the locations of both the loudspeakers and microphones are subsequently deduced.
Simultaneous to the introduction of object-based spatial audio in the consumer space has been the rapid adoption of so-called “smart speakers”, such as the Amazon Echo line of products. The tremendous popularity of these devices can be attributed to their simplicity and convenience afforded by wireless connectivity and an integrated voice interface (Amazon's Alexa, for example), but the sonic capabilities of these devices has generally been limited, particularly with respect to spatial audio. In most cases these devices are constrained to mono or stereo playback. However, combining the aforementioned flexible rendering and auto-location technologies with a plurality of orchestrated smart speakers may yield a system with very sophisticated spatial playback capabilities and that still remains extremely simple for the consumer to set up. A consumer can place as many or few of the speakers as desired, wherever is convenient, without the need to run speaker wires due to the wireless connectivity, and the built-in microphones can be used to automatically locate the speakers for the associated flexible renderer.
Conventional flexible rendering algorithms are designed to achieve a particular desired perceived spatial impression as closely as possible. In a system of orchestrated smart speakers, at times, maintenance of this spatial impression may not be the most important or desired objective. For example, if someone is simultaneously attempting to speak to an integrated voice assistant, it may be desirable to momentarily alter the spatial rendering in a manner that reduces the relative playback levels on speakers near certain microphones in order to increase the signal to noise ratio and/or the signal to echo ratio (SER) of microphone signals that include the detected speech. Some embodiments described herein may be implemented as modifications to existing flexible rendering methods, to allow such dynamic modification to spatial rendering, e.g., for the purpose of achieving one or more additional objectives.
Existing flexible rendering techniques include Center of Mass Amplitude Panning (CMAP) and Flexible Virtualization (FV). From a high level, both these techniques render a set of one or more audio signals, each with an associated desired perceived spatial position, for playback over a set of two or more speakers, where the relative activation of speakers of the set is a function of a model of perceived spatial position of said audio signals played back over the speakers and a proximity of the desired perceived spatial position of the audio signals to the positions of the speakers. The model ensures that the audio signal is heard by the listener near its intended spatial position, and the proximity term controls which speakers are used to achieve this spatial impression. In particular, the proximity term favors the activation of speakers that are near the desired perceived spatial position of the audio signal. For both CMAP and FV, this functional relationship is conveniently derived from a cost function written as the sum of two terms, one for the spatial aspect and one for proximity:
Here, the set {{right arrow over (s)}i} denotes the positions of a set of M loudspeakers, {right arrow over (o)} denotes the desired perceived spatial position of the audio signal, and g denotes an M dimensional vector of speaker activations. For CMAP, each activation in the vector represents a gain per speaker, while for FV each activation represents a filter (in this second case g can equivalently be considered a vector of complex values at a particular frequency and a different g is computed across a plurality of frequencies to form the filter). The optimal vector of activations is found by minimizing the cost function across activations:
With certain definitions of the cost function, it is difficult to control the absolute level of the optimal activations resulting from the above minimization, though the relative level between the components of gopt is appropriate. To deal with this problem, a subsequent normalization of gopt may be performed so that the absolute level of the activations is controlled. For example, normalization of the vector to have unit length may be desirable, which is in line with a commonly used constant power panning rules:
The exact behavior of the flexible rendering algorithm is dictated by the particular construction of the two terms of the cost function, Cspatial and Cproximity. For CMAP, Cspatial is derived from a model that places the perceived spatial position of an audio signal playing from a set of loudspeakers at the center of mass of those loudspeakers' positions weighted by their associated activating gains gi (elements of the vector g):
Equation 10 is then manipulated into a spatial cost representing the squared error between the desired audio position and that produced by the activated loudspeakers:
With FV, the spatial term of the cost function is defined differently. There the goal is to produce a binaural response b corresponding to the audio object position {right arrow over (o)} at the left and right ears of the listener. Conceptually, b is a 2×1 vector of filters (one filter for each ear) but is more conveniently treated as a 2×1 vector of complex values at a particular frequency. Proceeding with this representation at a particular frequency, the desired binaural response may be retrieved from a set of HRTFs indexed by object position:
At the same time, the 2×1 binaural response e produced at the listener's ears by the loudspeakers is modelled as a 2×M acoustic transmission matrix H multiplied with the M×1 vector g of complex speaker activation values:
The acoustic transmission matrix H is modelled based on the set of loudspeaker positions {sj} with respect to the listener position. Finally, the spatial component of the cost function is defined as the squared error between the desired binaural response (Equation 12) and that produced by the loudspeakers (Equation 13):
Conveniently, the spatial term of the cost function for CMAP and FV defined in Equations 11 and 14 can both be rearranged into a matrix quadratic as a function of speaker activations g:
where A is an M×M square matrix, B is a 1×M vector, and C is a scalar. The matrix A is of rank 2, and therefore whenM>2 there exist an infinite number of speaker activations g for which the spatial error term equals zero. Introducing the second term of the cost function, Cproximity, removes this indeterminacy and results in a particular solution with perceptually beneficial properties in comparison to the other possible solutions. For both CMAP and FV, Cproximity is constructed such that activation of speakers whose position {right arrow over (s)}i is distant from the desired audio signal position {right arrow over (o)} is penalized more than activation of speakers whose position is close to the desired position. This construction yields an optimal set of speaker activations that is sparse, where only speakers in close proximity to the desired audio signal's position are significantly activated, and practically results in a spatial reproduction of the audio signal that is perceptually more robust to listener movement around the set of speakers.
To this end, the second term of the cost function, Cproximity, may be defined as a distance-weighted sum of the absolute values squared of speaker activations. This is represented compactly in matrix form as:
where D is a diagonal matrix of distance penalties between the desired audio position and each speaker:
The distance penalty function can take on many forms, but the following is a useful parameterization
where ∥{right arrow over (o)}−{right arrow over (s)}i∥ is the Euclidean distance between the desired audio position and speaker position and α and β are tunable parameters. The parameter a indicates the global strength of the penalty; d0 corresponds to the spatial extent of the distance penalty (loudspeakers at a distance around d0 or further away will be penalized), and β accounts for the abruptness of the onset of the penalty at distance d0.
Combining the two terms of the cost function defined in Equations 15 and 16a yields the overall cost function
Setting the derivative of this cost function with respect to g equal to zero and solving for g yields the optimal speaker activation solution:
In general, the optimal solution in Equation 18 may yield speaker activations that are negative in value. For the CMAP construction of the flexible renderer, such negative activations may not be desirable, and thus Equation 18 may be minimized subject to all activations remaining positive.
A class of embodiments involves methods for rendering audio for playback by at least one (e.g., all or some) of a plurality of coordinated (orchestrated) smart audio devices. For example, a set of smart audio devices present (in a system) in a user's home may be orchestrated to handle a variety of simultaneous use cases, including flexible rendering (in accordance with an embodiment) of audio for playback by all or some (i.e., by speaker(s) of all or some) of the smart audio devices. Many interactions with the system are contemplated which require dynamic modifications to the rendering. Such modifications may be, but are not necessarily, focused on spatial fidelity.
Some embodiments are methods for rendering of audio for playback by at least one (e.g., all or some) of the smart audio devices of a set of smart audio devices (or for playback by at least one (e.g., all or some) of the speakers of another set of speakers). The rendering may include minimization of a cost function, where the cost function includes at least one dynamic speaker activation term. Examples of such a dynamic speaker activation term include (but are not limited to):
The dynamic speaker activation term(s) may enable at least one of a variety of behaviors, including warping the spatial presentation of the audio away from a particular smart audio device so that its microphone can better hear a talker or so that a secondary audio stream may be better heard from speaker(s) of the smart audio device.
Some embodiments implement rendering for playback by speaker(s) of a plurality of smart audio devices that are coordinated (orchestrated). Other embodiments implement rendering for playback by speaker(s) of another set of speakers.
Pairing flexible rendering methods (implemented in accordance with some embodiments) with a set of wireless smart speakers (or other smart audio devices) can yield an extremely capable and easy-to-use spatial audio rendering system. In contemplating interactions with such a system it becomes evident that dynamic modifications to the spatial rendering may be desirable in order to optimize for other objectives that may arise during the system's use. To achieve this goal, a class of embodiments augment existing flexible rendering algorithms (in which speaker activation is a function of the previously disclosed spatial and proximity terms), with one or more additional dynamically configurable functions dependent on one or more properties of the audio signals being rendered, the set of speakers, and/or other external inputs. In accordance with some embodiments, the cost function of the existing flexible rendering given in Equation 1 is augmented with these one or more additional dependencies according to
Equation 19 corresponds with Equation 1, above. Accordingly, the preceding discussion explains the derivation of Equation 1 as well as that of Equation 19.
In Equation 19, the terms Cj(g, {{ô}, {ŝi}, {ê}}j) represent additional cost terms, with {ô} representing a set of one or more properties of the audio signals (e.g., of an object-based audio program) being rendered, {ŝi} representing a set of one or more properties of the speakers over which the audio is being rendered, and {ê} representing one or more additional external inputs. Each term Cj(g, {{ô}, {ŝi}, {ê}}j) returns a cost as a function of activations g in relation to a combination of one or more properties of the audio signals, speakers, and/or external inputs, represented generically by the set {{ô}, {ŝi}, {ê}}j. It should be appreciated that the set {{ô}, {ŝi}, {ê}}j contains at a minimum only one element from any of {ô}, {ŝi}, or {ê}.
With the new cost function defined in Equation 28, an optimal set of activations may be found through minimization with respect to g and possible post-normalization as previously specified in Equations 28a and 28b.
In this implementation, block 1705 involves receiving, by a control system and via an interface system, audio data. In this example, the audio data includes one or more audio signals and associated spatial data. According to this implementation, the spatial data indicates an intended perceived spatial position corresponding to an audio signal. In some instances, the intended perceived spatial position may be explicit, e.g., as indicated by positional metadata such as Dolby Atmos positional metadata. In other instances, the intended perceived spatial position may be implicit, e.g., the intended perceived spatial position may be an assumed location associated with a channel according to Dolby 5.1, Dolby 7.1, or another channel-based audio format. In some examples, block 1705 involves a rendering module of a control system receiving, via an interface system, the audio data.
According to this example, block 1710 involves rendering, by the control system, the audio data for reproduction via a set of loudspeakers of an environment, to produce rendered audio signals. In this example, rendering each of the one or more audio signals included in the audio data involves determining relative activation of a set of loudspeakers in an environment by optimizing a cost function. According to this example, the cost is a function of a model of perceived spatial position of the audio signal when played back over the set of loudspeakers in the environment. In this example, the cost is also a function of a measure of proximity of the intended perceived spatial position of the audio signal to a position of each loudspeaker of the set of loudspeakers. In this implementation, the cost is also a function of one or more additional dynamically configurable functions. In this example, the dynamically configurable functions are based on one or more of the following: proximity of loudspeakers to one or more listeners; proximity of loudspeakers to an attracting force position, wherein an attracting force is a factor that favors relatively higher loudspeaker activation in closer proximity to the attracting force position; proximity of loudspeakers to a repelling force position, wherein a repelling force is a factor that favors relatively lower loudspeaker activation in closer proximity to the repelling force position; capabilities of each loudspeaker relative to other loudspeakers in the environment; synchronization of the loudspeakers with respect to other loudspeakers; wakeword performance; or echo canceller performance.
In this example, block 1715 involves providing, via the interface system, the rendered audio signals to at least some loudspeakers of the set of loudspeakers of the environment.
According to some examples, the model of perceived spatial position may produce a binaural response corresponding to an audio object position at the left and right ears of a listener. Alternatively, or additionally, the model of perceived spatial position may place the perceived spatial position of an audio signal playing from a set of loudspeakers at a center of mass of the set of loudspeakers' positions weighted by the loudspeaker's associated activating gains.
In some examples, the one or more additional dynamically configurable functions may be based, at least in part, on a level of the one or more audio signals. In some instances, the one or more additional dynamically configurable functions may be based, at least in part, on a spectrum of the one or more audio signals.
Some examples of the method 1700 involve receiving loudspeaker layout information. In some examples, the one or more additional dynamically configurable functions may be based, at least in part, on a location of each of the loudspeakers in the environment.
Some examples of the method 1700 involve receiving loudspeaker specification information. In some examples, the one or more additional dynamically configurable functions may be based, at least in part, on the capabilities of each loudspeaker, which may include one or more of frequency response, playback level limits or parameters of one or more loudspeaker dynamics processing algorithms.
According to some examples, the one or more additional dynamically configurable functions may be based, at least in part, on a measurement or estimate of acoustic transmission from each loudspeaker to the other loudspeakers. Alternatively, or additionally, the one or more additional dynamically configurable functions may be based, at least in part, on a listener or speaker location of one or more people in the environment. Alternatively, or additionally, the one or more additional dynamically configurable functions may be based, at least in part, on a measurement or estimate of acoustic transmission from each loudspeaker to the listener or speaker location. An estimate of acoustic transmission may, for example be based at least in part on walls, furniture or other objects that may reside between each loudspeaker and the listener or speaker location.
Alternatively, or additionally, the one or more additional dynamically configurable functions may be based, at least in part, on an object location of one or more non-loudspeaker objects or landmarks in the environment. In some such implementations, the one or more additional dynamically configurable functions may be based, at least in part, on a measurement or estimate of acoustic transmission from each loudspeaker to the object location or landmark location.
Numerous new and useful behaviors may be achieved by employing one or more appropriately defined additional cost terms to implement flexible rendering. All example behaviors listed below are cast in terms of penalizing certain loudspeakers under certain conditions deemed undesirable. The end result is that these loudspeakers are activated less in the spatial rendering of the set of audio signals. In many of these cases, one might contemplate simply turning down the undesirable loudspeakers independently of any modification to the spatial rendering, but such a strategy may significantly degrade the overall balance of the audio content. Certain components of the mix may become completely inaudible, for example. With the disclosed embodiments, on the other hand, integration of these penalizations into the core optimization of the rendering allows the rendering to adapt and perform the best possible spatial rendering with the remaining less-penalized speakers. This is a much more elegant, adaptable, and effective solution.
We next describe additional examples of embodiments. Similar to the proximity cost defined in Equations 25a and 25b, it may also be convenient to express each of the new cost function terms Cj(g, {{ô}, {ŝi}, {ê}}j) as a weighted sum of the absolute values squared of speaker activations, e.g. as follows:
Equation 20b corresponds with Equation 3, above.
Combining Equations 20a and 20b with the matrix quadratic version of the CMAP and FV cost functions given in Equation 15 yields a potentially beneficial implementation of the general expanded cost function (of some embodiments) given in Equation 19:
Equation 21 corresponds with Equation 2, above. Accordingly, the preceding discussion explains the derivation of Equation 2 as well as that of Equation 21.
With this definition of the new cost function terms, the overall cost function remains a matrix quadratic, and the optimal set of activations gopt can be found through differentiation of Equation 21 to yield
It is useful to consider each one of the weight terms wij as functions of a given continuous penalty value pij=pij({{ô}, {ŝi}, {ê}}j) for each one of the loudspeakers. In one example embodiment, this penalty value is the distance from the object (to be rendered) to the loudspeaker considered. In another example embodiment, this penalty value represents the inability of the given loudspeaker to reproduce some frequencies. Based on this penalty value, the weight terms wij can be parametrized as:
In case all loudspeakers are penalized, it is often convenient to subtract the minimum penalty from all weight terms in post-processing so that at least one of the speakers is not penalized:
As stated above, there are many possible use cases that can be realized using the new cost function terms described herein (and similar new cost function terms employed in accordance with other embodiments). Next, we describe more concrete details with three examples: moving audio towards a listener or talker, moving audio away from a listener or talker, and moving audio away from a landmark.
In the first example, what will be referred to herein as an “attracting force” is used to pull audio towards a position, which in some examples may be the position of a listener or a talker a landmark position, a furniture position, etc. The position may be referred to herein as an “attracting force position” or an “attractor location.” As used herein an “attracting force” is a factor that favors relatively higher loudspeaker activation in closer proximity to an attracting force position. According to this example, the weight wij takes the form of equation 17 with the continuous penalty value pij given by the distance of the ith speaker from a fixed attractor location {right arrow over (l)}j and the threshold value τj given by the maximum of these distances across all speakers:
To illustrate the use case of “pulling” audio towards a listener or talker, we specifically set αj=20, βj=3, and {right arrow over (l)}j to a vector corresponding to a listener/talker position of 180 degrees (bottom, center of the plot). These values of αj, βj, and {right arrow over (l)}j are merely examples. In some implementations, αj may be in the range of 1 to 100 and βj may be in the range of 1 to 25.
In the second and third examples, a “repelling force” is used to “push” audio away from a position, which may be a person's position (e.g., a listener position, a talker position, etc.) or another position, such as a landmark position, a furniture position, etc. In some examples, a repelling force may be used to push audio away from an area or zone of a listening environment, such as an office area, a reading area, a bed or bedroom area (e.g., a baby's bed or bedroom), etc. According to some such examples, a particular position may be used as representative of a zone or area. For example, a position that represents a baby's bed may be an estimated position of the baby's head, an estimated sound source location corresponding to the baby, etc. The position may be referred to herein as a “repelling force position” or a “repelling location.” As used herein an “repelling force” is a factor that favors relatively lower loudspeaker activation in closer proximity to the repelling force position. According to this example, we define pij and τj with respect to a fixed repelling location {right arrow over (l)}j similarly to the attracting force in Equations 26a and 26b:
To illustrate the use case of pushing audio away from a listener or talker, in one example we may specifically set αj=5, βj=2, and {right arrow over (l)}j to a vector corresponding to a listener/talker position of 180 degrees (at the bottom, center of the plot). These values of αj, βj, and {right arrow over (l)}j are merely examples. As noted above, in some examples αj may be in the range of 1 to 100 and βj may be in the range of 1 to 25.
The third example use case is “pushing” audio away from a landmark which is acoustically sensitive, such as a door to a sleeping baby's room. Similarly to the last example, we set {right arrow over (l)}j to a vector corresponding to a door position of 180 degrees (bottom, center of the plot). To achieve a stronger repelling force and skew the soundfield entirely into the front part of the primary listening space, we set αj=20, βj=5.
Aspects of some disclosed implementations include a system or device configured (e.g., programmed) to perform one or more disclosed methods, and a tangible computer readable medium (e.g., a disc) which stores code for implementing one or more disclosed methods or steps thereof. For example, the system can be or include a programmable general purpose processor, digital signal processor, or microprocessor, programmed with software or firmware and/or otherwise configured to perform any of a variety of operations on data, including one or more disclosed methods or steps thereof. Such a general purpose processor may be or include a computer system including an input device, a memory, and a processing subsystem that is programmed (and/or otherwise configured) to perform one or more disclosed methods (or steps thereof) in response to data asserted thereto.
Some disclosed embodiments are implemented as a configurable (e.g., programmable) digital signal processor (DSP) that is configured (e.g., programmed and otherwise configured) to perform required processing on audio signal(s), including performance of one or more disclosed methods. Alternatively, some embodiments (or elements thereof) may be implemented as a general purpose processor (e.g., a personal computer (PC) or other computer system or microprocessor, which may include an input device and a memory) which is programmed with software or firmware and/or otherwise configured to perform any of a variety of operations including one or more disclosed methods or steps thereof. Alternatively, elements of some disclosed embodiments are implemented as a general purpose processor or DSP configured (e.g., programmed) to perform one or more disclosed methods or steps thereof, and the system also includes other elements (e.g., one or more loudspeakers and/or one or more microphones). A general purpose processor configured to perform one or more disclosed methods or steps thereof would typically be coupled to an input device (e.g., a mouse and/or a keyboard), a memory, and a display device.
Another aspect of some disclosed implementations is a computer readable medium (for example, a disc or other tangible storage medium) which stores code for performing (e.g., coder executable to perform) any embodiment of one or more disclosed methods or steps thereof.
While specific embodiments and applications have been described herein, it will be apparent to those of ordinary skill in the art that many variations on the embodiments and applications described herein are possible without departing from the scope of the material described and claimed herein. It should be understood that while certain implementations have been shown and described, the present disclosure is not to be limited to the specific embodiments described and shown or the specific methods described.
Number | Date | Country | Kind |
---|---|---|---|
22172447.9 | May 2022 | EP | regional |
This application claims priority to U.S. provisional application 63/277,225, filed Nov. 9, 2021, U.S. provisional application 63/364,322, filed May 6, 2022, and EP application 22172447.9, filed May 10, 2022, each application of which is incorporated herein by reference in its entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2022/049170 | 11/7/2022 | WO |
Number | Date | Country | |
---|---|---|---|
63277225 | Nov 2021 | US | |
63364322 | May 2022 | US |