This disclosure pertains to devices, systems and methods for implementing echo management.
Audio devices having acoustic echo management systems are widely deployed. An acoustic echo management system may include an acoustic echo canceller and/or an acoustic echo suppressor. Although existing devices, systems and methods for acoustic echo management provide benefits, improved devices, systems and methods would be desirable.
Throughout this disclosure, including in the claims, the terms “speaker,” “loudspeaker” and “audio reproduction transducer” are used synonymously to denote any sound-emitting transducer (or set of transducers). A typical set of headphones includes two speakers. A speaker may be implemented to include multiple transducers (e.g., a woofer and a tweeter), which may be driven by a single, common speaker feed or multiple speaker feeds. In some examples, the speaker feed(s) may undergo different processing in different circuitry branches coupled to the different transducers.
Throughout this disclosure, including in the claims, the expression performing an operation “on” a signal or data (e.g., filtering, scaling, transforming, or applying gain to, the signal or data) is used in a broad sense to denote performing the operation directly on the signal or data, or on a processed version of the signal or data (e.g., on a version of the signal that has undergone preliminary filtering or pre-processing prior to performance of the operation thereon).
Throughout this disclosure including in the claims, the expression “system” is used in a broad sense to denote a device, system, or subsystem. For example, a subsystem that implements a decoder may be referred to as a decoder system, and a system including such a subsystem (e.g., a system that generates X output signals in response to multiple inputs, in which the subsystem generates M of the inputs and the other X−M inputs are received from an external source) may also be referred to as a decoder system.
Throughout this disclosure including in the claims, the term “processor” is used in a broad sense to denote a system or device programmable or otherwise configurable (e.g., with software or firmware) to perform operations on data (e.g., audio, or video or other image data). Examples of processors include a field-programmable gate array (or other configurable integrated circuit or chip set), a digital signal processor programmed and/or otherwise configured to perform pipelined processing on audio or other sound data, a programmable general purpose processor or computer, and a programmable microprocessor chip or chip set.
Throughout this disclosure including in the claims, the term “couples” or “coupled” is used to mean either a direct or indirect connection. Thus, if a first device couples to a second device, that connection may be through a direct connection, or through an indirect connection via other devices and connections.
As used herein, a “smart device” is an electronic device, generally configured for communication with one or more other devices (or networks) via various wireless protocols such as Bluetooth, Zigbee, near-field communication, Wi-Fi, light fidelity (Li-Fi), 3G, 4G, 5G, etc., that can operate to some extent interactively and/or autonomously. Several notable types of smart devices are smartphones, smart cars, smart thermostats, smart doorbells, smart locks, smart refrigerators, phablets and tablets, smartwatches, smart bands, smart key chains and smart audio devices. The term “smart device” may also refer to a device that exhibits some properties of ubiquitous computing, such as artificial intelligence.
Herein, we use the expression “smart audio device” to denote a smart device which is either a single-purpose audio device or a multi-purpose audio device (e.g., an audio device that implements at least some aspects of virtual assistant functionality). A single-purpose audio device is a device (e.g., a television (TV)) including or coupled to at least one microphone (and optionally also including or coupled to at least one speaker and/or at least one camera), and which is designed largely or primarily to achieve a single purpose. For example, although a TV typically can play (and is thought of as being capable of playing) audio from program material, in most instances a modern TV runs some operating system on which applications run locally, including the application of watching television. In this sense, a single-purpose audio device having speaker(s) and microphone(s) is often configured to run a local application and/or service to use the speaker(s) and microphone(s) directly. Some single-purpose audio devices may be configured to group together to achieve playing of audio over a zone or user configured area.
One common type of multi-purpose audio device is an audio device that implements at least some aspects of virtual assistant functionality, although other aspects of virtual assistant functionality may be implemented by one or more other devices, such as one or more servers with which the multi-purpose audio device is configured for communication. Such a multi-purpose audio device may be referred to herein as a “virtual assistant.” A virtual assistant is a device (e.g., a smart speaker or voice assistant integrated device) including or coupled to at least one microphone (and optionally also including or coupled to at least one speaker and/or at least one camera). In some examples, a virtual assistant may provide an ability to utilize multiple devices (distinct from the virtual assistant) for applications that are in a sense cloud-enabled or otherwise not completely implemented in or on the virtual assistant itself. In other words, at least some aspects of virtual assistant functionality, e.g., speech recognition functionality, may be implemented (at least in part) by one or more servers or other devices with which a virtual assistant may communication via a network, such as the Internet. Virtual assistants may sometimes work together, e.g., in a discrete and conditionally defined way. For example, two or more virtual assistants may work together in the sense that one of them, e.g., the one which is most confident that it has heard a wakeword, responds to the wakeword. The connected virtual assistants may, in some implementations, form a sort of constellation, which may be managed by one main application which may be (or implement) a virtual assistant.
Herein, “wakeword” is used in a broad sense to denote any sound (e.g., a word uttered by a human, or some other sound), where a smart audio device is configured to awake in response to detection of (“hearing”) the sound (using at least one microphone included in or coupled to the smart audio device, or at least one other microphone). In this context, to “awake” denotes that the device enters a state in which it awaits (in other words, is listening for) a sound command. In some instances, what may be referred to herein as a “wakeword” may include more than one word, e.g., a phrase.
Herein, the expression “wakeword detector” denotes a device configured (or software that includes instructions for configuring a device) to search continuously for alignment between real-time sound (e.g., speech) features and a trained model. Typically, a wakeword event is triggered whenever it is determined by a wakeword detector that the probability that a wakeword has been detected exceeds a predefined threshold. For example, the threshold may be a predetermined threshold which is tuned to give a reasonable compromise between rates of false acceptance and false rejection. Following a wakeword event, a device might enter a state (which may be referred to as an “awakened” state or a state of “attentiveness”) in which it listens for a command and passes on a received command to a larger, more computationally-intensive recognizer.
As used herein, the terms “program stream” and “content stream” refer to a collection of one or more audio signals, and in some instances video signals, at least portions of which are meant to be heard together. Examples include a selection of music, a movie soundtrack, a movie, a television program, the audio portion of a television program, a podcast, a live voice call, a synthesized voice response from a smart assistant, etc. In some instances, the content stream may include multiple versions of at least a portion of the audio signals, e.g., the same dialogue in more than one language. In such instances, only one version of the audio data or portion thereof (e.g., a version corresponding to a single language) is intended to be reproduced at one time.
At least some aspects of the present disclosure may be implemented via one or more audio processing methods. In some instances, the method(s) may be implemented, at least in part, by a control system and/or via instructions (e.g., software) stored on one or more non-transitory media. Some such methods may involve receiving, by a control system, location information for each of a plurality of audio devices in an audio environment. Some such methods may involve generating, by the control system and based at least in part on the location information, rendering information for a plurality of audio devices in an audio environment. Some such methods may involve determining, by the control system and based at least on part on the rendering information, a plurality of echo reference metrics. In some examples, each echo reference metric of the plurality of echo reference metrics may corresponding to audio data reproduced by one or more audio devices of the plurality of audio devices.
According to some examples, the rendering information may be, or may include, a matrix of loudspeaker activations. In some examples, at least one echo reference metric may correspond to a level of a corresponding echo reference, a uniqueness of the corresponding echo reference, a temporal persistence of the corresponding echo reference, an audibility of the corresponding echo reference, or one or more combinations thereof.
In some examples, the method also may involve receiving, by the control system, a content stream that includes audio data and corresponding metadata. According to some such examples, determining the at least one echo reference metric may be based, at least in part, on one or more of loudspeaker metadata, metadata corresponding to received audio data or an upmixing matrix.
In some implementations, the control system may be, or may include, an audio device control system. According to some such implementations, the method may involve making, by the control system and based at least in part on the echo reference metrics, an importance estimation for each echo reference of a plurality of echo references. In some such implementations, making the importance estimation may involve determining an expected contribution of each echo reference to mitigation of echo by at least one echo management system of at least one audio device of the audio environment. The at least one echo management system may be, or may include, an acoustic echo canceller (AEC), an acoustic echo suppressor (AES), or both an AEC and an AES. In some such implementations, the method may involve selecting, by the control system and based at least in part on the importance estimation, one or more selected echo references. In some such implementations, the method may involve providing, by the control system, the one or more selected echo references to the at least one echo management system.
In some examples, the method also may involve causing the at least one echo management system to cancel or suppress echoes based, at least in part, on the one or more selected echo references. According to some examples, making the importance estimation may involve determining an importance metric for a corresponding echo reference. In some examples, determining the importance metric may be based, at least in part, on a current listening objective, a current ambient noise estimate, or both a current listening objective and a current ambient noise estimate.
According to some examples, the method also may involve making, by the control system, a cost determination. In some examples, the cost determination may involve determining a cost for at least one echo reference of the plurality of echo references. In some such examples, selecting the one or more selected echo references may be based, at least in part, on the cost determination. In some examples, the cost determination may be based on the network bandwidth required for transmitting the at least one echo reference, an encoding computational requirement for encoding the at least one echo reference, a decoding computational requirement for decoding the at least one echo reference, an echo management system computational requirement for use of the at least one echo reference by the at least one echo management system, or one or more combinations thereof.
In some examples, the method also may involve determining a current echo management system performance level. According to some such examples, the importance estimation may be based, at least in part, on the current echo management system performance level.
According to some examples, the method also may involve receiving, by the control system, scene change metadata. In some examples, the importance estimation may be based, at least in part, on the scene change metadata.
In some examples, the method also may involve rendering the audio data, based at least in part on the rendering information, to produce rendered audio data. According to some implementations, the control system may be, or may include, an orchestrating device control system. In some such implementations, the method also may involve providing at least a portion of the rendered audio data to each audio device of the plurality of audio devices.
In some examples, the method also may involve providing at least one echo reference metric to each audio device of the plurality of audio devices.
According to some examples, the method also may involve generating, by the control system, at least one virtual echo reference corresponding to two or more audio devices of the plurality of audio devices.
In some examples, the method also may involve determining, by the control system, a weighted summation of echo references over a range of low frequencies. According to some such examples, the method may involve providing the weighted summation to at least one echo management system.
Some or all of the operations, functions and/or methods described herein may be performed by one or more devices according to instructions (e.g., software) stored on one or more non-transitory media. Such non-transitory media may include memory devices such as those described herein, including but not limited to random access memory (RAM) devices, read-only memory (ROM) devices, etc. Accordingly, some innovative aspects of the subject matter described in this disclosure can be implemented via one or more non-transitory media having software stored thereon.
At least some aspects of the present disclosure may be implemented via apparatus. For example, one or more devices may be capable of performing, at least in part, the methods disclosed herein. In some implementations, an apparatus is, or includes, an audio processing system having an interface system and a control system. The control system may include one or more general purpose single- or multi-chip processors, digital signal processors (DSPs), application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs) or other programmable logic devices, discrete gates or transistor logic, discrete hardware components, or combinations thereof.
Details of one or more implementations of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages will become apparent from the description, the drawings, and the claims. Note that the relative dimensions of the following figures may not be drawn to scale.
Like reference numbers and designations in the various drawings indicate like elements.
According to some alternative implementations the apparatus 50 may be, or may include, a server. In some such examples, the apparatus 50 may be, or may include, an encoder. Accordingly, in some instances the apparatus 50 may be a device that is configured for use within an audio environment, such as a home audio environment, whereas in other instances the apparatus 50 may be a device that is configured for use in “the cloud,” e.g., a server.
In this example, the apparatus 50 includes an interface system 55 and a control system 60. The interface system 55 may, in some implementations, be configured for communication with one or more other devices of an audio environment. The audio environment may, in some examples, be a home audio environment. In other examples, the audio environment may be another type of environment, such as an office environment, an automobile environment, a train environment, a street or sidewalk environment, a park environment, etc. The interface system 55 may, in some implementations, be configured for exchanging control information and associated data with audio devices of the audio environment. The control information and associated data may, in some examples, pertain to one or more software applications that the apparatus 50 is executing.
The interface system 55 may, in some implementations, be configured for receiving, or for providing, a content stream. The content stream may include audio data. The audio data may include, but may not be limited to, audio signals. In some instances, the audio data may include spatial data, such as channel data and/or spatial metadata. Metadata may, for example, have been provided by what may be referred to herein as an “encoder.” In some examples, the content stream may include video data and audio data corresponding to the video data.
The interface system 55 may include one or more network interfaces and/or one or more external device interfaces (such as one or more universal serial bus (USB) interfaces). According to some implementations, the interface system 55 may include one or more wireless interfaces. The interface system 55 may include one or more devices for implementing a user interface, such as one or more microphones, one or more speakers, a display system, a touch sensor system and/or a gesture sensor system. In some examples, the interface system 55 may include one or more interfaces between the control system 60 and a memory system, such as the optional memory system 65 shown in
The control system 60 may, for example, include a general purpose single- or multi-chip processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, and/or discrete hardware components.
In some implementations, the control system 60 may reside in more than one device. For example, in some implementations a portion of the control system 60 may reside in a device within one of the environments depicted herein and another portion of the control system 60 may reside in a device that is outside the environment, such as a server, a mobile device (e.g., a smartphone or a tablet computer), etc. In other examples, a portion of the control system 60 may reside in a device within one of the environments depicted herein and another portion of the control system 60 may reside in one or more other devices of the environment. For example, control system functionality may be distributed across multiple smart audio devices of an environment, or may be shared by an orchestrating device (such as what may be referred to herein as a smart home hub) and one or more other devices of the environment. In other examples, a portion of the control system 60 may reside in a device that is implementing a cloud-based service, such as a server, and another portion of the control system 60 may reside in another device that is implementing the cloud-based service, such as another server, a memory device, etc. The interface system 55 also may, in some examples, reside in more than one device.
In some implementations, the control system 60 may be configured for performing, at least in part, the methods disclosed herein. According to some examples, the control system 60 may be configured to obtain a plurality of echo references. The plurality of echo references may include at least one echo reference for each audio device of a plurality of audio devices in an audio environment. Each echo reference may, for example, correspond to audio data being played back by one or more loudspeakers of one audio device of the plurality of audio devices.
In some implementations, the control system 60 may be configured to make an importance estimation for each echo reference of the plurality of echo references. In some examples, making the importance estimation may involve determining an expected contribution of each echo reference to mitigation of echo by at least one echo management system of at least one audio device of the audio environment. The echo management system(s) may include an acoustic echo canceller (AEC) and/or an acoustic echo suppressor (AES).
According to some examples, the control system 60 may be configured to select based at least in part on the importance estimation, one or more selected echo references. In some examples, the control system 60 may be configured to provide the one or more selected echo references to the at least one echo management system.
Some or all of the methods described herein may be performed by one or more devices according to instructions (e.g., software) stored on one or more non-transitory media. Such non-transitory media may include memory devices such as those described herein, including but not limited to random access memory (RAM) devices, read-only memory (ROM) devices, etc. The one or more non-transitory media may, for example, reside in the optional memory system 65 shown in
In some examples, the apparatus 50 may include the optional microphone system 70 shown in
According to some implementations, the apparatus 50 may include the optional loudspeaker system 75 shown in
In some implementations, the apparatus 50 may include the optional sensor system 80 shown in
In some implementations, the apparatus 50 may include the optional display system 85 shown in
According to some such examples the apparatus 50 may be, or may include, a smart audio device. In some such implementations the apparatus 50 may be, or may include, a wakeword detector. For example, the apparatus 50 may be, or may include, a virtual assistant.
For playback media that is stereo or mono, traditionally it has been rendered into an audio environment (e.g., a living space, automobile, office space, etc.) via a pair of speakers physically wired to an audio player (e.g. a CD/DVD player, a television (TV), etc.). As smart speakers have become popular, users often have more than two audio devices configured for wireless communication (which may include, but are not limited to, smart speakers or other smart audio devices) in their homes (or other audio environments) that are capable of playing back audio.
Smart speakers are often configured to operate according to voice commands. Accordingly, such smart speakers are generally configured to listen continuously for a wakeword, which will normally be followed by a voice command. Any continuous listening task such as waiting for a wakeword, or performing any kind of “continuous calibration,” will preferably continue to function during the playback of content (such as the playback of music, the playback of sound tracks for movies and television programs, etc.) and while device interactions take place (e.g., during telephone calls). Audio devices that need to listen during the playback of content will typically need to employ some form of echo management, e.g., echo cancellation and/or echo suppression, to remove the “echo” (content played by the devices) from microphone signals.
According to this example, the audio environment 100 includes audio devices 110A, 110B and 110C. In this example, each of the audio devices 110A-110C is an instance of the apparatus 50 of
In this example, the audio devices 110A-110C are playing back audio content while a person 130 is talking. The microphones of audio device 110B detect not only the audio content played back by its own speaker, but also the speech sounds 131 of the person 130 and the audio content played back by the audio devices 110A and 110C.
In order to utilize as many speakers as possible at the same time, a typical approach is for all of the audio devices in an audio environment to play back the same content, with some timing mechanism to keep the playback media in synchronization. This has the advantage of making distribution simple, because all the devices receive the same copy of the playback media either downloaded or streamed to each audio device, or broadcast by one device and multicast to all the audio devices.
One major disadvantage of this approach is that no spatial effect is obtained. A spatial effect may be achieved by adding more playback channels (e.g., one per speaker), e.g., through upmixing. In some examples a spatial effect may be achieved via a flexible rendering process such as Center of Mass Amplitude Panning (CMAP), Flexible Virtualization (FV), or a combination of CMAP and FV. Relevant examples of CMAP, FV and combinations thereof are described in International Patent Publication No. WO 2021/021707 A1 (e.g., on pages 25-41), which is hereby incorporated by reference.
In this example, the audio devices 110A-110D are rendering content 122A, 122B, 122C and 122D via the loudspeakers 121A-121D. The “echo” corresponding to the content 122A-122D played back by each of the audio devices 110A-110D is detected by each of the microphones 120A-120D. In this example, the audio devices 110A-110D are configured to listen for a command or wakeword in the speech 131 from the person 130 within the audio environment 100.
According to this example, the control system 60 is implementing a renderer 201A, a multi-channel acoustic echo management system (MC-EMS) 203A and a speech processing block 240A. The MC-EMS 203A may include an acoustic echo canceller (AEC), an acoustic echo suppressor (AES), or both an AEC and an AES, depending on the particular implementation. According to this example, the speech processing block 240A is configured to detect user wakewords and commands. In some implementations, the speech processing block 240A may be configured for supporting a communications session, such as a telephone call.
In this implementation, the renderer 201A is configured to provide a local echo reference 220A to the MC-EMS 203A. The local echo reference 220A corresponds to (and in this example is identical to) the speaker feed signals provided to the loudspeaker 121A for playback by the audio device 110A. According to this example, the renderer 201A is also configured to provide non-local echo references 221A—corresponding to the content 122B, 122C and 122D played back by the other audio devices in the audio environment 100—to the MC-EMS 203A.
According to some examples, the audio device 110A receives a combined bitstream (e.g., as shown in
In some instances, the local echo reference 220A and/or the non-local echo references 221A may be full-fidelity replicas of the speaker feed signals provided to the loudspeakers 121A-121D for playback. In some alternative examples, the local echo reference 220A and/or the non-local echo references 221A may be lower-fidelity representations of the speaker feed signals provided to the loudspeakers 121A-121D for playback. In some such examples, the non-local echo references 221A may be downsampled versions of the speaker feed signals provided to the loudspeakers 121B-121D for playback. According to some examples, the non-local echo references 221A may be lossy compressions of the speaker feed signals provided to the loudspeakers 121B-121D for playback. In some examples, the non-local echo references 221A may be banded power information corresponding to the speaker feed signals provided to the loudspeakers 121B-121D for playback.
According to this implementation, the MC-EMS 203A is configured to use the local echo reference 220A and the non-local echo references 221A to predict and cancel and/or suppress the echo from microphone signals 223A, thereby producing the residual signal 224A in which the speech to echo ratio (SER) may have been improved with respect to that in the microphone signals 223A. This residual signal 224A may enable the speech processing block 240A to detect user wakewords and commands. In some implementations, the speech processing block 240A may be configured for supporting a communications session, such as a telephone call.
Some aspects of this disclosure involve making an importance estimation for each echo reference of a plurality of echo references (e.g., for the local echo reference 220A and the non-local echo references 221A). Making the importance estimation may involve determining an expected contribution of each echo reference to mitigation of echo by at least one echo management system of at least one audio device of the audio environment (e.g., the mitigation of echo by the MC-EMS 203A of audio device 110A). Various examples are provided below.
In the context of distributed and orchestrated devices, for the purposes of echo management, according to some examples each audio device may obtain the echo references corresponding to what is played back by one or more other audio devices in an audio environment, in addition to its own echo reference. The impact of including a particular echo reference in a local echo management system or “EMS” (such as the MC-EMS 203A of audio device 110A) may vary according to a multitude of parameters, such as the diversity of the audio content being played out, the network bandwidth required for transmitting the echo reference, the encoding computational requirement for encoding an echo reference if an encoded echo reference is transmitted, the decoding computational requirement for decoding the echo reference, the echo management system computational requirement for using the echo reference by the echo management system, the relative audibility of the audio devices, etc.
For example, if each audio device is rendering the same content, (in other words, if monophonic audio is being played back), then there is little, albeit non-zero, benefit to be made by making additional references available to the EMS. Moreover, due to practical limitations (such as bandlimited networks) it may not be desirable for all devices to share a replica of their local echo reference. Therefore, some implementations may provide a distributed and orchestrated EMS (DOEMS), wherein echo references are prioritized and transmitted (or not) accordingly. Some such examples may implement a tradeoff between the cost (e.g., network bandwidth required and/or computational overhead required) and the benefit (e.g., the expected echo mitigation improvement, which may be measured according to the signal-to-echo ratio (SER) and/or echo return loss enhancement (ERLE)) of each additional echo reference.
In
In other examples, the echo reference 220A′ may be different from the channel 0 audio because the echo reference 220A′ may not be a full-fidelity replica of the audio data being played back on the audio device 110A. In some such examples, the echo reference 220A′ may correspond to the audio data being played back on the audio device 110A, but may require relatively less data than the complete replica and therefore may consume relatively less bandwidth of the local network when the echo reference 220A′ is transmitted.
According to some such examples, the audio device 110A may be configured for making a downsampled version of the local echo reference 220A that is described above with reference to
In some examples, the audio device 110A may be configured for making a lossy compression of the local echo reference 220A. In such instances, the echo reference 220A′ may be a result of the control system 60a applying a lossy compression algorithm to the local echo reference 220A.
According to some examples, audio device 110A may be configured for providing banded power information to the audio devices 110B and 110C corresponding to the local echo reference 220A. In some such examples, instead of transmitting a full-fidelity replica of the audio data being played back on the audio device 110A, the control system 60a may be configured to determine a power level of the audio data being played back on the audio device 110A in each of a plurality of frequency bands and to transmit the corresponding banded power information to the audio devices 110B and 110C. In some such examples, the echo reference 220A′ may be, or may include, the banded power information.
In this example, the audio device 110A is an instance of the audio device 110A of
In this example, the elements of
The echo reference orchestrator 302A may function in various ways, depending on the particular implementation. Many examples are disclosed herein. In some examples, the echo reference orchestrator 302A may be configured for making an importance estimation for each echo reference of a plurality of echo references (e.g., for the local echo reference 220A and the non-local echo references 221A). Making the importance estimation may involve determining an expected contribution of each echo reference to mitigation of echo by at least one echo management system of at least one audio device of the audio environment (e.g., the mitigation of echo by the MC-EMS 203A of audio device 110A).
Some examples of making the importance estimation may involve determining an importance metric. In some such examples, the importance metric may be based, at least in part, on one or more characteristics of each echo reference, such as level, uniqueness, temporal persistence, audibility, or one or more combinations thereof. In some examples, the importance metric may be based, at least in part, on metadata (e.g., the metadata 312A), such as metadata corresponding to an audio device layout, loudspeaker metadata, metadata corresponding to received audio data, an upmixing matrix, a matrix of loudspeaker activations, or one or more combinations thereof. In some examples, the importance metric may be based, at least in part, on a current listening objective, a current ambient noise estimate, an estimate of a current performance of at least one echo management system, or one or more combinations thereof.
According to some examples, the echo reference orchestrator 302A may be configured for selecting a set of one or more echo references based, at least in part, on a cost determination. In some examples, the echo reference orchestrator 302A may be configured to make the cost determination, whereas in other examples another block of the control system 60a may be configured to make the cost determination. In some instances, the cost determination may involve determining a cost for at least one echo reference of a plurality of echo references, or in some cases for each of the plurality of echo references. In some examples, the cost determination may be based on network bandwidth required for transmitting the echo reference, an encoding computational requirement for encoding the at least one echo reference, a decoding computational requirement for decoding the at least one echo reference, a downsampling cost of making a downsampled version of the echo reference, an echo management system computational requirement for use of the at least one echo reference by the echo management system, or one or more combinations thereof.
According to some examples, the cost determination may be based on a replica of the at least one echo reference in a time domain or a frequency domain, on a downsampled version of the at least one echo reference, on a lossy compression of the at least one echo reference, on banded power information for the at least one echo reference, or one or more combinations thereof. In some instances, the cost determination may be based on a method of compressing a relatively more important echo reference less than a relatively less important echo reference. In some implementations, the echo reference orchestrator 302A (or another block of the control system 60a) may be configured for determining a current echo management system performance level (e.g., based at least in part on the metric(s) 350A). In some such examples, selecting the one or more selected echo references may be based, at least in part, on the current echo management system performance level.
Depending on the distributed audio device system, its configuration and the type of audio session (e.g., communication vs. listening to music) and/or the nature of the content being rendered, the rate at which the importance of each echo reference is estimated and the rate at which the set of echo references is evaluated may differ. Moreover, the rate at which the importance is estimated need not be equal to the rate at which the echo reference selection process makes decisions. If the two are not synchronized, in some examples the importance calculation would be more frequent. In some instances, the echo reference selection may be a discrete process wherein binary decisions are made either to include or not include particular echo references.
The fidelity of the copies, or representations, of the echo references will generally correlate inversely to the number of bits required for each such copy or representation. Accordingly, the fidelity of the copies, or representations, of the echo references provides an indication of the tradeoff between network cost (due to the varying number of bits required for transmission) and the expected echo management performance (because the performance should improve as the fidelity increases). Note that the straight lines used to connect the points in
In this example, the echo reference orchestrator 302A is an instance of the echo reference orchestrator 302A of
The echo reference importance estimator 401A may function in various ways, depending on the particular implementation. Various examples are provided in this disclosure. In some examples, the echo reference importance estimator 401A may be configured for making an importance estimation for each echo reference of a plurality of echo references (e.g., for the local echo reference 220A and the non-local echo references 221A). Making the importance estimation may involve determining an expected contribution of each echo reference to mitigation of echo by at least one echo management system of at least one audio device of the audio environment (e.g., the mitigation of echo by the MC-EMS 203A of audio device 110A).
In this example, making the importance estimation involves determining importance metrics 420A. The importance metrics 420A may be based, at least in part, on one or more characteristics of each echo reference, such as level, uniqueness, temporal persistence, audibility, or one or more combinations thereof. In some examples, an importance metric may be based, at least in part, on metadata (e.g., the metadata 312A), which may include metadata corresponding to an audio device layout, loudspeaker metadata (e.g., the sound pressure level (SPL) ratings, frequency ranges, whether the loudspeaker is an upwards-firing loudspeaker, etc.), metadata corresponding to received audio data (e.g., positional metadata, metadata indicating vocals or other speech, etc.), an upmixing matrix, a matrix of loudspeaker activations, or one or more combinations thereof. In some instances, as suggested by the dashed arrow 420A, the echo reference importance estimator 401A may provide importance metrics 420A to the MC-EMS performance model 405A.
According to this example, the importance metrics 420A are based, at least in part, on a current listening objective, as indicated by the information 421A. As described in more detail below, the current listening objective may significantly change how factors such as level, uniqueness, temporal persistence, audibility, etc., are evaluated. For example, the importance analysis may be very different during a telephone call than when awaiting a wakeword.
In this example, the importance metrics 420A are based, at least in part, on the current ambient noise estimate 318A, the metric(s) 350A indicating the current performance of the MC-EMS 203A, information 423A produced by the MC-EMS performance model 405A, or one or more combinations thereof. In some implementations, the echo reference importance estimator 401A may determine that a relatively higher room noise level (as indicated by the current ambient noise estimate 318A) will make it less likely that adding an echo reference will help mitigate echo significantly. As noted above, information 423A may correspond to the type of information that is described above with reference to
According to this implementation, the echo reference selector 402 selects a set of one or more echo references based, at least in part, on one or more metrics 350A indicating the current performance of the MC-EMS 203A, the importance metrics 420A, the current listening objective 421A, information 422A indicating the cost(s) of including an echo reference in the set of echo references 313A and information 423A produced by the MC-EMS performance model 405A. Some detailed examples of how the echo reference selector 402 may select echo references are provided below.
In this example, the cost estimation module 403A is configured to determine the computational and/or network costs of including an echo reference in the set of echo references 313A. The computational cost may, for example, include the additional computational cost of use, by the MC-EMS 203A, of a particular echo reference. This computational cost may depend, in turn, on the number of bits required to represent the echo reference. In some examples, the computational cost may include the computational cost of a lossy echo reference encoding process and/or the computational cost of a corresponding echo reference decoding process. Determining the network costs may involve determining the amount of data required to send a complete replica of an echo reference or a copy or representation of the echo reference across a local data network (e.g., a local wireless data network).
In some instances, the echo reference selection block 402A may generate and transmit a request 311A for another device in the audio environment to send one or more echo references to it over the network. (Element 314A of
One may note that a request for an encoded echo reference not only introduces a network cost due to sending the request and the reference, but also adds a computational cost for the responding device(s) (e.g., the smart home hub 105 or one or more of the audio devices 110B-110D) that must encode the reference, as well as the computational cost for the audio device 110A to decode the received reference. However, this encoding cost may be a one-time cost. Accordingly, the request from one audio device to another to send an encoded reference over the network changes the potential performance/cost tradeoff being performed in other devices (e.g., in audio devices 402C and 402D).
In some implementations, one or more of the blocks of the echo reference orchestrator 302A may be performed by an orchestrating device, e.g., the smart home hub 105 or one of the audio devices 110A-110D. According to some such implementations, at least some functionality of the echo reference importance estimator 401A and/or the echo reference selection block 402A may be performed by the orchestrating device. Some such implementations may be capable of determining cost/benefit trade-offs on a systemwide basis, taking into account the performance enhancements of all instances of the MC-EMS in the audio environment, the overall computational demands for all instances of the MC-EMS, the overall demands on the local network and/or the overall computational demands for all encoders and decoders.
Simply stated, the importance metric (which may be referred to herein as “Importance” or “I”) may be a measure of the expected improvement in performance of an EMS due to the inclusion of a particular echo reference. In some embodiments, Importance may depend on the present state of the EMS, particularly on the set of echo references already in use and at what level of fidelity they are being received. Importance may be available at different timescales, depending on the particular implementation. On one extreme, Importance may be implemented on a frame-by-frame basis (e.g., according to an Importance signal for each frame). In other examples, Importance may be implemented as a constant value for the duration of a content segment, or as a constant value for the time during which a particular configuration of audio devices is in use. The configuration of audio devices may correspond to audio device positions and/or audio device orientations.
Accordingly, the Importance metric may be calculated on a variety of timescales depending on the particular implementation, e.g.:
Decisions regarding which echo references are to be selected for the purposes of echo management can be made on a similar (or slower) time scale that that at which the importance metric is evaluated. For example, a device or system might estimate importance every 30 seconds and make a decision about changing the selected echo references every few minutes.
According to some examples, a control system may be configured to determine an Importance matrix, which may include all the importance information for a present system of audio devices. In some such examples, Importance matrix may have dimension N×M, including an entry for each audio device and an entry for each potential echo reference channel. In some such examples, N represents the number of audio devices and M represents the number of potential echo references. Because some audio devices may play back more than one channel, this type of Importance matrix will not always be square.
In some implementations, the importance metric I may be based on one or more of the following:
This aspect describes the level or loudness of the echo reference. All other things being equal, it is well known that louder playback signals have an increased impact on EMS performance. As used herein, the term “level” refers to the level within the digital representation of an audio signal, and not necessarily to the actual sound pressure level of the audio signal after being reproduced via a loudspeaker. In some examples, the loudness of a single channel of echo reference may be based on a root mean square (RMS) metric or an LKFS (loudness, k-weighted, relative to full scale) metric. Such metrics are easily computed on the echo references in real-time, or may be present as metadata in a bitstream. According to some implementations, L may be determined according to a volume setting, such as an audio system volume setting or a volume setting within a media application.
The uniqueness aspect is intended to capture the amount of new information that a particular echo reference provides about an overall audio presentation. From a statistical point of view, multichannel audio presentations often contain redundancy across channels. This redundancy may, for example, occur because instruments and other sound sources are replicated across channels on the left and right sides of a room, or as signals are panned and thus further replicated in multiple active loudspeakers at the same time. Even though such scenarios result in an over-specified problem for an EMS to solve (where echo filters may infer observations from multiple echo paths), some benefits and higher performance can nonetheless be observed in practice.
U may be computed or estimated in various ways. In some examples U may be based, at least in part, on the correlation coefficient between each echo reference. In one such example, U may be estimated as follows:
Ur=1−maxr(Σm=0M Σn=0N xr[n] xm[n]), wherein the subscript “r” corresponds with a particular echo reference being evaluated, N represents the total number of audio devices in an audio environment, n represents an individual audio device, M represents the total number of potential echo references in the audio environment and m represents an individual echo reference.
Alternatively, or additionally, in some examples U may be based, at least in part, on decomposition of audio signals to find redundancies. Some such examples may involve instantaneous frequency estimation, fundamental frequency (F0) estimation, spectrogram inversion and/or nonnegative matrix factorization (NMF).
According to some examples U may be based, at least in part, on data used for matrix decoding. Matrix decoding is an audio technology in which a small number of discrete audio channels (e.g., 2) are decoded into a larger number of channels on play back (e.g., 4 or 5). The channels are generally arranged for transmission or recording by an encoder, and decoded for playback by a decoder. Matrix decoding allows multichannel audio, such as surround sound, to be encoded in a stereo signal, to be played back as stereo on stereo equipment, and to be played back as surround on surround equipment. In one such example, if a stream of stereo audio data were being received by a Dolby 5.1 system, a static upmixing matrix could be applied to the stereo audio data in order to provide properly rendered audio for each of the loudspeakers in the Dolby 5.1 system. According to some examples U may be based, at least in part, on the coefficients of an up-mixing or down-mixing matrix used to address each of the loudspeakers of an audio environment (e.g., each of the audio devices 110A-110D) with audio.
In some examples U may be based, at least in part, on a standard canonical loudspeaker layout used in the audio environment (e.g., Dolby 5.1, Dolby 7.1, etc.) Some such examples may involve leveraging the way media content is traditionally mixed and presented in such a canonical loudspeaker layout. For example, in a Dolby 5.1 or a Dolby 7.1 system, artists typically put vocals in the center channel, but not surround channels. As noted above, audio corresponding to musical instruments and other sound sources is typically replicated across channels on the left and right sides of a room. In some instances, vocals, dialogue, instrumental music, etc., may be identified via metadata received with the corresponding audio data.
The persistence metric is intended to capture the aspect that different types of played-back media may have a wide range of temporal persistence, with different types of content having varying degrees of silence and loudspeaker activation. A continuous stream of spectrally dense content (such as music or the audio output of a video game console) may have a high level of temporal persistence, whereas podcasts may have a lower level of temporal persistence. Infrequent system notifications will have a very low level of temporal persistence. Echo references corresponding to media with a low degree of persistence may be less important for an EMS, depending on the specific listing task at hand. For instance, an occasional system notification is less likely to collide with a wake-word or barge-in request, and thus the relative importance of managing this echo is low.
Following are examples of metrics that may be used to measure or estimate persistence:
According to some examples, the audio content type may affect estimates of L, U and/or P. For example, knowing that the audio content is stereo music would allow the ranking of all of the echo references using just the channel assignment mentioned above. Alternatively, knowing that the audio content is Atmos could alter default L, U and/or P assumptions if the control system were not to analyze the audio content but instead to rely on the channel assignment.
The audibility metric is directed to the facts that audio devices have different playback characteristics and may be located at varying distances from one another in any given audio environment. Following are examples of metrics that may be used to measure or estimate audio device audibility:
Other factors may be evaluated to estimate importance and, in some instances, to determine an importance metric.
The listening objective may define the context and desired performance characteristics of the EMS. In some examples, the listening objective may modify the parameters and/or the domain over which LUPA is evaluated. The following discussion will consider 3 potential contexts in which the listening objective changes. In these different contexts, we will see how Probability and Criticality can affect LUPA.
When waiting for barge in, there is no immediate urgency: all time intervals in the future are normally considered to have an equal probability of a wakeword being spoken by the user. Furthermore, the wakeword detector is likely to be the most robust element of the voice assistant and the effect of echo leakage is less critical.
Immediately after a wakeword is spoken by a person, the likelihood of the person speaking a command is extremely high. Therefore, there is a high probability of collision with echo in the immediate future. Furthermore, because the command recognition module may be relatively less robust than the wakeword detector, the criticality of echo leakage will generally be high.
During a voice call the likelihood of any participant (both for the person(s) in the audio environment and the person(s) at the far end) speaking to one another is certain. In other words, the probability of a collision of echo with the users voice is essentially 1. However, because the person or persons at the far end are human and can deal extremely well with background noise, the criticality is small because they are unlikely to be bothered by echo leakage.
During these different listening objective contexts, in some examples the way LUPA is evaluated may change.
There may be no temporal discrimination because all future time intervals are considered to have equal probability of a wakeword being spoken. Thus, the temporal range over which a control system evaluates LUPA may be quite long in order to obtain better estimates of those parameters. In some such examples, the time interval over which a control system evaluates LUPA may be set to look relatively far into the future (e.g., over a time frame of minutes).
The time intervals immediately following a wakeword being spoken are very likely to have a command being spoken. Therefore, after the wakeword is detected, in some implementations LUPA may be evaluated over much shorter timescales than in the barge-in context, e.g., on the order of second. In some examples, references that are temporally sparse and which have content playing within the next few seconds after wakeword detection will be considered much more important during this time interval, now that the likelihood of a collision is high.
In this example, method 500 is an echo reference selection method. The blocks of method 500 may, for example, be performed by a control system, such as the control system 60a of
The reference selection method of
In this example, block 501 involves determining whether or not a current performance level of an EMS is greater than or equal to a desired performance level. If so, the process terminates (block 510). However, if it is determined that the current performance level is less than a desired performance level, in this example the process continues to block 502. According to this example, the determination of block 501 is based, at least in part, on one or more metrics indicating the current performance of the EMS, such as adaptive filter coefficient data or other AEC statistics, speech-to-echo (SER) ratio data, etc. In some examples wherein the determination of block 501 is made by the echo reference orchestrator 302A, this determination may be based, at least in part, on the one or more metrics 350A from the MC-EMS 203A. As noted above, some implementations may not include block 501.
According to this example, block 502 involves ranking the remaining unselected echo references by importance and estimating the potential EMS performance increase to be gained by including the most important echo reference that is not yet being used by the EMS. In some examples wherein the process of block 502 is performed by the echo reference orchestrator 302A, this process may be based, at least in part, on information 423A produced by the MC-EMS performance model 405A, which may in some examples be, or include, data such as shown in
In this example, block 503 involves comparing the performance and cost of adding the echo reference selected in block 502. In some examples wherein the process of block 503 is performed by the echo reference orchestrator 302A, block 503 may be based, at least in part, on information 422A from the cost estimation module 403A indicating the cost(s) of including an echo reference in the set of echo references 313A.
Because performance and cost may be variables having different ranges and/or domains, it may be challenging to compare these variables directly. Therefore, in some implementations the evaluation of block 503 may be facilitated by mapping the performance and cost may be variables to a similar scale, such as a range between predefined minimum and maximum values.
In some implementations, the cost of adding the echo reference being evaluated may simply be set to zero if adding the echo reference would not cause a predetermined network bandwidth and/or computational cost budget to be exceeded. In some such examples, the cost of adding the echo reference being evaluated may be set to be infinite if adding the echo reference would cause a predetermined network bandwidth and/or computational cost budget to be exceeded. Such examples have the benefits of simplicity and efficiency. In this manner, the control system may simply add the maximum number of echo references that the predetermined network bandwidth and/or computational cost budget will allow.
According to some examples, the estimated performance increase corresponding with adding an echo reference may be set to zero if the estimated performance increase is not above a predetermined threshold (e.g., 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, etc.). Such methods can prevent the consumption of network bandwidth and/or computational overhead by including echo references that only add insignificant performance increases. Some detailed alternative examples of cost determination are described below.
In this example, block 504 involves determining whether or not the new echo reference will be added, given the performance/cost evaluation of block 503. In some examples, blocks 503 and 504 may be combined into a single block. According to this example, block 504 involves determining whether the cost of adding the echo reference being evaluated would be less than the EMS performance increase that is estimated to be caused by adding the echo reference. In this example, if the estimated cost would not be less than the estimated performance increase, the process continues to block 511 and method 500 terminates. However, in this implementation, if the estimated cost would be less than the estimated performance increase, the process continues to block 505.
According to this example, block 505 involves adding the new echo reference to the set of selected echo references. In some instances, block 505 may include informing the renderer 201 to output the relevant echo reference. According to some examples, block 505 may involve sending the echo reference over the local network or sending a command 311 to another device to send the echo reference over the local network.
The echo references evaluated in method 500 may be either local or non-local echo references, the latter of which may be determined locally (e.g., by a local renderer as described above) or received over a local network. Accordingly, the cost estimation for some echo references may involve evaluating both computational and network costs.
According to some examples, to evaluate the next echo reference after block 505, the control system may simply reset the selected and unselected echo references and revert to a previous block of
An echo reference may be transmitted (or used locally within a device, such as a device that produces all of the echo references) in a number of forms or variants, which may alter the cost/benefit ratio of that particular echo reference. For example, it is possible to reduce the cost of sending an echo reference across the local network if we transform the echo reference into a banded power form (in other words, determining the power in each of a plurality of frequency bands and transmitting banded power information about the power in each frequency band). However, the potential improvement that could be obtained by an EMS using a lower-fidelity variant of an echo reference will generally also be lower. The choice to make any particular variant of the echo reference available can be accounted for by making it a potential candidate for selection.
In some implementations, an echo reference may be in one of the following forms, which are listed below (the first four of which are in an estimated order of decreasing performance):
The blocks of method 550 may, for example, be performed by a control system, such as the control system 60a of
Method 550 takes into account the fact that echo references may not necessarily be transmitted or used in a full-fidelity form, but instead may be in one of the above-described alternative partial-fidelity forms. Therefore, in method 550 the evaluation of performance and cost does not involve a binary decision as to whether an echo reference in a full-fidelity form will or will not be used. Instead, method 550 involves determining whether to include one or more lower-fidelity versions of an echo reference, which may involve and potentially less of an increase in EMS performance, but at a lower cost. Methods such as method 550 provide additional flexibility in the potential set of echo references to be used by the echo management system.
In this example, method 550 is an extension of the echo reference selection method 500 that is described above with reference to
Accordingly, method 550 involves evaluating lower-fidelity versions of an echo reference, if any are available. Such lower-fidelity versions may include a downsampled version of the echo reference, an encoded version of the echo reference produced via a lossy encoding process and/or banded power information corresponding to the echo reference.
The “cost” of an echo reference refers to the resources required to utilize the reference for the purposes of echo management, whether that be with an AEC or an AES. Some disclosed implementations may involve estimating one or more of the following types of costs:
The total cost of a particular set of echo references may be determined as the sum of the cost of each echo reference in the set. Some disclosed examples involve combining both the network and computational costs. According to some examples, the total cost Ctotal may be determined as follows:
In the foregoing equation, Rcomp represents the total amount of computational resources available for the purposes of echo management, Rnetwork represents the total amount of network resources available for the purposes of echo management; Cmcomp represents the computational cost associated with using the mth reference and Cmnetwork represents the network cost associated with using the mth reference (where there are a total of M references used in the EMS). One may note that this definition implies that
and that Ctotal includes only the cost components that are closest to becoming bounded by the resources available to the system.
The “performance” of an echo management system (EMS) may refer to the following:
Some examples may involve determining a single performance metric P. Some such examples use the ERLE and the robustness estimated from adaptive filter coefficient data or other AEC statistics obtained from the EMS. According to some such examples, a performance robustness metric Prob may be determined using the “microphone probability” extracted from an AEC, e.g., as follows:
In the foregoing equation, 0≤PRob≤1, 0≤M_prob≤1 and M_prob represents the microphone probability, which is the proportion of the number of subband adaptive filters in the AEC that produce poor echo predictions that do not provide substantial (or any) echo cancellation in their respective subband.
The performance of a wakeword (WW) detector is strongly dependent on the speech to echo ratio (SER), which is proportionately improved by the ERLE of the EMS. When the SER is too low, the WW detector is more likely to both trigger falsely (a false alarm) and miss keywords uttered by the user (a missed detection) due to the echo corrupting the microphone signal and decreasing the accuracy of the system. The SER of the residual signal (e.g., the residual signal 224A of
Accordingly, some disclosed examples involve mapping a desired WW performance level to a nominal SER level which in turn, in conjunction with knowledge of the typical playback levels of the devices in a system, allows a control system to map this a desired WW performance level to a nominal ERLE directly. In some examples, this method may be extended to map the WW performance of a system at various SER levels to the ERLE. In some such implementations, the receiver operating characteristic (ROC) curve of a particular WW detector can be produced using input data with a range of SER values. Some examples involve choosing a particular False alarm rate (FAR) of interest and taking the accuracy of the WW detector as a function of the SER for this particular FAR as our application basis. In some such examples,
Acc(SERres)=ROC(SERres,FARI)
In the foregoing equation, Acc(SERres) represents the accuracy of the WW detector as a function of the SERres which represents the SER of the residual signal output by the EMS. ROC( ) represents a collection of ROC curves for multiple SERs and FARI represents the False alarm rate of interest, of which typical values may be 3 per 24 hours and 1 per 10 hours. The accuracy Acc(SERres) may be represented as a percentage or normalized such that it is in the range from 0 to 1, which may be expressed as follows:
With knowledge of the playback capability of the audio devices in the audio environment, using LUPA components for, e.g., the actual echo level and speech levels typical of the target audio environments can be combined to determine typical SER values in the microphone signal (e.g., microphone signal 223A of
In the foregoing equation, Speech_pwr and Echo_pwr represent the expected baseline speech power level and the echo power level of the targeted audio environment, respectively. By way of the EMS, the SERmic can improved to SERres proportionately to the ERLE, e.g., as follows:
SERresdB=SERmicdB+ERLEdB
In the foregoing equation, the superscript dB indicates that the variables are represented in decibels in this example. For completeness, some implementations may define the ERLE of the EMS as follows:
Using the foregoing equations, some implementations may define a WW application based EMS performance metric as follows:
P
WW=Acc(SERmicdB+ERLEdB),
where SERmicdB is representative of the SER in the target environment. In some examples SERmicdB may be a static default number, whereas in other examples SERmicdB may be estimated, e.g., as a function of one or more LUPA components. Some implementations may involve defining a net performance metric P as a vector containing each element, e.g., as follows:
P=[P
WW
,P
Rob]
In some examples, one or more additional performance components may be added by increasing the size of the net performance vector. In some alternative examples, one or more additional performance components may be combined into a single scalar metric by weighting them, e.g., as follows:
In the foregoing equation, K represents a weighting factor, chosen by the system designer, which is used to determine how much of each component contributes to the net performance. Some alternative examples may use another method, e.g., simply averaging individual performance metrics. However, it may be advantageous to combine the individual performance metrics into a single scalar one.
When comparing the estimated cost and the estimated EMS performance enhancement for an echo reference, a method needs to somehow compare these two parameters which will not normally be in the same domain. One such method involves evaluating the cost and performance estimates individually and taking the lowest-cost solution that meets a predefined minimum performance criterion, Pmin. This predefined EMS performance criterion may, for example, be determined according to the requirements of a specific downstream application (e.g., providing a telephone call, music playback, awaiting a WW, etc.).
For example, in an implementation in which the application is WW detection, the performance may relate to a WW performance metric PWW. In some such examples, there may be some minimum level of WW detector accuracy that is deemed sufficient (e.g., an 80% level of WW detector accuracy, an 85% level of WW detector accuracy, a 90% level of WW detector accuracy, a 95% level of WW detector accuracy, etc.), which would have a corresponding ERLEdB as per the previous section. In some such examples, the ERLE of the EMS may be estimated using the EMS performance model (e.g., the MC-EMS performance model 405 of
As an alternative to meeting some minimum performance metric, some implementations may involve using both a performance metric P and a cost metric C Some such examples, may involve using a tradeoff parameter λ (e.g., a Lagrange multiplier), and formulate the cost/performance evaluation process as an optimization problem which seeks to maximize a quantity, such as the variable F in the following expression:
F=P−λC
total
One may observe that in the foregoing equation, a relatively larger value of/corresponds with a relatively larger difference between the performance metric P and product of A and the total cost Ctotal. The tradeoff parameter λ may be chosen (e.g., by the system designer) in order to directly trade off cost and performance. The solution for the set of echo references used by the EMS may then be found using an optimization algorithm wherein a set of echo references (which may include all available echo reference fidelity levels) determines the search space.
The method 600 may be performed by an apparatus or system, such as the apparatus 50 that is shown in
In this implementation, block 605 involves obtaining, by a control system, a plurality of echo references. In this example, the plurality of echo references includes at least one echo reference for each audio device of a plurality of audio devices in an audio environment. Here, each echo reference corresponds to audio data being played back by one or more loudspeakers of one audio device of the plurality of audio devices.
In this example, block 610 involves making, by the control system, an importance estimation for each echo reference of the plurality of echo references. According to this example, making the importance estimation involves determining an expected contribution of each echo reference to mitigation of echo by at least one echo management system of at least one audio device of the audio environment. In this example, the at least one echo management system includes an acoustic echo canceller (AEC) and/or an acoustic echo suppressor (AES).
In this implementation, block 615 involves selecting, by the control system and based at least in part on the importance estimation, one or more selected echo references. In this example, block 620 involves providing, by the control system, the one or more selected echo references to the at least one echo management system. In some implementations, method 600 may involve causing the at least one echo management system to cancel or suppress echoes based, at least in part, on the one or more selected echo references.
In some examples, obtaining the plurality of echo references may involve receiving a content stream that includes audio data and determining one or more echo references of the plurality of echo references based on the audio data. Some examples are described above with reference to the renderer 201A of
In some implementations, the control system may include an audio device control system of an audio device of the plurality of audio devices in the audio environment. In some such examples, the method may involve rendering, by the audio device control system, the audio data for reproduction on the audio device to produce local speaker feed signals. In some such examples, the method may involve determining a local echo reference that corresponds with the local speaker feed signals.
In some examples, obtaining the plurality of echo references may involve determining one or more non-local echo references based on the audio data. Each of the non-local echo references may, for example, correspond to non-local speaker feed signals for playback on another audio device of the audio environment.
According to some examples, obtaining the plurality of echo references may involve receiving one or more non-local echo references. Each of the non-local echo references may, for example, correspond to non-local speaker feed signals for playback on another audio device of the audio environment. In some examples, receiving the one or more non-local echo references may involve receiving the one or more non-local echo references from one or more other audio devices of the audio environment. In some examples, receiving the one or more non-local echo references may involve receiving each of the one or more non-local echo references from a single other device of the audio environment.
In some examples, the method may involve a cost determination. According to some such examples, the cost determination may involve determining a cost for at least one echo reference of the plurality of echo references. In some such examples selecting the one or more selected echo references may be based, at least in part, on the cost determination. According to some such examples, the cost determination may be based, at least in part, on the network bandwidth required for transmitting the at least one echo reference, an encoding computational requirement for encoding the at least one echo reference, a decoding computational requirement for decoding the at least one echo reference, an echo management system computational requirement for use of the at least one echo reference by the echo management system, or one or more combinations thereof. In some examples, the cost determination may be based, at least in part, on a full-fidelity replica of the at least one echo reference in a time domain or a frequency domain, on a downsampled version of the at least one echo reference, on a lossy compression of the at least one echo reference, on banded power information for the at least one echo reference, or one or more combinations thereof. In some examples, the cost determination may be based, at least in part, on a method of compressing a relatively more important echo reference less than a relatively less important echo reference.
According to some examples, the method may involve determining a current echo management system performance level. In some such examples, selecting the one or more selected echo references may be based, at least in part, on the current echo management system performance level.
In some examples, making the importance estimation may involve determining an importance metric for a corresponding echo reference. In some examples, determining the importance metric may involve determining a level of the corresponding echo reference, determining a uniqueness of the corresponding echo reference, determining a temporal persistence of the corresponding echo reference, determining an audibility of the corresponding echo reference, or one or more combinations thereof. According to some examples, determining the importance metric may be based, at least in part, on metadata corresponding to an audio device layout, loudspeaker metadata, metadata corresponding to received audio data, an upmixing matrix, a matrix of loudspeaker activations, or one or more combinations thereof. In some examples, determining the importance metric may be based, at least in part, on a current listening objective, a current ambient noise estimate, an estimate of a current performance of the at least one echo management system, or one or more combinations thereof.
Some disclosed implementations involve the challenge of requiring the other (“non-local”) devices' playback references for each local echo management system (EMS). The bandwidth required for transmitting echo references to all the participating audio devices in an audio environment can be significant. Such bandwidth requirements may be prohibitive if the number of audio devices is large and if the transmitted echo references are full-fidelity replicas of the speaker feed signals provided to the loudspeakers. The computational resources required to implement such methods and systems, including but not limited to computational resources for implementing the non-local devices' postprocessing, may also be significant.
However, transmitting all the playback streams to all the participating audio devices in an audio environment may not be necessary or even desirable for some implementations. This is true in part because the amount of echo in audio devices heavily depends on the content, the listening objective(s) and the audio device configurations.
It was noted above that one way to assess the importance of each “non-local” reference is via importance metrics 420, which may be computed using the rendered audio streams that are used as echo references in the EMS. It was also noted above that in some disclosed examples, the importance metric may be based, at least in part, on metadata (e.g., one or more components of the metadata 312 described above), such as metadata corresponding to an audio device layout, loudspeaker metadata, metadata corresponding to received audio data (such as a spatiality index), an upmixing matrix, a matrix of loudspeaker activations (which may also be referred to herein as a “rendering matrix”), or one or more combinations thereof. Moreover, it was noted above that in some examples U (the “uniqueness” aspect of the “LUPA” echo reference characteristics from which the importance metric may be determined) may be based at least in part on data used for matrix decoding, such as a static upmixing matrix.
Many of the elements shown in
Moreover, the cost estimation modules 403 themselves are optional in the implementations of
The new, or newly-explicit, elements of
In the example shown in
If the non-local audio device post-processing signal chain parameters are available to the local audio device (for example, via a setup calibration step) the non-local device references may be locally generated and added to the rendered audio stream 720. In this case the non-local device references may take the form of one or more virtual audio device references, or a set of device-specific, non-local device streams for the devices selected by the reference selection block. In case device computational capabilities becomes a bottleneck, each audio device may render its local reference only, and use a local network to exchange echo references (e.g., as described above with reference to
In other implementations, the blocks that are shown in
In the example shown in
In this hub and spoke example, both the renderer 201 and the echo reference generator 710 reside in the hub device 805. In this example, each audio device 110 receives rendered audio data for playback from the hub device 805. In this example, rendered audio data for playback includes the local echo reference 220. The non-local audio device references may be rendered at the hub device 805 as one or more single virtual non-local device references, as device-specific echo references (e.g., as described above with reference to
Various examples of computing echo reference metrics according to rendering information, such as rendering metadata, are disclosed in the following paragraphs.
A main component of the rendering metadata set is the rendering matrix (722) for the given audio device configuration. The rendering matrix defines the audio device configuration's spatial-frequency response to any audio object in the encoded audio-stream. In some rendering matrix examples, the audio environment (e.g., a room within which the audio devices reside) is first discretized to [nx, ny, nz] points, and a rendering filter is designed for each spatial point, for each device. The rendering filter may be defined in a frequency bin domain, with all filters having an nbin number of taps. Thus, for an N device system, the rendering matrix is an N×nx×ny×nz set of nbin length filters.
Let us assume that a sound source should be located at point xa, ya, za. In some implementations, this information (702) is available for each audio object in an audio object metadata file provided to the renderer (for example, as an Atmos .prm file). An ideal rendering system would achieve this with high accuracy. However, in some examples, given the limitations of the audio device configuration this level of accuracy may not be guaranteed, but only a best effort result of the renderer is realized. For example, the audio object vector may be approximated by a weighted average of values corresponding to the closest grid points of a rendering matrix, and the subset of rendering filters activated for these points may be used to render the sound source at that location.
Accordingly, the rendering matrix may act as a spatial transfer function, defined on each device and each point on a spatial grid. Thus, the rendering matrix includes information regarding how audible each audio device is to each other audio device (which may be referred to herein as “mutual audibility”). Although the rendering matrix 722 contains this information, it is desirable to compute an audibility metric that can be readily consumed by the echo reference importance estimator 401. The embodiments described below provide relevant examples. In some implementations, the meta data based metric computation block 705 is configured to perform this computation and the echo reference characteristics 733 include these metrics.
The rendering matrix for each device contains all information needed to estimate the spatial realization of an audio object (such as an Atmos audio object) for the audio device configuration shown in
One simple implementation involves computing a device-wise rendering matrix covariance matrix and using the covariance matrix as a proxy for covariance of the resultant speaker feeds. We refer to this herein as an “uninformed rendering covariance matrix” or an “uninformed rendering correlation matrix.”
We can see how the rendering matrix itself contains spatial information from which the inter-device audibility can be estimated. Even in its simplest form, one can use the uninformed rendering correlation matrix to obtain audibility rankings of each device as heard from every other device. Moreover, a complete uninformed rendering correlation matrix will also contain information about how this audibility varies in frequency.
Similarly, some implementations involve transforming audio object spatial metadata (which may be a component of the audio object metadata 702) into a metric that may be readily consumed by the echo reference importance estimator 401. In some examples, the metadata-based metric computation module 705 may be configured to make such transformations.
It was noted in the discussion above that in some implementations, the importance metric I may be based on one or more of the following:
As used herein, the acronym “LUPA” refers generally to echo reference characteristics from which the importance metric may be determined, including but not limited to one or more of L, U, P and/or A. As noted above, the rendering matrix includes audibility information, which is the “A” component of LUPA. Other LUPA parameters may be estimated based on the rendering matrix and spatial data. Some implementations estimate LUPA parameters by determining a statistic based on aggregate spatial data that is highly correlated with one or more LUPA parameters.
In some examples, the audio object spatial metadata indicates the spatio-temporal distribution of each audio source in the received audio data bit stream. Some implementations involve computing the amount of time an audio object is present in each spatial grid tile. Some such implementations involve producing 3-D heatmaps of “counts” for each audio object channel.
In each graph, the coordinates x, y and z denote the length, width and height of Atmos bins for an acoustic space, which is an example of a cubic audio environment. In each graph, a sphere in a particular location indicates a “count,” an instance of time during which a corresponding audio object was in the corresponding (x, y, z) location during the song. In some implementations, audio object counts may be used as a basis for estimating P, the temporal persistence of an echo reference.
Some implementations use audio object counts as the basis for a spatial importance weighting. The spatial importance weighting may, in some examples, be used along with various other types of importance metrics, such as audibility metrics. For example, if a spatial importance weighting is used in conjunction with an “uninformed rendering correlation matrix” such as those described with reference to
As noted above, in some implementations U may be based, at least in part, on a metric of correlation between each echo reference. In some examples, the spatially informed correlation matrix may be used as proxy for an audio data-based correlation metric (for example a correlation matrix based on PCM data for each echo reference reference) to produce an importance metric for input to the echo reference importance estimator 401.
One may observe that the rankings of the spatially informed correlation matrix differ from those of the uninformed rendering correlation matrix. Except for the ranking corresponding to audio device 4, the highest-ranked non-local echo references according to the uninformed rendering correlation matrix differ from those of the spatially informed rendering correlation matrix. Referring first to
One way to compare the utility of the approximation of a PCM-based correlation matrix via a spatially informed correlation matrix would be to evaluate the resulting non-local reference management schemes implemented by a local device based on each of these metrics. A simple indicator of how close the approximation is would be a comparison of the echo reference ranks produced by the echo reference importance estimator 401 based on each type of metric.
In this example, the echo reference importance rankings shown in
In some implementations, the LUPA estimates are based on an assumption that the spatial scene is stationary within an estimation time window. The LUPA estimates will eventually reflect any notable change in the spatially rendered scene after some (variable) time delay. This means that during significant audio scene changes, the echo references selected using these estimates, as well as the virtual echo references generated, may be incorrect. In some examples, the echo return loss enhancement (ERLE) may decrease beyond operating limits, which could lead to echo management system instabilities. Such conditions can also trigger fast reference switching that might not be actually needed, but which is an artifact of the changing scene dynamics. To guard against these potential negative outcomes, we disclose herein two additions to the upstream data processing:
These additions have various potential advantages. For example, the audio scene change messages 715 may enable the echo reference importance estimator 401 and the metadata-based metric computation module 705 to dump their histories and reset their memory buffers, thereby enabling a fast response to an audio scene change. A “fast” response may be on the order of hundreds of milliseconds (such as 300 to 500 milliseconds). Such fast responses may, for example, avoid the risk of AEC divergence.
In some examples, audio object metadata files contain spatial coordinates for each time interval. This spatial metadata may be used as input to, for example, the scene change analyzer 755 of
A key point, however, is to have a look-ahead buffer of values pertaining to audio scene changes during a look-ahead time window (e.g., 5 seconds, 8 seconds, 10 seconds, 12 seconds, etc.). Input from this look-ahead buffer can enable the scene change analyzer 755 estimate the similarity of the current rendered audio scene when compared with the near-future rendered audio scene. This information can then be used by the metadata-based metric computation module 705 and the echo reference importance estimator 401 to temper the rate of adaptation of their metrics. In some implementations, the audio scene change messages 715 are device-specific, because an audio device only needs information regarding audio scene changes within that audio device's audible spatial grid subset (the grid subset that significantly affects the operation of MC-EMS 203)
According to some examples, an example importance metric (I(t)) at time t could be expressed as follows:
In the foregoing equation, i represents a spatial grid index, n represents a look-ahead window, Ci(t+k) represents the audio object count at look-ahead time k, and αik and βik represent predefined coefficients per spatial grid point depending on the audio device configuration. In most cases αik and βik are less than 1. Such an importance metric can be designed to approximate a weighted object density, or the cumulative object persistence within the spatial and temporal region of interest.
With previously-deployed metadata schemes, the render would only have access to spatial data up to the current time t. In contrast, some disclosed implementations augment the audio data stream with audio object spatial coordinates within the look-ahead window n in the above example.
An integral part of the metadata-based scene analysis for the purpose of echo management is the EMS health data 423 determined by the MC-EMS performance model 405. The EMS health data 423 is highly sensitive to significant audio scene changes and may, for example, indicate EMS divergence caused by such audio scene changes.
Because information regarding such audio scene changes are, in some disclosed examples, now conveyed ahead in time (e.g., via EMS look-ahead statistics 732 from the metadata-based metric computation module 705 and/or audio scene change messages 715 from the scene change analyzer 755), some implementations of the echo reference orchestrator 302 may be configured use such audio scene change information to predict the EMS health data 423 (e.g., via the MC-EMS performance model 405). If the MC-EMS performance model 405 predicts, for example, a possible EMS filter divergence based on one or more EMS look-ahead statistics 732 and/or audio scene change messages 715, according to some disclosed examples the MC-EMS performance model 405 may be configured to provide corresponding EMS health data 423 to the echo reference importance estimator 401 and the echo reference selector 402, which can reset their algorithms accordingly.
In some examples, the MC-EMS performance model 405 may be configured to implement an embodiment of EMS heath prediction based on a regression model based on scene change importance look-ahead data, e.g., as follows:
A(t+k)=f(({Iik})
In the foregoing equation, A represents EMS health data, f represents a regression function (which may be linear or non-linear) and the set {Iik} represents the set of importance values for the total look ahead window and spatial grid set.
We can reduce the complexity of the echo-generation process by approximating the salient characteristics of the non-local echo references and producing a minimal set of echo references needed to achieve a satisfactory ERLE performance at each device. The resulting virtual echo references (also referred to herein as virtual sound sources) can even lead to an improved ERLE at the device EMS output, compared to a device-wise far echo reference set. The use of virtual echo references can provide important benefits, particularly when the number of audio devices in the audio environment is large (e.g, >10). In such cases, the non-uniqueness of echo references can lead to prohibitively high ERLE's and EMS failure, if device-wise PCM based algorithms are used.
The position O may, for example, be obtained using room mapping data (such as audio device location data) available to the renderer 201 or the echo reference generator 710 via an initial and/or periodic calibration step. For example, if all speakers have no occlusions, one may determine the position O according to the centroid position of the cumulative far device heatmap, which may be generated by adding the rendering matrix slices (e.g., as shown in
The virtual sound source D corresponds to the playback of audio devices B and C from the perspective of audio device A. Virtual sound source D is the equivalent sound source at position O that creates the same non-local audio device playback sound field that the separate played-back audio from audio devices B and C would create at the location of audio device A. One should note that the virtual source D need not approximate the full sound field that the separate played-back audio from audio devices B and C would create in all parts of the audio environment.
A lower-dimensional approximation of D for far device echo references can be realized using different approaches, a few of which are described herein. In general, these methods may involve finding a Weight Matrix {right arrow over (W)}, and an input subspace matrix {right arrow over (X)}, (e.g., a PCM matrix or a principal component analysis (PCA) matrix), such that the device echo reference frame d(t) (e.g., the PCM frame) for time t, can be found as
{right arrow over (d)}[t]={right arrow over (W)}[t]{right arrow over (X)}[t], (Equation 0)
In the examples described below we use frequency domain separations (low frequency and high frequency), audio object based methods, and statical independence based methods as example approaches to create virtual sound sources.
At low frequencies, the renderer produces references differently due to the differing capabilities of loudspeakers regarding playback of content at these frequencies. The particular low frequency range may depend on details of the particular implementation, such as loudspeaker capabilities. In some implementations in which the capability of one or more loudspeakers in the audio environment for reproducing sound in the bass range is minimal, the range of low frequencies may be 400 Hz or less, whereas for other implementations the range of low frequencies may be 350 Hz or less, 300 Hz or less, 250 Hz or less, 200 Hz or less, etc. In the context of a multichannel echo canceller, the reference signals used for cancellation in the low frequencies can be determined using the renderer configuration. Rather than passing all or a subset of the echo references available, a weighted summation of the echo references over a proportion of low frequencies can be used. The amount of crossover with cancellation of higher frequencies may also be considered. Given a weighting w, for any echo reference r, at frequency k, the chosen echo references can be represented as:
R
i
k=Σi=1nwikrik (Equation A)
In Equation A, the superscript n represents the total number of echo references. The weighting, the range of low frequencies to use this summation over and the proportion of crossover with higher frequency cancellation may be extracted from rendering information in some examples. Examples of weighting and low frequency ranges are described below. The weighting and low frequency ranges may be based, at least in part, on individual loudspeaker capability and limitations, and how content may be rendered for each device. One motivation for implementing low-frequency management methods is to avoid the non-uniqueness problem and high cross-correlation between echo references at low frequencies.
In some implementations, the low-frequency management module 1410 may be configured to select frequencies and/or generate weights based, at least in part, on rendering information, such as information regarding the rendering matrix 722 (or the rendering matrix 722 itself).
The frequencies to perform low frequency management over could be based on a hard cut-off frequency or on a range of frequencies in a crossover frequency range. A crossover frequency range may be desirable to account differing loudspeaker capabilities in an overlapping frequency region where certain audio device echo references have lower frequency content than the summed reference. For example, a crossover frequency range may be desirable when a subwoofer is present and may be considered the dominant or only reference at the majority of lower frequencies. The cut-off frequency or range of frequencies in a crossover frequency range may be included in rendering information, which may take into account the loudspeaker capabilities of audio devices in the audio environment. In some examples, the cut-off frequency may have a value of a few hundred Hz, such as 200 Hz, 250 Hz, 300 Hz, 350 Hz, 400 Hz, etc. According to some examples, the crossover frequency range may have a low end of 100 Hz, 150 Hz, 200 Hz, etc., and may include frequencies up to 200 Hz, 250 Hz, 300 Hz, 350 Hz, 400 Hz, etc.
In some implementations, weights may be applied according to audio device configuration and capabilities, such as only using a local or subwoofer reference for low-frequency playback if a subwoofer is present. If a subwoofer is present, it will generally be desirable for most low-frequency audio content to be played back by the subwoofer rather than for the low-frequency audio content to be provided to audio devices that may be unable to play back audible lower frequencies without distortion. According to some implementations that do not include a subwoofer and in which the audio devices have the same (or similar) capabilities for low-frequency audio reproduction, in order to obtain any audible lower-frequency performance the low-frequencies reproduced by all the audio devices may be the same in order to maximize power. In some such implementations, the reproduced low-frequency audio may be monophonic/non-directional. In some such instances, the weighting may be 1.0 for a single reference. This is equivalent to monophonic-only echo cancellation below a certain frequency, which may be referred to herein as “max_mono_hz.”
As used herein, a “higher frequency” may refer to any audible frequency above one of the low-frequency ranges described with reference to
Therefore, rendering of echo references that have substantial high-frequency components may be relatively more complicated as compared to rendering of virtual references that have mainly lower-frequency components. For example, some high-frequency management implementations may involve multiple instances of Equation A, each for a different portion of a high-frequency range and each with potentially different weighting factors. Non-uniqueness and the associated AEC divergence are lower risks in higher-frequency bands.
However, some examples exploit frequency sparsity of some audio content to manage multi-band reference generation. Creating a mix that is based at least in part on frequency-dependent audibility differences can eliminate the need for having multiple echo references without degrading the quality of the AEC health. In some such examples, the rendering implementation may be similar to the frequency management implementations for echo references. In some such examples, only the weight generator and frequency selector parameters may be different.
According to this example, each of the frequency management modules 1410A-1410K is configured to function generally as described above with reference to the low-frequency management module 1410 of
Some disclosed subspace-based examples involve defining lower-dimensional embedding via statistical properties. Some subspace-based examples involve using methods such as independent component analysis or principal component analysis. By implementing such methods, a control system may be configured to find the K statistically independent audio streams that approximate the non-local references.
In this example, the echo reference generator 710 works in tandem with the echo reference importance estimator 401 and the echo reference selector 402: in this example, blocks 1620, 1655 and 1660 are implemented by the echo reference importance estimator 401 and/or the echo reference selector 402, and blocks 1625-1650 are implemented by the echo reference generator 710.
In the example, method 1600 starts with block 1601, after which an initial local audio device and an initial non-local (“far”) audio device are selected in block 1605. In block 1610, it is determined whether all local audio devices have been processed. If so, the process stops (block 1615). However, if it is determined in block 1610 that the current local audio device has not been processed, the process continues to block 1620.
According to this example, block 1620 involves determining whether each far device has been evaluated for the current local audio device. If not, the process continues to block 1655, in which it is determined whether the echo reference characteristics (e.g., LUPA values) for the current far device's audio stream exceed a threshold value. In some examples, the threshold may be a long-term function of the audio device configuration (such as the audio device layout and audio device capabilities), characteristics of the audio environment and playback content. According to some such examples, this threshold may be approximated as the long-term mean of the echo reference characteristics for the current audio device configuration, audio environment and content type. In this context, “long-term” may be hours or days. In some examples, playback may not be continuous during the “long-term” time interval. Accordingly, this example involves selecting a subset of far devices based on the echo reference characteristics 733 that are output by the metadata-based metric computation module 705 (for example, LUPA scores). The current playback frames of the selected far devices form the pcm matrix P for the local device currently being evaluated. Accordingly, if it is determined in block 1655 that the echo reference characteristics for the current far device's audio stream exceed a threshold value, in block 1660 the far device's audio frame is added to a columns of the pcm matrix P. In this example, the next far device (if any) is selected in block 1662 and then the process continues to block 1620.
After it is determined in block 1620 that all far devices have been evaluated for the current local audio device, the process continues to blocks that are implemented by the echo reference generator 710. In this example the process continues to block 1625, which involves obtaining the pcm matrix P (e.g., from a memory). According to this example, a dimension reduction is done to reduce any feature redundancy. The dimension reduction may, for example, be achieved by a method such as Principal Component Analysis (PCA). Other examples may implement other methods of dimension reduction. In the PCA example shown in
P
c
=P−mean(P)
In this example, the Covariance matrix C is computed in block 1635 as
In the foregoing equation, n represents the number of rows in the PCM matrix. According to this example, block 1640 involves performing eigen decomposition to determine the eigen value matrix D and the eigen vector matrix V such that
In this example, only eigen values that are greater than a threshold T are retained and the redundant features are discarded. An example realization of such a threshold could be constructed using energy based approximation. Given D is a diagonal matrix with values decreasing along the left diagonal, we can define DT by retaining the most significant number of eigenvalues that contain a percentage (in this example, 90%) of the signal energy.
In other examples, a different percentage may be used (such as 75%, 80%, 85%, 95%, etc.). Accordingly, in this example block 1645 involves determining a truncated eigen value matrix DT and a truncated eigen vector matrix VT. The truncated eigen value matrix DT is one example of the Weight matrix in equation 0 and the corresponding eigen vectors in the truncated matrix VT are, collectively, an example of the input matrix Xin the equation 0. Therefore, in this example block 1650 involves determining an echo reference by multiplying DT by VT.
In the example shown in
The method 1700 may be performed by an apparatus or system, such as the apparatus 50 that is shown in
In this implementation, block 1705 involves receiving, by a control system, location information for each of a plurality of audio devices in an audio environment. In some examples, the location information may be included in the metadata 312 that is disclosed herein, which may include information corresponding to an audio device layout, loudspeaker metadata, metadata corresponding to received audio data, etc. According to some examples, block 1705 may involve receipt of the location information by a renderer, such as the renderer 201 described herein (see, for example,
In this example, block 1710 involves generating, by the control system and based at least in part on the location information, rendering information for a plurality of audio devices in an audio environment. In some examples, the rendering information may be, or may include, a matrix of loudspeaker activations. According to some examples, method 1700 may involve rendering the audio data, based at least in part on the rendering information, to produce rendered audio data. In some such examples, the control system may be an orchestrating device control system. In some such implementations, method 1700 may involve providing at least a portion of the rendered audio data to each audio device of the plurality of audio devices in the audio environment.
In this implementation, block 1715 involves determining, by the control system and based at least on part on the rendering information, a plurality of echo reference metrics. In this example, each echo reference metric of the plurality of echo reference metrics corresponds to audio data reproduced by one or more audio devices of the plurality of audio devices. In some such examples, the control system may be an orchestrating device control system. In some such implementations, method 1700 may involve providing at least one echo reference metric to each audio device of the plurality of audio devices.
In some examples, method 1700 may involve receiving, by the control system, a content stream that includes audio data and corresponding metadata. In some such examples, determining the at least one echo reference metric may be based, at least in part, on loudspeaker metadata, metadata corresponding to received audio data and/or an upmixing matrix.
According to some examples, block 1715 may be performed, at least in part, by the metadata-based metric computation module 705 of
In some examples, method 1700 may involve making, by the control system and based at least in part on the echo reference metrics, an importance estimation for each echo reference of a plurality of echo references. In some such implementations, the control system may be an audio device control system. According to some implementations, the echo reference importance estimator 401 may make the importance estimation. According to some examples, making the importance estimation may involve determining an expected contribution of each echo reference to mitigation of echo by an echo management system of an audio device of the audio environment. The echo management system may include an acoustic echo canceller (AEC), an acoustic echo suppressor (AES), or both an AEC and an AES. The echo management system may be, or may include, an instance of the MC-EMS 203 disclosed herein.
According to some examples, making the importance estimation may involve determining an importance metric for a corresponding echo reference. In some examples, determining the importance metric may be based at least in part on one or more of a current listening objective or a current ambient noise estimate.
Some such examples may involve selecting, by the control system and based at least in part on the importance estimation, one or more selected echo references. According to some examples, the echo references may be selected by an instance of the echo reference selector 402 disclosed herein. Some examples may involve providing, by the control system, the one or more selected echo references to the at least one echo management system.
Some examples may involve making, by the control system, a cost determination. In some such examples, the cost estimation module 403 may be configured to make the cost determination. The cost determination may, for example, involve determining a cost for at least one echo reference of the plurality of echo references. In some such examples, selecting the one or more selected echo references may be based, at least in part, on the cost determination. According to some examples, the cost determination may be based on the network bandwidth required for transmitting the at least one echo reference, an encoding computational requirement for encoding the at least one echo reference, a decoding computational requirement for decoding the at least one echo reference and/or an echo management system computational requirement for use of the at least one echo reference by the at least one echo management system.
Some implementations may involve determining, by the control system, a current echo management system performance level. In some such examples, the MC-EMS performance model 405 may be configured to determine the current echo management system performance level. According to some examples, the importance estimation may be based, at least in part, on the current echo management system performance level.
Some examples may involve receiving, by the control system, scene change metadata. In some examples, the importance estimation may be based, at least in part, on the scene change metadata. In some implementations, the scene change analyzer 755 may receive the scene change metadata and may generate one or more scene change messages 715. In some such examples, the importance estimation may be based, at least in part, on one or more scene change messages 715.
In some examples, method 1700 may involve generating, by the control system, at least one echo reference. In some instances, at least one echo reference may be generated by the echo reference generator 710. According to some examples, the echo reference generator 710 may generate at least one echo reference based, at least in part, on one or more components of the metadata 312, such a matrix of loudspeaker activations (e.g., the rendering matrix 722). In some examples, method 1700 may involve generating, by the control system, at least one virtual echo reference. A virtual echo reference may, for example, correspond to two or more audio devices of the plurality of audio devices.
In some examples, method 1700 may involve generating (e.g., by the echo reference generator 710) one or more subspace-based non-local device echo references. In some examples, the subspace-based non-local device echo references may include low-frequency non-local device echo references. Some such examples may involve determining, by the control system, a weighted summation of echo references over a range of low frequencies. Some such examples may involve providing the weighted summation to an echo management system. Some implementations may involve causing the echo management system to cancel or suppress echoes based, at least in part, on the one or more selected echo references.
According to this example, the environment 1800 includes a living room 1810 at the upper left, a kitchen 1815 at the lower center, and a bedroom 1822 at the lower right. Boxes and circles distributed across the living space represent a set of loudspeakers 1805a-1805h, at least some of which may be smart speakers in some implementations, placed in locations convenient to the space, but not adhering to any standard prescribed layout (arbitrarily placed). In some examples, the television 1830 may be configured to implement one or more disclosed embodiments, at least in part. In this example, the environment 1800 includes cameras 1811a-1811e, which are distributed throughout the environment. In some implementations, one or more smart audio devices in the environment 1800 also may include one or more cameras. The one or more smart audio devices may be single purpose audio devices or virtual assistants. In some such examples, one or more cameras of the optional sensor system 130 may reside in or on the television 1830, in a mobile phone or in a smart speaker, such as one or more of the loudspeakers 1805b, 1805d, 1805e or 1805h. Although cameras 1811a-1811e are not shown in every depiction of the audio environments presented in this disclosure, each of the audio environments may nonetheless include one or more cameras in some implementations.
Various features and aspects will be appreciated from the following enumerated exemplary embodiments (“EEEs”):
EEE1. An audio processing method, comprising:
EEE2. The audio processing method of EEE1, further comprising causing the at least one echo management system to cancel or suppress echoes based, at least in part, on the one or more selected echo references.
EEE3. The audio processing method of EEE1 or EEE2, wherein obtaining the plurality of echo references involves:
EEE4. The audio processing method of EEE3, wherein the control system comprises an audio device control system of an audio device of the plurality of audio devices in the audio environment, further comprising:
EEE5. The audio processing method of EEE4, wherein obtaining the plurality of echo references involves determining one or more non-local echo references based on the audio data, each of the non-local echo references corresponding to non-local speaker feed signals for playback on another audio device of the audio environment.
EEE6. The audio processing method of EEE4, wherein obtaining the plurality of echo references involves receiving one or more non-local echo references, each of the non-local echo references corresponding to non-local speaker feed signals for playback on another audio device of the audio environment.
EEE7. The audio processing method of EEE6, wherein receiving the one or more non-local echo references involves receiving the one or more non-local echo references from one or more other audio devices of the audio environment.
EEE8. The audio processing method of EEE6, wherein receiving the one or more non-local echo references involves receiving each of the one or more non-local echo references from a single other device of the audio environment.
EEE9. The audio processing method of any one of EEEs 1-8, further comprising a cost determination, the cost determination involving determining a cost for at least one echo reference of the plurality of echo references, wherein selecting the one or more selected echo references is based, at least in part, on the cost determination.
EEE10. The audio processing method of EEE9, wherein the cost determination is based on network bandwidth required for transmitting the at least one echo reference, an encoding computational requirement for encoding the at least one echo reference, a decoding computational requirement for decoding the at least one echo reference, an echo management system computational requirement for use of the at least one echo reference by the echo management system, or combinations thereof.
EEE11. The audio processing method of EEE9 or EEE10, wherein the cost determination is based on a replica of the at least one echo reference in a time domain or a frequency domain, on a downsampled version of the at least one echo reference, on a lossy compression of the at least one echo reference, on banded power information for the at least one echo reference, or combinations thereof.
EEE12. The audio processing method of any one of EEEs 9-11, wherein the cost determination is based on a method of compressing a relatively more important echo reference less than a relatively less important echo reference.
EEE13. The audio processing method of any one of EEEs 1-12, further comprising determining a current echo management system performance level, wherein selecting the one or more selected echo references is based, at least in part, on the current echo management system performance level.
EEE14. The audio processing method of any one of EEEs 1-13, wherein making the importance estimation involves determining an importance metric for a corresponding echo reference.
EEE15. The audio processing method of EEE14, wherein determining the importance metric involves determining a level of the corresponding echo reference, determining a uniqueness of the corresponding echo reference, determining a temporal persistence of the corresponding echo reference, determining an audibility of the corresponding echo reference, or combinations thereof.
EEE16. The audio processing method of EEE14 or EEE15, wherein determining the importance metric is based at least in part on metadata corresponding to an audio device layout, loudspeaker metadata, metadata corresponding to received audio data, an upmixing matrix, a matrix of loudspeaker activations, or combinations thereof.
EEE17. The audio processing method of any one of EEEs 14-16, wherein determining the importance metric is based at least in part on a current listening objective, a current ambient noise estimate, an estimate of a current performance of the at least one echo management system, or combinations thereof.
EEE18. An apparatus configured to perform the method of any one of EEEs 1-17.
EEE19. A system configured to perform the method of any one of EEEs 1-17.
EEE20. One or more non-transitory media having software stored thereon, the software including instructions for controlling one or more devices to perform the method of any one of EEEs 1-17. Some aspects of present disclosure include a system or device configured (e.g., programmed) to perform one or more examples of the disclosed methods, and a tangible computer readable medium (e.g., a disc) which stores code for implementing one or more examples of the disclosed methods or steps thereof. For example, some disclosed systems can be or include a programmable general purpose processor, digital signal processor, or microprocessor, programmed with software or firmware and/or otherwise configured to perform any of a variety of operations on data, including an embodiment of disclosed methods or steps thereof. Such a general purpose processor may be or include a computer system including an input device, a memory, and a processing subsystem that is programmed (and/or otherwise configured) to perform one or more examples of the disclosed methods (or steps thereof) in response to data asserted thereto.
Some embodiments may be implemented as a configurable (e.g., programmable) digital signal processor (DSP) that is configured (e.g., programmed and otherwise configured) to perform required processing on audio signal(s), including performance of one or more examples of the disclosed methods. Alternatively, embodiments of the disclosed systems (or elements thereof) may be implemented as a general purpose processor (e.g., a personal computer (PC) or other computer system or microprocessor, which may include an input device and a memory) which is programmed with software or firmware and/or otherwise configured to perform any of a variety of operations including one or more examples of the disclosed methods. Alternatively, elements of some embodiments of the inventive system are implemented as a general purpose processor or DSP configured (e.g., programmed) to perform one or more examples of the disclosed methods, and the system also includes other elements (e.g., one or more loudspeakers and/or one or more microphones). A general purpose processor configured to perform one or more examples of the disclosed methods may be coupled to an input device (e.g., a mouse and/or a keyboard), a memory, and a display device.
Another aspect of present disclosure is a computer readable medium (for example, a disc or other tangible storage medium) which stores code for performing (e.g., coder executable to perform) one or more examples of the disclosed methods or steps thereof.
While specific embodiments of the present disclosure and applications of the disclosure have been described herein, it will be apparent to those of ordinary skill in the art that many variations on the embodiments and applications described herein are possible without departing from the scope of the disclosure described and claimed herein. It should be understood that while certain forms of the disclosure have been shown and described, the disclosure is not to be limited to the specific embodiments described and shown or the specific methods described.
Number | Date | Country | Kind |
---|---|---|---|
21177382.5 | Jun 2021 | EP | regional |
This application claims the benefit of priority from EP patent application Ser. No. 21/177,382.5, filed on Jun. 2, 2021, U.S. Provisional Patent Application No. 63/201,939, filed on May 19, 2021, and U.S. Provisional Patent Application No. 63/147,573, filed on Feb. 9, 2021, which are hereby incorporated by reference.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2022/015436 | 2/7/2022 | WO |
Number | Date | Country | |
---|---|---|---|
63147573 | Feb 2021 | US | |
63201939 | May 2021 | US |