This disclosure pertains to systems and methods for selecting and implementing audio filters, including but not limited to audio filters for acoustic echo cancellation (AEC), beam steering, active noise cancellation (ANC) or dereverberation.
Methods, devices and systems for selecting and implementing audio filters are widely deployed. Although existing devices, systems and methods for selecting and implementing audio filters provide benefits, improved systems and methods would be desirable.
Throughout this disclosure, including in the claims, the terms “speaker,” “loudspeaker” and “audio reproduction transducer” are used synonymously to denote any sound-emitting transducer (or set of transducers) driven by a single speaker feed. A typical set of headphones includes two speakers. A speaker may be implemented to include multiple transducers (e.g., a woofer and a tweeter), which may be driven by a single, common speaker feed or multiple speaker feeds. In some examples, the speaker signal(s) may undergo different processing in different circuitry branches coupled to the different transducers.
Throughout this disclosure, including in the claims, the expression performing an operation “on” a signal or data (e.g., filtering, scaling, transforming, or applying gain to, the signal or data) is used in a broad sense to denote performing the operation directly on the signal or data, or on a processed version of the signal or data (e.g., on a version of the signal that has undergone preliminary filtering or pre-processing prior to performance of the operation thereon).
Throughout this disclosure including in the claims, the expression “system” is used in a broad sense to denote a device, system, or subsystem. For example, a subsystem that implements a decoder may be referred to as a decoder system, and a system including such a subsystem (e.g., a system that generates X output signals in response to multiple inputs, in which the subsystem generates M of the inputs and the other X-M inputs are received from an external source) may also be referred to as a decoder system.
Throughout this disclosure including in the claims, the term “processor” is used in a broad sense to denote a system or device programmable or otherwise configurable (e.g., with software or firmware) to perform operations on data (e.g., audio, or video or other image data). Examples of processors include a field-programmable gate array (or other configurable integrated circuit or chip set), a digital signal processor programmed and/or otherwise configured to perform pipelined processing on audio or other sound data, a programmable general purpose processor or computer, and a programmable microprocessor chip or chip set.
Throughout this disclosure including in the claims, the term “couples” or “coupled” is used to mean either a direct or indirect connection. Thus, if a first device couples to a second device, that connection may be through a direct connection, or through an indirect connection via other devices and connections.
As used herein, a “smart device” is an electronic device, generally configured for communication with one or more other devices (or networks) via various wireless protocols such as Bluetooth, Zigbee, near-field communication, Wi-Fi, light fidelity (Li-Fi), 3G, 4G, 5G, etc., that can operate to some extent interactively and/or autonomously. Several notable types of smart devices are smartphones, smart cars, smart thermostats, smart doorbells, smart locks, smart refrigerators, phablets and tablets, smartwatches, smart bands, smart key chains and smart audio devices. The term “smart device” may also refer to a device that exhibits some properties of ubiquitous computing, such as artificial intelligence. Herein, we use the expression “smart audio device” to denote a smart device which is either a single-purpose audio device or a multi-purpose audio device (e.g., an audio device that implements at least some aspects of virtual assistant functionality). A single-purpose audio device is a device (e.g., a television (TV)) including or coupled to at least one microphone (and optionally also including or coupled to at least one speaker and/or at least one camera), and which is designed largely or primarily to achieve a single purpose. For example, although a TV typically can play (and is thought of as being capable of playing) audio from program material, in most instances a modern TV runs some operating system on which applications run locally, including the application of watching television. In this sense, a single-purpose audio device having speaker(s) and microphone(s) is often configured to run a local application and/or service to use the speaker(s) and microphone(s) directly. Some single-purpose audio devices may be configured to group together to achieve playing of audio over a zone or user configured area.
One common type of multi-purpose audio device is a smart audio device, such as a “smart speaker,” that implements at least some aspects of virtual assistant functionality, although other aspects of virtual assistant functionality may be implemented by one or more other devices, such as one or more servers with which the multi-purpose audio device is configured for communication. Such a multi-purpose audio device may be referred to herein as a “virtual assistant.” A virtual assistant is a device (e.g., a smart speaker or voice assistant integrated device) including or coupled to at least one microphone (and optionally also including or coupled to at least one speaker and/or at least one camera). In some examples, a virtual assistant may provide an ability to utilize multiple devices (distinct from the virtual assistant) for applications that are in a sense cloud-enabled or otherwise not completely implemented in or on the virtual assistant itself. In other words, at least some aspects of virtual assistant functionality, e.g., speech recognition functionality, may be implemented (at least in part) by one or more servers or other devices with which a virtual assistant may communication via a network, such as the Internet. Virtual assistants may sometimes work together, e.g., in a discrete and conditionally defined way. For example, two or more virtual assistants may work together in the sense that one of them, e.g., the one which is most confident that it has heard a wakeword, responds to the wakeword. The connected virtual assistants may, in some implementations, form a sort of constellation, which may be managed by one main application which may be (or implement) a virtual assistant.
Herein, “wakeword” is used in a broad sense to denote any sound (e.g., a word uttered by a human, or some other sound), where a smart audio device is configured to awake in response to detection of (“hearing”) the sound (using at least one microphone included in or coupled to the smart audio device, or at least one other microphone). In this context, to “awake” denotes that the device enters a state in which it awaits (in other words, is listening for) a sound command. In some instances, what may be referred to herein as a “wakeword” may include more than one word, e.g., a phrase.
Herein, the expression “wakeword detector” denotes a device configured (or software that includes instructions for configuring a device) to search continuously for alignment between real-time sound (e.g., speech) features and a trained model. Typically, a wakeword event is triggered whenever it is determined by a wakeword detector that the probability that a wakeword has been detected exceeds a predefined threshold. For example, the threshold may be a predetermined threshold which is tuned to give a reasonable compromise between rates of false acceptance and false rejection. Following a wakeword event, a device might enter a state (which may be referred to as an “awakened” state or a state of “attentiveness”) in which it listens for a command and passes on a received command to a larger, more computationally-intensive recognizer.
As used herein, the terms “program stream” and “content stream” refer to a collection of one or more audio signals, and in some instances video signals, at least portions of which are meant to be heard together. Examples include a selection of music, a movie soundtrack, a movie, a television program, the audio portion of a television program, a podcast, a live voice call, a synthesized voice response from a smart assistant, etc. In some instances, the content stream may include multiple versions of at least a portion of the audio signals, e.g., the same dialogue in more than one language. In such instances, only one version of the audio data or portion thereof (e.g., a version corresponding to a single language) is intended to be reproduced at one time.
At least some aspects of the present disclosure may be implemented via apparatus. For example, one or more devices (e.g., a system that includes one or more devices) may be capable of performing, at least in part, the methods disclosed herein. In some implementations, an apparatus may be, or may include, an audio device having a microphone system and a control system. The control system may include one or more general purpose single- or multi-chip processors, digital signal processors (DSPs), application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs) or other programmable logic devices, discrete gates or transistor logic, discrete hardware components, or combinations thereof. The control system may be configured for implementing some or all of the methods disclosed herein.
In some examples, the control system may be configured to receive microphone signals from the microphone system. The microphone signals may include signals corresponding to one or more sounds detected by the microphone system. According to some examples, the control system may be configured to determine, via a trained neural network, a filtering scheme for the microphone signals. In some examples, the filtering scheme may include one or more filtering processes. In some examples, the trained neural network may be configured to implement one or more subband-domain adaptive filter management modules. In some examples, the control system may be configured to apply the filtering scheme to the microphone signals, to produce enhanced microphone signals.
According to some examples, the control system may be configured to implement one or more multichannel, multi-hypothesis adaptive filter blocks. In some such examples, the one or more subband-domain adaptive filter management modules may be configured to control the one or more multichannel, multi-hypothesis adaptive filter blocks.
In some examples, the control system may be configured to implement a subband-domain acoustic echo canceller (AEC). In some such examples, the filtering scheme may involve an echo cancellation process.
In some implementations, the audio device also may include a loudspeaker system. According to some such implementations, the control system may be further configured to implement a renderer for producing rendered local audio signals and for providing the rendered local audio signals to the loudspeaker system and to the subband-domain AEC.
In some examples, the control system may be configured for providing reference non-local audio signals to the subband-domain AEC. In some such examples, the reference non-local audio signals may correspond to audio signals being played back by one or more other audio devices.
According to some examples, the control system may be configured to implement a noise compensation module. In some such examples, the filtering scheme may include, or may involve, a noise compensation process.
In some examples, the control system may be configured to implement a dereverberation module. In some such examples, the filtering scheme may include, or may involve, a dereverberation process.
According to some examples, the control system may be configured to implement a beam steering module. In some such examples, the filtering scheme may include, or may involve, a beam steering process.
In some examples, the control system may be configured to implement an automatic speech recognition module. In some such examples, the control system may be configured to provide the enhanced microphone signals to the automatic speech recognition module.
According to some examples, the control system may be configured to implement a telecommunications module. In some such examples, the control system may be configured to provide the enhanced microphone signals to the telecommunications module.
In some implementations, the trained neural network may be, or may include, a recurrent neural network. According to some such implementations, the recurrent neural network may be, or may include, a gated adaptive filter unit. In some such implementations, the gated adaptive filter unit may include a reset gate, an update gate and a keep gate. In some such implementations, the gated adaptive filter unit may include an adaptation gate.
According to some implementations, the audio device may include a square law module configured to generate a plurality of residual power signals based, at least in part, on the microphone signals. The square law module may, in some examples, be implemented by the control system. In some implementations, the square law module may be configured to generate the plurality of residual power signals based, at least in part, on reference signals that correspond to audio being played back by the audio device and one or more other audio devices. According to some implementations, the audio device may include a selection block configured to select the enhanced microphone signals based, at least in part, on a minimum residual power signal of the plurality of residual power signals. The selection block may, in some examples, be implemented by the control system.
In some implementations, the control system may be configured to implement post-deployment training of the trained neural network. The post-deployment training may, for example, occur after the audio device has been deployed and activated in an audio environment.
At least some aspects of the present disclosure may be implemented via one or more audio processing methods. In some instances, the method(s) may be implemented, at least in part, by a control system and/or via instructions (e.g., software) stored on one or more non-transitory media. Some methods may involve receiving microphone signals from a microphone system. The microphone signals may include signals corresponding to one or more sounds detected by the microphone system. Some such methods may involve determining (for example, via a trained neural network), a filtering scheme for the microphone signals. The filtering scheme may include one or more filtering processes. In some examples, the trained neural network may be configured to implement one or more subband-domain adaptive filter management modules. Some such methods may involve applying the filtering scheme to the microphone signals, to produce enhanced microphone signals.
Some methods may involve implementing (via a control system, for example, via a trained neural network implemented by the control system) the one or more multichannel, multi-hypothesis adaptive filter blocks. In some such examples, the method may involve controlling, by the one or more subband-domain adaptive filter management modules, the one or more multichannel, multi-hypothesis adaptive filter blocks.
Some methods may involve implementing (e.g., via the control system) a subband-domain acoustic echo canceller (AEC). In some such examples, the filtering scheme may involve an echo cancellation process. Some methods may involve implementing (e.g., via the control system) a renderer for producing rendered local audio signals. In some such examples, the method may involve providing the rendered local audio signals to a loudspeaker system and to the subband-domain AEC. Some methods may involve providing reference non-local audio signals to the subband-domain AEC. The reference non-local audio signals may correspond to audio signals being played back by one or more other audio devices.
According to some examples, the method may involve implementing (e.g., by the control system) a noise compensation module. In some such examples, the filtering scheme may include, or may involve, a noise compensation process.
In some examples, the method may involve implementing (e.g., by the control system) a dereverberation module. In some such examples, the filtering scheme may include, or may involve, a dereverberation process.
According to some examples, the method may involve implementing (e.g., by the control system) a beam steering module. In some such examples, the filtering scheme may include, or may involve, a beam steering process.
In some examples, the method may involve implementing (e.g., by the control system) an automatic speech recognition module. In some such examples, the method may involve providing the enhanced microphone signals to the automatic speech recognition module.
According to some examples, the method may involve implementing (e.g., by the control system) a telecommunications module. In some such examples, the method may involve providing the enhanced microphone signals to the telecommunications module.
According to some implementations, the method may involve generating (e.g., by a square law module implemented by the control system) a plurality of residual power signals based, at least in part, on the microphone signals. Some implementations may involve generating the plurality of residual power signals based, at least in part, on reference signals that correspond to audio being played back by the audio device and one or more other audio devices. According to some implementations, the method may involve selecting (e.g., by a selection block implemented by the control system) the enhanced microphone signals based, at least in part, on a minimum residual power signal of the plurality of residual power signals.
In some implementations, the method may involve implementing (e.g., by the control system) post-deployment training of the trained neural network. The post-deployment training may, for example, occur after an audio device configured to implement at least some disclosed methods has been deployed and activated in an audio environment.
Some or all of the operations, functions and/or methods described herein may be performed by one or more devices according to instructions (e.g., software) stored on one or more non-transitory media. Such non-transitory media may include memory devices such as those described herein, including but not limited to random access memory (RAM) devices, read-only memory (ROM) devices, etc. Accordingly, some innovative aspects of the subject matter described in this disclosure can be implemented via one or more non-transitory media having software stored thereon.
Details of one or more implementations of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages will become apparent from the description, the drawings, and the claims. Note that the relative dimensions of the following figures may not be drawn to scale.
Like reference numbers and designations in the various drawings indicate like elements.
Full duplex audio devices, such as smart speakers, perform both audio playback and audio capture tasks in order to provide functionality and features to one or more listeners in an audio environment. Many listening tasks employ the use of so-called “optimal filters,” which also may be referred to herein as “filters” or “acoustic filters,” to perform tasks such as acoustic echo cancellation (AEC), beam steering, active noise cancellation (ANC), dereverberation, etc. To build practical subsystems that realize these tasks, the optimal filters are typically supported, by means of control and aiding mechanisms, via heuristics that are manually designed and tuned by the system or product designer.
Some implementations disclosed in this document involve the use of machine learning to build neural networks that can automatically realize structure and tunings of heuristics to control and aid the filters. Some novel implementations of these machine learning algorithms in the form of recurrent neural networks called Gated Adaptive Filters (GAFs) are disclosed herein. In some disclosed examples, GAFs are curated for learning filter-related heuristics by controlling the flow of information over long sequences of data for each of a plurality of hypothesis filtering systems. In this context, a “long sequence of data” may be a sequence of data during a time interval of multiple seconds (such as 5 seconds, 10 seconds, 15 seconds, 20 seconds, 25 seconds, 30 seconds, 35 seconds, 40 seconds, 45 seconds, 50 seconds, 55 seconds, 60 seconds, 65 seconds, 70 seconds, 75 seconds, 80 seconds, 85 seconds, 90 seconds, 95 seconds, 100 seconds, 105 seconds, 110 seconds, 115 seconds, etc.) or during a time interval of multiple minutes (such as 2 minutes, 3 minutes, 4 minutes, 5 minutes, 6 minutes, 7 minutes, 8 minutes, etc.). According to some such examples, GAFs can improve the learning of such heuristics because their gated nature of retain, adapt and reset of information is fully compatible with the control mechanism of filter coefficients in a multi-hypothesis adaptive filter scheme. In some such examples, a GAF-based multi-hypothesis adaptive filter scheme may be designed to optimize audio device performance responsive to disturbances in the audio environment.
Devices that are configured to listen during the playback of content will typically employ some form of echo management, such as echo cancellation and/or echo suppression, to remove the “echo” (the content played back by audio devices in the audio environment) from microphone signals. Any continuous listening task, such as waiting for a wakeword or performing any kind of “continuous calibration,” should continue to function when audio devices are playing back content corresponding to music, movies, etc., and when audio device interactions (such as interactions between a person and an audio device that is implementing a voice assistant, at least in part) take place. In addition to this, active noise cancellation and or suppression along with beamforming and dereverberation technologies can further enhance the quality of the microphone signal for downstream applications.
According to this example, the audio environment 100 includes audio devices 110A, 110B, 110C and 110D. In this example, each the audio devices 110A-110D includes a respective one of the microphones 120A, 120B, 120C and 120D, as well as a respective one of the loudspeakers 121A, 121B, 121C and 121D. According to some examples, each the audio devices 110A-110D may be a smart audio device, such as a smart speaker. In this example, the audio devices 110A-110D are configured to listen for a command or wakeword within the audio environment (100).
According to this example, one acoustic event is caused by the talking person 130, who is talking in the vicinity of the audio device 110A. Element 131 is intended to represent speech of the talking person 130.
According to some alternative implementations the apparatus 150 may be, or may include, a server. In some such examples, the apparatus 150 may be, or may include, an encoder. Accordingly, in some instances the apparatus 150 may be a device that is configured for use within an audio environment, such as a home audio environment, whereas in other instances the apparatus 150 may be a device that is configured for use in “the cloud,” e.g., a server.
In this example, the apparatus 150 includes an interface system 155 and a control system 160. The interface system 155 may, in some implementations, be configured for communication with one or more other devices of an audio environment. The audio environment may, in some examples, be a home audio environment. In other examples, the audio environment may be another type of environment, such as an office environment, an automobile environment, a train environment, a street or sidewalk environment, a park environment, etc. The interface system 155 may, in some implementations, be configured for exchanging control information and associated data with audio devices of the audio environment. The control information and associated data may, in some examples, pertain to one or more software applications that the apparatus 150 is executing.
The interface system 155 may, in some implementations, be configured for receiving, or for providing, a content stream. The content stream may include audio data. The audio data may include, but may not be limited to, audio signals.
In some instances, the audio data may include spatial data, such as channel data and/or spatial metadata. Metadata may, for example, have been provided by what may be referred to herein as an “encoder.” In some examples, the content stream may include video data and audio data corresponding to the video data. The interface system 155 may include one or more network interfaces and/or one or more external device interfaces (such as one or more universal serial bus (USB) interfaces). According to some implementations, the interface system 155 may include one or more wireless interfaces. The interface system 155 may include one or more devices for implementing a user interface, such as one or more microphones, one or more speakers, a display system, a touch sensor system and/or a gesture sensor system. Accordingly, while some such devices are represented separately in
In some examples, the interface system 155 may include one or more interfaces between the control system 160 and a memory system, such as the optional memory system 165 shown in
The control system 160 may, for example, include a general purpose single- or multi-chip processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, and/or discrete hardware components.
In some implementations, the control system 160 may reside in more than one device. For example, in some implementations a portion of the control system 160 may reside in a device within one of the environments depicted herein and another portion of the control system 160 may reside in a device that is outside the environment, such as a server, a mobile device (e.g., a smartphone or a tablet computer), etc. In other examples, a portion of the control system 160 may reside in a device within one of the environments depicted herein and another portion of the control system 160 may reside in one or more other devices of the environment. For example, control system functionality may be distributed across multiple smart audio devices of an environment, or may be shared by an orchestrating device (such as what may be referred to herein as a smart home hub) and one or more other devices of the environment. In other examples, a portion of the control system 160 may reside in a device that is implementing a cloud-based service, such as a server, and another portion of the control system 160 may reside in another device that is implementing the cloud-based service, such as another server, a memory device, etc. The interface system 155 also may, in some examples, reside in more than one device.
In some implementations, the control system 160 may be configured for performing, at least in part, the methods disclosed herein. According to some examples, the control system 160 may be configured to receive microphone signals from a microphone system. The microphone signals may include signals corresponding to one or more sounds detected by the microphone system. In some such examples, the control system 160 may be configured to determine, via a trained neural network, a filtering scheme for the microphone signals, the filtering scheme including one or more filtering processes. In some such examples, the trained neural network may be configured to implement one or more subband-domain adaptive filter management modules. In some examples, the control system 160 may be configured to implement the trained neural network. According to some examples, the control system 160 may be configured to implement one or more multichannel, multi-hypothesis adaptive filter blocks. In some such examples, the one or more subband-domain adaptive filter management modules may be configured to control the one or more multichannel, multi-hypothesis adaptive filter blocks.
Some or all of the methods described herein may be performed by one or more devices according to instructions (e.g., software) stored on one or more non-transitory media. Such non-transitory media may include memory devices such as those described herein, including but not limited to random access memory (RAM) devices, read-only memory (ROM) devices, etc. The one or more non-transitory media may, for example, reside in the optional memory system 165 shown in
In some examples, the apparatus 150 may include the optional microphone system 170 shown in
According to some implementations, the apparatus 150 may include the optional loudspeaker system 175 shown in
In some implementations, the apparatus 150 may include the optional sensor system 180 shown in
In some implementations, the apparatus 150 may include the optional display system 185 shown in
According to some such examples the apparatus 150 may be, or may include, a smart audio device, such as a smart speaker. In some such implementations the apparatus 150 may be, or may include, a wakeword detector. For example, the apparatus 150 may be configured to implement (at least in part) a virtual assistant.
In this example, the audio device 110A includes a renderer 110A, a filter optimizing module 150A and a speech processor/communications block 151A. According to this example, the renderer 110A, the filter optimizing module 150A and the speech processor/communications block 151A are all implemented by the control system 160A, which is an instance of the control system 160 of
According to this example, the renderer 110A is configured to render audio data received by the audio device 110A or stored on the audio device 110A for reproduction on loudspeaker 121A. In this example, the renderer output 101A is provided to the loudspeaker 121A for playback.
In some implementations, the speech processor/communications block 151A may be configured for speech recognition functionality. In some examples, the speech processor/communications block 151A may be configured to provide telecommunications services, such as telephone calls, video conferencing, etc. Although not shown in
In this example, the filter optimizing module 150A is configured to select and implement filters for enhancing the microphone signal(s) 123A, to produce the enhanced microphone signal(s) 124A. In some examples, the filter optimizing module 150A may be configured to perform acoustic echo cancellation (AEC), beam steering, active noise cancellation (ANC), dereverberation, or combinations thereof.
Acoustic echo cancellers (AECs) are often implemented in the subband domain for both performance and cost reasons. A subband domain AEC (which also may be referred to herein as a multi-channel AEC or an MC-AEC) normally includes a subband AEC for each of a plurality of subbands. Furthermore, also for practical reasons, each subband AEC normally runs multiple adaptive filters, each of which is optimal in different acoustic conditions. The multiple adaptive filters are controlled by adaptive filter management modules, so that overall the subband AEC may have the best characteristics of each filter.
The renderer 110A and the speech processor/communications block 151A may be configured substantially as described with reference to
In this example, the MC-AEC 203A is an implementation of the filter optimizing module 150A that is described elsewhere herein, e.g., with reference to
In some examples, the choice of these adaptation laws and heuristics 410A may be made by a person (such as a designer) such that out of all the hypotheses of the MC-MH AFB 411A there will be at least one good hypothesis in both favorable conditions and unfavorable conditions. That is, there will be at least one predicted echo signal in the set of predicted echo reference signals 402A that produces a good residual signal (one of the 403A signals) that is then in turn output by the heuristic block 410A as the enhanced microphone signal 124A. An example of a “good” residual signal is a lowest-power residual signal. Other examples of “good” residual signals may be produced by minimizing a cost of one of the cost functions disclosed herein. In some examples, the quality of a residual signal may be estimated according to how much of the echo remains in the residual signal, for example by correlating the residual signal with the echo references. Alternatively, or additionally, the quality of a residual signal may be estimated according to the stability of the residual signal over a time interval. An example of a time interval corresponding to an “unfavorable condition” is a time interval during a disturbance (e.g., during an acoustical change in the audio environment, such as an acoustical change caused by a noise source), after a disturbance, or both during and after a disturbance.
Although not illustrated as such in
According to some examples, each type of adaptive filter of the MC-MH AFB 411A may perform better in different acoustic conditions. For example, one type of adaptive filter may be better at tracking echo path changes whereas another type of adaptive filter may be better at avoiding misadaptation during instances of doubletalk. The adaptive filters of the MC-MH AFB 411A may, in some examples, include a continuum of adaptive filters. The adaptive filters of the MC-MH AFB 411A may, for example, range from a highly adaptive or aggressive adaptive filter (which may sometimes be referred to as a “main” adaptive filter) that determines filter coefficients responsive to current audio conditions (e.g., responsive to a current error signal) to a highly conservative adaptive filter (which may sometimes be referred to as a “shadow” adaptive filter) that provides little or no change in filter coefficients responsive to current audio conditions.
In some implementations, the adaptive filters of the MC-MH AFB 411A may include adaptive filters having a variety of adaptation rates, filter lengths and/or adaptation algorithms (e.g., adaptation algorithms that include one or more of least mean square (LMS), normalized least mean square (NLMS), proportionate normalized least mean square (PNLMS) and/or recursive least square (RLS)), etc. In some implementations, the adaptive filters of the MC-MH AFB 411A may include linear and/or non-linear adaptive filters, adaptive filters having different reference and microphone signal time alignments, etc. According to some implementations, the adaptive filters of the MC-MH AFB 411A may include an adaptive filter that only adapts when the output is very loud or very quiet. For example, a “party” adaptive filter might only adapt to the loud parts of output audio.
According to this example, the heuristic block 410A is configured to select a subband domain residual signal from the plurality of residual signals 403 according to a set of heuristic rules. For example, the heuristic block 410A may be configured to monitor the state of the system and to manage the MC-MH AFB 411A through mechanisms such as copying filter coefficients from one adaptive filter into the other if certain conditions are met (e.g., one is outperforming the other). For example, if adaptive filter A is clearly outperforming adaptive filter B, the subband domain adaptive filter management module 411 may be configured to copy the filter coefficients for adaptive filter A to adaptive filter B. In some instances, the subband domain adaptive filter management module 411 may also issue reset commands to one or more adaptive filters of the plurality of subband domain adaptive filters 410 if the subband domain adaptive filter management module 411 detects divergence.
Accordingly, the neural network block 510A is configured to control the MC-MH AFB 411A in this example. Various types of control mechanisms may be implemented, depending on the particular implementation. Such control mechanisms may include:
The neural network block 510A may be trained via offline training (e.g., prior to deployment by an end user), online training (e.g., during deployment by an end user) or by a combination of both offline training and online training. Various examples of training the neural network block 510A are disclosed herein. One or more cost functions used to optimize the neural network block 510A may be chosen by a person, such as a system designer. The cost function(s) may be chosen in such a way as to attempt to make the set of hypotheses corresponding to the plurality of sets of adaptive filters of the MC-MH AFB 411A be globally optimal. The definition of globally optimal is application dependent and chosen by the designer. In some examples, the cost function(s) may be selected to optimize for one or more of the following:
Additional detailed examples of training the neural network block 510A are described below.
In this embodiment, the square law device 601A is configured to compute the square of all input signals and to output corresponding power signals. The argmin block 611A is configured to determine which of the residual power signals 602A (the argument) has the lowest power and outputs the argument 603A to the selection block 613A. The selection block 613A is configured to select one of the residual signals 403A corresponding to the argument 603A, and to provide a selected residual signal as one of the output residual signals 124A. In this example, the neural network block 612A is configured to consume the power signals 601A and to determine the control signals 401A for controlling the adaptive filters of the MC-MH AFB 411A.
In this example, the filter coefficients 701A of the adaptive filters of the MC-MH AFB 411A are being fed back into the neural network block 510A as input. According to this example, the control signals 401A are based, at least in part, on the filter coefficients 701A.
In this embodiment, the square law device 601A is configured to compute the square of all input signals and to output corresponding power signals. The argmin block 611A is configured to determine which of the residual power signals 602A (the argument) has the lowest power and to output the argument 603A to the selection block 613A. The selection block 613A is configured to select one of the residual signals 403A corresponding to the argument 603A, and to provide a selected residual signal as one of the output residual signals 124A.
In this example, the GAF unit 810A is configured to determine the control signals 401A for controlling the adaptive filters of the MC-MH AFB 411A based, at least in part, on the power signals 601A and the filter coefficients 701A. In this example, the GAF unit 810A has been trained to produce the control signals 401A for controlling adaptive filters of the MC-MH AFB 411A based, at least in part, on the filter coefficients 701A.
According to this example, the GAF unit 810A also includes a reset gate 922A, an update gate 923A, a keep gate 924A and an adaptation gate 930A, which generate reset signals 902A, update signals 903A, keep signals 904A and adaptation signals 905A, respectively. In this example, the reset signals 902A, update signals 903A and keep signals 904A correspond to indications that filter coefficients of adaptive filters should be reset, updated or kept unchanged, respectively. In some examples, the reset signals 902A may correspond to indications of divergence, e.g., that the output of adaptive filters has diverged too far and cannot be recovered. According to some examples, the update signals 903A may correspond to indications that the filter coefficients of a better-performing adaptive filter should be copied to those of a worse-performing adaptive filter. For example, if adaptive filter A is clearly outperforming adaptive filter B, the update signals 903A may indicate that the filter coefficients for adaptive filter A should be copied to adaptive filter B.
In this example, a reset signal 902A, an update signal 903A and a keep signal 904A are provided by the reset gate 922A, the update gate 923A and the keep gate 924A, respectively, to a signal selection module 914A, which is configured to select a one of the input signals (the reset signal 902A, the update signal 903A or the keep signal 904A) and to output selected filter signals 906A. In this example, the signal selection module 914A is implemented via a softmax module. In other examples, the signal selection module 914A may be implemented by implementing an argmax function or a “soft margin” softmax function. In this example, the signal selection module 914A is configured to apply the following softmax function to the reset signal 902A, the update signal 903A and the keep signal 904A:
In the foregoing softmax function, the input vector z may be expressed as follows:
In the foregoing equation, r represents the reset signal 902A, u represents the update signal 903A and k represents the keep signal 904A. In the softmax function, “i” corresponds to a particular value of r, u or k. So, one example of the softmax function in which i=1 would be z1/(z1+z2+z3). This makes the output of the signal selection module 914A bounded by zero and 1. The vector at time t may be expressed as follows:
In some examples, the reset gate 922A may compute the reset signal 902A (rt) using a linear layer 911A, e.g., as follows:
In the foregoing equation, Wr and Ur represent the weights of the reset layer and br represents the bias of the reset layer. W, U and b may be learned as part of a process of training the GAF unit 810A. In some examples, W, U and b may be any number and are not bounded by zero and 1.
Similarly, update gate 923A may compute the update gate signal 903A using a linear layer 911A, e.g., as follows:
In the foregoing equation, Wu and Uu represent the weights of the update layer and by represents the bias of the update layer.
In some examples, the keep gate 924A may compute the keep gate signal 904A using a linear layer 911A, e.g., as follows:
In the foregoing equation, Wk and Uk represent the weights of the keep layer and bk represents the bias of the keep layer.
The inputs to the foregoing equations for rt, ut and kt is denoted as xt, which represents the filter coefficients 701A and a plurality of microphone power signals, audio content power signals and residual power signals (601A) multiplexed together. Accordingly, xt represents essentially everything in the multiplexed signals 901A output by the multiplexer 910A, except for the hidden state information 991A.
According to this example, the adaptation gate 930A is configured to output an adaptation gate signal that is computed, at least in part, by the linear layer 911A. In this example, the adaptation gate 930A includes a sigmoid function block 912A that is configured to produce an output adaptation gate signal 905A that is in the range from 0 to 1, e.g., as follows:
In the foregoing equation, at represents the output adaptation gate signal 905A and S represents a sigmoid function, which may be expressed as follows:
In this example, the multiplexer 910C generates the control signals 401A by multiplexing the output adaptation gate signal 905A and the selected filter signals 906A output by the signal selection module 914A.
According to this example, the GAF unit 810A includes an optional filter adaptation module 951A, which is configured to compute the filter adaptation step 995A based on the output adaptation gate signal 905A and the filter coefficients 701A. In some such examples, the filter adaptation step 995A may be represented as αt δWt, where δWt indicates how much the filter coefficients 701A have changed and αt represents the output adaptation gate signal 905A. In some examples, δWt may be computed using the adaptation algorithm for each of the multiple hypotheses (corresponding to sets of adaptive filters) of the MC-MH AFB 411A. Such adaptation algorithm algorithms may include normalised least-mean-squares (NLMS) and proportionate NLMS (PNLMS) algorithms, which can provide diversity in the hypotheses. However, in some examples the adaption algorithms may not be specified. It may not be necessary to specify the adaption algorithms so long as the neural network (in this example, the GAF unit 810A) is able to learn to produce diverse hypotheses by way of controlling each of the filters differently.
In some alternative implementations, the GAF unit 810A may not include a filter adaptation module 951A. In some such implementations, the GAF unit 810A may be configured to obtain the filter adaptation step 995A from the MC-MH AFB 411A.
According to this example, the multiplexer 910B is configured to determine the hidden state information 992A, which is being produced by the GAF unit 810A during the current time step, by concatenating the filter adaptation step 995A and the selected filter signals 906A output by the signal selection module 914A, e.g., as follows:
In the foregoing equation, ft represents the selected filter signals 906A and αt δWt represents the filter adaptation step 995A.
In some examples, the data used for training contains “clean” echo, which is microphone data including only “echo” corresponding to audio played back by one or more audio devices in an audio environment, in addition to other training vectors which contain both echo and perturbations. The nature of the perturbations included in the training dataset will define the behaviour of the learned heuristics and will thus define the performance of a system that includes the trained neural network when the device or system is deployed into the real world. In some such examples, the data used for training may contain:
This section provides examples of cost functions in the context of training neural networks for functionality relating to MC-AECs. One of ordinary skill in the art will appreciate that at least some of the disclosed examples will apply to training neural networks for other types of functionality.
The goal of many MC-AECs is to reduce the power of the residual signal. Thus, one cost function that one could consider for training a neural network for implementing an MC-AEC would be a cost function that seeks to reduce the mean-square of the residual signal over all N of the hypotheses in an adaptive filter bank, such as an adaptive filter bank of the MC-MH AFB 411A. One such cost function may be expressed as follows:
In this example, resn represents the amplitude of the residual signal. A cost function of this type may be used to derive the instantaneous adaptation of many adaptive filters (e.g., those used to compute the adaptation step of the MC-MH AFB 411A.). In some examples, one may wish to reduce the power of the residual signal over the entire time T of the training vector, e.g., as follows:
When non-stationary perturbations are in the training data, one may wish to weight the time periods surrounding these non-stationary perturbations more heavily. This may be due to the fact that perturbation timesteps are underrepresented in the data and/or because one realizes that one or more of the heuristics that one wishes the neural network to learn are non-stationary perturbations. In order to weight the time periods surrounding non-stationary perturbations more heavily, some examples involve altering any of the cost functions (above or below) which involve summing over T, e.g., as follows:
In the foregoing equation, βt represents a vector that weights each timestep. For example, one could set timesteps to be 0.1 second, in general, whereas timesteps during perturbations could be set to 0.5 seconds and those during a time interval immediately after the perturbations (e.g., for the next 2 seconds, the next 3 seconds, the next 4 seconds, etc.) to be 1.0. This would place emphasis on the heuristics that are being learned to provide robustness in the presence of perturbations (which may be a primary goal of the heuristics).
Alternatively, or additionally, one could consider that the target application would revolve around the MC-AEC (in this example) not only trying to completely cancel the echo, but also to improve the ability of a downstream automatic speech recognition (ASR) module to listen to the user in the room. Therefore, if we have a copy of the clean speech signal that is present in the training corpus (the audio data used for training), one could use a cost function such as the following:
In the foregoing equation, speech, represents the clean speech signal.
Alternatively, or additionally, a cost function may be based, at least in part, on any other type of intelligibility metric, on a mean opinion score (MOS), etc. Such cost functions would place an emphasis on the system enhancing a user's speech and distinguishing the user's speech from the corrupted (speech and echo) microphone signals.
In some examples, one may wish to train a neural network to replicate a hand-crafted set of heuristics, such as those produced by the heuristic block 410A of
In the foregoing equation, θMAN represents the control signals resulting from a hand-crafted set of heuristic rules and θNN the control signals produced by the neural network. If the nature of these signals is binary-like, a log-loss type cost function may be more suitable. The term “binary-like” implies that the signal only involves 2 numbers, or that the signal is continuous but only takes on 2 numbers effectively. For example, a signal x=[0, 1, 1, 0, 1, 0] is binary. The signal x=[0.1 0.89, 0.9, 0.08, 0.92, 0.05] is binary-like because it is essentially just taking on two values, plus a small amount of noise. The term “log-loss cost function” refers to a cross-entropy cost function. Therefore, a “log-loss type cost function” refers to other cost functions that are similar to the cross-entropy cost function, such as the Kullback-Leibler divergence cost function.
A typical issue with machine learning in general and machine learning for AECs specifically is that of overfitting. We can penalise overfitting by using a cost function such as the following for training a neural network:
In the foregoing equation, ζ represents the filter coefficients of an adaptive filter bank. This cost function resembles an L2 regularization and may not be useful on its own. However, see the discussion of combining cost functions below. A cost function of this type may be useful in mitigating the non-uniqueness problem that arises in the multi-channel AEC scenario due to correlated content.
Having multiple hypotheses in an adaptive filter bank can be useful in a non-static audio environment. By specifying βt such that the time immediately after a perturbation that contributed significantly to the total cost of the vector, the neural network may be able to reduce this cost by ensuring that the coefficients of each of the adaptive filters are significantly different at that time (for example, by ensuring one filter has paused adaptation during a perturbation and that another filter has not paused adaptation during the perturbation). However, in some examples this feature may not be explicitly a part of the cost function and therefore the desired behavior is not guaranteed to be learned. Thus, it can be useful to use a cost function that penalizes the similarity of any two hypotheses in an adaptive filter bank. Such a penalty may, for example, be based on a simple distance metric (such as a Euclidean distance metric, a Manhattan distance metric, a cosine distance metric, etc.) that may be temporally weighted with βt.
Any number of the above-described cost functions can be combined together using, for example, Lagrange multipliers, to use the cost functions simultaneously. In some examples, the above-described cost functions may be used sequentially, for example in the context of transfer learning as described elsewhere herein.
In this particular instance, the neural network being trained is the neural network block 612A of
The training may involve providing a corpus of suitable training data and one or more suitable cost functions, such as one or more types of training data and cost functions disclosed herein. According to some examples, the training data and/or the cost functions may be selected for a target audio environment in which the system will be deployed, for example assuming perturbations and noise levels representative of a typical home acoustic environment.
It can be useful to train a neural network to first replicate hand-crafted heuristics which have proven to be at least somewhat useful in most environments, such as those disclosed in United States Provisional Patent Application No. 63/200,590, filed on Mar. 16, 2021 and entitled, “SUBBAND DOMAIN ACOUSTIC ECHO CANCELLER BASED ACOUSTIC STATE ESTIMATOR,” particularly
In some examples, transfer learning may be performed after a device that includes a trained neural network has been deployed into the target environment and activated (a condition that also may be referred to as being “online”). Many of the cost functions defined above are suitable for unsupervised learning after deployment. Accordingly, some examples may involve updating the neural network coefficients online in order to optimize performance. Such methods may be particularly useful when the target audio environment is significantly different from the audio environment(s) which produced the training data, because the new “real world” data may include data previously unseen by the neural network.
In some examples, the online training may involve supervised training. In some such examples, automatic speech recognition modules may be used to produce labels for user speech segments. Such labels may be used as the “ground truth” for online supervised training. Some such examples may involve using a time-weighted residual in which the weight immediately after speech is higher than the weight during speech.
The method 1100 may be performed by an apparatus or system, such as the apparatus 150 that is shown in
In this example, block 1105 involves receiving, by a control system, microphone signals from a microphone system. In this example, the microphone signals include signals corresponding to one or more sounds detected by the microphone system.
According to this example, block 1110 involves determining, by a trained neural network implemented by the control system, a filtering scheme for the microphone signals. According to some examples, the trained neural network may be, or may include, the neural network block 510A and/or the GAF unit 810A of the present disclosure. In this example, the filtering scheme includes one or more filtering processes and the trained neural network is configured to implement one or more subband-domain adaptive filter management modules. In this example, block 1115 involves applying, by the control system, the filtering scheme to the microphone signals, to produce enhanced microphone signals.
In some examples, the control system may be further configured to implement one or more multichannel, multi-hypothesis adaptive filter blocks, such as the multi-channel, multi-hypothesis adaptive filter block (MC-MH AFB) 411A that is shown in
In some implementations, the control system may be further configured to implement a subband-domain acoustic echo canceller (AEC). In some such implementations, the filtering scheme may include an echo cancellation process. In some such implementations, the control system may be further configured to implement a renderer for producing rendered local audio signals and for providing the rendered local audio signals to a loudspeaker system and to the subband-domain AEC. In some such implementations, the apparatus may include the loudspeaker system. In some implementations, the control system may be configured for providing reference non-local audio signals to the subband-domain AEC. The reference non-local audio signals may correspond to audio signals being played back by one or more other devices in the audio environment.
According to some examples, the control system may be configured to implement a noise compensation module. In some such examples, the filtering scheme may include a noise compensation process. In some examples, the control system may be configured to implement a dereverberation module. In some such examples, the filtering scheme may include a dereverberation process.
In some implementations, the control system may be configured to implement a beam steering module. In some such implementations, the filtering scheme may involve, or may include, a beam steering process. In some such implementations, the beam steering process may be a receive-side beam steering process to be implemented by a microphone system.
According to some examples, the control system may be configured to provide the enhanced microphone signals to an automatic speech recognition module. In some such examples, the control system may be configured to implement the automatic speech recognition module.
In some examples, the control system may be configured to provide the enhanced microphone signals to a telecommunications module. In some such examples, the control system may be configured to implement the telecommunications module.
In some implementations, the trained neural network may be, or may include, a recurrent neural network. In some such implementations, the recurrent neural network may be, or may include, a gated adaptive filter unit. In some such implementations, the gated adaptive filter unit may include a reset gate, an update gate, a keep gate, or any combination thereof. In some implementations, the gated adaptive filter unit may include an adaptation gate.
According to some examples, the apparatus configured for implementing the method may include a square law module configured to generate a plurality of residual power signals based, at least in part, on the microphone signals. In some such examples, the square law module may be configured to generate the plurality of residual power signals based, at least in part, on reference signals corresponding to audio being played back by the apparatus and one or more other devices. According to some examples, the apparatus configured for implementing the method may include a selection block configured to select the enhanced microphone signals based, at least in part, on a minimum residual power signal of the plurality of residual power signals.
In some examples, the control system may be further configured to implement post-deployment training of the trained neural network. The post-deployment training may, in some such examples, occur after the apparatus configured for implementing the method has been deployed and activated in an audio environment.
Some aspects of present disclosure include a system or device configured (e.g., programmed) to perform one or more examples of the disclosed methods, and a tangible computer readable medium (e.g., a disc) which stores code for implementing one or more examples of the disclosed methods or steps thereof. For example, some disclosed systems can be or include a programmable general purpose processor, digital signal processor, or microprocessor, programmed with software or firmware and/or otherwise configured to perform any of a variety of operations on data, including an embodiment of disclosed methods or steps thereof. Such a general purpose processor may be or include a computer system including an input device, a memory, and a processing subsystem that is programmed (and/or otherwise configured) to perform one or more examples of the disclosed methods (or steps thereof) in response to data asserted thereto.
Some embodiments may be implemented as a configurable (e.g., programmable) digital signal processor (DSP) that is configured (e.g., programmed and otherwise configured) to perform required processing on audio signal(s), including performance of one or more examples of the disclosed methods. Alternatively, embodiments of the disclosed systems (or elements thereof) may be implemented as a general purpose processor (e.g., a personal computer (PC) or other computer system or microprocessor, which may include an input device and a memory) which is programmed with software or firmware and/or otherwise configured to perform any of a variety of operations including one or more examples of the disclosed methods. Alternatively, elements of some embodiments of the inventive system are implemented as a general purpose processor or DSP configured (e.g., programmed) to perform one or more examples of the disclosed methods, and the system also includes other elements (e.g., one or more loudspeakers and/or one or more microphones). A general purpose processor configured to perform one or more examples of the disclosed methods may be coupled to an input device (e.g., a mouse and/or a keyboard), a memory, and a display device.
Another aspect of present disclosure is a computer readable medium (for example, a disc or other tangible storage medium) which stores code for performing (e.g., coder executable to perform) one or more examples of the disclosed methods or steps thereof.
While specific embodiments of the present disclosure and applications of the disclosure have been described herein, it will be apparent to those of ordinary skill in the art that many variations on the embodiments and applications described herein are possible without departing from the scope of the disclosure described and claimed herein. It should be understood that while certain forms of the disclosure have been shown and described, the disclosure is not to be limited to the specific embodiments described and shown or the specific methods described.
This application claims the benefit of priority from U.S. Provisional Patent Application No. 63/277,242, filed on Nov. 9, 2021, and U.S. Provisional Patent Application No. 63/369,311, filed on Jul. 25, 2022, both of which are incorporated herein by reference in their entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2022/048607 | 11/1/2022 | WO |
Number | Date | Country | |
---|---|---|---|
63369311 | Jul 2022 | US | |
63277242 | Nov 2021 | US |