The present disclosure is generally related to audio signal enhancement.
Advances in technology have resulted in smaller and more powerful computing devices. For example, there currently exist a variety of portable personal computing devices, including wireless telephones such as mobile and smart phones, tablets and laptop computers that are small, lightweight, and easily carried by users. These devices can communicate voice and data packets over wireless networks. Further, many such devices incorporate additional functionality such as a digital still camera, a digital video camera, a digital recorder, and an audio file player. Also, such devices can process executable instructions, including software applications (e.g., a web browser application), that can be used to access the Internet. As such, these devices can include significant computing capabilities.
Such computing devices often incorporate functionality to process an audio signal. For example, the audio signal may represent sounds captured by one or more microphones or correspond to decoded audio data. Such devices may perform signal enhancement, such as noise suppression, to generate an enhanced audio signal. The signal enhancement (e.g., noise suppression) can remove context from the enhanced audio signal and introduce artifacts that reduce audio quality.
According to one implementation of the present disclosure, a device includes a processor configured to perform signal enhancement of an input audio signal to generate an enhanced mono audio signal. The processor is also configured to mix a first audio signal and a second audio signal to generate a stereo audio signal. The first audio signal is based on the enhanced mono audio signal.
According to another implementation of the present disclosure, a method includes performing, at a device, signal enhancement of an input audio signal to generate an enhanced mono audio signal. The method also includes mixing, at the device, a first audio signal and a second audio signal to generate a stereo audio signal. The first audio signal is based on the enhanced mono audio signal.
According to another implementation of the present disclosure, a non-transitory computer-readable medium includes instructions that, when executed by one or more processors, cause the one or more processors to perform signal enhancement of an input audio signal to generate an enhanced mono audio signal. The instructions, when executed by the one or more processors, also cause the one or more processors to mix a first audio signal and a second audio signal to generate a stereo audio signal. The first audio signal is based on the enhanced mono audio signal.
According to another implementation of the present disclosure, an apparatus includes means for performing signal enhancement of an input audio signal to generate an enhanced mono audio signal. The apparatus also includes means for mixing a first audio signal and a second audio signal to generate a stereo audio signal. The first audio signal is based on the enhanced mono audio signal.
Other aspects, advantages, and features of the present disclosure will become apparent after review of the entire application, including the following sections: Brief Description of the Drawings, Detailed Description, and the Claims.
Various devices perform signal enhancement, such as noise suppression, to generate enhanced audio signals. The signal enhancement (some examples include noise suppression, audio zoom, beamforming, dereverberation, source separation, bass adjustment, equalization) can remove audio context from the enhanced audio signal. Audio context in the present disclosure may generally refer to one or more audio signals or signal components which provide audible spatial and/or environmental information for the enhanced audio signal. For example, signal enhancement can be performed on an audio signal captured via microphones of a device during a call to generate an enhanced audio signal without background sounds (e.g., as a result of noise suppression of the audio signal). A listener of the enhanced audio signal, in the absence of the background sounds, cannot determine whether a speaker is in a busy market or an office. The signal enhancement (e.g., noise suppression) can also introduce artifacts that reduce audio quality. For example, speech in the enhanced audio signal can sound choppy.
Recently, signal enhancement using one or more generative networks has been introduced. Specifically, so-called generative adversarial networks (GAN) may be used to generate audio signals, such as speech signals, with improved signal quality, e.g., with increased signal-to-noise-ratio or even without any background sounds. In a GAN, a generative network may generate candidates for data, such as elements (e.g., words, phonemes, etc.) of a speech signal, while a discriminative network evaluates the candidates. Signal enhancement in the present disclosure may process one or more input audio signals using at least one generative network (e.g., a GAN) to generate one or more enhanced mono audio signals.
Specifically, the one or more input signals may be audio signals captured by one or more microphones in a soundscape that includes a source of a primary (target) audio signal such as a speech signal, e.g., uttered by a person, and one or more sources of secondary (unwanted) audio signals, e.g., other speech signals, directional noise, diffuse noise, etc. Signal enhancement in the present disclosure may refer to at least partially removing the secondary audio signals from the input audio signals. As described above, in some examples, the secondary audio signals may be removed using one or more generative networks.
Alternatively or additionally, noise suppression by signal filtering, audio zoom, beamforming, dereverberation, source separation, bass adjustment, and/or equalization may be applied when performing signal enhancement. For example, signal enhancement in the present disclosure may refer to increasing the gain of the target signal (e.g., the speech signal), lowering the gain of the unwanted audio signals, or both, to perform an audio zoom operation. As described above, in some examples, the gain of the target signal may be increased based on using one or more generative networks. In addition, as described above, in some examples, the gain of the unwanted audio signals may be decreased based on using one or more generative networks.
One way to perform an audio zoom operation is to perform a beamforming operation that includes generating a virtual audio beam formed by two or more microphones in the direction of the primary (target) audio signal and/or a null beam in the direction of the secondary (unwanted) audio signals. Thus, signal enhancement in the present disclosure may also refer to at least performing an audio zoom operation. As described above, in some examples, the zoom operations to increase perceptibility of the target signal may be based on using one or more generative networks.
Moreover, signal enhancement in the present disclosure may also refer to generating a virtual audio beam in the direction of the target signal and/or a null beam in the direction of unwanted sound signals. As described within this disclosure, in some examples, beamforming may focus on the target signal using one or more generative networks. In other examples within this disclosure, the unwanted signals may be removed based on using one or more generative networks.
In another example, a mixture of audio signals includes different types of sounds (e.g., speech signals, directional noise, diffuse noise, non-stationary noise, speech from multiple speakers, etc.). Signal enhancement in the present disclosure may also refer to source separation, where the mixture of audio signals are separated from each other. In other examples within this disclosure, the source separation of a mixture of audio signals may be based on using one or more generative networks.
In some examples, an audio signal, such as music audio, includes various frequency components. Signal enhancement in the present disclosure may refer to equalization, where balance of the frequency components is adjusted. In other examples within this disclosure, the equalization of frequency components of an audio signal may be based on using one or more generative networks.
Furthermore, generative audio techniques may be used to generate an enhanced mono audio signal. In some examples, the enhanced mono audio signal may be a noise suppressed speech signal (e.g., a speech signal captured by one or more microphones or decoded from encoded audio data), wherein noise has been partially or completely removed from the corresponding one or more input audio signals. As mentioned above, the noise suppression may involve one or more generative networks (e.g., GANs), as further described with respect to
In some examples, the enhanced mono audio signal may be an audio zoomed signal (e.g., a target signal captured by one or more microphones or decoded from encoded audio data), wherein the gain of the unwanted audio signals has been reduced, the gain of the target signal has been increased, or both. As mentioned above, the audio zoom may involve one or more generative networks (e.g., GANs), as further described with respect to
In some examples, the enhanced mono audio signal may be a beamformed signal (e.g., a target signal captured by one or more microphones or decoded from encoded audio data), wherein a virtual audio beam is formed by two or more microphones in the direction of the primary (target) audio signal and/or a null beam is formed in the direction of the secondary (unwanted) audio signals. As mentioned above, the beam forming may involve one or more generative networks (e.g., GANs), as further described with respect to
In some examples, the enhanced mono audio signal may be a dereverberated signal (e.g., a speech signal captured by one or more microphones or decoded from encoded audio data), wherein reverberation has been partially or completely removed from the corresponding one or more input audio signals. As mentioned above, the dereverberation may involve one or more generative networks (e.g., GANs), as further described with respect to
In some examples, the enhanced mono audio signal may be a source separated signal (e.g., a target signal from a particular audio source captured by one or more microphones or decoded from encoded audio data), wherein unwanted audio signals have been partially or completely removed from the corresponding one or more input audio signals. As mentioned above, the source separation may involve one or more generative networks (e.g., GANs), as further described with respect to
In some examples, the enhanced mono audio signal may be a bass adjusted signal (e.g., a music signal captured by one or more microphones or decoded from encoded audio data), wherein bass has been increased or reduced from the corresponding one or more input audio signals. As mentioned above, the bass adjustment may involve one or more generative networks (e.g., GANs), as further described with respect to
In some examples, the enhanced mono audio signal may be an equalized signal (e.g., a music signal captured by one or more microphones or decoded from encoded audio data), wherein balance of different frequency components is adjusted from the corresponding one or more input audio signals. As mentioned above, the equalization may involve one or more generative networks (e.g., GANs), as further described with respect to
Systems and methods of audio signal enhancement are disclosed. In an illustrative example, a signal enhancer performs signal enhancement of an input audio signal to generate an enhanced mono audio signal. As described above, the enhanced mono audio signal may be an enhanced speech signal. The enhanced speech signal may be associated with a single/particular speaker. The enhanced speech signal may be a mono audio signal generated from one or more input audio signals. The one or more input audio signals may be captured using one or more microphones as described in more detail below. The enhanced mono audio signal is a single-channel audio signal.
As mentioned above, the signal enhancement can include at least one of noise suppression, audio zoom, beamforming, dereverberation, source separation, bass adjustment, or equalization. Audio context in the present disclosure may refer to ancillary or secondary audio signals and/or signal components, wherein the enhanced mono audio signal represents the primary audio signal or signal component, such as a speech signal associated with an individual/particular speaker. As described above, the enhanced mono audio signal may be a processed input audio signal, e.g., filtered or beamformed, or a synthetic audio signal, e.g., generated using generative networks based on the input audio signal. In some examples, the primary audio signal or signal component refers to an audio signal, such as a speech signal, from a specific audio source, such as an individual/particular speaker, received on a direct path from the audio source to one or more microphones (e.g., without reverberations or environmental noise). The audio context on the other hand may refer to audio signals and/or signal components other than the directly received audio signal from the particular audio source. In some examples, the audio context may include reverberated/reflected speech signals originating from the individual/particular speaker, speech signals from speakers other than the particular speaker, diffuse background noise, locally emanating or directional noise (e.g., from a moving vehicle), or combinations thereof.
An audio mixer generates a first audio signal that is based on the enhanced mono audio signal. In some examples, the first audio signal is the same as the enhanced mono audio signal. In other examples, the audio mixer can perform additional processing, such as panning or binauralization, on the enhanced mono audio signal to generate the first audio signal. In some aspects, the first audio signal corresponds to an enhanced audio signal with reduced audio context. To add such context, the audio mixer generates a second audio signal based on at least one of a directional audio signal, a background audio signal, or a reverberation signal. The audio mixer mixes the first audio signal and the second audio signal to generate a stereo audio signal. The stereo audio signal thus balances the signal enhancement included in the first audio signal with the audio context included in the second audio signal.
Particular aspects of the present disclosure are described below with reference to the drawings. In the description, common features are designated by common reference numbers. As used herein, various terminology is used for the purpose of describing particular implementations only and is not intended to be limiting of implementations. For example, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. Further, some features described herein are singular in some implementations and plural in other implementations. To illustrate,
In some drawings, multiple instances of a particular type of feature are used. Although these features are physically and/or logically distinct, the same reference number is used for each, and the different instances are distinguished by addition of a letter to the reference number. When the features as a group or a type are referred to herein, e.g., when no particular one of the features is being referenced, the reference number is used without a distinguishing letter. However, when one particular feature of multiple features of the same type is referred to herein, the reference number is used with the distinguishing letter. For example, referring to
As used herein, the terms “comprise,” “comprises,” and “comprising” may be used interchangeably with “include,” “includes,” or “including.” Additionally, the term “wherein” may be used interchangeably with “where.” As used herein, “exemplary” indicates an example, an implementation, and/or an aspect, and should not be construed as limiting or as indicating a preference or a preferred implementation. As used herein, an ordinal term (e.g., “first,” “second,” “third,” etc.) used to modify an element, such as a structure, a component, an operation, etc., does not by itself indicate any priority or order of the element with respect to another element, but rather merely distinguishes the element from another element having a same name (but for use of the ordinal term). As used herein, the term “set” refers to one or more of a particular element, and the term “plurality” refers to multiple (e.g., two or more) of a particular element. As used herein, A “and/or” B may mean that either “A and B”, or “A or B”, or both “A and B” and “A or B” are applicable or acceptable.
As used herein, “coupled” may include “communicatively coupled,” “electrically coupled,” or “physically coupled,” and may also (or alternatively) include any combinations thereof. Two devices (or components) may be coupled (e.g., communicatively coupled, electrically coupled, or physically coupled) directly or indirectly via one or more other devices, components, wires, buses, networks (e.g., a wired network, a wireless network, or a combination thereof), etc. Two devices (or components) that are electrically coupled may be included in the same device or in different devices and may be connected via electronics, one or more connectors, or inductive coupling, as illustrative, non-limiting examples. In some implementations, two devices (or components) that are communicatively coupled, such as in electrical communication, may send and receive signals (e.g., digital signals or analog signals) directly or indirectly, via one or more wires, buses, networks, etc. As used herein, “directly coupled” may include two devices that are coupled (e.g., communicatively coupled, electrically coupled, or physically coupled) without intervening components.
In the present disclosure, terms such as “determining,” “calculating,” “estimating,” “shifting,” “adjusting,” etc. may be used to describe how one or more operations are performed. It should be noted that such terms are not to be construed as limiting and other techniques may be utilized to perform similar operations. Additionally, as referred to herein, “generating,” “calculating,” “estimating,” “using,” “selecting,” “accessing,” and “determining” may be used interchangeably. For example, “generating,” “calculating,” “estimating,” or “determining” a parameter (or a signal) may refer to actively generating, estimating, calculating, or determining the parameter (or the signal) or may refer to using, selecting, or accessing the parameter (or signal) that is already generated, such as by another component or device.
Referring to
The audio analyzer 140 is configured to obtain the one or more input audio signals 125 that represent a soundscape with sounds 185 of one or more audio sources 184. In some implementations, the one or more input audio signals 125 correspond to microphone output of the one or more microphones 120, decoded audio data, an audio stream, or a combination thereof. For example, the audio analyzer 140 is configured to receive a first input audio signal 125 from a first microphone of the one or more microphones 120, and to receive a second input audio signal 125 from a second microphone of the one or more microphones 120. In another example, the first input audio signal 125 corresponds to a first audio channel of the decoded audio data (or the audio stream), and the second input audio signal 125 corresponds to a second audio channel of the decoded audio data (or the audio stream).
In a particular aspect, the audio analyzer 140 is configured to obtain image data 127 that represents a visual scene associated with (e.g., including) the one or more audio sources 184. In some examples, the image data 127 is based on camera output, decoded image data, stored image data, a graphic visual stream, or a combination thereof.
The audio analyzer 140 includes a signal enhancer 142 coupled to an audio mixer 148. The audio analyzer 140 also includes a directional analyzer 144 coupled to the signal enhancer 142, the audio mixer 148, or both. In some implementations, the audio analyzer 140 includes a context analyzer 146 coupled to the directional analyzer 144, the audio mixer 148, or both. In some implementations, the audio analyzer 140 includes a location sensor 162 coupled to the context analyzer 146.
The context analyzer 146 is configured to process the image data 127 to generate a visual context 147 of the one or more input audio signals 125. The visual context 147 may include any information that is associated with the one or more input audio signals 125 except for the input audio signals 125 themselves and which can be derived based on the image data 127. Such information may include but is not limited to (relative) location(s) of audio source(s) in the visual scene, (relative) location(s) of microphone(s) in the visual scene, acoustic characteristics of the soundscape, such as open space, confined/closed space, room geometry, etc., or the like. In some examples, the visual context 147 indicates a location (e.g., an elevation and azimuth) of an audio source 184A (e.g., a person) of the one or more audio sources 184 in a visual scene represented by the image data 127. In some examples, the visual context 147 indicates an environment of the visual scene represented by the image data 127. For example, the visual context 147 is based on surfaces of an acoustic environment, room geometry, or both. In some implementations, the context analyzer 146 is configured to use a neural network 156 to process at least the image data 127 to generate the visual context 147. In some examples, one or more operations described herein as performed by a neutral network may be performed using a machine learned network, such as an artificial neural network (ANN), other types of machine learned networks (e.g., based on fuzzy logic, evolutionary programming, and/or genetic algorithms), etc.
The context analyzer 146 is configured to obtain location data 163 indicating a location of a soundscape represented by the one or more input audio signals 125, a visual scene represented by the image data 127, or both. In some implementations, the location sensor 162 (e.g., a global positioning system (GPS) sensor) is configured to generate the location data 163. In other implementations, the location data 163 is generated by an application, received from another device, or both.
The context analyzer 146 is configured to process the location data 163 to generate a location context 137 of the one or more input audio signals 125. For example, the location context 137 indicates a location, a location type, or both, of a soundscape represented by the one or more input audio signals 125, a visual scene represented by the image data 127, or both. In some implementations, the context analyzer 146 uses the neural network 156 to process at least the location data 163 to generate the location context 137.
The directional analyzer 144 is configured to perform directional audio coding (DirAC) on the one or more input audio signals 125 to generate one or more directional audio signals 165, a background audio signal 167, or both. The one or more directional audio signals 165 correspond to directional sounds (e.g., speech or car) and the background audio signal 167 corresponds to diffuse noise (e.g., wind noise or background traffic) in the soundscape. In some implementations, the directional analyzer 144 is configured to use a neural network 154 to perform the DirAC on the one or more input audio signals 125.
The signal enhancer 142 is configured to perform signal enhancement to generate one or more enhanced mono audio signals 143. The signal enhancement is performed on the one or more input audio signals 125, the one or more directional audio signals 165, or a combination thereof, to generate the one or more enhanced mono audio signals 143. For example, the signal enhancer 142 performs signal enhancement on a first input audio signal 125 to generate a first enhanced mono audio signal 143, and performs signal enhancement on a second input audio signal 125 to generate a second enhanced mono audio signal 143. As another example, the signal enhancer 142 performs signal enhancement on a directional audio signal 165A to generate a first enhanced mono audio signal 143, and performs signal enhancement on a directional audio signal 165B to generate a second enhanced mono audio signal 143. The signal enhancement includes at least one of noise suppression, audio zoom, beamforming, dereverberation, source separation, bass adjustment, or equalization. In a particular implementation, the signal enhancer 142 is configured to use a neural network 152 to perform the signal enhancement. For example, the signal enhancer 142 may be configured to use the neural network 152 (e.g., a GAN) to perform generative audio techniques to generate the one or more enhanced mono audio signals 143. To illustrate, the signal enhancer 142 may use the neural network 152 to partially or completely remove noise for noise suppression, to adjust signal gains for audio zoom, to perform beamforming for audio focus, to perform dereverberation for removing the effects of reverberation, to partially or fully separate audio for source separation, to perform bass adjustment for increasing or decreasing bass, to perform equalization for adjusting a balance of different frequency components, or a combination thereof.
The audio mixer 148 is configured to generate one or more enhanced audio signals 151 based on the one or more enhanced mono audio signals 143, and to generate one or more audio signals 155 based on the one or more directional audio signals 165, the background audio signal 167, the location context 137, the visual context 147, or a combination thereof, as further described with reference to
In some implementations, the device 102 corresponds to or is included in one of various types of devices. In an illustrative example, the one or more processors 190 are integrated in a headset device, such as described further with reference to
During operation, the audio analyzer 140 obtains one or more input audio signals 125. In an example, the one or more input audio signals 125 correspond to the sounds 185 of the one or more audio sources 184 captured by the one or more microphones 120. In another example, the one or more processors 190 are configured to decode encoded audio data to generate the one or more input audio signals 125, as further described with reference to
In an example, the one or more audio sources 184 include an audio source 184A, an audio source 184B, an audio source 184C, one or more additional audio sources, or a combination thereof. To illustrate, the sounds 185 include speech from the audio source 184A (e.g., a person), directional noise from the audio source 184B (e.g., a car driving by), diffuse noise from the audio source 184C (e.g., leaves moving in the wind), or a combination thereof. In some examples, the audio source 184C can be invisible, such as wind. In some examples, the audio source 184C can correspond to multiple audio sources, such as traffic or leaves, that together correspond to diffuse noise. In some examples, sound from the audio source 184C can be directionless or all around.
In a particular aspect, the audio analyzer 140 obtains image data 127 representing a visual scene associated with (e.g., including) the one or more audio sources 184. In an example, the audio analyzer 140 is configured to receive the image data 127 from the one or more cameras 130 concurrently with receiving the one or more input audio signals 125 from the one or more microphones 120. In another example, the one or more processors 190 are configured to receive encoded data from another device and to decode the encoded data to generate the image data 127, as further described with reference to
In some implementations, the context analyzer 146 determines, based on the image data 127, a visual context 147 of the one or more input audio signals 125. In an illustrative example, the context analyzer 146 performs audio source detection (e.g., face detection) on the image data 127 to generate the visual context 147 indicating a location (e.g., an elevation and azimuth) of the audio source 184A (e.g., a person) of the one or more audio sources 184 in a visual scene represented by the image data 127. In a particular aspect, the visual context 147 indicates the location of the audio source 184A relative to a location of a notional listener, e.g., represented by the one or more microphones 120, in the visual scene.
In a particular aspect, the visual context 147 is based on a user input 103 indicating a position (e.g., location, orientation, tilt, or a combination thereof) of the user 101 (e.g., the head of the user 101). To illustrate, the user input 103 can correspond to sensor data indicating movement of a headset, camera output representing an image of the user 101, or both. In some implementations, the visual context 147 indicates an environment of the visual scene represented by the image data 127. For example, the visual context 147 is based on surfaces of an environment, room geometry, or both.
In some implementations, the context analyzer 146 obtains location data 163 indicating a location of a soundscape represented by the one or more input audio signals 125, a visual scene represented by the image data 127, or both. In some implementations, the context analyzer 146 processes the image data 127 based on the location data 163 to determine the visual context 147. In an illustrative example, the context analyzer 146 processes the image data 127 and detects multiple faces. The context analyzer 146, in response to determining that the location data 163 indicates an art gallery, performs an analysis of the image data 127 to distinguish between a painted face and an actual person to generate the visual context 147 indicating a location of the person as the audio source 184A.
In some aspects, the audio analyzer 140 determines, based on the location data 163, a location context 137 of the one or more input audio signals 125. The location context 137 indicates a particular location indicated by the location data 163, a location type of the particular location, or both. The location indicated by the location data 163 can correspond to a geographical location, a virtual location, or both. The location type can indicate an open space or a closed or confined space. Non-limiting examples of the location type may be indoors, outdoors, office, playground, park, aircraft interior, vehicle interior, etc.
In some examples, the location data 163 corresponds to GPS data indicating a location of the device 102, the one or more microphones 120, another device used to generate the one or more input audio signals 125, or a combination thereof. The context analyzer 146 generates the location context 137 indicating the location, location type, or both, associated with the GPS data. As another example, an application (e.g., a gaming application) of the one or more processors 190 generates the location data 163 indicating a virtual location of the soundscape represented by the one or more input audio signals 125. The context analyzer 146 generates the location context 137 indicating the virtual location (e.g., “training hall”), a type of the virtual location (e.g., “large room”), or both.
In some implementations, the location data 163 corresponds to user data (e.g., calendar data, user login data, etc.) indicating that a user 101 of the device 102 is at a particular location when the one or more input audio signals 125 are generated by the one or more microphones 120 of the device 102. In a particular aspect, the location data 163 corresponds to user data (e.g., calendar data, user login data, etc.) indicating that a second user (e.g., the audio source 184A) is at a particular location when the one or more input audio signals 125 are received from a second device of the second user. The context analyzer 146 generates the location context 137 indicating the particular location (e.g., Grand Bazaar), type of the particular location (e.g., covered market), or both.
In some examples, the context analyzer 146 processes the location data 163 based on an image analysis of the image data 127, an audio analysis of the one or more input audio signals 125, or both, to generate the location context 137. As an illustrative example, the context analyzer 146, in response to determining that the location data 163 indicates a highway and the image data 127 indicates an interior of a vehicle, generates the location context 137 indicating a location type corresponding to a vehicle interior on a highway.
The directional analyzer 144 performs DirAC on the one or more input audio signals 125 to generate one or more directional audio signals 165, a background audio signal 167, or both. In some aspects, the one or more directional audio signals 165 correspond to directional sounds and the background audio signal 167 corresponds to diffuse noise. In an illustrative example, a directional audio signal 165A represents the speech from an audio source 184A (e.g., a person), a directional audio signal 165B represents directional noise from an audio source 184B (e.g., a car driving by), and the background audio signal 167 represents diffuse noise from an audio source 184C (e.g., leaves moving in the wind). In some aspects, a particular sound direction of the sounds represented by the directional audio signal 165A can change over time. In an illustrative example, the direction of the sounds represented by the directional audio signal 165A changes as the audio source 184B (e.g., the car) moves relative to a notional listener in a soundscape represented by the one or more input audio signals 125.
In some implementations, the directional analyzer 144 generates the directional audio signal 165A based on audio source detection data (e.g., face detection data) indicated by the visual context 147. For example, the visual context 147 indicates an estimated location (e.g., absolute location or relative location) of the audio source 184A in a visual scene represented by the image data 127 and the directional analyzer 144 generates the directional audio signal 165A based on sounds corresponding to the estimated location. As another example, the directional analyzer 144, in response to determining that the visual context 147 indicates a particular audio source type (e.g., a face), performs analysis corresponding to the particular audio source type (e.g., speech separation) to generate the directional audio signal 165A from the one or more input audio signals 125. As another example, the directional analyzer 144, in response to determining a (relative) location of an audio source, e.g., audio source 184A, performs spatial filtering or beamforming of a plurality of input audio signals captured by a plurality of microphones 120 to spatially isolate an audio signal from the audio source. In a particular example, the directional analyzer 144, in response to determining a (relative) location of an audio source, e.g., audio source 184A, performs gain adjustment of a plurality of input audio signals captured by a plurality of microphones 120 to perform an audio zoom of an audio signal from the audio source.
In some implementations, the directional analyzer 144 provides the one or more directional audio signals 165 to the signal enhancer 142 and the background audio signal 167 to the audio mixer 148. In these implementations, the signal enhancer 142 performs signal enhancement of the one or more directional audio signals 165 to generate the one or more enhanced mono audio signals 143. The audio mixer 148 generates the one or more stereo audio signals 149 based on the one or more enhanced mono audio signals 143 and the background audio signal 167.
In some implementations, the directional analyzer 144 provides the one or more directional audio signals 165, the background audio signal 167, or a combination thereof, to the audio mixer 148. In these implementations, the signal enhancer 142 performs signal enhancement of the one or more input audio signals 125 to generate the one or more enhanced mono audio signals 143. The audio mixer 148 generates the one or more stereo audio signals 149 based on the one or more enhanced mono audio signals 143 and further based on the one or more directional audio signals 165, the background audio signal 167, or a combination thereof.
The signal enhancer 142 performs signal enhancement to generate the one or more enhanced mono audio signals 143. The signal enhancement includes at least one of noise suppression, audio zoom, beamforming, dereverberation, source separation, bass adjustment, or equalization. In some examples, the signal enhancer 142 selects the signal enhancement from at least one of noise suppression, audio zoom, beamforming, dereverberation, source separation, bass adjustment, or equalization, and performs the selected signal enhancement to generate the one or more enhanced mono audio signals 143. The signal enhancer 142 can select the signal enhancement based on a user input 103 received from the user 101, a configuration setting, default data, or a combination thereof.
The audio mixer 148 receives the one or more enhanced mono audio signals 143 from the signal enhancer 142. The audio mixer 148 also receives at least one of the one or more directional audio signals 165, the background audio signal 167, the location context 137, the visual context 147, or a combination thereof. The audio mixer 148 generates one or more enhanced audio signals 151 based on the one or more enhanced mono audio signals 143, and generates one or more audio signals 155 based on the one or more directional audio signals 165, the background audio signal 167, the visual context 147, or a combination thereof. In some implementations, the one or more enhanced audio signals 151 are the same as the one or more enhanced mono audio signals 143, as further described with reference to
In some implementations, the one or more audio signals 155 correspond to delay and attenuation applied to the background audio signal 167, as further described with reference to
The audio mixer 148 mixes the one or more enhanced audio signals 151 and the one or more audio signals 155 to generate one or more stereo audio signals 149. In a particular implementation, the audio mixer 148 receives an enhanced mono audio signal 143A (e.g., enhanced first microphone sound, such as noise suppressed speech), an enhanced mono audio signal 143B (e.g., enhanced second microphone sound, such as noise suppressed speech), the directional audio signal 165A (e.g., the speech), the directional audio signal 165B (e.g., the directional noise), the background audio signal 167 (e.g., the diffuse noise), or a combination thereof. In an illustrative example of this implementation, the audio mixer 148 mixes the one or more enhanced audio signals 151 corresponding to the noise suppressed speech with the one or more audio signals 155 corresponding to the directional noise or corresponding to a reverberation signal that is based on the directional noise or diffuse noise. The one or more stereo audio signals 149 include less noise (e.g., no diffuse noise) than the one or more input audio signals 125 while providing more audio context (e.g., directional noise, reverberation based on background noise, etc.) than the one or more enhanced mono audio signals 143 (e.g., noise suppressed speech).
In an alternative implementation, the signal enhancer 142 performs signal enhancement on the directional audio signal 165A and the directional audio signal 165B to generate the enhanced mono audio signal 143A (e.g., enhanced speech) and the enhanced mono audio signal 143B (e.g., signal enhanced directional noise, such as noise suppressed silence), respectively. The audio mixer 148 receives the enhanced mono audio signal 143A (e.g., enhanced speech), the enhanced mono audio signal 143B (e.g., noise suppressed silence), and the background audio signal 167 (e.g., the diffuse noise). In an illustrative example of this implementation, the audio mixer 148 receives and mixes the one or more enhanced audio signals 151 corresponding to noise suppressed speech and silence with the one or more audio signals 155 corresponding to a reverberation signal that is based on the diffuse noise. The stereo audio signals 149 include less noise than the input audio signals 125 (e.g., no background noise) while providing more audio context (e.g., reverberation based on the diffuse noise) than the one or more enhanced mono audio signals 143 (e.g., noise suppressed speech).
The system 100 thus balances signal enhancement performed by the signal enhancer 142 and audio context associated with the one or more input audio signals 125 in generating the one or more stereo audio signals 149. For example, the one or more stereo audio signals 149 can include directional noise or reverberation that provides audio context to a listener while removing at least some of the background noise (e.g., diffuse noise or all background noise).
Optionally, in some implementations, the signal enhancer 142 selects one or more of the signal enhancements (e.g., noise suppression, the audio zoom, the beamforming, the dereverberation, the source separation, the bass adjustment, or the equalization), and a second signal enhancer performs one or more remaining ones of the signal enhancements. In a particular aspect, the second signal enhancer is a component that is external to the signal enhancer 142 (as the directional analyzer 144 is external to the signal enhancer 142). In these implementations, particular signal enhancement by the signal enhancer 142 can be performed before or after other signal enhancement by the second signal enhancer. To illustrate, an input of the signal enhancer 142 can be based on an output of the second signal enhancer, an input of the second signal enhancer can be based on an output of the signal enhancer 142, or both. For example, the second signal enhancer performs particular signal enhancement on the one or more input audio signals 125 or the directional audio signals 165 to generate one or more second enhanced mono audio signals, and the signal enhancer 142 performs other signal enhancement on the one or more second enhanced mono audio signals to generate the one or more enhanced mono audio signals 143. In another example, the second signal enhancer performs additional signal enhancement on the one or more enhanced mono audio signals 143 to generate one or more second enhanced mono audio signals, and the audio mixer 148 generates the one or more enhanced audio signals 151 based on the one or more second enhanced mono audio signals.
Although the one or more microphones 120 and the one or more cameras 130 are shown as external to the device 102, in other implementations at least one of the one or more microphones 120 or the one or more cameras 130 can be integrated in the device 102. Although the one or more input audio signals 125 are illustrated as corresponding to microphone output of the one or more microphones 120, in other implementations the one or more input audio signals 125 can correspond to decoded audio data, an audio stream, stored audio data, or a combination thereof. Although the image data 127 is illustrated as corresponding to camera output of the one or more cameras 130, in other implementations the image data 127 can correspond to decoded image data, an image stream, stored image data, or a combination thereof.
Although the audio analyzer 140 is illustrated as included in a single device (e.g., 140), two or more components of the audio analyzer 140 can be distributed across multiple devices. For example, the signal enhancer 142, the directional analyzer 144, the context analyzer 146, the location sensor 162, or a combination thereof can be integrated in a first device (e.g., a user playback device) and the audio mixer 148 can be integrated in a second device (e.g., a headset).
Optionally, one or more operations described herein with reference to components of the audio analyzer 140 can be performed by neural networks. In one such example, the signal enhancer 142 uses the neural network 152 (e.g., a speech generative network) to perform signal enhancement to generate the enhanced mono audio signal 143A representing an enhanced version of the speech of the audio source 184A. In another example, the directional analyzer 144 uses the neural network 154 to process the one or more input audio signals 125 to generate the one or more directional audio signals 165, the background audio signal 167, or a combination thereof. In yet another example, the context analyzer 146 uses the neural network 156 to process the image data 127, the location data 163, or both, to generate the visual context 147, the location context 137, or both. In some examples, the neural network 156 includes a first neural network and a second neural network to process the image data 127, the location data 163, or both, to generate the visual context 147 and the location context 137, respectively. In an example, the audio mixer 148 uses a neural network to generate the one or more stereo audio signals 149, as further described with reference to
Referring to
The signal enhancer 142 performs the noise suppression 132 to, partially or completely, remove noise from the one or more input audio signals 115A to generate the one or more enhanced audio signals 133A (e.g., enhanced mono audio signal(s)). The one or more input audio signals 115A are based on the one or more input audio signals 125 or the one or more directional audio signals 165. For example, the one or more input audio signal 115A are the one or more input audio signals 125. As another example, the one or more input audio signal 115A are the one or more directional audio signals 165. In yet another example, the one or more input audio signals 115A include one or more enhanced audio signals generated by the signal enhancer 142, as described with reference to
In an example, the one or more enhanced audio signals 133A correspond to one or more noise suppressed speech signals. Optionally, in some implementations, the signal enhancer 142 uses a neural network 152A to perform the noise suppression 132. Thus, the one or more enhanced audio signals 133A may be one or more mono speech signals with noise suppressed by application of the neural network 152A. The one or more enhanced mono audio signals 143 are based on the one or more enhanced audio signals 133A.
Referring to
The signal enhancer 142 performs the audio zoom 134 to reduce the gain of an unwanted audio signal of the input audio signals 115B, increase the gain of a target audio signal of the input audio signals 115B, or both, to generate the enhanced audio signals 133B (e.g., enhanced mono audio signals). The enhanced audio signals 133B correspond to audio zoomed signals. The input audio signals 115B are based on the input audio signals 125 or the directional audio signals 165. For example, the input audio signals 115B are the input audio signals 125. As another example, the input audio signals 115B are the directional audio signals 165. In yet another example, the input audio signals 115B include enhanced audio signals generated by the signal enhancer 142, as described with reference to
In an example, the enhanced audio signals 133B correspond to zoomed speech signals. Optionally, in some implementations, the signal enhancer 142 uses a neural network 152B to perform the audio zoom 134. Thus, the enhanced audio signals 133B may be mono speech signals with zoomed audio generated by application of the neural network 152B. The one or more enhanced mono audio signals 143 are based on the enhanced audio signals 133B.
Referring to
The signal enhancer 142 performs the beamforming 136 to form a virtual beam in the direction of a primary (e.g., target) audio source and/or a null beam in the direction of secondary (e.g., unwanted) audio sources to generate the enhanced audio signals 133C (e.g., enhanced mono audio signals). The enhanced audio signals 133C correspond to beamformed audio signals. The input audio signals 115C are based on the input audio signals 125 or the directional audio signals 165. For example, the input audio signals 115C are the input audio signals 125. As another example, the input audio signals 115B are the directional audio signals 165. In yet another example, the input audio signals 115C include enhanced audio signals generated by the signal enhancer 142, as described with reference to
In an example, the enhanced audio signals 133C correspond to beamformed speech signals. Optionally, in some implementations, the signal enhancer 142 uses a neural network 152C to perform the beamforming 136. Thus, the enhanced audio signals 133B may be mono speech signals with beamformed audio generated by application of the neural network 152C. The one or more enhanced mono audio signals 143 are based on the enhanced audio signals 133B.
Referring to
The signal enhancer 142 performs the dereverberation 138 to, partially or completely, remove reverberation from the one or more input audio signals 115D to generate the one or more enhanced audio signals 133D (e.g., enhanced mono audio signal(s)). The one or more input audio signals 115D are based on the one or more input audio signals 125 or the one or more directional audio signals 165. For example, the one or more input audio signals 115D are the one or more input audio signals 125. As another example, the one or more input audio signals 115D are the one or more directional audio signals 165. In yet another example, the one or more input audio signals 115D include one or more enhanced audio signals generated by the signal enhancer 142, as described with reference to
In an example, the one or more enhanced audio signals 133D correspond to one or more dereverberated speech signals. Optionally, in some implementations, the signal enhancer 142 uses a neural network 152D to perform the dereverberation 138. Thus, the one or more enhanced audio signals 133D may be one or more mono speech signals with dereverberation generated by application of the neural network 152D. The one or more enhanced mono audio signals 143 are based on the one or more enhanced audio signals 133D.
Referring to
The signal enhancer 142 performs the source separation 150 to, partially or completely, remove sounds of secondary (e.g., unwanted) audio sources from the one or more input audio signals 115E to generate the one or more enhanced audio signals 133E (e.g., enhanced mono audio signal(s)). The one or more input audio signals 115E are based on the one or more input audio signals 125 or the one or more directional audio signals 165. For example, the one or more input audio signals 115E are the one or more input audio signals 125. As another example, the one or more input audio signals 115E are the one or more directional audio signals 165. In yet another example, the one or more input audio signals 115E include one or more enhanced audio signals generated by the signal enhancer 142, as described with reference to
In an example, the one or more enhanced audio signals 133E correspond to source separated speech signals. Optionally, in some implementations, the signal enhancer 142 uses a neural network 152E to perform the source separation 150. Thus, the one or more enhanced audio signals 133E may be one or more mono speech signals with source separated audio generated by application of the neural network 152E. The one or more enhanced mono audio signals 143 are based on the one or more enhanced audio signals 133E.
Referring to
The signal enhancer 142 performs the bass adjustment 158 to increase or decrease bass in the one or more input audio signals 115F to generate the one or more enhanced audio signals 133F (e.g., enhanced mono audio signal(s)). The one or more input audio signals 115F are based on the one or more input audio signals 125 or the one or more directional audio signals 165. For example, the one or more input audio signals 115F are the one or more input audio signals 125. As another example, the one or more input audio signals 115F are the one or more directional audio signals 165. In yet another example, the one or more input audio signals 115F include one or more enhanced audio signals generated by the signal enhancer 142, as described with reference to
In an example, the one or more enhanced audio signals 133F correspond to one or more bass adjusted speech signals. Optionally, in some implementations, the signal enhancer 142 uses a neural network 152F to perform the bass adjustment 158. Thus, the one or more enhanced audio signals 133F may be one or more mono speech signals with bass adjusted by application of the neural network 152F. The one or more enhanced mono audio signals 143 are based on the one or more enhanced audio signals 133F.
Referring to
The signal enhancer 142 performs the equalization 160 to adjust a balance of various frequency components of the one or more input audio signals 115G to generate the one or more enhanced audio signals 133G (e.g., an enhanced mono audio signal). The one or more input audio signals 115G are based on the one or more input audio signals 125 or the one or more directional audio signals 165. For example, the one or more input audio signals 115G are the one or more input audio signals 125. As another example, the one or more input audio signals 115G are the one or more directional audio signals 165. In yet another example, the one or more input audio signals 115G include one or more enhanced audio signals generated by the signal enhancer 142, as described with reference to
In an example, the one or more enhanced audio signals 133G correspond to one or more equalized signals (e.g., music audio). Optionally, in some implementations, the signal enhancer 142 uses a neural network 152G to perform the equalization 160. Thus, the one or more enhanced audio signals 133G may be one or more mono speech signals with equalization performed by application of the neural network 152G. The one or more enhanced mono audio signals 143 are based on the one or more enhanced audio signals 133G.
Referring to
In the example illustrated in
The audio mixer 148A generates an audio signal 155 based on the background audio signal 167. For example, the audio mixer 148A applies a delay 202 to the background audio signal 167 to generate a delayed audio signal 203. In some aspects, an amount of the delay 202 is based on the user input 103 of
Additionally or alternatively, the audio mixer 148A performs an attenuate operation 216 including applying an attenuation factor 215 to the (delayed) audio signal 203 to generate an attenuated audio signal 217. In some aspects, the attenuation factor 215 is based on the user input 103 of
The audio mixer 148A mixes the attenuated audio signal 217 with each of the one or more enhanced audio signals 151 to generate the one or more stereo audio signals 149. For example, the audio mixer 148A mixes the attenuated audio signal 217 with each of the enhanced mono audio signal 143A and the enhanced mono audio signal 143B to generate a stereo audio signal 149A and a stereo audio signal 149B, respectively. The enhanced mono audio signal 143A may correspond to an enhanced left channel of a stereo audio signal, such as a stereo speech signal, and the enhanced mono audio signal 143B may correspond to an enhanced right channel of a stereo audio signal, such as the stereo speech signal. The one or more enhanced mono audio signals 143 include signal enhanced sounds (e.g., noise suppressed speech) and the audio signal 155 is based on the background audio signal 167 (e.g., diffuse noise from the leaves) and not based on the one or more directional audio signals 165 (e.g., the car noise). In this example, the one or more stereo audio signals 149 include speech and diffuse noise (e.g., from leaves) and do not include the directional noise (e.g., the car noise).
Referring to
The audio mixer 148B includes a neural network 258 configured to process one or more inputs of the audio mixer 148A of
In a particular aspect, the neural network 258 includes an input layer, an output layer, and one or more hidden layers between the input layer and the output layer. In some implementations, the one or more hidden layers include fully connected layers, one or more convolution layers, or both. In some implementations, the one or more hidden layers include at least one recurrent layer, such as a long short-term memory (LSTM) layer, a gated recurrent unit (GRU) layer, or another recurrent neural network structure.
In a particular implementation, the input layer of the neural network 258 includes at least one input node for each signal input to the neural network 258. For example, the input layer of the neural network 258 may include at least one input node for feature values derived from the enhanced mono audio signal 143A, at least one input node for feature values derived from the enhanced mono audio signal 143B, at least one input node for feature values derived from the background audio signal 167, and optionally at least one input node for feature values derived from each of the input audio signal(s) 125.
In a particular implementation, the output layer of the neural network 258 includes two nodes corresponding to a right-channel stereo output node and a left-channel stereo output node. For example, the left-channel stereo output node may output the stereo audio signal 149A and the right-channel stereo output node may output the stereo audio signal 149B.
During training of the neural network 258, the audio mixer 148B (or another training component of a device) generates a loss metric based on a comparison of the one or more stereo audio signals 149 generated by the audio mixer 148A of
Referring to
The neural network 258 is configured to process one or more inputs of the audio mixer 148A of
In the example illustrated in
In
In
Each time-windowed sample may be transformed to a frequency domain signal. Each frequency domain signal may be encoded as an N-bit, e.g., 16-bit, feature vector with N being a non-zero power of two. In this example, each sample of the background audio signal 167 is represented by 16 bits, and the input layer 270 may include 16 nodes to receive the 16 bits of the features of the background audio signal 167. In other examples, feature vectors larger than or smaller than 16 bits are used to represent one or more of the audio signals 143A, 143B, 167, or 125. Further, the feature vectors used to represent each of the audio signals 143A, 143B, 167, or 125 need not be of the same size. To illustrate, the enhanced audio signals 151 may be represented with higher fidelity (and a corresponding larger number of bits) than the background audio signal 167. In other words, the neural network 258 may include a larger number of input nodes per input signal for the one or more enhanced mono audio signals as compared to the other signals, e.g., the background audio signal.
In the example of
The output layer 276 of the neural network 258 includes at least two nodes corresponding to output nodes for features of the stereo audio signal 149A and the stereo audio signal 149B. Optionally, each of the nodes of the output layer 276 may be associated with a respective one of the biases 272C. In
During training of the neural network 258, the audio mixer 148B (or another training component of a device) generates a loss metric based on a comparison of the one or more stereo audio signals 149 generated by the audio mixer 148A of
Referring to
In the example illustrated in
The audio mixer 148A generates the one or more audio signals 155 based on the one or more directional audio signals 165. For example, the audio mixer 148A applies a delay 302 to the one or more directional audio signals 165 to generate one or more delayed audio signals 303. In some aspects, an amount of the delay 302 is based on the user input 103 of
Additionally or alternatively, the audio mixer 148A applies one or more panning operations 316 to the one or more (delayed) audio signals 303 to generate one or more panned audio signals 317. The one or more panned audio signals 317 correspond to the one or more audio signals 155. For example, a panned audio signal 317A and a panned audio signal 317B correspond to the audio signal 155A and the audio signal 155B, respectively. In an example, a pan factor generator 326 of the audio mixer 148A determines one or more pan factors 315 (e.g., gains, delays, or both) based on the visual context 147, a source direction selection 347, or both. As an example, the visual context 147 indicates a particular location (e.g., absolute location, relative location, or both) of the audio source 184A in a visual scene represented by the image data 127 of
In some examples, the audio analyzer 140 determines the source direction selection 347 based on hand gesture detection, head tracking, eye gaze direction, a user interface input, or a combination thereof. In a particular aspect, the source direction selection 347 corresponds to a user input 103 of
In a particular aspect, the particular location indicated by the visual context 147, the source direction selection 347, or both, corresponds to an estimated location of the audio source 184A, a target (e.g., desired) location of the audio source 184A, or both. The pan factor generator 326, in response to determining that the particular location (indicated by the visual context 147, the source direction selection 347, or both) corresponds to left of center in the visual scene, generates a pan factor 315A with a lower gain value than a pan factor 315B. The pan factor 315A (e.g., the lower gain value) is used to generate a stereo audio signal 149B (e.g., a right channel signal) and the pan factor 315B (e.g., the higher gain value) is used to generate the stereo audio signal 149A (e.g., a left channel signal). The audio mixer 148A performs a panning operation 316A based on the pan factor 315A to a particular delayed audio signal of the one or more delayed audio signals 303 to generate the panned audio signal 317A. For example, the particular delayed audio signal is based on the directional audio signal 165A that represents speech of the audio source 184A and the audio mixer 148A performs the panning operation 316A to the particular delayed audio signal to generate the panned audio signal 317A. The audio mixer 148A mixes the enhanced mono audio signal 143B (e.g., enhanced second microphone sound) with the panned audio signal 317A to generate the stereo audio signal 149B (e.g., a right channel signal).
Additionally, the audio mixer 148A performs a panning operation 316B based on the pan factor 315B to the particular delayed audio signal to generate the panned audio signal 317B. The audio mixer 148A mixes the enhanced mono audio signal 143A with the panned audio signal 317B to generate the stereo audio signal 149A. The speech from the audio source 184A is thus more perceptible in the stereo audio signal 149A (e.g., the left channel signal) as compared to the stereo audio signal 149B (e.g., the right channel signal). In some implementations, the pan factors 315 dynamically change over time as the audio source 184A moves from left to right.
In some implementations, instead of panning the one or more (delayed) audio signals 303, the audio mixer 148A pans the one or more enhanced mono audio signals 143 based on the visual context 147, the source direction selection 347, or both, as described with reference to
Referring to
The audio mixer 148B includes a neural network 358 configured to process one or more inputs of the audio mixer 148A of
In a particular aspect, the neural network 358 includes an input layer, an output layer, and one or more hidden layers between the input layer and the output layer. In some implementations, the one or more hidden layers include fully connected layers, one or more convolution layers, or both. In some implementations, the one or more hidden layers include at least one recurrent layer, such as a LSTM layer, a GRU layer, or another recurrent neural network structure.
In a particular implementation, the input layer of the neural network 358 includes at least one input node for each signal input to the neural network 358. For example, the input layer of the neural network 358 may include at least one input node for feature values derived from the enhanced mono audio signal 143A, at least one input node for feature values derived from the enhanced mono audio signal 143B, at least one input node for feature values derived from each of the directional audio signal(s) 165, and optionally at least one input node for feature values derived from the visual context 147 and/or the source direction selection 347.
In a particular implementation, the output layer of the neural network 358 includes two nodes corresponding to a right-channel stereo output node and a left-channel stereo output node. For example, the left-channel stereo output node may output the stereo audio signal 149A, and the right-channel stereo output node may output the stereo audio signal 149B.
During training of the neural network 358, the audio mixer 148B (or another training component of a device) generates a loss metric based on a comparison of the one or more stereo audio signals 149 generated by the audio mixer 148A of
Referring to
In the example illustrated in
In
In
In the example of
The output layer 276 of the neural network 358 includes at least two nodes corresponding to output nodes for features of the stereo audio signal 149A and the stereo audio signal 149B. Optionally, each of the nodes of the output layer 276 may be associated with a respective one of the biases 272C. In
During training of the neural network 358, the audio mixer 148B (or another training component of a device) generates a loss metric based on a comparison of the one or more stereo audio signals 149 generated by the audio mixer 148A of
Referring to
The audio mixer 148A performs, based on the visual context 147, the source direction selection 347, or both, one or more binauralization operations 416 on the one or more enhanced mono audio signals 143 to generate one or more binaural audio signals 417. The one or more binaural audio signals 417 correspond to the one or more enhanced audio signals 151. For example, the audio mixer 148A performs a binauralization operation 416A on the enhanced mono audio signal 143A and a binauralization operation 416B on the enhanced mono audio signal 143B to generate a binaural audio signal 417A and a binaural audio signal 417B, respectively. The binaural audio signal 417A and the binaural audio signal 417B correspond to the enhanced audio signal 151A and the enhanced audio signal 151B, respectively.
As an example, the visual context 147, the source direction selection 347, or both, indicate a particular location (e.g., a relative location, an absolute location, or both) of the audio source 184A in a visual scene represented by the image data 127 of
In this example, the one or more stereo audio signals 149 are based on the one or more directional audio signals 165 (e.g., the one or more delayed audio signals 303) and not based on the background audio signal 167 (e.g., the delayed audio signal 203 of
Referring to
The audio mixer 148B includes a neural network 458 configured to process one or more inputs of the audio mixer 148A of
In a particular aspect, the neural network 458 includes an input layer, an output layer, and one or more hidden layers between the input layer and the output layer. In some implementations, the one or more hidden layers include fully connected layers, one or more convolution layers, or both. In some implementations, the one or more hidden layers include at least one recurrent layer, such as a LSTM layer, a GRU layer, or another recurrent neural network structure.
In a particular implementation, the input layer of the neural network 458 includes at least one input node for each signal input to the neural network 458. For example, the input layer of the neural network 458 may include at least one input node for feature values derived from the enhanced mono audio signal 143A, at least one input node for feature values derived from the enhanced mono audio signal 143B, at least one optional input node for feature values derived from the visual context 147 and/or the source direction selection 347, and at least one input node for feature values derived from each of the directional audio signal(s) 165.
In a particular implementation, the output layer of the neural network 458 includes two nodes corresponding to a right-channel stereo output node and a left-channel stereo output node. For example, the left-channel stereo output node may output the stereo audio signal 149A and the right-channel stereo output node may output the stereo audio signal 149B.
During training of the neural network 458, the audio mixer 148B (or another training component of a device) generates a loss metric based on a comparison of the one or more stereo audio signals 149 generated by the audio mixer 148A of
Referring to
In the example illustrated in
In
In
In the example of
The output layer 276 of the neural network 458 includes at least two nodes corresponding to output nodes for features of the stereo audio signal 149A and the stereo audio signal 149B. Optionally, each of the nodes of the output layer 276 may be associated with a respective one of the biases 272C. In
During training of the neural network 458, the audio mixer 148B (or another training component of a device) generates a loss metric based on a comparison of the one or more stereo audio signals 149 generated by the audio mixer 148A of
Referring to
The audio mixer 148A performs, based on the visual context 147, the source direction selection 347, or both, one or more panning operations 516 on the one or more enhanced mono audio signals 143 to generate one or more panned audio signals 517. The one or more panned audio signals 517 correspond to the one or more enhanced audio signals 151. For example, the audio mixer 148A performs a panning operation 516A on the enhanced mono audio signal 143A and a panning operation 516B on the enhanced mono audio signal 143B to generate a panned audio signal 517A and a panned audio signal 517B, respectively. The panned audio signal 517A and the panned audio signal 517B correspond to the enhanced audio signal 151A and the enhanced audio signal 151B, respectively.
As an example, the visual context 147, the source direction selection 347, or both, indicate a particular location (e.g., a relative location, an absolute location, or both) of the audio source 184A in a visual scene represented by the image data 127 of
The audio mixer 148A includes a reverberation generator 544 that uses a reverberation model 554 to process the one or more directional audio signals 165, the background audio signal 167, or both, to generate a reverberation signal 545. The audio mixer 148A mixes each of the one or more panned audio signals 517 with the reverberation signal 545 to generate the one or more stereo audio signals 149. For example, the audio mixer 148A mixes each of the panned audio signal 517A and the panned audio signal 517B with the reverberation signal 545 to generate the stereo audio signal 149A and the stereo audio signal 149B, respectively.
In this example, the one or more stereo audio signals 149 include reverberation that is based on the one or more directional audio signals 165, the background audio signal 167, or both, and do not include the one or more directional audio signals 165 or the background audio signal 167. As a result, the stereo audio signals 149 include reverberation and do not include background noise (e.g., car noise or wind noise).
Referring to
The audio mixer 148B includes a neural network 558 configured to process one or more inputs of the audio mixer 148A of
In a particular aspect, the neural network 558 includes an input layer, an output layer, and one or more hidden layers between the input layer and the output layer. In some implementations, the one or more hidden layers include fully connected layers, one or more convolution layers, or both. In some implementations, the one or more hidden layers include at least one recurrent layer, such as a LSTM layer, a GRU layer, or another recurrent neural network structure.
In a particular implementation, the input layer of the neural network 558 includes at least one input node for each signal input to the neural network 558. For example, the input layer of the neural network 558 may include at least one input node for feature values derived from the enhanced mono audio signal 143A, at least one input node for feature values derived from the enhanced mono audio signal 143B, at least one input node for feature values derived from the visual context 147 and/or the source direction selection 347, and optionally, at least one input node for feature values derived from each of the directional audio signal(s) 165, and/or at least one input node for feature values derived from the background audio signal 167.
In a particular implementation, the output layer of the neural network 558 includes two nodes corresponding to a right-channel stereo output node and a left-channel stereo output node. For example, the left-channel stereo output node may output the stereo audio signal 149A and the right-channel stereo output node may output the stereo audio signal 149B.
During training of the neural network 558, the audio mixer 148B (or another training component of a device) generates a loss metric based on a comparison of the one or more stereo audio signals 149 generated by the audio mixer 148A of
Referring to
In the example illustrated in
In
In
In the example of
The output layer 276 of the neural network 558 includes at least two nodes corresponding to output nodes for features of the stereo audio signal 149A and the stereo audio signal 149B. Optionally, each of the nodes of the output layer 276 may be associated with a respective one of the biases 272C. In
During training of the neural network 558, the audio mixer 148B (or another training component of a device) generates a loss metric based on a comparison of the one or more stereo audio signals 149 generated by the audio mixer 148A of
Referring to
The audio mixer 148A includes a reverberation generator 644 that uses a reverberation model 654 to generate a synthesized reverberation signal 645 (e.g., not based on reverberation extracted from audio signals as in
The audio mixer 148A mixes each of the one or more panned audio signals 517 (e.g., generated as described with reference to
In this example, the one or more stereo audio signals 149 include reverberation that is based on the location context 137, the visual context 147, or both, and are not based on the one or more directional audio signals 165 or the background audio signal 167 of
Referring to
The audio mixer 148B includes a neural network 658 configured to process one or more inputs of the audio mixer 148A of
In a particular aspect, the neural network 658 includes an input layer, an output layer, and one or more hidden layers between the input layer and the output layer. In some implementations, the one or more hidden layers include fully connected layers, one or more convolution layers, or both. In some implementations, the one or more hidden layers include at least one recurrent layer, such as a LSTM layer, a GRU layer, or another recurrent neural network structure.
In a particular implementation, the input layer of the neural network 658 includes at least one input node for each signal input to the neural network 658. For example, the input layer of the neural network 658 may include at least one input node for feature values derived from the enhanced mono audio signal 143A, at least one input node for feature values derived from the enhanced mono audio signal 143B, and at least one input node for feature values derived from the visual context 147, the source direction selection 347, and/or the location context 137.
In a particular implementation, the output layer of the neural network 658 includes two nodes corresponding to a right-channel stereo output node and a left-channel stereo output node. For example, the left-channel stereo output node may output the stereo audio signal 149A and the right-channel stereo output node may output the stereo audio signal 149B.
During training of the neural network 658, the audio mixer 148B (or another training component of a device) generates a loss metric based on a comparison of the one or more stereo audio signals 149 generated by the audio mixer 148A of
Referring to
In the example illustrated in
In
In
In the example of
The output layer 276 of the neural network 658 includes at least two nodes corresponding to output nodes for features of the stereo audio signal 149A and the stereo audio signal 149B. Optionally, each of the nodes of the output layer 276 may be associated with a respective one of the biases 272C. In
During training of the neural network 658, the audio mixer 148B (or another training component of a device) generates a loss metric based on a comparison of the one or more stereo audio signals 149 generated by the audio mixer 148A of
It should be understood that
Referring to
During operation, the receiver 740 receives encoded data 749 from the device 704. The encoded data 749 represents the one or more input audio signals 125, the image data 127, the location data 163, or a combination thereof. For example, the encoded data 749 represents the sounds 185 of the one or more audio sources 184, images of the one or more audio sources 184, a location of an audio scene associated with the sounds 185, or a combination thereof. In some implementations, a decoder of the device 102 decodes the encoded data 749 to generate the one or more input audio signals 125, the image data 127, the location data 163, or a combination thereof.
As described with reference to
Referring to
During operation, the audio analyzer 140 obtains the one or more input audio signals 125, the image data 127, or both. In some aspects, the one or more input audio signals 125 correspond to a microphone output of the one or more microphones 120, and the image data 127 corresponds to a camera output of the one or more cameras 130. The one or more input audio signals 125 represent the sounds 185 of the audio sources 184. The audio analyzer 140 processes the one or more input audio signals 125, the image data 127, or both, to generate the one or more stereo audio signals 149. For example, as described with reference to
The transmitter 840 transmits encoded data 849 to the device 804. The encoded data 849 is based on the one or more stereo audio signals 149. In some examples, an encoder of the device 102 encodes the one or more stereo audio signals 149 to generate the encoded data 849.
A decoder of the device 804 decodes the encoded data 849 to generate a decoded audio signal. The device 804 provides the decoded audio signal to one or more speakers 822 to output sounds 885.
In some implementations, the device 804 is the same as the device 704. For example, the device 102 receives the encoded data 749 from a second device and outputs the sounds 785 via the one or more speakers 722 while concurrently capturing the sounds 185 via the one or more microphones 120 and sending the encoded data 849 to the second device.
Referring to
The method 1800 includes performing signal enhancement of an input audio signal to generate an enhanced mono audio signal, at 1802. For example, the signal enhancer 142 of
The method 1800 also includes mixing a first audio signal and a second audio signal to generate a stereo audio signal, the first audio signal is based on the enhanced mono audio signal, at 1804. For example, the audio mixer 148 mixes the one or more enhanced audio signals 151 and the one or more audio signals 155 to generate the one or more stereo audio signals 149, as described with reference to
The method 1800 balances signal enhancement of the enhanced mono audio signal with directional context associated with the second audio signals in generating the stereo audio signal. For example, the one or more stereo audio signals 149 can include directional noise or reverberation that provides audio context to a listener while removing at least some of the background noise (e.g., diffuse noise or all background noise).
The method 1800 of
Referring to
In a particular implementation, the device 1900 includes a processor 1906 (e.g., a CPU). The device 1900 may include one or more additional processors 1910 (e.g., one or more DSPs, one or more GPUs, or a combination thereof). In a particular aspect, the one or more processors 190 of
The device 1900 may include a memory 1986 and a CODEC 1934. The memory 1986 may include instructions 1956, that are executable by the one or more additional processors 1910 (or the processor 1906) to implement the functionality described with reference to the audio analyzer 140. The device 1900 may include a modem 1970 coupled, via a transceiver 1950, to an antenna 1952.
The device 1900 may include a display 1928 coupled to a display controller 1926. The one or more speakers 722, the one or more microphones 120, or a combination thereof may be coupled to the CODEC 1934. The CODEC 1934 may include a digital-to-analog converter (DAC) 1902, an analog-to-digital converter (ADC) 1904, or both. In a particular implementation, the CODEC 1934 may receive analog signals from the one or more microphones 120, convert the analog signals to digital signals using the analog-to-digital converter 1904, and provide the digital signals to the speech and music codec 1908. The speech and music codec 1908 may process the digital signals, and the digital signals may further be processed by the audio analyzer 140. In a particular implementation, the speech and music codec 1908 may provide digital signals to the CODEC 1934. The CODEC 1934 may convert the digital signals to analog signals using the digital-to-analog converter 1902 and may provide the analog signals to the one or more speakers 722.
In a particular implementation, the device 1900 may be included in a system-in-package or system-on-chip device 1922. In a particular implementation, the memory 1986, the processor 1906, the processors 1910, the display controller 1926, the CODEC 1934, and the modem 1970 are included in the system-in-package or system-on-chip device 1922. In a particular implementation, an input device 1930 and a power supply 1944 are coupled to the system-in-package or the system-on-chip device 1922. Moreover, in a particular implementation, as illustrated in
The device 1900 may include a smart speaker, a speaker bar, a mobile communication device, a smart phone, a cellular phone, a laptop computer, a computer, a tablet, a personal digital assistant, a display device, a television, a gaming console, a music player, a radio, a digital video player, a digital video disc (DVD) player, a tuner, a camera, a navigation device, a vehicle, a headset, an augmented reality headset, a mixed reality headset, a virtual reality headset, an aerial vehicle, a home automation system, a voice-activated device, a wireless speaker and voice activated device, a portable electronic device, a car, a computing device, a communication device, an internet-of-things (IoT) device, a virtual reality (VR) device, a base station, a mobile device, or any combination thereof.
In conjunction with the described implementations, an apparatus includes means for performing signal enhancement of an input audio signal to generate an enhanced mono audio signal. For example, the means for performing the signal enhancement can correspond to the neural network 152, the signal enhancer 142, the audio mixer 148, the audio analyzer 140, the one or more processors 190, the device 102, the system 100 of
The apparatus also includes means for mixing the first audio signal and the second audio signal to generate a stereo audio signal, the first audio signal based on the enhanced mono audio signal. For example, the means for mixing can correspond to the audio mixer 148, the audio analyzer 140, the one or more processors 190, the device 102, the system 100 of
In some implementations, a non-transitory computer-readable medium (e.g., a computer-readable storage device, such as the memory 1986) includes instructions (e.g., the instructions 1956) that, when executed by one or more processors (e.g., the one or more processors 1910 or the processor 1906), cause the one or more processors to perform signal enhancement of an input audio signal (e.g., the one or more input audio signals 125) to generate an enhanced mono audio signal (e.g., the one or more enhanced mono audio signals 143). The instructions, when executed by the one or more processors, also cause the one or more processors to mix a first audio signal (e.g., the one or more enhanced audio signals 151) and a second audio signal (e.g., the one or more audio signals 155) to generate a stereo audio signal (e.g., the one or more stereo audio signals 149). The first audio signal is based on the enhanced mono audio signal.
Particular aspects of the disclosure are described below in sets of interrelated Examples:
According to Example 1, a device includes: a processor configured to perform signal enhancement of an input audio signal to generate an enhanced mono audio signal; and mix a first audio signal and a second audio signal to generate a stereo audio signal, the first audio signal based on the enhanced mono audio signal.
Example 2 includes the device of Example 1, wherein the second audio signal is associated with a context of the input audio signal.
Example 3 includes the device of Example 1 or Example 2, wherein the processor is configured to use a neural network to perform the signal enhancement.
Example 4 includes the device of any of Example 1 to Example 3, wherein the input audio signal is based on microphone output of one or more microphones.
Example 5 includes the device of any of Example 1 to Example 4, wherein the processor is configured to decode encoded audio data to generate the input audio signal.
Example 6 includes the device of any of Example 1 to Example 5, wherein the signal enhancement includes at least one of noise suppression, audio zoom, beamforming, dereverberation, source separation, bass adjustment, or equalization.
Example 7 includes the device of any of Example 1 to Example 6, wherein the signal enhancement is based at least in part on a configuration setting, a user input, or both.
Example 8 includes the device of any of Example 1 to Example 7, wherein the processor is configured to use a neural network to mix the first audio signal and the second audio signal to generate the stereo audio signal.
Example 9 includes the device of any of Example 1 to Example 8, wherein the processor is configured to: use a first neural network to perform signal enhancement of an input audio signal to generate the enhanced mono audio signal; and use a second neural network to mix the first audio signal and the second audio signal.
Example 10 includes the device of any of Example 1 to Example 9, wherein the processor is configured to: perform signal enhancement of a second input audio signal to generate a second enhanced mono audio signal; and generate the stereo audio signal based on mixing the first audio signal, the second audio signal, and a third audio signal, the third audio signal based on the second enhanced mono audio signal.
Example 11 includes the device of any of Example 1 to Example 10, wherein the processor is configured to generate at least one directional audio signal based on the input audio signal; and wherein the second audio signal is based on the at least one directional audio signal.
Example 12 includes the device of any of Example 1 to Example 11, wherein the processor is configured to: generate a directional audio signal based on the input audio signal; and apply a delay to the directional audio signal to generate a delayed audio signal, wherein the second audio signal is based on the delayed audio signal.
Example 13 includes the device of Example 12, wherein the processor is configured to pan, based on a visual context, the delayed audio signal to generate the second audio signal.
Example 14 includes the device of any of Example 1 to Example 13, wherein the processor is configured to pan the enhanced mono audio signal to generate the first audio signal.
Example 15 includes the device of Example 14, wherein the processor is configured to receive a user selection of an audio source direction, and wherein the enhanced mono audio signal is panned based on the audio source direction.
Example 16 includes the device of Example 15, wherein the processor is configured to determine the user selection based on hand gesture detection, head tracking, eye gaze detection, a user interface input, or a combination thereof.
Example 17 includes the device of Example 15 or Example 16, wherein the processor is configured to apply, based on the audio source direction, a head-related transfer function (HRTF) to the enhanced mono audio signal to generate the first audio signal.
Example 18 includes the device of any of Example 1 to Example 17, wherein the processor is configured to generate a background audio signal from an input audio signal, wherein the second audio signal is based at least in part on the background audio signal.
Example 19 includes the device of Example 18, wherein the processor is configured to: apply a delay to the background audio signal to generate a delayed background audio signal; and attenuate the delayed background audio signal to generate the second audio signal.
Example 20 includes the device of Example 19, wherein the processor is configured to attenuate the delayed background audio signal based on a visual context to generate the second audio signal.
Example 21 includes the device of any of Example 18 to Example 20, wherein the processor is configured to: generate at least one directional audio signal from the input audio signal; and use a reverberation model to process the background audio signal, the at least one directional audio signal, or a combination thereof, to generate a reverberation signal, wherein the second audio signal includes the reverberation signal.
Example 22 includes the device of any of Example 1 to Example 20, wherein the processor is configured to: determine, based on image data, a visual context of the input audio signal, the image data representing a visual scene associated with an audio source of the input audio signal; and use a reverberation model to generate a synthesized reverberation signal corresponding to the visual context, wherein the second audio signal includes the synthesized reverberation signal.
Example 23 includes the device of Example 22, wherein the visual context is based on surfaces of an environment, room geometry, or both.
Example 24 includes the device of Example 22 or Example 23, wherein the image data is based on at least one of camera output, a graphic visual stream, decoded image data, or stored image data.
Example 25 includes the device of any of Example 22 to Example 24, wherein the processor is configured to determine the visual context based at least in part on performing face detection on the image data.
Example 26 includes the device of any of Example 1 to Example 20, wherein the processor is configured to: determine a location context based on location data; and use a reverberation model to generate a synthesized reverberation signal corresponding to the location context, wherein the second audio signal includes the synthesized reverberation signal.
According to Example 27, a method includes: performing, at a device, signal enhancement of an input audio signal to generate an enhanced mono audio signal; and mixing, at the device, a first audio signal and a second audio signal to generate a stereo audio signal, the first audio signal based on the enhanced mono audio signal.
Example 28 includes the method of Example 27, wherein the second audio signal is associated with a context of the input audio signal.
Example 29 includes the method of Example 27 or Example 28, further including using a neural network to perform the signal enhancement.
Example 30 includes the method of any of Example 27 to Example 29, wherein the input audio signal is based on microphone output of one or more microphones.
Example 31 includes the method of any of Example 27 to Example 30, further including decoding encoded audio data to generate the input audio signal.
Example 32 includes the method of any of Example 27 to Example 31, wherein the signal enhancement includes at least one of noise suppression, audio zoom, beamforming, dereverberation, source separation, bass adjustment, or equalization.
Example 33 includes the method of any of Example 27 to Example 32, wherein the signal enhancement is based at least in part on a configuration setting, a user input, or both.
Example 34 includes the method of any of Example 27 to Example 33, further including using a neural network to mix the first audio signal and the second audio signal to generate the stereo audio signal.
Example 35 includes the method of any of Example 27 to Example 34, further including: using a first neural network to perform signal enhancement of an input audio signal to generate the enhanced mono audio signal; and using a second neural network to mix the first audio signal and the second audio signal.
Example 36 includes the method of any of Example 27 to Example 35, further including: performing signal enhancement of a second input audio signal to generate a second enhanced mono audio signal; and generating the stereo audio signal based on mixing the first audio signal, the second audio signal, and a third audio signal, the third audio signal based on the second enhanced mono audio signal.
Example 37 includes the method of any of Example 27 to Example 36, further including generating at least one directional audio signal based on the input audio signal; and wherein the second audio signal is based on the at least one directional audio signal.
Example 38 includes the method of any of Example 27 to Example 37, further including: generating a directional audio signal based on the input audio signal; and applying a delay to the directional audio signal to generate a delayed audio signal, wherein the second audio signal is based on the delayed audio signal.
Example 39 includes the method of Example 38, further including panning, based on a visual context, the delayed audio signal to generate the second audio signal.
Example 40 includes the method of any of Example 27 to Example 39, further including panning the enhanced mono audio signal to generate the first audio signal.
Example 41 includes the method of Example 40, further including: receiving a user selection of an audio source direction, wherein the enhanced mono audio signal is panned based on the audio source direction.
Example 42 includes the method of Example 41, further including determining the user selection based on hand gesture detection, head tracking, eye gaze detection, a user interface input, or a combination thereof.
Example 43 includes the method of Example 41 or Example 42, further including applying, based on the audio source direction, a head-related transfer function (HRTF) to the enhanced mono audio signal to generate the first audio signal.
Example 44 includes the method of any of Example 27 to Example 43, further including generating a background audio signal from an input audio signal, wherein the second audio signal is based at least in part on the background audio signal.
Example 45 includes the method of Example 44, further including: applying a delay to the background audio signal to generate a delayed background audio signal; and attenuating the delayed background audio signal to generate the second audio signal.
Example 46 includes the method of Example 45, further including attenuating the delayed background audio signal based on a visual context to generate the second audio signal.
Example 47 includes the method of any of Example 44 to Example 46, further including: generating at least one directional audio signal from the input audio signal; and using a reverberation model to process the background audio signal, the at least one directional audio signal, or a combination thereof, to generate a reverberation signal, wherein the second audio signal includes the reverberation signal.
Example 48 includes the method of any of Example 27 to Example 46, further including: determining, based on image data, a visual context of the input audio signal, the image data representing a visual scene associated with an audio source of the input audio signal; and using a reverberation model to generate a synthesized reverberation signal corresponding to the visual context, wherein the second audio signal includes the synthesized reverberation signal.
Example 49 includes the method of Example 48, wherein the visual context is based on surfaces of an environment, room geometry, or both.
Example 50 includes the method of Example 48 or Example 49, wherein the image data is based on at least one of camera output, a graphic visual stream, decoded image data, or stored image data.
Example 51 includes the method of any of Example 48 to Example 50, further including determining the visual context based at least in part on performing face detection on the image data.
Example 52 includes the method of any of Example 27 to Example 46, further including: determining a location context based on location data; and using a reverberation model to generate a synthesized reverberation signal corresponding to the location context, wherein the second audio signal includes the synthesized reverberation signal.
According to Example 53, a device includes: a memory configured to store instructions; and a processor configured to execute the instructions to perform the method of any of Example 27 to Example 52.
According to Example 54, a non-transitory computer-readable medium stores instructions that, when executed by a processor, cause the processor to perform the method of any of Example 27 to Example 52.
According to Example 55, an apparatus includes means for carrying out the method of any of Example 27 to Example 52.
According to Example 56, a non-transitory computer-readable medium stores instructions that, when executed by one or more processors, cause the one or more processors to: perform signal enhancement of an input audio signal to generate the enhanced mono audio signal; and mix a first audio signal and a second audio signal to generate a stereo audio signal, the first audio signal based on the enhanced mono audio signal.
Example 57 includes the non-transitory computer-readable medium of Example 56, wherein the second audio signal is associated with a context of the input audio signal.
According to Example 58, an apparatus includes: means for performing signal enhancement of an input audio signal to generate the enhanced mono audio signal; and means for mixing a first audio signal and a second audio signal to generate a stereo audio signal, the first audio signal based on the enhanced mono audio signal.
Example 59 includes the apparatus of Example 58, wherein the means for performing the signal enhancement, and the means for mixing the first audio signal and the second audio signal are integrated into at least one of a smart speaker, a speaker bar, a computer, a tablet, a display device, a television, a gaming console, a music player, a radio, a digital video player, a camera, a navigation device, a vehicle, a headset, an augmented reality headset, a mixed reality headset, a virtual reality headset, an aerial vehicle, a home automation system, a voice-activated device, a wireless speaker and voice activated device, a portable electronic device, a communication device, an internet-of-things (IoT) device, a virtual reality (VR) device, a base station, or a mobile device.
Those of skill would further appreciate that the various illustrative logical blocks, configurations, modules, circuits, and algorithm steps described in connection with the implementations disclosed herein may be implemented as electronic hardware, computer software executed by a processor, or combinations of both. Various illustrative components, blocks, configurations, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or processor executable instructions depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, such implementation decisions are not to be interpreted as causing a departure from the scope of the present disclosure.
The steps of a method or algorithm described in connection with the implementations disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in random access memory (RAM), flash memory, read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), registers, hard disk, a removable disk, a compact disc read-only memory (CD-ROM), or any other form of non-transient storage medium known in the art. An exemplary storage medium is coupled to the processor such that the processor may read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an application-specific integrated circuit (ASIC). The ASIC may reside in a computing device or a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a computing device or user terminal.
The previous description of the disclosed aspects is provided to enable a person skilled in the art to make or use the disclosed aspects. Various modifications to these aspects will be readily apparent to those skilled in the art, and the principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope possible consistent with the principles and novel features as defined by the following claims.