Aspects of the disclosure generally relate to a low complexity multi-channel smart loudspeaker with voice control.
Smart loudspeakers with voice control and Internet connectivity are becoming increasingly popular. End users expect the product to perform various functions, including understanding a user's voice from any distant point in a room even while music is playing, responding and interacting quickly to user requests, focusing on one voice command and suppressing others, playing back stereo music with high quality, filling the room with music like a small home theater system, and automatically steering to the position of user listening in the room.
In one or more illustrative examples, a smart loudspeaker includes an array of N speaker elements disposed in a circular configuration about an axis and configured for multi-channel audio playback and a digital signal processor. The digital signal processor is configured to extract a center channel from a stereo input, apply the center channel to the array of speaker elements using a first set of finite input response filters and a first rotation matrix to generate a first beam of audio content at a target angle about the axis, apply a left channel of the stereo input to the array of speaker elements using a second set of finite input response filters and a second rotation matrix to generate a second beam of audio content at a first offset angle from the target angle about the axis, and apply a right channel of the stereo input to the array of speaker elements using a third set of finite input response filters and a third rotation matrix to generate a third beam of audio content at a second offset angle from the target angle about the axis.
In one or more illustrative examples, a method for a smart loudspeaker includes extracting a center channel from a stereo input; applying the center channel to an array of speaker elements disposed in a circular configuration about an axis and configured for multi-channel audio playback, using a first set of finite input response filters and a first rotation matrix to generate a first beam of audio content at a target angle about the axis; applying a left channel of the stereo input to the array of speaker elements using a second set of finite input response filters and a second rotation matrix to generate a second beam of audio content at a first offset angle from the target angle about the axis; and applying a right channel of the stereo input to the array of speaker elements using a third set of finite input response filters and a third rotation matrix to generate a third beam of audio content at a second offset angle from the target angle about the axis.
As required, detailed embodiments of the present invention are disclosed herein; however, it is to be understood that the disclosed embodiments are merely exemplary of the invention that may be embodied in various and alternative forms. The figures are not necessarily to scale; some features may be exaggerated or minimized to show details of particular components. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a representative basis for teaching one skilled in the art to variously employ the present invention.
In order to realize smart loudspeaker features, a combination of a powerful host processor with WIFI connectivity, a real-time signal processor comprising steerable beam forming for both received and sent sound, and multichannel echo cancelling filter banks are required. These components require a massive demand for processing power. On the other hand, wireless portability with battery power options is often desirable. This disclosure presents a solution that fulfills the demand for audio quality and smart loudspeaker features, while minimizing processing cost.
The loudspeaker 100 may also include a loudspeaker beamformer 108. The loudspeaker beamformer 108 may have three inputs configured to receive the upmixed signals 106 (L−C), (R−C), and (C) from the upmixer 104. The loudspeaker beamformer 108 may also be connected to an L array of loudspeakers 110 (typically L=6 . . . 8). Each input channel (L−C), (R−C), and (C) corresponds to an acoustic beam of defined beam width.
Referring back to
The loudspeaker 100 also includes a two input/one output adaptive acoustic echo canceller (AEC) filters 126. An AEC output signal 128 approximates the music signal that the microphones 112 receive, originating from input channels 102 (L) and (R), and reaching the microphones 112 from the loudspeakers 110 via both direct and indirect (room reflection) paths. By subtracting this signal 128 from the microphone signals 114, the music will be suppressed, and only the intended speech signal will be heard.
The microphone diameter may be small, e.g., with a diameter typically 10 millimeters. This allows the AEC 126 for the system to be simplified greatly. In other systems, the microphones may be placed in a circular arrangement of typically 4-10 centimeters (cm). This approach would require separate AEC filter pairs for each microphone of the array 112, because acoustic responses vary significantly with increasing distance. By reducing the diameter of the microphone array 112, processing power for performing AEC can be cut by a factor of M (i.e., the number of microphones) by applying only one AEC filter pair instead of M pairs. Reference for the AEC can be either the center microphone signal, or a signal obtained by averaging over the M array microphones 112 along the circle.
Referring more specifically to
The outputs of each of the high pass filters 602 and the low pass decimation filters 604 is provided to Short-Term Fourier Transform (STFT) blocks 606 using the two-way time/frequency analysis scheme. The upmixer 104 performs a two-way time/frequency analysis scheme that uses very short Fourier transform lengths of typically 128 with a hop size of 48, thereby achieving much higher time resolution than methods using longer lengths. A method that applies a single Fast Fourier Transform (FFT) of length 1024 may result in a time resolution of 10 . . . 20 milliseconds (msec), depending on overlap length. By using the short transfer length, time resolution is shortened by a factor of ten, which is now more closely related to human perception (e.g., 1 . . . 2 msec). Frequency resolution is not compromised but improved as well due to sub-sampling of the lower frequency band. Also, aliasing distortion, which can occur in poly-phase filter banks with nonlinear processing, is avoided. Thus, the two-way time/frequency analysis scheme leads to exceptional fidelity and sound quality with artifacts suppressed below audibility. Further aspects of the operation of the scheme are described in U.S. Patent Publication No. 2013/0208895 titled, “Audio Surround Processing System,” which is incorporated herein by reference in its entirety.
The (L) and (R) outputs of the STFT blocks 606 of the high-frequency path are provided to a center extraction block 608. Similarly, the (L) and (R) outputs of the STFT blocks 606 of the low-frequency path are provided to another center extraction block 608.
Notably, the STFT block 606 and center extraction block 608 in the low-frequency path run at a reduced sample rate of typically fS/rS, with fS=48 kHz, rS=16. This results in an rs-fold increase in low-frequency resolution, thus the same short STFT length of 128 can be used.
Recombination after respective center extraction processing in high-frequency paths and low-frequency paths is accomplished by inverse STFTs, interpolation from the reduced sample rate fS/16 to the original sample rate fS, and delay compensation at high frequencies, in order to match the higher latency due to FIR filtering of the low-frequency path. More specifically, each of the center extraction blocks 608 feeds into an independent inverse STFT block 610. The output of the inverse STFT block 610 in the low-frequency path feeds into a FIR interpolation filter 612, which may interpolate to account for the decimation performed at block 604. The output of the inverse STFT block 610 in the high-frequency path may then feed into a delay compensation block 614. The outputs of the FIR interpolation filter 612 and the delay compensation block 614 may then be combined using an adder 616, where the output of the adder 616 is the center output (C) channel 106.
More specifically referring to the algorithm implemented by the center extraction block 608 itself, the following values may be computed as follows:
P=[|VL|2+|VR|2]/2 (1)
where P is the mean signal energy, VL is a complex vector of the short-term signal spectra of the (L) input channel 102 signal, and VR is a complex vector of the short-term signal spectra of the (R) input channel 102 signal;
VX=|VLVR*| (2)
where VX represents the absolute value of cross spectral density; and
pc=VX/P (3)
where pc is a quotient computed as the ratio of the absolute value of the cross spectral density VX to the mean signal energy P. This quotient may be referred to as the “Time/Frequency Mask.”
Using those values, a time average
The center signal is then extracted using a nonlinear mapping function F. The desired output signal is obtained by multiplying the sum of the inputs (as a mono signal) with a nonlinear function F of the mask
Yc=(VL+VR)·F(
The beam can be rotated to any desired angle ϕ by re-assigning the tweeters. For example, rotation of ϕ=60° may be accomplished by connecting filter h1 to tweeter T2, h26 to tweeter pairs T1 and T3, and so on. Additionally, any angle in-between can be realized by linear interpolation of the respective tweeter signals. The rotation is realized as a 4×6 gain matrix, because there are four beam forming filters and six tweeters in this example. However, different numbers of filters and tweeters would affect the dimensions of the rotation matrix. Besides linear interpolation, other interpolation laws such as cosine or cosine squared may additionally or alternately be used.
Referring back to
The design of beam forming filters may be based on acoustic data. In an example, impulse responses may be captured in an anechoic chamber. Each array driver may be measured at discrete angles around the speaker by rotation via a turntable. Further aspects of the design of the beamforming filters is discussed in further detail in International Application Number PCT/US17/49543, titled “Variable Acoustics Loudspeaker,” which is incorporated herein by reference in its entirety.
The acoustic data may be preconditioned by computing complex spectra using the Fourier transform. Then, complex smoothing may be performed by computing magnitude and phase, separately smoothing magnitude and phase responses, then transforming the data back into complex spectral values. Additionally, angular response may be normalized to the spectrum of the frontal transducer at 0° by multiplying each spectrum with its inverse. This inverse response may be utilized later for global equalization.
The measured, smoothed complex frequency responses can be written in matrix form as follows:
Hsm(i,j),i=1 . . . N,j=1 . . . M, (6)
where the frequency index is i, N is the FFT length (N=2048 in the illustrated example), and M the number of angular measurements in the interval [0 . . . 180]° (M=13 for 15° steps in the illustrated example).
An array of R drivers (here R=6) includes one frontal driver at 0°, one rear driver at 180°, and P=(R−2)/2 driver pairs located at the angles
The design of P beam forming filters Cr is such that they are connected to the driver pairs where an additional filter CP+1 is provided for the rear driver. First, as stated above, the measured frequency responses are normalized at angles greater than zero with respect to the frontal response to eliminate the driver frequency response. This normalization may be factored back in later when designing the final filter in form of driver equalization, as follows:
H0(i)=Hsm(i,1);
Hnorm(i,j)=Hsm(i,j)/H0(i),i=1 . . . N,j=1 . . . M (7)
The filter design iteration works for each frequency point separately. The frequency index may be eliminated for convenience, as follows:
H(αk):=Hnorm(i,k) (8)
as the measured and normalized frequency response at discrete angle αk.
Assuming a radial-symmetric, cylindrical enclosure and identical drivers, the frequency responses U(k) of the array may be computed at angles αk by applying the same offset angle to all driver as follows:
The spectral filter values Cr can be obtained iteratively by minimizing the quadratic error function:
where t(k) is a spatial target function, specific to the chosen beam width, as defined later.
The parameter α defines the array gain:
αgain=20·log (α)
The array gain specifies how much louder the array plays compared to one single transducer. It should be higher than one, but cannot be higher than the total transducer number R. In order to allow some sound cancellation that is necessary for super-directive beam forming, the array gain will be less than R but should be much higher than one. In general, the array gain is frequency dependent and must be chosen carefully to obtain good approximation results.
Additionally, Q is the number of angular target points (for example Q=9). Also, w(k) is a weighting function that can be used if higher precision is required in a particular approximation point versus another (usually 0.1<w<1).
The variables to be optimized are the P+1 complex filter values per frequency index i, Cr (i), r=1 . . . (P+1). The optimization may be started at the first frequency point in the band of interest
(for example f1=100 Hz, fg=24 KHz, N=2048=<i1=8), set Cr=1 ∀r as a start solution, then subsequently compute the filter values by incrementing the index each time until reaching the last point
Instead of real and imaginary part, use of magnitude |Cr(i)| and unwrapped phase arg(Cr(i)=arctan (Im{Cr(i)}/Re{Cr(i)}) can be used for the nonlinear optimization routine as variables.
This bounded, nonlinear optimization problem can be solved with standard software, for example the function “fmincon”, which is part of the Matlab optimization toolbox. The following bounds may be applied:
Gmax=20*log(max(|Cr|))
The maximum allowed filter gain, and lower and upper limits for the magnitude values from one calculated frequency point to the next to be calculated point, specified by an input parameter δ, as follows:
|Cr(i)|·(1−δ)<|Cr(i+1)|<|Cr(i)|·(1+δ) (12)
in order to control smoothness of the resulting frequency response.
Design examples using an array diameter of 150 millimeters, with 6 mid/tweeters crossed over at 340 Hz are discussed as follows.
In a narrow beam example,
The contour plot of the medium-wide beam is shown in
The loudspeaker 100 may further be utilized in an omni-directional mode. For monaural sources, such as speech, an omni-directional mode with a dispersion pattern as uniform and angle-independent as possible is often required. First, a wide-beam design is approached with the same method:
Referring to the steerable microphone array 112, the microphone beamformer 120 may be designed in three stages, initial and in-situ calibration, closed-form start solution, and optimization to a target.
Regarding microphone auto-calibration, low-cost Electret Condenser Microphones (ECM) and Microelectromechanical system (MEMS) microphones usually exhibit a deviation of typically +/−3 dB from a mean response. This is confirmed by the example of
In order to accommodate for microphone aging or environmental conditions such as temperature and humidity, in-situ calibration is required from time to time. This can be accomplished by estimating the response of the reference microphone over time with the music being played, or a dedicated test signal, then equalizing the other microphones to that target.
Regarding the initial beamforming solution, closed solutions exist for circular microphone arrays 112 in free air. A well-known design may be used to obtain a start solution for subsequent nonlinear optimization. The textbook by Jacob Benesty, “Design of Circular Differential Microphone Arrays,” Springer 2015 is incorporated by reference in its entirety, and describes that the microphone beam forming filter vector H=[H1 . . . Hm] can be computed as follows:
where
representing a “pseudo coherence matrix” for diffuse noise;
I is a unity matrix;
ω is frequency;
c is the speed of sound;
the distances between microphones i and j are:
where d is the array diameter;
D=[D1 . . . Dm] denotes the steering vector, where
ε is a regularization factor. In this example ε=1e−5.
The delay vector V=[V1 . . . VM] of an ideal, circular array of point sensors at the angle θ may be defined as:
We obtain the complex response Bm of microphone m at angle θ by cascading above delay Vm, beam filter Hm, and conjugate complex steering vector element Dm as:
Bm(θ)=Vm(θ)HmDm*, (15)
and finally the beam response U(θ) by complex summation over the individual responses:
Regarding non-linear post optimization,
First, the data is preconditioned by complex smoothing in the frequency domain, and normalization to the frontal transducer. Hence, the frequency response of the first transducer mic1 is set to constant one during the optimization. Instead of applying a beam forming filter to mic1, a global EQ filter applied to all microphones may be used.
Target function for the design are attenuation values uk at angles θk=[0:15:180]°, which can be taken from the initial solution uk (f)=, |U(f, θk)| see above. Since this response is frequency dependent, a number of constant target functions are used for different frequency intervals. For example, below a transition frequency ftr=1000 Hz a first target function uk (f=2000 Hz) can be used for the approximation in the interval 100 Hz . . . 1000 Hz, then a second target function uk (f=4000 Hz) is used for the remaining interval 1000 Hz . . . 20 KHz. This method results in a subsequently narrower beam at higher frequencies.
The initial solution for C1 . . . C3 may be set to the previously-obtained beam forming filters Hm, as shown in
In addition to the allowed amplitude difference δ from one frequency iteration point i to the next point i+1:
|Cr(i)|·(1−δ)<|Cr(i+1)|<|Cr(i)|·(1+δ), (17)
a phase boundary δp is applied:
arg(Cr(i))·(1−δP)<arg(Cr(i+1))<arg(Cr(i))·(1+δP). (18)
In summary, the following bounds are applied:
The overall white noise gain may be calculated as:
WNG=20 log{|EQFilt|·√{square root over (1+2·|C1|2+2·|C2|2+|C3|2)}}. (19)
At operation 3104, the loudspeaker 100 extracts a center channel from the input signal. In an example, the upmixer 104 is configured to generate a center channel (C) out of the two-channel stereo sources (i.e., (L) and (R) of the audio input 102), resulting in upmixed signals 106 left minus center (L−C), center (C), and right minus center (R−C). Further aspects of the operation of the upmixer 104 are described in detail with respect to
At operation 3106, the loudspeaker 100 generates a center channel beam for output by the loudspeaker 100. In an example, as discussed at least with respect to
At operation 3108, the loudspeaker 100 generates stereo channel beams for output by the loudspeaker 100. In an example, as discussed at least with respect to
At 3110, the loudspeaker 100 calibrates the microphone array 112. In an example, the loudspeaker 100 calibrates the array of microphones 112 by convolution of the electrical signals from each of the microphones using a minimum phase correction filter and a target microphone that is one of the microphone elements of the array 112. In another example, the loudspeaker 100 performs an in-situ calibration including to estimate a frequency response of a reference microphone of the microphone array 112 using the audio playback of the array of speakers 110 as a reference signal, and equalizing the microphones of the array 112 according to the measured frequency response.
At operation 3112, the loudspeaker 100 receives microphone signals 114 from the microphone array 112. In an example, the processor of the loudspeaker 100 may be configured to receive the raw microphone signals 114 from the microphone array 112.
At operation 3114, the loudspeaker 100 performs echo cancellation on the received microphone signals 114. In an example, the loudspeaker 100 utilize a single adaptive acoustic echo canceller (AEC) 126 filter pair keyed to the stereo input for the array of microphone elements. It may be possible to use the single AEC as opposed to M AEC due to the short distance between the microphone elements of the array 112, as well as due to the calibration of the microphone array 112. Further aspects of the operation of the AEC are described above with respect to
At operation 3116, the loudspeaker 100 performs speech recognition on the microphone signals 114 that are echo cancelled. Accordingly, the loudspeaker 100 may be able to respond to voice commands. After operation 3116, the process 3100 ends.
The processor 3202 may be any technically feasible form of processing device configured to process data and/or execute program code. The processor 3202 could include, for example, and without limitation, a system-on-chip (SoC), a central processing unit (CPU), a graphics processing unit (GPU), an application-specific integrated circuit (ASIC), a digital signal processor (DSP), a field-programmable gate array (FPGA), and so forth. Processor 3202 includes one or more processing cores. In operation, processor 3202 is the master processor of computing device 3201, controlling and coordinating operations of other system components.
I/O devices 3204 may include input devices, output devices, and devices capable of both receiving input and providing output. For example, and without limitation, I/O devices 3204 could include wired and/or wireless communication devices that send data to and/or receive data from the speaker(s) 3220, the microphone(s) 3230, remote databases, other audio devices, other computing devices, etc.
Memory 3210 may include a memory module or a collection of memory modules. The audio processing application 3212 within memory 3210 is executed by the processor 3202 to implement the overall functionality of the computing device 3201 and, thus, to coordinate the operation of the audio system 3200 as a whole. For example, and without limitation, data acquired via one or more microphones 3230 may be processed by the audio processing application 3212 to generate sound parameters and/or audio signals that are transmitted to one or more speakers 3220. The processing performed by the audio processing application 3212 may include, for example, and without limitation, filtering, statistical analysis, heuristic processing, acoustic processing, and/or other types of data processing and analysis.
The speaker(s) 3220 are configured to generate sound based on one or more audio signals received from the computing system 3200 and/or an audio device (e.g., a power amplifier) associated with the computing system 3200. The microphone(s) 3230 are configured to acquire acoustic data from the surrounding environment and transmit signals associated with the acoustic data to the computing device 3201. The acoustic data acquired by the microphone(s) 3230 could then be processed by the computing device 3201 to determine and/or filter the audio signals being reproduced by the speaker(s) 3220. In various embodiments, the microphone(s) 3230 may include any type of transducer capable of acquiring acoustic data including, for example and without limitation, a differential microphone, a piezoelectric microphone, an optical microphone, etc.
Generally, computing device 3201 is configured to coordinate the overall operation of the audio system 3200. In other embodiments, the computing device 3201 may be coupled to, but separate from, other components of the audio system 3200. In such embodiments, the audio system 3200 may include a separate processor that receives data acquired from the surrounding environment and transmits data to the computing device 3201, which may be included in a separate device, such as a personal computer, an audio-video receiver, a power amplifier, a smartphone, a portable media player, a wearable device, etc. However, the embodiments disclosed herein contemplate any technically feasible system configured to implement the functionality of the audio system 3200.
The descriptions of the various embodiments have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments.
Aspects of the present embodiments may be embodied as a system, method or computer program product. Accordingly, aspects of the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “module” or “system.” Furthermore, aspects of the present disclosure may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
Aspects of the present disclosure are described above with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general-purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, enable the implementation of the functions/acts specified in the flowchart and/or block diagram block or blocks. Such processors may be, without limitation, general purpose processors, special-purpose processors, application-specific processors, or field-programmable.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
While exemplary embodiments are described above, it is not intended that these embodiments describe all possible forms of the invention. Rather, the words used in the specification are words of description rather than limitation, and it is understood that various changes may be made without departing from the spirit and scope of the invention. Additionally, the features of various implementing embodiments may be combined to form further embodiments of the invention.
Number | Name | Date | Kind |
---|---|---|---|
7991170 | Horbach | Aug 2011 | B2 |
9294860 | Carlson | Mar 2016 | B1 |
9749747 | Kriegel | Aug 2017 | B1 |
10109292 | Choisel | Oct 2018 | B1 |
20060233389 | Mao et al. | Oct 2006 | A1 |
20110051950 | Burnett | Mar 2011 | A1 |
20130208895 | Horbach et al. | Aug 2013 | A1 |
20150030164 | Ranieri et al. | Jan 2015 | A1 |
20170006399 | Maziewski | Jan 2017 | A1 |
20170236547 | Baggio | Aug 2017 | A1 |
20170366897 | Azarewicz et al. | Dec 2017 | A1 |
20180098172 | Family | Apr 2018 | A1 |
20180226065 | Marquez | Aug 2018 | A1 |
Number | Date | Country |
---|---|---|
2545359 | Jun 2017 | GB |
2018045133 | Mar 2018 | WO |
Number | Date | Country | |
---|---|---|---|
20190373390 A1 | Dec 2019 | US |