Various signal processing techniques have been developed for estimating the location of a sound source by using multiple microphones. Such techniques typically assume that the microphones are located in free space with a relatively simple geometric arrangement, such as a linear array or a circular array, which makes it relatively easy to analyze detected sound waves. However, in some situations, microphones may not be arranged in a linear or circular array. For example, microphones may be randomly positioned at various locations across a device of an arbitrary shape in a given environment instead of being positioned in a linear or circular array. Sound waves may be diffracted and scattered across the device before they are detected by the microphones. Scattering effects, reverberations, and other linear and nonlinear effects across an arbitrarily shaped device may complicate the analysis involved in estimating the location of a sound source.
In multi-microphone devices the geometry/shape of the device is important. If the shape of the device changes, for example to move the placement of the microphones, the operation of the device, particularly the accuracy, of the device may be greatly affected. To address changes in the device shape, the device must be recorded in multiple size and shape rooms using the new design. As such, all previous recordings done for the device using the previous shape may be thrown away, which may result in a waste of resources.
According to an embodiment of the disclosed subject matter, a method is disclosed for auralizing a multi-microphone device. Path information is determined for one or more sound paths using dimensions and room reflection coefficients of a simulated room for one of a plurality of microphones included in a multi-microphone device. An array-related transfer function (ARTF) for the one of the plurality of microphones is retrieved. The auralized impulse response for the one of the plurality of microphones is generated based at least on the retrieved ARTF and the determined path information.
In an aspect of the embodiment, generating the auralized impulse response comprises extracting from the retrieved ARTFs, an ARTF corresponding to each of the one or more sound paths, determining an auralized path to the one of the plurality of microphones for each of the sound paths, and combining the auralized paths for the one of the plurality of microphones to generate the auralized impulse response of the one of the plurality of microphones.
In an aspect of the embodiment, determining the path information for the one or more sound paths comprises determining an n th shortest sound path to the one of the plurality of microphones, wherein n is a counter that is used to determine the number of sound paths that have been determined, computing the path information for the determined n th shortest sound path, and incrementing the counter by one if n is less than a threshold number of determined sound paths.
In an aspect of the embodiment, determining the auralized path to the one of the plurality of microphones for each of the sound paths comprises convolving each ARTF corresponding to the one or more sound paths with a room impulse response for respective one or more sound paths for the one of the plurality of microphones, wherein the room impulse response is calculated based on the path information of the respective one or more sound.
In an aspect of the embodiment, the path information includes a path-distance, signal attenuation, and array-direction of arrival (DOA).
In an aspect of the embodiment, the method comprises retrieving a microphone transfer function for the one of the plurality of microphones, and convolving the microphone transfer function with the determined auralized path for the one of the plurality of microphones.
In an aspect of the embodiment, the method comprises retrieving a near-microphone sound from a sound database including a plurality of near-microphone recorded speeches and sounds, and convolving the near-microphone sound with the determined auralized path for the one of the plurality of microphones to generate the auralized impulse response for the one of the plurality of microphones.
In an aspect of the embodiment, the method comprises generating an auralized impulse response for each of the plurality of microphones included in the multi-microphone device.
In an aspect of the embodiment, the method comprises modifying the microphone transfer function.
In an aspect of the embodiment, the method comprises modifying the dimensions and the room reflection coefficients of the simulated room, and generating the auralized impulse response for each of the plurality of microphones included in the multi-microphone device based on the modified dimensions and room reflection coefficients of the simulated room.
According to an embodiment of the disclosed subject matter, a system for auralizing a multi-microphone device comprises a room simulator, including a processor, the room simulator configured to determine path information for one or more sound paths using dimensions and room reflection coefficients of a simulated room for one of a plurality of microphones included in the multi-microphone device, an array-related transfer functions (ARTFs) database including a ARTFs for the one of the plurality of microphones, and an auralizer, including a processor. The auralizer is configured to retrieve the ARTFs for the one of the plurality of microphones, and generate an auralized impulse response for the one of the plurality of microphones based at least on the retrieved ARTFs and the determined path information.
Additional features, advantages, and embodiments of the disclosed subject matter may be set forth or apparent from consideration of the following detailed description, drawings, and claims. Moreover, it is to be understood that both the foregoing summary and the following detailed description are illustrative and are intended to provide further explanation without limiting the scope of the claims.
The accompanying drawings, which are included to provide a further understanding of the disclosed subject matter, are incorporated in and constitute a part of this specification. The drawings also illustrate embodiments of the disclosed subject matter and together with the detailed description serve to explain the principles of embodiments of the disclosed subject matter. No attempt is made to show structural details in more detail than may be necessary for a fundamental understanding of the disclosed subject matter and various ways in which it may be practiced.
According to embodiments of this disclosure, methods and apparatus are provided for auralizing a multi-microphone device. In the foregoing description, multiple microphones may be collectively referred to as an “array” of microphones. An array of microphones may include microphones placed in various locations on an arbitrarily shaped device in an indoor environment such as a smart-home environment, or in another type of enclosed environment. Sound waves may experience scattering effects, diffractions, reverberations, or other linear or nonlinear effects before they are detected by the microphones. According to embodiments of the disclosure, a sound detection system includes a neural network that is trained to estimate the location of a sound source in a three-dimensional space in a given environment based on sound signals detected by multiple microphones without being dependent on conventional schemes for determining the source of a sound, where these conventional schemes may be limited to relatively simple geometric arrangements of microphones, for example, linear or circular arrays with no obstructions or objects that may absorb, reflect, or distort sound propagation. An auralizing system is implemented to generate multi-channel “auralized” sound signals based at least on impulse responses of the microphone array in an anechoic chamber and in a simulated room environment as well as other inputs.
As used herein, “auralization” refers to a process of rendering audio data by digital means to achieve a virtual three-dimensional sound space. Training a neural network with auralized multi-channel signals allows the neural network to capture the scattering effects of the multi-microphone array, other linear or non-linear effects, reverberation times in a room environment, as well as manufacturing variations between different microphones in the multi-microphone array. After being trained with data derived from the auralized multi-channel signals, a neural network may compute the complex coefficients, which may be used to estimate the direction or location of an actual sound source in a three-dimensional space with respect to a multi-microphone device. In some implementations, in addition to detecting the direction or location of the sound source, the neural network may also be trained and used as speech detector or a sound classifier to detect whether the received sound signal is or contains speech based on comparisons with a speech database, such as the TIMIT database.
In some implementations, sound signals from stationary or moving sound sources may be auralized, by an auralizing system, to generate auralized multi-channel sound signals. In some embodiments, the auralizer may obtain impulse responses of the multi-microphone array in a multi-microphone device, i.e., ARTFs or device related transfer functions, across a dense grid of three-dimensional coordinates, such as spherical coordinates, Cartesian coordinates, or cylindrical coordinates, and combine the ARTFs with responses from a room simulator and transfer functions indicative of microphone variations to generate auralized multi-channel sound signals, and signal labels related thereto, for example.
A signal label may include spatial information indicative of an estimated location of the sound source. For example, a label may include azimuth, elevation and distance in spherical coordinates if the sound source is stationary. Other types of three-dimensional coordinates such as Cartesian coordinates or cylindrical coordinates may also be used. If the sound source is moving, then a set of labels each corresponding to a given time frame may be provided. A neural network, for example, may be trained by receiving, processing, and learning from multiple sound features and their associated labels or sets of labels for stationary or moving sound sources to allow the sound detection system to estimate the locations of actual stationary or moving sound sources in a room environment.
The ARTFs may be obtained across a dense grid of three-dimensional coordinates, which may be Cartesian coordinates, cylindrical coordinates, or spherical coordinates, in a three-dimensional space. The ARTF generator 202 obtains the ARTFs that have been measured in an anechoic chamber across a dense grid of distance, azimuth, and elevation. For a given distance, direction, and microphone number, the ARTF generator 202 generates the estimated ARTF by interpolating across the measured ARTFs;
A
pm(z)=ARTF Interpolator(θpm,dpm).
The generated ARTFs are stored in a database (not shown) in the ARTF generator 202 for retrieval by the auralizer 210.
In some implementations, it is expected that individual microphones in a multi-microphone array may have different response characteristics. Even if the microphones in the multi-microphone array are of the same make and model, there may be slight differences in their response characteristics due to manufacturing variations, for example. A microphone transfer function generator (e.g., a microphone simulator) 204 may be implemented to generate microphone transfer functions, which take into account the response characteristics of individual microphones in the multi-microphone array. The microphone simulator 204 uses the gain and phase variations obtained from published datasheets, or from random sampling of microphones, to generate a random transfer function of a typical microphone; i.e.,
M
m(z)=Microphone_simulator(m).
A near-microphone sound/speech generator 206 may be implemented to generate sounds or speeches to be transmitted to the auralizer 210. In some implementations, the near-microphone sound/speech generator 206 may generate reference sound signals for the auralizer 210. The near-microphone sound/speech may be a “clean” single-channel sound generated by a speech database, such as the TIMIT database which contains phonemically and lexically transcribed speeches of American English speakers of different genders and dialects. The generated near-microphone sound may be stored in a sound database (not shown) in the generator 206 for retrieval by the auralizer 210.
As shown in
The room simulator uses simulated room dimensions and the reflection coefficients of the walls and ceilings, thereof, and provides path information for the various sound paths (direct and reflective paths) to each microphone in the array, including the direction of arrival with respect to the microphone, and length of the total path, represented by:
[Rpm(z),θpm,dpm]=Room Simulator(dimension,reflection_coefficients,p,m);
where Rpm(z), θpm, and dpm are the transfer function, direction of arrival, and distance of the p th shortest path from the speaker to the m th microphone, respectively. The dimensions and reflected coefficients of the simulated room may be varied to simulate any room configuration that the multi-microphone device may be used in. The sound paths for each configuration are determined to generate the auralized multi-channel signal, which may be used to train a neural network, etc.
The path counter may be incremented by 1 (310). If the attenuation of the previous n paths is less than a threshold, the room simulator has generated the path information of the simulated room for each microphone included in the device, otherwise, the n th shortest path is determined (304).
The auralizer 210, including a processor, generates auralized multi-channel signals 212 and signal labels 214 corresponding to the auralized multi-channel signals 212 based on the inputs from the ARTF generator 202, the microphone transfer function generator 204, the near-microphone sound/speech generator 206, and the room simulator 208. The auralized path from a speaker to each microphone is obtained by combining the transfer function of the path from the room simulator 208 with that of the corresponding ARTF for each microphone, represented by
pm(z)=Rpm(z)Apm(z)
where
If x(n) is the signal from the speaker, the auralized signal (ym) to the m th microphone, is represented by
y
m(n)=hm*x(n);
where hm is the impulse response of the transfer function Hm(z). The auralized transfer function Hm(z) may be modified to simulate only the initial reverberation, while the late reverberations can be simulated by a decaying random process, where the decay rate is dependent on the room reverberation characteristics, i.e.,
y
m(n)=hm*x(n)+σ(n)ν(n)
where σ(n) is the decaying function and ν(n) is a white noise process with unit variance.
As shown in
The auralizer may compute the auralized path for each microphone by convolving the path with the corresponding ARTF (412) and combine all of the auralized paths to a microphone to obtain the auralized impulse response for the respective microphone (414). The auralized path may then be convolved with the m th microphone transfer function (416) and the auralized impulse responses for each of the microphones (418).
As disclosed, in some embodiments the auralizer generates an auralized impulse response for each microphone for the simulated room dimensions and reflection coefficient, microphone transfer function, and position of the microphone in the simulated room. In some embodiments the auralizer determines an auralized impulse response for a plurality of different scenarios, where the simulated room configuration, the microphone transfer function, and/or the simulated room dimensions and reflection coefficients may change.
If the microphone transfer function is to be changed, the respective microphone transfer function is retrieved from the microphone simulator (406).
If the position of the speaker or microphone changes, the room simulator generates the path information for each path (408).
If the configuration of a new room is read, the desired room dimensions and reflection coefficients are obtained (404).
As disclosed herein, some embodiments may use the auralized multi-channel signals generated by the auralizing system to train a neural network, a sound classifier, and the like.
In some implementations, the auralizing system may generate auralized multi-channel signals from not only a stationary sound source but also a moving sound source. For example, a moving sound source may be a person who is talking and walking at the same time, or an animal that is barking and running at the same time. For a moving sound source, the ARTFs and the room impulse responses may be obtained across a dense grid of three-dimensional coordinates over time, and each ARTF and each room impulse response at a given point in space may vary as a function of time. In some implementations, the ARTFs and the room impulse responses may be regarded as having a fourth dimension (time) in addition to the three dimensions of space.
The distance and direction of a moving sound source with respect to the m th microphone can be expressed in parametric form d(t) and θ(t), respectively, where t is the time instant. Consequently, the auralized impulse response from the speaker to a microphone at time t is a function of the distance and direction, e.g., Hm(z, d(t), 0(t)), or more concisely as Hm(z, t).
Let
h
m,t
(n),hm,t
be the known impulse responses of the auralized transfer functions Hm(z, t0), Hm(z, t1), . . . , Hm(z, tT) respectively. Then the impulse response at any time t, where 0<t<T, can be estimated by interpolating across the known impulses responses; i.e.,
h
m,r(n)=Impulse_response_interpolator(hm,t
Consequently, a moving sound source can be implemented as a time-varying impulse response where the variations are computed using the interpolator. If x(n) is the signal from the moving source, the auralized signal at m th microphone may be represented by:
y
m(n)=x(n)*hm,t(n);
where hm,t(n) is a time-varying filter.
In some embodiments, the output from each of the transfer functions Hm(z, t0), Hm(z, t1), . . . , Hm(z, tT) and an appropriately selected weighted combination of the output that varies over time is computed to auralize a moving sound source. If x(n) is the input to the transfer functions Hm(z, t0), Hm(z, t1), . . . , Hm(z, tT) are the corresponding outputs, the auralized signal, ym(n), at the m th microphone can be computed by utilizing time-varying weights; i.e.,
y
m(n)=w0(t)yt
where w0(t)+w1(t)+ . . . +wT(t)=1. By appropriately varying the weights w0(t), w1(t), . . . wT(t), a moving source can be simulated. A block diagram of an implementation is shown in
Embodiments of the presently disclosed subject matter may be implemented in and used with a variety of component and network architectures. For example, the system for auralizing multi-channel signal for a multi-microphone device as shown in
The bus 21 allows data communication between the central processor 24 and one or more memory components, which may include RAM, ROM, and other memory, as previously noted. Typically RAM is the main memory into which an operating system and application programs are loaded. A ROM or flash memory component can contain, among other code, the Basic Input-Output system (BIOS) which controls basic hardware operation such as the interaction with peripheral components. Applications resident with the computer 20 are generally stored on and accessed via a computer readable medium, such as a hard disk drive (e.g., fixed storage 23), an optical drive, floppy disk, or other storage medium.
The fixed storage 23 may be integral with the computer 20 or may be separate and accessed through other interfaces. The network interface 29 may provide a direct connection to a remote server via a wired or wireless connection. The network interface 29 may provide such connection using any suitable technique and protocol as will be readily understood by one of skill in the art, including digital cellular telephone, Wi-Fi, Bluetooth®, near-field, and the like. For example, the network interface 29 may allow the computer to communicate with other computers via one or more local, wide-area, or other communication networks, as described in further detail below.
Many other devices or components (not shown) may be connected in a similar manner (e.g., document scanners, digital cameras and so on). Conversely, all of the components shown in
More generally, various embodiments of the presently disclosed subject matter may include or be embodied in the form of computer-implemented processes and apparatuses for practicing those processes. Embodiments also may be embodied in the form of a computer program product having computer program code containing instructions embodied in non-transitory or tangible media, such as floppy diskettes, CD-ROMs, hard drives, USB (universal serial bus) drives, or any other machine readable storage medium, such that when the computer program code is loaded into and executed by a computer, the computer becomes an apparatus for practicing embodiments of the disclosed subject matter. Embodiments also may be embodied in the form of computer program code, for example, whether stored in a storage medium, loaded into or executed by a computer, or transmitted over some transmission medium, such as over electrical wiring or cabling, through fiber optics, or via electromagnetic radiation, such that when the computer program code is loaded into and executed by a computer, the computer becomes an apparatus for practicing embodiments of the disclosed subject matter. When implemented on a general-purpose microprocessor, the computer program code segments configure the microprocessor to create specific logic circuits.
In some configurations, a set of computer-readable instructions stored on a computer-readable storage medium may be implemented by a general-purpose processor, which may transform the general-purpose processor or a device containing the general-purpose processor into a special-purpose device configured to implement or carry out the instructions. Embodiments may be implemented using hardware that may include a processor, such as a general purpose microprocessor or an Application Specific Integrated Circuit (ASIC) that embodies all or part of the techniques according to embodiments of the disclosed subject matter in hardware or firmware. The processor may be coupled to memory, such as RAM, ROM, flash memory, a hard disk or any other device capable of storing electronic information. The memory may store instructions adapted to be executed by the processor to perform the techniques according to embodiments of the disclosed subject matter.
In some embodiments, the multi-microphone device 100 as shown in
In general, a “sensor” as disclosed herein may include multiple sensors or sub-sensors, such as where a position sensor includes both a global positioning sensor (GPS) as well as a wireless network sensor, which provides data that can be correlated with known wireless networks to obtain location information. Multiple sensors may be arranged in a single physical housing, such as where a single device includes movement, temperature, magnetic, or other sensors. Such a housing also may be referred to as a sensor or a sensor device. For clarity, sensors are described with respect to the particular functions they perform or the particular physical hardware used, when such specification is necessary for understanding of the embodiments disclosed herein.
A sensor may include hardware in addition to the specific physical sensor that obtains information about the environment.
Sensors as disclosed herein may operate within a communication network, such as a conventional wireless network, or a sensor-specific network through which sensors may communicate with one another or with dedicated other devices. In some configurations one or more sensors may provide information to one or more other sensors, to a central controller, or to any other device capable of communicating on a network with the one or more sensors. A central controller may be general- or special-purpose. For example, one type of central controller is a home automation network that collects and analyzes data from one or more sensors within the home. Another example of a central controller is a special-purpose controller that is dedicated to a subset of functions, such as a security controller that collects and analyzes sensor data primarily or exclusively as it relates to various security considerations for a location. A central controller may be located locally with respect to the sensors with which it communicates and from which it obtains sensor data, such as in the case where it is positioned within a home that includes a home automation or sensor network. Alternatively or in addition, a central controller as disclosed herein may be remote from the sensors, such as where the central controller is implemented as a cloud-based system that communicates with multiple sensors, which may be located at multiple locations and may be local or remote with respect to one another.
Moreover, the smart-home environment may make inferences about which individuals live in the home and are therefore users and which electronic devices are associated with those individuals. As such, the smart-home environment may “learn” who is a user (e.g., an authorized user) and permit the electronic devices associated with those individuals to control the network-connected smart devices of the smart-home environment, in some embodiments including sensors used by or within the smart-home environment. Various types of notices and other information may be provided to users via messages sent to one or more user electronic devices. For example, the messages can be sent via email, short message service (SMS), multimedia messaging service (MMS), unstructured supplementary service data (USSD), as well as any other type of messaging services or communication protocols.
A smart-home environment may include communication with devices outside of the smart-home environment but within a proximate geographical range of the home. For example, the smart-home environment may communicate information through the communication network or directly to a central server or cloud-computing system regarding detected movement or presence of people, animals, and any other objects and receives back commands for controlling the lighting accordingly.
The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit embodiments of the disclosed subject matter to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to explain the principles of embodiments of the disclosed subject matter and their practical applications, to thereby enable others skilled in the art to utilize those embodiments as well as various embodiments with various modifications as may be suited to the particular use contemplated.
Number | Date | Country | |
---|---|---|---|
Parent | 15996070 | Jun 2018 | US |
Child | 16555118 | US | |
Parent | 15170924 | Jun 2016 | US |
Child | 15996070 | US |