With the advancement of technology, the use and popularity of electronic devices has increased considerably. Electronic devices are commonly used to capture and process audio data.
For a more complete understanding of the present disclosure, reference is now made to the following description taken in conjunction with the accompanying drawings.
Electronic devices may be used to capture input audio and process input audio data. The input audio data may be used for voice commands and/or sent to a remote device as part of a communication session. In addition, the electronic devices may be used to process output audio data and generate output audio. The output audio may correspond to the communication session or may be associated with media content, such as audio corresponding to music or movies played in a home theater. Multiple devices may be grouped together in order to generate output audio using a combination of the multiple devices.
To optimize audio quality generated by multiple devices for a listening position of a user, devices, systems and methods are disclosed that perform user localization to determine the listening position and use the listening position to generate map data representing a device map. In some examples, the map data may include a listening position and/or television associated with the home theater group, such that the map data is centered on the listening position with the television along a vertical axis. The map data may be used to generate renderer coefficient values for each of the devices, enabling each individual device to generate playback audio that takes into account the location of the device and characteristics of the device (e.g., frequency response, etc.).
To improve user localization in an environment, the system may determine a location of the user and/or a user orientation using a combination of (i) timing information indicating temporal differences between multiple devices and (ii) angle information determined by individual microphone arrays of the multiple devices. For example, each of the multiple devices may generate audio data capturing an audible sound generated by a sound source (e.g., user, television, and/or the like). Using multi-channel audio data generated by an individual microphone array, the system may compare when the audible sound is captured by individual microphones to determine the angle information. The angle information may indicate a direction of the sound source relative to the microphone array and may be determined using Angle of Arrival (AoA) processing, although the disclosure is not limited thereto. Using audio data generated by each of the multiple devices, the system may compare when the audible sound is captured by individual devices to determine the timing information. The timing information may indicate a direction of the sound source relative to the individual device and may be determined using Time Difference of Arrival (TDoA) processing, although the disclosure is not limited thereto.
Using the timing information and/or the angle information, the system may generate a spatial likelihood function that represents the environment using a grid comprising a plurality of grid cells. For example, the spatial likelihood function may associate each grid cell with a spatial likelihood value indicating a likelihood that the grid cell corresponds to a location of the sound source. By combining a first spatial likelihood function generated using the timing information with a second spatial likelihood function generated using the angle information, the system may generate a total spatial likelihood function and may associate a maximum value in the total spatial likelihood function with the location of the user or sound source.
The device 110 may be an electronic device configured to capture and/or receive audio data. For example, the device 110 may include a microphone array configured to generate input audio data, although the disclosure is not limited thereto and the device 110 may include multiple microphones without departing from the disclosure. As is known and used herein, “capturing” an audio signal and/or generating audio data includes a microphone transducing audio waves (e.g., sound waves) of captured sound to an electrical signal and a codec digitizing the signal to generate the microphone audio data. In addition to capturing the input audio data, the device 110 may be configured to receive output audio data and generate output audio using one or more loudspeakers of the device 110. For example, the device 110 may generate output audio corresponding to media content, such as music, a movie, and/or the like.
As illustrated in
In some examples, the fourth device 110d may receive a home theater configuration. For example, the user may use a smartphone or other devices and may input the home theater configuration using a user interface. However, the disclosure is not limited thereto, and the system 100 may receive the home theater configuration without departing from the disclosure. In response to the home theater configuration, the fourth device 110d may generate calibration data indicating a sequence for generating playback audio, may send the calibration data to each device in the home theater group, and may cause the devices to perform the calibration sequence. For example, the calibration data may indicate that the first device 110a may generate a first audible sound during a first time range, the second device 110b may generate a second audible sound during a second time range, the third device 110c may generate a third audible sound during a third time range, and that the fourth device 110d may generate a fourth audible sound during a fourth time range. In some examples there are gaps between the audible sounds, such that the calibration data may be include values of zero (e.g., padded with zeroes between audible sounds), but the disclosure is not limited thereto and the calibration data may not include gaps without departing from the disclosure.
During the calibration sequence, a single device 110 may generate an audible sound and the remaining devices may capture the audible sound in order to determine a relative direction and/or distance. For example, when the first device 110a generates the first audible sound, the second device 110b may capture the first audible sound by generating first audio data including a first representation of the first audible sound. Thus, the second device 110b may perform localization (e.g., sound source localization (SSL) processing and/or the like) using the first audio data and determine a first position of the first device 110a relative to the second device 110b. Similarly, the third device 110c may generate second audio data including a second representation of the first audible sound. Thus, the third device 110c may perform localization using the second audio data and may determine a second position of the first device 110a relative to the third device 110c. Each of the devices 110 may perform these steps to generate audio data and/or determine a relative position of the first device 110a relative to the other devices 110, as described in greater detail below with regard to
The fourth device 110d may receive measurement data from the devices 110 in the home theater group. For example, the fourth device 110d may receive first measurement data from the second device 110b, second measurement data from the third device 110c, and third measurement data from the first device 110a, although the disclosure is not limited thereto. In some examples, the measurement data may include angle information (e.g., angle of arrival) representing a relative direction from one device to another, along with timing information that the system 100 may use to determine distance values representing relative distances between the devices.
The fourth device 110d may determine relative positions of the devices 110 using the distance values. For example, the fourth device 110d may determine an optimal arrangement of the devices 110 in the home theater group, such as by using multi-dimensional scaling, determining a least-squares solution, and/or the like. While the relative positions remain fixed based on the distance values between the devices 110, the location of the relative positions may vary. Thus, the fourth device 110d may perform additional processing to determine exact locations of the devices 110.
In some examples, the fourth device 110d may determine device orientation data. For example, the fourth device 110d may use the relative positions of the devices 110 and the angle information included in the measurement data to determine an orientation of each device. To illustrate an example, the fourth device 110d may identify a first angle value represented in the measurement data, which indicates a direction of the third device 110c relative to an orientation of the second device 110b (e.g., relative angle of arrival). The fourth device 110d may then use the relative positions to determine a second angle value that corresponds to the actual direction of the third device 110c relative to the second device 110b in the global coordinate system (e.g., absolute angle of arrival). Using the first angle value and the second angle value, the fourth device 110d may determine the orientation of the second device 110b, which indicates a rotation of the second device 110b relative to the global coordinate system. For example, the combination of the orientation of the second device 110b and the first angle value (e.g., relative angle of arrival) is equal to the second angle value (e.g., absolute angle of arrival). Thus, once the fourth device 110d determines the device orientation data, the fourth device 110d may convert each of the relative angle of arrivals included in the measurement data to absolute angle of arrivals that correspond to the actual directions between the devices 110 in the global coordinate system.
As described in greater detail below, the system 100 may determine a location of the user and/or a user orientation using a combination of (i) timing information indicating temporal differences between the devices 110a-110d and (ii) angle information determined by individual microphone arrays of the devices 110a-110d. For example, each of the devices 110a-110d may generate audio data capturing an audible sound generated by a sound source (e.g., user, television, and/or the like). Using multi-channel audio data generated by an individual microphone array, the system 100 may compare when the audible sound is captured by individual microphones to determine the angle information. The angle information may indicate a direction of the sound source relative to the microphone array and may be determined using Angle of Arrival (AoA) processing, although the disclosure is not limited thereto. Using audio data generated by each of the devices 110a-110d, the system 100 may compare when the audible sound is captured by individual devices to determine the timing information. The timing information may indicate a direction of the sound source relative to the individual device and may be determined using Time Difference of Arrival (TDoA) processing, although the disclosure is not limited thereto.
Using the timing information and/or the angle information, the system 100 may generate a spatial likelihood function that represents the environment using a grid comprising a plurality of grid cells. For example, the spatial likelihood function may associate each grid cell with a spatial likelihood value indicating a likelihood that the grid cell corresponds to a location of the sound source. By combining a first spatial likelihood function generated using the timing information with a second spatial likelihood function generated using the angle information, the system 100 may generate a total spatial likelihood function and may associate a maximum value in the total spatial likelihood function with the location of the user or sound source.
As illustrated in
The fourth device 110d may receive (134) Angle of Arrival (AoA) data. In some examples, each of the devices 110a-110c may generate a portion of the AoA data and may send their respective portions of the AoA data to the fourth device 110d. However, the disclosure is not limited thereto, and in other examples each of the devices 110a-110c may send multi-channel audio data captured by a corresponding microphone array to the fourth device 110d and the fourth device 110d may generate the AoA data using the multi-channel audio data without departing from the disclosure.
The fourth device 110d may receive (136) cross-correlation data associated with pairwise combinations of the devices 110a-110d. For example, the system 100 may generate first cross-correlation data using first audio data generated by the first device 110a and second audio data generated by the second device 110b, second cross-correlation data using the first audio data and third audio data generated by the third device 110c, and third cross-correlation data using the first audio data and fourth audio data generated by the fourth device 110d. Similarly, the system 100 may generate fourth cross-correlation data using the second audio data and the third audio data, fifth cross-correlation data using the second audio data and the fourth audio data, and sixth cross-correlation data using the third audio data and the fourth audio data. Thus, the cross-correlation data corresponds to pairwise combinations of the devices 110a-110d.
The fourth device 110d may determine (138) an AoA spatial likelihood function using the AoA data. For example, the fourth device 110d may use the angle information associated with each device 110a-110d to determine individual spatial likelihood functions and may determine the AoA spatial likelihood function by summing or otherwise combining the individual spatial likelihood functions, although the disclosure is not limited thereto.
As described in greater detail below, a spatial likelihood function represents an environment (e.g., search space) using a grid that comprises a plurality of grid cells or grid points (e.g., plurality of segments) having a uniform size. Thus, the system 100 may divide the search space into the plurality of grid cells and determine a spatial likelihood value for each grid cell. For example, the system 100 may determine a first spatial likelihood value associated with a first grid cell, and the first spatial likelihood value may indicate a first likelihood that the first grid cell corresponds to the user (e.g., the user location and/or listening position is located within the first grid cell). Similarly, the system 100 may determine a second spatial likelihood value associated with a second grid cell, and the second spatial likelihood value may indicate a second likelihood that the second grid cell corresponds to the user (e.g., the user location and/or listening position is located within the second grid cell). Thus, the spatial likelihood function indicates the relative likelihood that the user is located within each grid cell, enabling the fourth device 110d to determine a user location by identifying a maximum likelihood value represented in the total spatial likelihood function.
The audio data may represent user speech or other utterances generated by the user, as captured by each of the devices 110a-110d. Thus, each cross-correlation data includes a peak representing the user speech, which corresponds to a Time Difference of Arrival (TDoA) between the two devices associated with the cross-correlation data. The fourth device 110d may determine (140) a TDoA spatial likelihood function using the cross-correlation data. For example, the fourth device 110d may use the cross-correlation data to determine pairwise spatial likelihood functions and may determine the TDoA spatial likelihood function by summing or otherwise combining the pairwise spatial likelihood functions, although the disclosure is not limited thereto.
After determining the AoA spatial likelihood function and the TDoA spatial likelihood function, the fourth device 110d may determine (142) a final spatial likelihood function. For example, the fourth device 110d may determine the final spatial likelihood function by combining the AoA spatial likelihood function and the TDoA spatial likelihood function. The fourth device 110d may combine these spatial likelihood functions using a weighted sum operation, a log-likelihood operation, and/or the like without departing from the disclosure.
The fourth device 110d may determine (144) a user location using the final spatial likelihood function. For example, the fourth device 110d may determine the user location by identifying a maximum likelihood value represented in the total spatial likelihood function, although the disclosure is not limited thereto.
The fourth device 110d may determine (146) a user orientation, which indicates a look direction from the listening position to the television. In some examples, the fourth device 110d may repeat steps 134-142 for audio data representing an audible sound generated by the television. For example, the fourth device 110d may use the audio data from each of the devices 110a-110d to determine a second AoA spatial likelihood function, a second TDoA spatial likelihood function, and a second final spatial likelihood function corresponding to the television.
In some examples, the fourth device 110d may determine a maximum likelihood value represented in the second total spatial likelihood function, may associate the maximum likelihood value with a location of the television, and may determine the user orientation based on a look direction between the listening position and the estimated location of the television. However, the disclosure is not limited thereto, and in other examples the fourth device 110d may determine the user orientation without determining the location of the television. For example, the second total spatial likelihood function may include a plurality of likelihood values that are similar to the maximum likelihood value, indicating that the location of the television could correspond to a plurality of grid cells. Thus, the system 100 may determine a rough area associated with the television, but not the actual location of the television. As the user orientation indicates a direction of the television relative to the listening position, the fourth device 110d may determine the user orientation based on the plurality of grid cells without departing from the disclosure. However, the disclosure is not limited thereto, and in some examples the fourth device 110d may determine the user orientation without determining the second total spatial likelihood function without departing from the disclosure.
Determining the user location and/or user orientation enables the system 100 to provide context for the device map, such as centering the device map on a listening position associated with the user and/or orienting the device map based on a look direction from the listening position to the television. This context is beneficial as it enables the system 100 to render output audio properly for the home theater group, with a sound stage of the output audio aligned with the television (e.g., directional sounds generated in the appropriate direction) and volume balanced between the devices (e.g., a volume of the output audio generated by a particular device is determined based on a distance from the device to the listening position).
The fourth device 110d may generate (148) map data. For example, the fourth device 110d may generate map data indicating locations of each of the devices 110 included in the home theater group. In some examples, the fourth device 110d may use the user location (e.g., listening position) and user orientation to determine a center point and an orientation of the device map. For example, the fourth device 110d may generate the map data with the center point corresponding to the listening position, such that coordinate values of each of the locations in the map data indicate a position relative to the listening position. Additionally or alternatively, the fourth device 110d may generate the map data with the television along a vertical axis from the listening position, such that a look direction from the listening position to the television extends vertically along the vertical axis.
After generating the map data, the fourth device 110d may send (150) the map data to a rendering component to generate rendering coefficient. For example, the rendering component may process the map data and determine rendering coefficient values for each of the devices 110a-110d included in the home theater group.
As used herein, audio signals or audio data (e.g., microphone audio data, or the like) may correspond to a specific range of frequency bands. For example, the audio data may correspond to a human hearing range (e.g., 20 Hz-20 kHz), although the disclosure is not limited thereto.
As used herein, a frequency band (e.g., frequency bin) corresponds to a frequency range having a starting frequency and an ending frequency. Thus, the total frequency range may be divided into a fixed number (e.g., 256, 512, etc.) of frequency ranges, with each frequency range referred to as a frequency band and corresponding to a uniform size. However, the disclosure is not limited thereto and the size of the frequency band may vary without departing from the disclosure.
The device 110 may include multiple microphones configured to capture sound and pass the resulting audio signal created by the sound to a downstream component. Each individual piece of audio data captured by a microphone may be in a time domain. To isolate audio from a particular direction, the device may compare the audio data (or audio signals related to the audio data, such as audio signals in a sub-band domain) to determine a time difference of detection of a particular segment of audio data. If the audio data for a first microphone includes the segment of audio data earlier in time than the audio data for a second microphone, then the device may determine that the source of the audio that resulted in the segment of audio data may be located closer to the first microphone than to the second microphone (which resulted in the audio being detected by the first microphone before being detected by the second microphone).
Using such direction isolation techniques, a device 110 may isolate directionality of audio sources. A particular direction may be associated with azimuth angles divided into bins (e.g., μθ-45 degrees, 46-90 degrees, and so forth). To isolate audio from a particular direction, the device 110 may apply a variety of audio filters to the output of the microphones where certain audio is boosted while other audio is dampened, to create isolated audio corresponding to a particular direction, which may be referred to as a beam. While in some examples the number of beams may correspond to the number of microphones, the disclosure is not limited thereto and the number of beams may be independent of the number of microphones. For example, a two-microphone array may be processed to obtain more than two beams, thus using filters and beamforming techniques to isolate audio from more than two directions. Thus, the number of microphones may be more than, less than, or the same as the number of beams. The beamformer unit of the device may have an adaptive beamformer (ABF) unit/fixed beamformer (FBF) unit processing pipeline for each beam, although the disclosure is not limited thereto.
Despite the flexible home theater 200 including multiple different types of devices 110 in an asymmetrical configuration relative to the listening position 210 of the user, the system 100 may generate playback audio optimized for the listening position 210. For example, the system 100 may generate map data indicating the locations of the devices 110, the type of devices 110, and/or other context (e.g., number of loudspeakers, frequency response of the drivers, etc.), and may send the map data to a rendering component. The rendering component may generate individual renderer coefficient values for each of the devices 110, enabling each individual device 110 to generate playback audio that takes into account the location of the device 110 and characteristics of the device 110 (e.g., frequency response, etc.).
To illustrate a first example, the second device 110b may act as a center channel in the flexible home theater 200 despite being slightly off-center below the television. For example, second renderer coefficient values associated with the second device 110b may adjust the playback audio generated by the second device 110b to shift the sound stage to the left from the perspective of the listening position 210 (e.g., centered under the television). To illustrate a second example, the fourth device 110d may act as a right channel and the first device 110a may act as a left channel in the flexible home theater 200, despite being different distances from the listening position 210. For example, fourth renderer coefficient values associated with the fourth device 110d and first renderer coefficient values associated with the first device 110a may adjust the playback audio generated by the fourth device 110d and the first device 110a such that the two channels are balanced from the perspective of the listening position 210.
The first device 110a may generate the first measurement data by generating first audio data capturing one or more audible sounds and performing sound source localization processing to determine direction(s) associated with the audible sound(s) represented in the first audio data. For example, if the second device 110b is generating first playback audio during a first time range, the first device 110a may capture a representation of the first playback audio and perform sound source localization processing to determine that the second device 110b is in a first direction relative to the first device 110a, although the disclosure is not limited thereto. Similarly, the second device 110b may generate the second measurement data by generating second audio data capturing one or more audible sounds and performing sound source localization processing to determine direction(s) associated with the audible sound(s) represented in the second audio data. For example, if the third device 110c is generating second playback audio during a second time range, the second device 110b may capture a representation of the second playback audio and perform sound source localization processing to determine that the third device 110c is in a second direction relative to the second device 110b, although the disclosure is not limited thereto.
In some examples, the measurement data may include information associated with each of the other devices 110 in the flexible home theater. To illustrate an example, the measurement data may include angle information and timing information generated by the devices 110a-110d. For example, the angle information (e.g., angle of arrival value, variance associated with the angle of arrival, and/or the like) may indicate a relative direction from a first device to a second device, while the timing information may enable the system 100 to estimate a propagation delay and/or calculate distance values (e.g., range information), such as a distance from the first device to the second device. However, the disclosure is not limited thereto, and the measurement data may include additional information without departing from the disclosure.
As illustrated in
A device orientation component 320 may receive the input data 302 and the device location data 315 and may determine device orientation data 325 indicating device orientations. For example, the system 100 may use the relative positions of the devices 110 and the angle information included in the measurement data to determine an orientation of each device. To illustrate an example, the system 100 may identify a first angle value represented in the measurement data, which indicates a direction of the third device 110c relative to an orientation of the second device 110b (e.g., relative angle of arrival). The system 100 may then use the relative positions to determine a second angle value that corresponds to the actual direction of the third device 110c relative to the second device 110b in the global coordinate system (e.g., absolute angle of arrival). Using the first angle value and the second angle value, the first device 110a may determine the orientation of the second device 110b, which indicates a rotation of the second device 110b relative to the global coordinate system. For example, the combination of the orientation of the second device 110b and the first angle value (e.g., relative angle of arrival) is equal to the second angle value (e.g., absolute angle of arrival). Thus, once the system 100 determines the device orientation data 325, the system 100 may convert each of the relative angle of arrivals included in the measurement data to absolute angle of arrivals that correspond to the actual directions between the devices 110 in the global coordinate system.
The system 100 may perform user localization 300 using the device location data 315, the device orientation data 325, and cross-correlation data 335, as will be described in greater detail below with regard to
As illustrated in
As will be described in greater detail below with regard to
A user and TV localization component 340 may use the device location data 315 and the device orientation data 325 to determine a location and orientation for each of the devices 110 that generated audio data during the calibration sequence (e.g., devices 110a-110d). For example, as generating the audio data enables the system 100 to determine distance information, only the devices 110 that generated the audio data and are thus associated with distance information are included in the device location data. Using the location and orientation information for each of the devices 110a-110d, the system 100 may perform device localization to determine a location of any additional devices 110 included in the flexible home theater system. For example, the system 100 may perform device localization to determine a location and/or direction of a fifth device 110e (e.g., television), as will be described in greater detail below. However, the disclosure is not limited thereto, and in some examples the flexible home theater may include more than one device 110 that did not generate audio data during the calibration sequence.
To perform device localization for the fifth device 110e, the system 100 may instruct the fifth device 110e to generate an audible sound and may capture representations of the audible sound using the devices 110a-110d. For example, the fifth device 110e may be included in the calibration sequence, despite not generating audio data, such that the multi-channel audio data 304 generated by the devices 110a-110d may include representations of the audible sound output by the fifth device 110e during a fifth time range. To perform device localization for the fifth device 110e, the cross-correlation generator component 330 may generate a first portion of the cross-correlation data 335 corresponding to the representations of the audible sound and the user and TV localization component 340 may generate a first spatial likelihood function using the first portion of the cross-correlation data, as will be described in greater detail below with regard to
Similarly, the user and TV localization component 340 may also perform user localization to determine a location of the user (e.g., listening position 210). For example, the system 100 may instruct the user to speak from the listening position 210 and the multi-channel audio data 304 may include representations of the speech during a sixth time range. To perform user localization, the cross-correlation generator component 330 may generate a second portion of the cross-correlation data 335 corresponding to the representations of the speech and the user and TV localization component 340 may generate a second spatial likelihood function using the second portion of the cross-correlation data, as will be described in greater detail below with regard to
Using the second spatial likelihood function, the user and TV localization component 340 may determine the location of the user and generate user location data 345 indicating the user location. Using the first spatial likelihood function, the user and TV localization component 340 may determine a direction of the fifth device 110e relative to the location of the user and may generate user orientation data 350 indicating the direction (e.g., user orientation). Thus, even though the user and TV localization component 340 may be unable to estimate a location of the fifth device 110e, the user and TV localization component 340 may still determine the user orientation based on a relative direction of the fifth device 110e.
While
As used herein, the angle information may be represented as a relative value (e.g., relative AoA value), which indicates angle information relative to a device orientation, and/or an absolute value (e.g., absolute AoA value), which indicates angle information using a fixed frame of reference such as a global coordinate system. To illustrate an example of a relative value, the first device 110a may generate relative AoA data (e.g., relative AoA value) indicating that a second device 110b is in a first direction relative to a fixed point associated with the first device 110a. In some examples, the fixed point may correspond to a front of the first device 110a, such that the first direction varies depending on which direction the first device 110a is facing. As used herein, the direction that the first device 110a is facing may be referred to as an orientation of the device 110, which may be represented as a device orientation indicating this direction relative to the global coordinate system. For example, the device orientation may indicate a rotation of the first device 110a relative to the global coordinate system and may vary based on how the first device 110a is positioned.
While the relative AoA data may enable the first device 110a to determine a relative position of the second device 110b, other devices 110 may be unable to determine the relative position of the second device 110b without knowing the device orientation associated with the relative AoA data. If the system 100 knows the device orientation, the system 100 may use the device orientation and the relative AoA data to determine absolute AoA data (e.g., absolute AoA value), which indicates that the second device 110b is in a second direction relative to a location of the first device 110a within the grid. As the absolute AoA data indicates the second direction relative to the global coordinate system, other devices 110 may use the absolute AoA data without regard to a current device orientation of the first device 110a. Conversely, if the system 100 knows the relative AoA data and the absolute AoA data, the system 100 may determine the device orientation associated with the first device 110a. For example, the system 100 may determine the device orientation based on a difference between the absolute AoA value and the relative AoA value without departing from the disclosure.
The disclosure is not limited thereto, however, and the system 100 may generate the relative AoA data 375 using each of the microphone arrays, such that the relative AoA data 375 includes angle information generated for each individual device 110 associated with a microphone array. Thus, the relative AoA data 375 may also include a second plurality of relative AoA values that indicate a direction of each of the devices 110 included in the flexible home theater relative to the second device 110b. For example, the relative AoA data 375 may include a third relative AoA value indicating a third direction of the first device 110a relative to a second device orientation of the second device 110b, a fourth relative AoA value indicating a fourth direction of the third device 110c relative to the second device orientation, and so on. The system 100 may determine the second plurality of relative AoA values using second multi-channel audio data generated by a second microphone array associated with the second device 110b.
The relative pairwise AoA estimation component 370 may output the relative AoA data 375 to the device localization component 310, the device orientation component 320, and the absolute pairwise AoA estimation component 380. As described above, the device orientation component 320 may use the relative AoA data 375 and the device location data 315 to generate the device orientation data 325. For example, the device orientation component 320 may use the device location data 315 to determine a first absolute AoA value indicating an actual direction of the second device 110b relative to the first device 110a in the global coordinate system (e.g., absolute angle of arrival), and may determine the first device orientation based on the first relative AoA value and the first absolute AoA value. For example, the combination of the orientation of the first device 110a and the first relative AoA value (e.g., relative angle of arrival) may be equal to the first absolute angle value (e.g., absolute angle of arrival).
The device orientation component 320 may output the device orientation data 325 to the user and TV localization component 340 and the absolute pairwise AoA estimation component 380. Thus, the absolute pairwise AoA estimation component 380 may receive the relative AoA data 375 and the device orientation data 325 and may generate absolute AoA data 385. For example, the absolute pairwise AoA estimation component 380 may convert each of the relative angle of arrival values included in the relative AoA data 375 to absolute angle of arrival values that correspond to the actual directions between the devices 110 in the global coordinate system.
The absolute pairwise AoA estimation component 380 may output the absolute AoA data 385 to the user and TV localization component 340. As illustrated in
As described above,
The system 100 may determine a perspective with which to generate the device map. For example, the system 100 may determine the listening position 210 of the user and center the device map on the listening position 210, such that locations of the devices 110 within the device map are relative to the listening position 210 (e.g., listening position 210 is at an origin). In some examples, such as when a television is associated with the home theater group, the system 100 may determine a location of the television and generate the device map with the television along a vertical axis. Thus, the device map may represent locations of the devices 110 relative to a look direction from the listening position 210 to the television, although the location of the television may not be included in the device map without departing from the disclosure.
In some examples, the system 100 may prompt the user to speak from the listening position 210, such as by saying a wakeword or particular utterance, and the devices 110 may detect the wakeword or other speech and generate the user localization measurement data indicating a direction of the speech relative to each device. As the system 100 previously determined the device orientation data indicating an orientation for each device 110, in some examples the system 100 may identify the orientation of a selected device and determine the direction to the user based on the user localization measurement data generated by the selected device 110. Thus, the system 100 may perform triangulation using two or more devices 110 in the home theater group to determine a location associated with the speech.
In some examples, the system 100 may instruct the television to generate two audible sounds at a specific time, such as a first audible sound using a left channel and a second audible sound using a right channel of the television. Each of the devices 110 in the flexible home theater group may detect these audible sounds and determine angle information associated with the television. For example, a selected device may generate first angle information associated with the first audible sound (e.g., left channel) and generate second angle information associated with the second audible sound (e.g., right channel). Knowing the device orientation data for the selected device, the system 100 may determine the direction of the television relative to the selected device based on the first angle information, the second angle information, and the device orientation of the selected device. Repeating this process for multiple devices in the flexible home theater group, in some examples the system 100 may estimate the location of the television (e.g., by performing triangulation or the like), although the disclosure is not limited thereto.
In some examples, the system 100 may track the left channel and the right channel separately to determine two different locations, such that the system 100 determines the location of the television by averaging the two locations. For example, the system 100 may use two sets of angle information for each device to determine a first location associated with the left channel and a second location associated with the right channel, then determine the location of the television as being between the first location and the second location. However, the disclosure is not limited thereto, and in other examples the system 100 may separately identify the left channel and the right channel but then combine this information to determine a single location associated with the television without departing from the disclosure. For example, the system 100 may determine a mean value (e.g., average) of the first angle information and the second angle information and use this mean value to determine the direction of the television relative to the selected device without departing from the disclosure.
In some examples, the system 100 may include the television in the calibration data such that the measurement data already includes the angle information associated with the television. However, the disclosure is not limited thereto, and in other examples the system 100 may perform television localization as a discrete step in which the television generates the audible sounds separately from the other devices in the home theater group without departing from the disclosure.
The device map data may include location(s) associated with each of the devices 110, a location of a listening position 210, a direction of a television relative to the listening position 210 and/or a location of the television, and/or the like without departing from the disclosure. In some examples, the device map data may include additional information, such as device descriptors or other information corresponding to the devices 110 included in the device map.
Determining the listening position 210 and/or the location of the television enables the system 100 to provide context for the device map, such as centering the device map on the listening position 210 and/or orienting the device map based on a look direction from the listening position 210 to the television. This context is beneficial as it enables the system 100 to render output audio properly for the home theater group, with a sound stage of the output audio aligned with the television (e.g., directional sounds generated in the appropriate direction) and volume balanced between the devices (e.g., a volume of the output audio generated by a particular device is determined based on a distance from the device to the listening position).
Thus, the device map may represent the listening position 210 at a first location in the device map (e.g., at an origin, which is the intersection between the horizontal axis and the vertical axis in a two-dimensional plane, although the disclosure is not limited thereto) and represent each of the devices 110 at a corresponding location in the device map, with the device map oriented relative to the location of the television such that the location of the television is along a vertical axis from the listening position 210, although the disclosure is not limited thereto.
In addition, the system 100 may process the device map data, the listening position data, and/or device description data to generate flexible renderer coefficient values. For example, the system 100 may generate first renderer coefficient data (e.g., first renderer coefficient values) for the first device 110a, second renderer coefficient data (e.g., second renderer coefficient values) for the second device 110b, third renderer coefficient data (e.g., third renderer coefficient values) for the third device 110c, and/or fourth renderer coefficient data (e.g., fourth renderer coefficient values) for the fourth device 110d, although the disclosure is not limited thereto. The renderer coefficient values enable the system 100 to render output audio properly for the home theater group, with a sound stage of the output audio aligned with the television.
Using the device location data 315 and the device orientation data 325, the user and TV localization component 340 may perform user localization to determine a location of the user (e.g., listening position 210) and a user orientation (e.g., direction from the listening position 210 to the fifth device 110e). In some examples, the user and TV localization component 340 may determine a fifth location associated with the television (e.g., fifth device 110e), but the disclosure is not limited thereto and the user and TV localization component 340 only needs to determine a direction of the television relative to the listening position 210.
While the device location map 410 only illustrates four devices 110a-110d, the disclosure is not limited thereto and the device location map 410 may include any number of devices without departing from the disclosure. Additionally or alternatively, for ease of illustration the device location map 410 represents the devices 110a-110d in a particular orientation that is based on the listening position 210 being at an origin and the television being oriented along a vertical axis. However, the disclosure is not limited thereto and the device locations represented in the device location map 410 may vary without departing from the disclosure. For example, prior to the user and TV localization component 340 performing user localization, the device location map 410 may represent the devices 110a-110d using relative positions that may be rotated or flipped relative to a final device map.
The fourth device 110d may broadcast (512) the schedule to each of the secondary devices (e.g., 110a-110c and 110e) and may start (514) the calibration sequence. For example, the fourth device 110d may send the calibration data to the first device 110a, to the second device 110b, to the third device 110c, to the fifth device 110e, and/or to any additional secondary devices included in the flexible home theater group. Each of the devices 110a-110e may start the calibration sequence based on the calibration data received from the fourth device 110d. For example, during the first time range the first device 110a may generate the first audible sound while some of the devices 110a-110d generate audio data including representations of the first audible sound. Similarly, during the second time range the second device 110b may generate the second audible sound while the devices 110a-110d generate audio data including representations of the second audible sound. In some examples, some of the devices 110 (e.g., fifth device 110e) may not include a microphone and therefore may not generate audio data during the calibration steps illustrated in
The fourth device 110d may receive (516) calibration measurement data from devices 110a-110c. For example, the devices 110a-110c may process the audio data and generate the calibration measurement data by comparing a delay between when an audible sound was scheduled to be generated and when the audible sound was captured by the respective device 110. To illustrate an example, the first device 110a may perform sound source localization to determine an angle of arrival (AOA) associated with the second device 110b, although the disclosure is not limited thereto. Additionally or alternatively, the first device 110a may determine timing information associated with the second device 110b, which may be used to determine a distance between the first device 110a and the second device 110b, although the disclosure is not limited thereto. While not illustrated in
The fourth device 110d may trigger (518) user localization and may receive (520) user localization measurement data from each of the devices 110a-110c. For example, the fourth device 110d may send instructions to the devices 110a-110c to perform user localization and the instructions may cause the devices 110a-110c to begin the user localization process. During the user localization process, the devices 110a-110d may be configured to capture audio in order to detect a wakeword or other audible sound generated by the user and generate the user localization measurement data corresponding to the user. For example, the system 100 may instruct the user to speak the wakeword from the user's desired listening position 210 and the user localization measurement data may indicate a relative direction and/or distance from each of the devices 110 to the listening position 210. While not illustrated in
After receiving the calibration measurement data and the user localization measurement data, the fourth device 110d may generate (522) device map data representing a device map for the flexible home theater group. For example, the fourth device 110d may process the calibration measurement data in order to generate a final estimate of device locations, interpolating between the calibration measurement data generated by individual devices 110a-110d. Additionally or alternatively, the fourth device 110d may process the user localization measurement data to generate a final estimate of the listening position 210, interpolating between the user localization measurement data generated by individual devices 110a-110d.
If the flexible home theater group does not include a display such as a television, the fourth device 110d may generate the device map based on the listening position 210, but an orientation of the device map may vary. For example, the fourth device 110d may set the listening position 210 as a center point and may generate the device map extending in all directions from the listening position 210. However, if the flexible home theater group includes a television, the fourth device 110d may set the listening position 210 as a center point and may select the orientation of the device map based on a location of the television. For example, the fourth device 110d may determine the location of the television and may generate the device map with the location of the television extending along a vertical axis, although the disclosure is not limited thereto.
To determine the location of the television, in some examples the fourth device 110d may generate calibration data instructing the television to generate a first audible noise using a left channel during a first time range and generate a second audible noise using a right channel during a second time range. Thus, each of the devices 110a-110d may generate calibration measurement data including separate calibration measurements for the left channel and the right channel, such that a first portion of the calibration measurement data corresponds to a first location associated with the left channel and a second portion of the calibration measurement data corresponds to a second location associated with the right channel. This enables the fourth device 110d to determine the location of the television based on the first location and the second location, although the disclosure is not limited thereto.
As the exact timing associated with the user speech is unknown, the calibration sequence 610 illustrates the sixth time range as a listening period instead of as a pulse signal. While the calibration sequence 610 illustrates the fifth device (Television) generating the fifth audible sound during the fifth time range, the disclosure is not limited thereto. As described above, in some examples the fifth device may generate distinct audible sounds using a left channel and a right channel without departing from the disclosure. Thus, the fifth device may generate the fifth audible sound during the fifth time range using a left channel, may generate a sixth audible sound during a sixth time range using a right channel, and user localization may be performed during a seventh time range without departing from the disclosure.
The measurement data generated by some of the devices 110 (e.g., devices 110 that include a microphone) is represented in calibration sound capture 620. For example, the calibration sound capture 620 illustrates that while the first device (DeviceA) captures the first audible sound immediately, the other devices capture the first audible sound after variable delays caused by a relative distance from the first device to the capturing device. To illustrate a first example, the first device (DeviceA) may generate first audio data that includes a first representation of the first audible sound within the first time range and at a first volume level (e.g., amplitude). However, the second device (DeviceB) may generate second audio data that includes a second representation of the first audible sound after a first delay and at a second volume level that is lower than the first volume level. Similarly, the third device (DeviceC) may generate third audio data that includes a third representation of the first audible sound after a second delay and at a third volume level that is lower than the first volume level, and the fourth device (DeviceD) may generate fourth audio data that includes a fourth representation of the first audible sound after a third delay and at a fourth volume level that is lower than the first volume level.
Similarly, the second audio data may include a first representation of the second audible sound within the second time range and at a first volume level. However, the first audio data may include a second representation of the second audible sound after a first delay and at a second volume level that is lower than the first volume level, the third audio data may include a third representation of the second audible sound after a second delay and at a third volume level that is lower than the first volume level, and the fourth audio data may include a fourth representation of the second audible sound after a third delay and at a fourth volume level that is lower than the first volume level.
As illustrated in
The fourth audio data may include a first representation of the fourth audible sound within the fourth time range at a first volume level. However, the first audio data may include a second representation of the second audible sound after a first delay and at a second volume level that is lower than the first volume level, the second audio data may include a third representation of the fourth audible sound after a second delay and at a third volume level that is lower than the first volume level, and the third audio data may include a fourth representation of the fourth audible sound after a third delay and at a fourth volume level that is lower than the first volume level. Based on the different delays and/or amplitudes, the system 100 may determine a relative position of each of the devices within the environment.
As illustrated in
In some examples, the system 100 may estimate a time difference of arrival (TDoA) value between two devices by determining a time difference between when each device captures a particular audible sound. For example, the system 100 may determine a first timestamp associated with the audible sound being captured by the first device 110a (e.g., represented in first audio data generated by the first device 110a) and a second timestamp associated with the audible sound being captured by the second device 110b (e.g., represented in second audio data generated by the second device 110b). By determining a difference between the first timestamp and the second timestamp, the system 100 may determine a first TDoA value associated with the first device 110a and the second device 110b.
However, performing any time-based localization, such as TDoA processing, requires that the audio data from multiple devices 110 be synchronized with each other. In some examples, this synchronization can occur when the devices 110 themselves are synchronized to each other. For example, a first clock signal associated with the first device 110a may be synchronized with a second clock signal associated with the second device 110b, the first device 110a may begin generating first audio data at the same time that the second device 110b begins generating second audio data, and/or the like. In other examples, this synchronization can occur based on sounds represented in the audio data itself. For example, the first audio data may be synchronized with the second audio data based on an audible sound represented in both the first audio data and the second audio data. However, synchronizing the first audio data to the second audio data based on a single audible sound removes time differences caused by the different positions of the first device 110a and the second device 110b relative to the sound source generating the audible sound.
While the devices 110 included in the flexible home theater may not be synchronized with each other, the calibration sequence 610 enables the system 100 to synchronize the audio data between multiple devices 110. For example, as both the first device 110a and the second device 110b capture a first audible sound generated by the first device 110a and a second audible sound generated by the second device 110b, the system 100 may determine a reference point by which to synchronize the first audio data to the second audio data without removing the time differences. In some examples, the system 100 may align the first audio data and the second audio data based on a midpoint between the first audible sound and the second audible sound, which enables the system 100 to measure a time-of-flight between the first device 110a and the second device 110b.
To align the signals, the system 100 may determine midpoints between a pair of audible sounds captured by a pair of devices 110 and then align the midpoints. For example, the system 100 may determine a first midpoint in the first audio data between a first representation of the first audible sound and a first representation of the second audible sound (e.g., 0.5(tAA+tBA)), may determine a second midpoint in the second audio data between a second representation of the first audible sound and a second representation of the second audible sound (e.g., 0.5(tAB+tBB)), and may align the first midpoint with the second midpoint. As illustrated in
In the aligned signals 720, the captured audible sounds from each pair of devices form a symmetric trapezoid, with a first propagation delay from the first device 110a to the second device 110b being equal to a second propagation delay from the second device 110b to the first device 110a. The system 100 may determine delays to apply to each audio stream in order to align the audio data and ensure that the midpoints of the upper and lower base of the trapezoid are equal. For example, for N devices the system 100 may determine N(N−1)/2 discrete pairs (e.g., unique equations), along with N−1 independent variables (e.g., with one device arbitrarily chosen as a zero delay reference to align the audio streams). If AA denotes the delay applied to the first audio data, the equations are of the form:
2(ΔA−ΔB)=tAB+tBB−tBA−tAA [1]
which leads to a system of equations (e.g., for an example including only three devices):
In the example illustrated above, the fourth equation ensures that the average delay is equal to zero, although choosing a single reference device or using the average is arbitrary.
In some examples, the system 100 may estimate a time difference of arrival (TDoA) value between two devices by detecting when a known source signal (e.g., known excitation signal) is captured by each device and determining a time difference. For example, the system 100 may determine a first timestamp associated with the source signal being captured by the first device 110a (e.g., represented in first audio data generated by the first device 110a) and a second timestamp associated with the source signal being captured by the second device 110b (e.g., represented in second audio data generated by the second device 110b). By determining a difference between the first timestamp and the second timestamp, the system 100 may determine a first TDoA value associated with the first device 110a and the second device 110b. However, this technique is only accurate if the source signal is known ahead of time.
If the source signal is not known ahead of time, the system 100 may determine the TDoA value by taking a cross-correlation between the first audio data associated with the first device 110a and the second audio data associated with the second device 110b. For example, as the received signal (e.g., user speech) is highly correlated between the two devices 110a/110b, the system 100 may treat a second representation of the user speech associated with the second audio data as a delayed version of a first representation of the user speech associated with the first audio data. Thus, the system 100 may generate cross-correlation data using the first audio data and the second audio data and determine the first TDoA value by detecting a peak in the cross-correlation data.
Similarly, the fourth device 110d may generate fourth pairwise cross-correlation data by taking a fourth cross-correlation between the second audio data and the third audio data (e.g., “B-C”), fifth pairwise cross-correlation data by taking a fifth cross-correlation between the second audio data and the fourth audio data (e.g., “B-D”), and sixth pairwise cross-correlation data by taking a sixth cross-correlation between the third audio data and the fourth audio data (e.g., “C-D”).
Using the pairwise cross-correlation data 800, in some examples the system 100 may determine estimated TDoA values for each pairwise combination. For example, the system 100 may determine a largest peak represented in the cross-correlation data for each pairwise combination without departing from the disclosure. However, this ignores potentially useful information represented in the pairwise cross-correlation data 800 and the disclosure is not limited thereto.
To benefit from the additional information represented in the pairwise cross-correlation data 800, such as secondary peaks that may represent a direct path time delay, the system 100 may estimate the TDoA value using steered response power, such as a simple delay-sum beamformer. For any candidate position x, the system 100 may calculate the distances to each microphone and delay each signal inversely to the delay. Thus, if the source is actually located at the candidate position z, the delayed versions of the emission will add together coherently, such that the overall power will be greater than for incorrect locations where the sum is not coherent. Specifically, if the system 100 defines the received signal from device i to be si, the steered response power is:
While
can be fractional, si is a discrete signal, so the system 100 may apply interpolation (e.g., nearest-neighbor interpolation) without departing from the disclosure. For every evaluation point, the Steered Response Power calculated using Equation [3] needs to integrate across both time and devices, increasing a processing consumption. However, the SRP function may be mapped monotonically to a function of the cross-correlations:
where the cross-correlations si*sj can be pre-computed instead of repeated for each evaluation. This offers a substantial savings in the processing consumption required per grid evaluation. In some examples, the system 100 may replace the raw cross-correlation data with a Generalized Cross-Correlation Power Phase Transform (GCC-PHAT) to create a more temporally compact and spectrally-flat cross-correlation. For example, the system 100 may apply Steered Response Power Phase Transform (SRP-PHAT) without departing from the disclosure, which is beneficial particularly for delay estimation in reverberant environments.
As used herein, a spatial likelihood function represents an environment (e.g., search space) using a grid that comprises a plurality of grid cells or grid points (e.g., plurality of segments) having a uniform size. Thus, the system 100 may divide the search space into the plurality of grid cells and determine a spatial likelihood value for each grid cell. For example, the system 100 may determine a first spatial likelihood value associated with a first grid cell, and the first spatial likelihood value may indicate a first likelihood that the first grid cell corresponds to the sound source (e.g., the sound source is located within the first grid cell). Similarly, the system 100 may determine a second spatial likelihood value associated with a second grid cell, and the second spatial likelihood value may indicate a second likelihood that the second grid cell corresponds to the sound source (e.g., the sound source is located within the second grid cell). Thus, the spatial likelihood function indicates the relative likelihood that the sound source is located within each grid cell, enabling the system 100 to determine a maximum likelihood value and associate the maximum likelihood value with the location of the sound source.
In the examples illustrated in
As shown in the combined SLF data 920, a maximum value of the SRP corresponds to an intersection of several hyperbolic shapes having highest likelihood values represented in the combined SLF data 920. As illustrated in
In some examples, the system 100 may evaluate SRP′(x) using a grid search, although the disclosure is not limited thereto and the system 100 may use more complicated schemes (e.g., particle filters) without departing from the disclosure. While the system 100 may generate the spatial likelihood functions using Equation [4], the system 100 may need to make assumptions and/or estimate initial values in order to evaluate SRP′(x) properly. For example, in some examples the system 100 may assume that the user is inside a bounding box formed by the devices 110a-110d, or that bounding box plus some buffer region(s) outside of it, without departing from the disclosure. However, the disclosure is not limited thereto, and in other examples the system 100 may incorporate Angle of Arrival (AoA) data to generate an initial guess of the user location without departing from the disclosure.
In order to accurately measure the time differences associated with the TDoA information, the devices 110a-110e must be time-aligned or synchronized. For example, the devices 110a-110e may leverage the calibration sequence described above with regard to
The combined SLF data 920 illustrated in
As it is common for cross-correlation peaks to be narrower than this, a low resolution grid creates the possibility that some cross-correlation peaks may be skipped entirely.
Instead of evaluating the likelihood at a single point, the system 100 may integrate over the likelihood of the entire grid cell surrounding that point. Given the function of grid cells to pairwise TDoA, the system 100 may compute partial derivatives with respect to the likelihood evaluation point. This determines a range of the cross-correlation functions that will map to the grid cell, enabling the system 100 to combine all samples within the range without missing any peaks.
The system 100 may compute the TDoA range within each grid cell by finding a distance along the gradient until it hits the boundary of the grid cell. For example, the system 100 may use a maximum component of the gradient by itself to determine how quickly the gradient will hit the boundary. In some examples, the system 100 may determine the TDoA τij between devices i and j, as a function of x, a two-dimensional (2D) or three-dimensional (3D) position. The TDoA is a scalar-valued function, and the system 100 may calculate its gradient with respect to x. For example, the system 100 may compute a vector Δx, which is the vector in the same direction as the gradient and which stops when it reaches the boundary of the grid cell.
Let α=∥∇τij∥≈ (where the L≈−norm is the magnitude of the largest component). As the length of the corresponding component of ∇x will be length
their ratio is
Because Δx is aligned with ∇τij, the ratio of all their components is the same. This enables the system 100 to compute the whole Δx vector:
To compute the range of TDoA spanned by the grid cell, the system 100 may take the inner product of the gradient and Δx, giving a final expression of:
As the gradient magnitude goes to zero, the numerator will shrink faster than the denominator, so for very small gradients the system 100 may replace the overall value with zero to avoid dividing by zero.
The system 100 may combine the cross-correlation window for each evaluation using different window integration functions (e.g., combination function) without departing from the disclosure. For example, the spatial likelihood function examples 1100 illustrated in
As described above, there is a fundamental tradeoff between the resolution of the grid and the number of evaluations and corresponding processing complexity. For example, a high resolution grid may identify the sound source location with a greater accuracy, but at the cost of higher processing consumption. In contrast, a low resolution grid may reduce the processing consumption, but at the cost of decreasing the accuracy. To compromise between accuracy and complexity, the system 100 may use a multiresolution approach.
To conceptually illustrate a simple example,
For each iteration of the multiresolution grid search example 1200, the system 100 selects the next search space based on a highest likelihood of the previous spatial likelihood function. For example, the system 100 may identify a first area of the first SLF 1210 that includes a highest likelihood of first likelihood values and may select the second search space to include the first area. Similarly, the system 100 may identify a second area of the second SLF 1220 that includes a highest likelihood of second likelihood values and may select the third search space to include the second area. Thus,
In some examples, the system 100 may assume that the device locations represented by the device location data 315 are accurate and that a precise location of each device 110 is known to the system 100. However, small errors in the estimated device locations may translate into errors represented in the resulting localization. In certain situations, the relationship can be highly nonlinear, such that small errors in sensor location cause large localization errors. However, this is largely due to situations where the devices 110 are in close proximity (e.g., very close to each other), and this is less of an issue when the device spacing is large relative to the device localization error.
While the examples described above refer to using cross-correlation data to generate spatial likelihood functions that correspond to Time Difference of Arrival (TDoA) values, the disclosure is not limited thereto. In some examples, the system 100 may generate spatial likelihood functions using Angle of Arrival (AoA) data without departing from the disclosure. Additionally or alternatively, the system 100 may generate a spatial likelihood function using a combination of the TDoA spatial likelihood functions and the AoA spatial likelihood functions without departing from the disclosure.
As described above, the system 100 may determine relative AoA data, which indicates angle information relative to a device orientation, and/or absolute AoA data, which indicates angle information using a fixed frame of reference such as a global coordinate system. To illustrate an example of relative AoA data, a first device 110a may generate relative AoA data (e.g., relative AoA value) indicating that a second device 110b is in a first direction relative to a device orientation of the first device 110a. While the relative AoA data may enable the first device 110a to determine a relative position of the second device 110b, the relative AoA data varies based on the device orientation, which may not be known by other devices 110. Once the system 100 determines the device orientation of the first device 110a, the system 100 may use the device orientation and the relative AoA data to generate absolute AoA data, which indicates that the second device 110b is in a second direction relative to a location of the first device 110a within the grid. Thus, the system 100 may generate absolute AoA data that indicates angle values between each pair of devices using the global coordinate system.
To calculate the spatial likelihood function, the system 100 may determine an estimated AoA value for each candidate position (e.g., grid cell, segment, etc.) and compare the estimated AoA value to the measured AoA value. To conceptually illustrate some examples, the AoA SLF 1510 includes two candidate positions along with their corresponding estimated AoA values. For example, a first candidate position (e.g., [3, 8]) may be associated with a first estimated AoA value (e.g., xθ1), while a second candidate position (e.g., [5, 8]) is associated with a second estimated AoA value (e.g., xθ2). To determine a first spatial likelihood value associated with the first candidate position, the system 100 may determine a first difference between the first estimated AoA value (e.g., xθ1) and the measured AoA value (e.g., μθ). Similarly, the system 100 may determine a second spatial likelihood value associated with the second candidate position by determining a second difference between the second estimated AoA value (e.g., xθ2) and the measured AoA value (e.g., μθ). Thus, the system 100 may determine individual spatial likelihood values for each candidate position represented in the AoA SLF 1510.
While
While the example described above refers to the system 100 generating a plurality of individual SLFs and then using these SLFs to generate the combined AoA spatial likelihood function, the disclosure is not limited thereto. In some examples, the system 100 may calculate the combined AoA spatial likelihood function directly without departing from the disclosure. For example, the system 100 may determine the combined AoA spatial likelihood function using:
where SLFAoA denotes the spatial likelihood function, x is a candidate position in the grid space, f(x, n) computes the AoA value from each point x to device n (e.g., xθ(n)), μθ(n) denotes the measured AoA value to device n, σθ2 denotes the AoA variance (e.g., estimated from the 95′ percentile), wrap( ) denotes a wrap function to maintain the angle values between −π and +π, σθ(n) denotes the standard deviation, and N is the total number of devices. As illustrated in Equation [7], the standard deviation σθ(n) is used to inversely scale the values to provide a normalization constant that keeps the total area under the Gaussian constant. By inversely scaling using the standard deviation σθ(n), Equation [7] reduces a weighting associated with lower confidence (e.g., higher variance) information.
Evaluating the SLFAoA for a grid of candidate points provides a proxy likelihood function with which the system 100 may determine a location of the sound source. For example, the system 100 may select a candidate point using a maximum likelihood, such as:
However, the disclosure is not limited thereto, and the system 100 may determine the location of the sound source using other techniques without departing from the disclosure.
As illustrated in
In some examples, the system 100 may repeat these steps to determine a location and/or relative direction of the television. As illustrated in
While
Using the user location and the user orientation, the system 100 may generate (148) map data and may send (150) the map data to a rendering component, as described in greater detail above with regard to
After generating the individual spatial likelihood function associated with the first device, the system 100 may determine (1652) whether there is an additional device that captured the audible sound and may loop to step 1650 to select the additional device. For example, the system 100 may repeat steps 1520-1532 to generate an individual spatial likelihood function for the additional device.
Once the system 100 determines that there are no additional devices, the system 100 may determine (1654) weight values for the measured AoA values. For example, the system 100 may determine a first weight value corresponding to a first measured AoA value generated by the first device 110a, a second weight value corresponding to a second measured AoA value generated by the second device 110b, and so on for each measured AoA value.
To illustrate an example, as part of generating the first measured AoA value, the system 100 may also determine a first variance associated with the first measured AoA value. The term variance refers to a statistical measurement of the spread between numbers in a data set, with a large variance indicating that the numbers are far from the mean and from each other. Thus, the system 100 may use the first variance as a proxy for a confidence score that indicates a likelihood that the first measured AoA value is accurate. For example, a large variance may correspond to a low confidence score (e.g., low likelihood that the first measured AoA value is accurate), while a small variance may correspond to a high confidence score (e.g., high likelihood that the first measured AoA value is accurate). Based on the first variance, the system 100 may determine the first weight value, which is associated with the first measured AoA value and a corresponding first individual AoA SLF.
As illustrated in
In some examples, the system 100 may repeat these steps to determine a location and/or relative direction of the television. As illustrated in
While
Using the user location and the user orientation, the system 100 may generate (148) map data and may send (150) the map data to a rendering component, as described in greater detail above with regard to
The system 100 may then determine (1762) whether there is an additional candidate position, and if so, may loop to step 1754 and repeat steps 1754-1760 for the additional candidate position. Once the system 100 determines that there are no additional candidate positions, the system 100 may determine (1764) a pairwise TDoA spatial likelihood function using the spatial likelihood values determined in step 1760.
After generating the pairwise TDoA spatial likelihood function associated with the first pair of devices, the system 100 may determine (1766) whether there is an additional pair of devices that captured the audible sound and may loop to step 1750 to select the additional pair. For example, the system 100 may repeat steps 1750-1764 to generate a pairwise TDoA spatial likelihood function for the additional pair of devices. Once the system 100 determines that there are no additional pairs, the system 100 may determine (1768) a combined TDoA spatial likelihood function.
In some examples, the system 100 may generate a final spatial likelihood function using a combination of the AoA processing and the TDoA processing described above. For example, the system 100 may perform the AoA processing and the TDoA processing in parallel, resulting in the final spatial likelihood function being more accurate than a spatial likelihood function generated using either AoA processing or TDoA processing alone.
As illustrated in
In addition, the system 100 may receive (1816) cross-correlation data associated with the first audible sound, may determine (1818) pairwise TDoA spatial likelihood functions for each pair of devices, and may determine (1820) a combined TDoA spatial likelihood function using the pairwise TDoA SLFs. For example, the system 100 may determine the combined TDoA spatial likelihood function as described in greater detail above with regard to
After performing both AoA processing and TDoA processing, the system 100 may determine (142) a final spatial likelihood function using the combined AoA spatial likelihood function and the combined TDoA spatial likelihood function. The system 100 may combine the two estimated spatial likelihood functions using a variety of techniques without departing from the disclosure. For example, the system 100 may combine the two estimated spatial likelihood functions using a log-likelihood operation, as shown below:
where SLFfinal(x) denotes the final spatial likelihood function, SLFAOA(x) denotes the combined AoA spatial likelihood function, and SLFTDoA(X) denotes the combined TDoA spatial likelihood function.
The system 100 may determine (144) the user location based on the final spatial likelihood function. For example, the system 100 may determine the user location based on a maximum spatial likelihood value represented in the final spatial likelihood function:
In some examples, the system 100 may determine the user location without explicitly determining the final spatial likelihood function. For example, the system 100 may determine the maximum spatial likelihood value directly using the combined AoA spatial likelihood function and the combined TDoA spatial likelihood function, as shown below:
After determining the user location, the system 100 may determine (146) a user orientation based on a direction of the television relative to the user location. Thus, even if the system 100 is unable to determine a precise location of the television, the system 100 may determine the user orientation as the direction of the television remains constant. In some examples, the system 100 may determine the user orientation by repeating steps 1812-1820 to generate a second spatial likelihood function associated with the television, and using the second spatial likelihood function to determine a location of the television and/or direction of the television relative to the user. However, the disclosure is not limited thereto and the system 100 may determine the user orientation using other techniques without departing from the disclosure. For example, the system 100 may determine the user orientation without generating another spatial likelihood function without departing from the disclosure.
Using the user location and the user orientation, the system 100 may generate (148) map data and may send (150) the map data to a rendering component, as described in greater detail above with regard to
While not illustrated in
While not illustrated in
After determining the confidence score value, the system 100 may determine (1920) whether the confidence score value exceeds a threshold value. If the confidence score value does not exceed the threshold value, the system 100 may loop to step 1910 and repeat steps 1910-1918 to generate a new spatial likelihood value. If the confidence score value exceeds the threshold value, the system 100 may perform (1922) additional steps using the spatial likelihood function, such as determining the user location and/or user orientation, as described in greater detail above. While
Multiple systems (120/125) may be included in the system 100 of the present disclosure, such as one or more remote systems 120 for performing ASR processing, one or more remote systems 120 for performing NLU processing, and one or more skill component 125, etc. In operation, each of these systems may include computer-readable and computer-executable instructions that reside on the respective device (120/125), as will be discussed further below.
Each of these devices (110/120/125) may include one or more controllers/processors (2104/2204), which may each include a central processing unit (CPU) for processing data and computer-readable instructions, and a memory (2106/2206) for storing data and instructions of the respective device. The memories (2106/2206) may individually include volatile random access memory (RAM), non-volatile read only memory (ROM), non-volatile magnetoresistive memory (MRAM), and/or other types of memory. Each device (110/120/125) may also include a data storage component (2108/2208) for storing data and controller/processor-executable instructions. Each data storage component (2108/2208) may individually include one or more non-volatile storage types such as magnetic storage, optical storage, solid-state storage, etc. Each device (110/120/125) may also be connected to removable or external non-volatile memory and/or storage (such as a removable memory card, memory key drive, networked storage, etc.) through respective input/output device interfaces (2102/2202).
Computer instructions for operating each device (110/120/125) and its various components may be executed by the respective device's controller(s)/processor(s) (2104/2204), using the memory (2106/2206) as temporary “working” storage at runtime. A device's computer instructions may be stored in a non-transitory manner in non-volatile memory (2106/2206), storage (2108/2208), or an external device(s). Alternatively, some or all of the executable instructions may be embedded in hardware or firmware on the respective device in addition to or instead of software.
Each device (110/120/125) includes input/output device interfaces (2102/2202). A variety of components may be connected through the input/output device interfaces (2102/2202), as will be discussed further below. Additionally, each device (110/120/125) may include an address/data bus (2124/2224) for conveying data among components of the respective device. Each component within a device (110/120/125) may also be directly connected to other components in addition to (or instead of) being connected to other components across the bus (2124/2224).
Referring to
Via antenna(s) 2114, the input/output device interfaces 2102 may connect to one or more networks 199 via a wireless local area network (WLAN) (such as Wi-Fi) radio, Bluetooth, and/or wireless network radio, such as a radio capable of communication with a wireless communication network such as a Long Term Evolution (LTE) network, WiMAX network, 3G network, 4G network, 5G network, etc. A wired connection such as Ethernet may also be supported. Through the network(s) 199, the system may be distributed across a networked environment. The I/O device interface (2102/2202) may also include communication components that allow data to be exchanged between devices such as different physical servers in a collection of servers or other components.
The components of the device 110, the remote system 120, and/or a skill component 125 may include their own dedicated processors, memory, and/or storage. Alternatively, one or more of the components of the device 110, the remote system 120, and/or a skill component 125 may utilize the I/O interfaces (2102/2202), processor(s) (2104/2204), memory (2106/2206), and/or storage (2108/2208) of the device(s) 110, system 120, or the skill component 125, respectively. Thus, the ASR component may have its own I/O interface(s), processor(s), memory, and/or storage; the NLU component 260 may have its own I/O interface(s), processor(s), memory, and/or storage; and so forth for the various components discussed herein.
As noted above, multiple devices may be employed in a single system. In such a multi-device system, each of the devices may include different components for performing different aspects of the system's processing. The multiple devices may include overlapping components. The components of the device 110, the remote system 120, and a skill component 125, as described herein, are illustrative, and may be located as a stand-alone device or may be included, in whole or in part, as a component of a larger device or system.
As illustrated in
The concepts disclosed herein may be applied within a number of different devices and computer systems, including, for example, general-purpose computing systems, speech processing systems, and distributed computing environments.
The above aspects of the present disclosure are meant to be illustrative. They were chosen to explain the principles and application of the disclosure and are not intended to be exhaustive or to limit the disclosure. Many modifications and variations of the disclosed aspects may be apparent to those of skill in the art. Persons having ordinary skill in the field of computers and speech processing should recognize that components and process steps described herein may be interchangeable with other components or steps, or combinations of components or steps, and still achieve the benefits and advantages of the present disclosure. Moreover, it should be apparent to one skilled in the art, that the disclosure may be practiced without some or all of the specific details and steps disclosed herein.
Aspects of the disclosed system may be implemented as a computer method or as an article of manufacture such as a memory device or non-transitory computer readable storage medium. The computer readable storage medium may be readable by a computer and may comprise instructions for causing a computer or other device to perform processes described in the present disclosure. The computer readable storage medium may be implemented by a volatile computer memory, non-volatile computer memory, hard drive, solid-state memory, flash drive, removable disk, and/or other media. In addition, components of system may be implemented as in firmware or hardware, such as an acoustic front end (AFE), which comprises, among other things, analog and/or digital filters (e.g., filters configured as firmware to a digital signal processor (DSP)).
Conditional language used herein, such as, among others, “can,” “could,” “might,” “may,” “e.g.,” and the like, unless specifically stated otherwise, or otherwise understood within the context as used, is generally intended to convey that certain embodiments include, while other embodiments do not include, certain features, elements and/or steps. Thus, such conditional language is not generally intended to imply that features, elements, and/or steps are in any way required for one or more embodiments or that one or more embodiments necessarily include logic for deciding, with or without other input or prompting, whether these features, elements, and/or steps are included or are to be performed in any particular embodiment. The terms “comprising,” “including,” “having,” and the like are synonymous and are used inclusively, in an open-ended fashion, and do not exclude additional elements, features, acts, operations, and so forth. Also, the term “or” is used in its inclusive sense (and not in its exclusive sense) so that when used, for example, to connect a list of elements, the term “or” means one, some, or all of the elements in the list.
Disjunctive language such as the phrase “at least one of X, Y, Z,” unless specifically stated otherwise, is understood with the context as used in general to present that an item, term, etc., may be either X, Y, or Z, or any combination thereof (e.g., X, Y, and/or Z). Thus, such disjunctive language is not generally intended to, and should not, imply that certain embodiments require at least one of X, at least one of Y, or at least one of Z to each be present.
As used in this disclosure, the term “a” or “one” may include one or more items unless specifically stated otherwise. Further, the phrase “based on” is intended to mean “based at least in part on” unless specifically stated otherwise.
Number | Name | Date | Kind |
---|---|---|---|
20120117502 | Nguyen | May 2012 | A1 |