This disclosure generally relates to acoustic devices that include microphone arrays for capturing acoustic signals.
An array of microphones can be used for capturing acoustic signals along a particular direction.
In general, in one aspect, this document features a method of reproducing audio related to a teleconference between a second location and a remote first location. The method includes receiving data representing audio captured by a microphone array disposed at the remote first location, wherein the data includes directional information representing the direction of a sound source relative to the remote microphone array. The method also includes obtaining, based on the directional information, information representative of one or more head-related transfer functions (HRTFs) corresponding to the direction of the sound source relative to the remote microphone array, and generating, using one or more processing devices, an output signal for an acoustic transducer located at the second location. The output signal is generated by processing the received data using the information representative of the one or more HRTFs, and is configured to cause the acoustic transducer to generate an audible acoustic signal.
In another aspect, this document features a system that includes an audio reproduction engine having one or more processing devices. The audio reproduction engine is configured to receive data representing audio captured by a microphone array disposed at the remote location, wherein the data includes directional information representing the direction of a sound source relative to the remote microphone array. The audio reproduction engine is also configured to obtain, based on the directional information, information representative of one or more head-related transfer functions (HRTFs) corresponding to the direction of the sound source relative to the remote microphone array, and generate an output signal for an acoustic transducer by processing the received data using the information representative of the one or more HRTFs. The output signal is configured to cause the acoustic transducer to generate an audible acoustic signal.
In another aspect, this document features one or more machine-readable storage devices having encoded thereon computer readable instructions for causing one or more processing devices to perform various operations. The operations include receiving data representing audio captured by a microphone array disposed at the remote first location, wherein the data includes directional information representing the direction of a sound source relative to the remote microphone array. The operations also include obtaining, based on the directional information, information representative of one or more head-related transfer functions (HRTFs) corresponding to the direction of the sound source relative to the remote microphone array, and generating, using one or more processing devices, an output signal for an acoustic transducer located at the second location. The output signal is generated by processing the received data using the information representative of the one or more HRTFs, and is configured to cause the acoustic transducer to generate an audible acoustic signal.
Implementations of the above aspects may include one or more of the following features. The directional information can include one or more of an azimuth angle, an elevation angle, and a distance of the sound source from the remote microphone array. Individual microphones of the microphone array can be disposed on a substantially cylindrical or spherical surface. The information representative of the one or more HRTFs can be obtained by accessing a database of pre-computed HRTFs stored on a non-transitory computer-readable storage device. Obtaining the information representative of the one or more HRTFs includes determining, based on the directional information, that a corresponding HRTF is unavailable in the database of pre-computed HRTFs, and computing the corresponding HRTF based on interpolating one or more HRTFs available in the database of pre-computed HRTFs. One or more directional beam-patterns can be employed to capture the audio by the microphone array. When multiple directional beam patterns used to capture the audio, generating the output signal for the acoustic transducer can include multiplying the multiple directional beam patterns with corresponding weights to generate weighted beam-patterns, and generating the output signal by processing the weighted beam-patterns using the information representative of the one or more HRTFs. The output signal for the acoustic transducer can represent a convolution of at least a portion of the received information with corresponding impulse responses of the one or more HRTFs. The acoustic transducer can be disposed in one of: an in-ear earphone, over-the-ear earphone, or an around-the-ear earphone. Obtaining information representative of the one or more HRTFs can include receiving information representing an orientation of the head of a user, and selecting the one or more HRTFs based on the information representing the orientation of the head of the user.
Various implementations described herein may provide one or more of the following advantages. By processing received audio data based on directional information included within it, a user's perception of the generated audio can be configured to be coming from a particular direction. When used in teleconference or video conference applications, this may improve user experience by providing a realistic impression of sound coming from a source at a virtual location that mimics the location of the original sound source with respect to the audio capture device. In addition, directional sensitivity patterns (or beams) generated via beamforming processes may be weighted to emphasize and/or deemphasize sounds from particular directions. This in turn may allow for improving focus on one or more speakers during a teleconference. The orientation of the head of a user at the destination location may be determined, for example using head-tracking, and the received information can be processed adaptively to move the location of a virtual sound source in accordance with the head-movements.
Two or more of the features described in this disclosure, including those described in this summary section, may be combined to form implementations not specifically described herein.
The details of one or more implementations are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description and drawings, and from the claims.
This document describes technology for processing audio data transmitted from an origin location to a destination location. The audio data at the origin location can be captured using a microphone array or other directional audio capture equipment, and therefore include directional information representing a relative location of a sound source with respect to the audio capture equipment. The audio data received at the destination location can be processed based on the directional information in a way such that a user exposed to the resultant acoustic signals perceives the signals to be coming from a virtual location that mimics the relative location of the original sound source with respect to the audio capture equipment at the origin location. In some cases, this can result in a superior teleconference experience that allows a participant to identify the direction of a sound source based on binaurally played audio. For example, if a participant at the destination location knows the relative locations of multiple users participating in the teleconference at the origin location, the participant may readily distinguish between the users based on the virtual direction from which the binaurally played audio appears to be coming. This in turn may reduce the need of speakers identifying themselves during the teleconference and result in an improved and more natural teleconference experience.
In some cases, when multiple participants are taking part in a teleconference, it may be challenging to discern who is speaking at a given time. For example, in the example of
In some implementations, the technology described herein can be used to address the above-described ambiguity by processing the audio signals at the destination location prior to reproduction such that the audio appears to come from the direction of the speaker relative to the audio capture device used at the remote location. For example, if the device 125 is used as an audio capture device at the first location 105, and the speaker 110d is speaking, the corresponding audio that is reproduced at the second location 115 for a listener (e.g., participant 120c) can be processed such that the reproduced audio appears to come from a direction that mimics the direction of the speaker with respect to the audio capture device at the first location 105. In this particular example where participant 110d is speaking at the first location 105, the processed audio reproduction for participant 120c at the second location 115 can cause the participant 120c to perceive the audio as coming from the direction 160d, which mimics or represents the direction 155d of the speaker 110d relative to the audio capture device 125. Therefore, when the participants 110a, 110b, 110c, or 110d speak at the first location 105, the audio is reproduced for the participant 120c as coming from the directions 160a, 160b, 160c, and 160d, respectively. Because the directions 160a-160d mimic the directions 155a-155d, respectively, the participant 120c may be able to then readily discern from the reproduced audio which of the participants 110a-110d is speaking at a given instant. In some cases, this may reduce ambiguity associated with remote speakers, and in turn improve the teleconference experience by increasing naturalness of conversations taking place over a teleconference.
In some implementations, the audio capture device 205 can include other directional audio capture devices. For example, the audio capture device 205 can include multiple directional microphones such as shotgun microphones. In some implementations, the audio capture device 205 can include a device that includes multiple microphones separated by passive directional acoustic elements disposed between the microphones. In some implementations, the passive directional acoustic elements include a pipe or tubular structure having an elongated opening along at least a portion of the length of the pipe, and an acoustically resistive material covering at least a portion of the elongated opening. The acoustically resistive material can include, for example, wire mesh, sintered plastic, or fabric, such that acoustic signals enter the pipe through the acoustically resistive material and propagate along the pipe to one or more microphones. The wire mesh, sintered plastic or fabric includes multiple small openings or holes, through which acoustic signals enter the pipe. The passive directional acoustic elements each therefore act as an array of closely spaced sensors or microphones. Various types and forms of passive directional acoustic elements may be used in the audio capture device 205. Examples of such passive directional acoustic elements are illustrated and described in U.S. Pat. No. 8,351,630, U.S. Pat. No. 8,358,798, and U.S. Pat. No. 8,447,055, the contents of which are incorporated herein by reference. Examples of microphone arrays with passive directional acoustic elements are described in co-pending U.S. application Ser. No. 15/406,045, titled “Capturing Wide-Band Audio Using Microphone Arrays and Passive Directional Acoustic Elements,” the entire content of which is also incorporated herein by reference.
Data generated from the signals captured by the audio capture device 205 may be processed to generate a sensitivity pattern that emphasizes the signals along a “beam” in the particular direction and suppresses signals from one or more other directions. Examples of such beams or sensitivity patterns 207a-207c (207, in general) are depicted in
The audio processing engine 215 can be located at various locations. In some implementations, the audio processing engine 215 may be disposed in a device located at the first location 105. In some such cases, the audio processing engine 215 may be disposed as a part of the audio capture device 205. In some implementations, the audio processing engine 215 may be located on a device at a location that is remote with respect to the location 105. For example, the audio processing engine 215 can be located on a remote server, or on a distributed computing system such as a cloud-based system.
In some implementations, the audio processing engine 215 can be configured to process the data generated from the signals captured by the audio capture device 205 and generate audio data that includes directional information representing the direction of a corresponding sound source relative to the audio capture device 205. In some implementations, the audio processing engine 215 can be configured to generate the audio data in substantially real-time (e.g., within a few milliseconds) such that the audio data is usable for real-time or near-real-time applications such as a teleconference. The allowable or acceptable time delay for the real-time processing in a particular application may be governed, for example, by an amount of lag or processing delay that may be tolerated without significantly degrading a corresponding user-experience associated with the particular application. The audio data generated by the audio processing engine 215 can then be transmitted, for example, over the network 150 to a destination location (e.g., the second location 115) of the teleconference environment. In some implementations, the audio data may be stored or recorded at a storage location (e.g., on a non-transitory computer-readable storage device) for future reproduction.
The audio data received at the second location 115 can be processed by a reproduction engine 220 for eventual rendering using one or more acoustic transducer. The reproduction engine 220 can include one or more processing devices that can be configured to process the received data in a way such that acoustic signals generated by the one or more acoustic transducers based on the processed data appear to come from a particular direction. In some implementations, the reproduction engine 220 can be configured to obtain, based on directional information included in the received data, one or more transfer functions that can be used for processing the received data to generate an output signal, which, upon being rendered by one or more acoustic transducers, causes a user to perceive the rendered sound as coming from a particular direction. The one or more transfer functions that may be used for the purpose are referred to as head-related transfer functions (HRTFs), which, in some implementations, may be obtained from a database of pre-computed HRTFs stored at a storage location 225 (e.g., a non-transitory computer-readable storage device) accessible by the reproduction engine 220. The storage location 225 may be physically connected to the reproduction engine 220, or located at a remote location such as on a remote server or cloud drive.
The monaural cues can represent modifications of the original source sound (e.g., by the environment) prior to entering the corresponding ear canal for processing by the auditory system. In some cases, such modifications may encode information representing one or more parameters of the environment, and may be captured via an impulse response representing a path between a location of the source and the ear. The one or more parameters that may be encoded in such an impulse response can include, for example, a location of the source, an acoustic signature of the environment etc. Such an impulse response can be referred to as a head-related impulse response (HRIR), and a frequency domain representation (e.g., Fourier transform) of a HRIR can be referred to as the corresponding head-related transfer function (HRTF). A particular HRIR is associated with a particular point in space around a listener, and therefore, convolution of an arbitrary source sound with the particular HRIR can be used to generate a sound which would have been heard by the listener had it originated at the particular point in space. Therefore, if an HRIR (or HRTF) corresponding to a path between a particular point in space and the user's ear is available, an acoustic signal can be processed by the reproduction engine 220 using the HRIR (or HRTF) to cause the user to perceive the signal as coming from the particular point in space.
The above concept can be used by the reproduction engine 220 to localize received audio data to virtual sources at particular locations in space. For example, referring to
The above example assumes the HRTFs or HRIRs to be specific to one particular dimension (azimuth angle) only. However, if HRTFs or HRIRs corresponding to various elevations, distances, and/or azimuths are available, the reproduction engine can be configured to process received audio data to localize a virtual source at various points in space as governed by the granularity of the available HRTFs or HRIRs. In some implementations, an HRTF or HRIR corresponding to the directional information included in the received data may not be available in the database of pre-computed HRTFs or HRIRs. In such cases, the reproduction engine 220 can be configured to compute the required HRTF of HRIR from available pre-computed HRTFs or HRIRs using an interpolation process. In some implementations, if an HRTF or HRIR corresponding exactly to the directional information included in the received data is not available, an approximate HRTF or HRIR (based, for example, on a nearest neighbor criterion) may be used.
In some implementations, the one or more HRTFs can be obtained based on the orientation of the head of the user. For example, if the user moves his/her head, a new or updated HRTF or HRIR may be needed to maintain the location of a virtual sound source with respect to the user. In some implementations, a head tracking process can be employed to track the head of the user, and the information can be provided to the reproduction engine 220 for the reproduction engine to adaptively obtain or compute a new HRTF or HRIR. The head-tracking process may be implemented, for example, by processing data from accelerometers and/or gyroscopes disposed within the user's headphones or earphones, by processing images or videos captured using a camera, or by using other available head-tracking devices and technologies.
In some implementations, the received data can include information corresponding to multiple sensitivity patterns or beams 207a-207c. In some such cases, the reproduction engine 220 can be configured to weight the contribution of the different beams 207 prior to processing the data with the corresponding HRTFs or HRIRs. For example, if a participant 110 is speaking while another sound source (e.g. the acoustic transducer 210, or another participant) is also active, the reproduction engine 220 can be configured to weight the beam 207c higher than other beams (e.g., the beam 207a capturing the signals from the acoustic transducer 210) prior to processing using HRTFs or HRIRs. In some cases, this can suppress interfering sources and/or noise and provide a further improved teleconference experience.
The acoustic transducers used for binaurally playing back acoustic signals generated based on the outputs of the reproduction engine 220 can be disposed in various devices. In some implementations, the acoustic transducers can be disposed in a set of headphones 230 as shown in
The example shown in
Operations of the process includes receiving data representing audio captured by a microphone array disposed at a remote location, the data including directional information representing the direction of a sound source relative to the remote microphone array (402). In some implementations, the microphone array can be disposed in an audio capture device such as the device 205 mentioned above with reference to
Operations of the process 400 also includes obtaining, based on the directional information, information representative of one or more HRTFs corresponding to the direction of the sound source relative to the remote microphone array (404). The information representative of one or more HRTFs can include information on corresponding HRIRs, as described above with reference to
Operations of the process 400 further includes generating an output signal for an acoustic transducer by processing the received data using the information representative of the one or more HRTFs, the output signal configured to cause the acoustic transducer to generate an audible acoustic signal (406). This can include generating separating output signals for left channel and right channel audio of a stereo system. For example, the separate output signals can be used for driving acoustic transducers disposed in one of: an in-ear earphone or headphone, an over-the-ear earphone or headphone, or an around-the-ear earphone or headphone. In some implementations, multiple directional beam patterns are used to capture the audio, and generating the output signal for the acoustic transducer includes multiplying the multiple directional beam patterns with corresponding weights to generate weighted beam-patterns, and generating the output signal by processing the weighted beam-patterns using the information representative of the one or more HRTFs. The output signal for the acoustic transducer can represent a convolution of at least a portion of the received information with corresponding impulse responses of the one or more HRTFs.
The functionality described herein, or portions thereof, and its various modifications (hereinafter “the functions”) can be implemented, at least in part, via a computer program product, e.g., a computer program tangibly embodied in an information carrier, such as one or more non-transitory machine-readable media or storage device, for execution by, or to control the operation of, one or more data processing apparatus, e.g., a programmable processor, a computer, multiple computers, and/or programmable logic components.
A computer program can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program can be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a network.
Actions associated with implementing all or part of the functions can be performed by one or more programmable processors executing one or more computer programs to perform the functions of the calibration process. All or part of the functions can be implemented as, special purpose logic circuitry, e.g., an FPGA and/or an ASIC (application-specific integrated circuit). In some implementations, at least a portion of the functions may also be executed on a floating point or fixed point digital signal processor (DSP) such as the Super Harvard Architecture Single-Chip Computer (SHARC) developed by Analog Devices Inc.
Processing devices suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. Components of a computer include a processor for executing instructions and one or more memory devices for storing instructions and data.
Other embodiments and applications not specifically described herein are also within the scope of the following claims. For example, the parallel feedforward compensation may be combined with a tunable digital filter in the feedback path. In some implementations, the feedback path can include a tunable digital filter as well as a parallel compensation scheme to attenuate generated control signal in a specific portion of the frequency range.
Elements of different implementations described herein may be combined to form other embodiments not specifically set forth above. Elements may be left out of the structures described herein without adversely affecting their operation. Furthermore, various separate elements may be combined into one or more individual elements to perform the functions described herein.