This disclosure relates in general to microphone arrangement of a wearable head device.
Symmetrical microphone configurations can offer several advantages in detecting voice onset events. Because a symmetrical microphone configuration may place two or more microphones equidistant from a sound source (e.g., a user's mouth), audio signals received from each microphone may be easily added and/or subtracted from each other for signal processing.
However, it may be more difficult for symmetric microphone configurations to distinguish a user's voice from other audio signals. For example, a person standing directly in front of a user may not be distinguishable from the user with a symmetrical microphone configuration on a wearable head device. A symmetrical microphone configuration may result in both microphones receiving speech signals at the same time, regardless of whether the user was speaking or if the person directly in front of the user is speaking. This may allow the person directly in front of the user to “hijack” a MR system by issuing voice commands that the MR system may not be able to determine as originating from someone other than the user.
Furthermore, due to the symmetric configuration, it may be more difficult to capture sound information along an axis of symmetry (e.g., symmetric microphones are at a same level on the axis of symmetry, the symmetric microphones are co-planar). This difficulty would in turn cause user voice isolation, acoustic cancellation, audio scene analysis, fixed-orientation environment capture, and lobe steering to become more challenging because sound information along all axis of an environment may be required. A solution to improve accuracy is to include additional microphones along the axis of symmetry to capture more information along the axis. However, adding microphones would result in increased weight and power consumption, which may not be desirable for battery-powered device worn by a user, such as a wearable head device.
Examples of the disclosure describe systems and methods related to microphone arrangement of a wearable head device.
In some embodiments, a wearable head device comprises: a first plurality of microphones, wherein the first plurality of microphones are co-planar; a second microphone, wherein the second microphone is not co-planar with the plurality of microphones; and one or more processors configured to perform: capturing, with the microphones, a sound of an environment; forming a beamforming pattern, wherein: the beamforming pattern comprises a location of the sound of the environment, and the beamforming pattern comprises a component that is not co-planar with the plurality of microphones; applying the beamforming pattern on a signal of the captured sound to generate a beamformed signal; and processing the beamformed signal.
In some embodiments, a number of the first plurality of microphones is three.
In some embodiments, the beamforming pattern comprises a radial component, an azimuthal angle component, and a non-zero polar angle component.
In some embodiments, the beamforming pattern comprises at least one of cardioid, hypercardioid, supercardioid, dipole, bipolar, and shotgun shapes.
In some embodiments, processing the beamformed signal comprises at least one of: reducing a noise level in the signal, performing post conditioning on the signal, detecting a voice activity in the signal, generating a speaker signal for acoustic cancellation, analyzing an audio scene associated with the captured sound, and compensating for a movement of the wearable head device.
In some embodiments, the one or more processors are configured to further perform preconditioning the signal of the captured sound.
In some embodiments, one of the first plurality of microphones and the second microphone are located on a front of the wearable head device.
In some embodiments, the beamforming pattern does not include a location of a second sound on a plane co-planar with the first plurality of microphones.
In some embodiments, a microphone of the first plurality of microphones is located proximal to an ear location.
In some embodiments, the one or more processors are configured to further perform: generating a first microphone signal based on the sound captured by a microphone of the first plurality of microphones; generating a second microphone signal based on the sound captured by the second microphone; calculating a magnitude difference, a phase difference, or both between the first and second microphone signals; and based on the magnitude difference, the phase difference, or both, deriving a coordinate of the sound not co-planar with the plurality of microphones.
In some embodiments, a method of operating a wearable head device comprising: a first plurality of microphones, wherein the first plurality of microphones are co-planar; and a second microphone, wherein the second microphone is not co-planar with the plurality of microphones, the method comprising: capturing, with the microphones, a sound of an environment; forming a beamforming pattern, wherein: the beamforming pattern comprises a location of the sound of the environment, and the beamforming pattern comprises a component that is not co-planar with the plurality of microphones; applying the beamforming pattern on a signal of the captured sound to generate a beamformed signal; and processing the beamformed signal.
In some embodiments, a number of the first plurality of microphones is three.
In some embodiments, the beamforming pattern comprises a radial component, an azimuthal angle component, and a non-zero polar angle component.
In some embodiments, the beamforming pattern comprises at least one of cardioid, hypercardioid, supercardioid, dipole, bipolar, and shotgun shapes.
In some embodiments, processing the beamformed signal comprises at least one of: reducing a noise level in the signal, performing post conditioning on the signal, detecting a voice activity in the signal, generating a speaker signal for acoustic cancellation, analyzing an audio scene associated with the captured sound, and compensating for a movement of the wearable head device.
In some embodiments, the method further comprises performing preconditioning the signal of the captured sound.
In some embodiments, one of the first plurality of microphones and the second microphone are located on a front of the wearable head device.
In some embodiments, the beamforming pattern does not include a location of a second sound on a plane co-planar with the first plurality of microphones.
In some embodiments, a microphone of the first plurality of microphones is located proximal to an ear location.
In some embodiments, the method further comprises: generating a first microphone signal based on the sound captured by a microphone of the first plurality of microphones; generating a second microphone signal based on the sound captured by the second microphone; calculating a magnitude difference, a phase difference, or both between the first and second microphone signals; and based on the magnitude difference, the phase difference, or both, deriving a coordinate of the sound not co-planar with the plurality of microphones.
In some embodiments, a non-transitory computer-readable medium storing one or more instructions, which, when executed by one or more processors of an electronic device comprising: a first plurality of microphones, wherein the first plurality of microphones are co-planar; and a second microphone, wherein the second microphone is not co-planar with the plurality of microphones, cause the device to perform a method comprising: capturing, with the microphones, a sound of an environment; forming a beamforming pattern, wherein: the beamforming pattern comprises a location of the sound of the environment, and the beamforming pattern comprises a component that is not co-planar with the plurality of microphones; applying the beamforming pattern on a signal of the captured sound to generate a beamformed signal; and processing the beamformed signal.
In some embodiments, a number of the first plurality of microphones is three.
In some embodiments, the beamforming pattern comprises a radial component, an azimuthal angle component, and a non-zero polar angle component.
In some embodiments, the beamforming pattern comprises at least one of cardioid, hypercardioid, supercardioid, dipole, bipolar, and shotgun shapes.
In some embodiments, processing the beamformed signal comprises at least one of: reducing a noise level in the signal, performing post conditioning on the signal, detecting a voice activity in the signal, generating a speaker signal for acoustic cancellation, analyzing an audio scene associated with the captured sound, and compensating for a movement of the wearable head device.
In some embodiments, the method further comprises performing preconditioning the signal of the captured sound.
In some embodiments, one of the first plurality of microphones and the second microphone are located on a front of the wearable head device.
In some embodiments, the beamforming pattern does not include a location of a second sound on a plane co-planar with the first plurality of microphones.
In some embodiments, a microphone of the first plurality of microphones is located proximal to an ear location.
In some embodiments, the method further comprises: generating a first microphone signal based on the sound captured by a microphone of the first plurality of microphones; generating a second microphone signal based on the sound captured by the second microphone; calculating a magnitude difference, a phase difference, or both between the first and second microphone signals; and based on the magnitude difference, the phase difference, or both, deriving a coordinate of the sound not co-planar with the plurality of microphones.
In the following description of examples, reference is made to the accompanying drawings which form a part hereof, and in which it is shown by way of illustration specific examples that can be practiced. It is to be understood that other examples can be used and structural changes can be made without departing from the scope of the disclosed examples.
Like all people, a user of a MR system exists in a real environment—that is, a three-dimensional portion of the “real world,” and all of its contents, that are perceptible by the user. For example, a user perceives a real environment using one's ordinary human senses—sight, sound, touch, taste, smell—and interacts with the real environment by moving one's own body in the real environment. Locations in a real environment can be described as coordinates in a coordinate space; for example, a coordinate can comprise latitude, longitude, and elevation with respect to sea level; distances in three orthogonal dimensions from a reference point; or other suitable values. Likewise, a vector can describe a quantity having a direction and a magnitude in the coordinate space.
A computing device can maintain, for example in a memory associated with the device, a representation of a virtual environment. As used herein, a virtual environment is a computational representation of a three-dimensional space. A virtual environment can include representations of any object, action, signal, parameter, coordinate, vector, or other characteristic associated with that space. In some examples, circuitry (e.g., a processor) of a computing device can maintain and update a state of a virtual environment; that is, a processor can determine at a first time 10, based on data associated with the virtual environment and/or input provided by a user, a state of the virtual environment at a second time t1. For instance, if an object in the virtual environment is located at a first coordinate at time t0, and has certain programmed physical parameters (e.g., mass, coefficient of friction); and an input received from user indicates that a force should be applied to the object in a direction vector; the processor can apply laws of kinematics to determine a location of the object at time t1 using basic mechanics. The processor can use any suitable information known about the virtual environment, and/or any suitable input, to determine a state of the virtual environment at a time t1. In maintaining and updating a state of a virtual environment, the processor can execute any suitable software, including software relating to the creation and deletion of virtual objects in the virtual environment; software (e.g., scripts) for defining behavior of virtual objects or characters in the virtual environment; software for defining the behavior of signals (e.g., audio signals) in the virtual environment; software for creating and updating parameters associated with the virtual environment; software for generating audio signals in the virtual environment; software for handling input and output; software for implementing network operations; software for applying asset data (e.g., animation data to move a virtual object over time); or many other possibilities.
Output devices, such as a display or a speaker, can present any or all aspects of a virtual environment to a user. For example, a virtual environment may include virtual objects (which may include representations of inanimate objects; people; animals; lights; etc.) that may be presented to a user. A processor can determine a view of the virtual environment (for example, corresponding to a “camera” with an origin coordinate, a view axis, and a frustum); and render, to a display, a viewable scene of the virtual environment corresponding to that view. Any suitable rendering technology may be used for this purpose. In some examples, the viewable scene may include some virtual objects in the virtual environment, and exclude certain other virtual objects. Similarly, a virtual environment may include audio aspects that may be presented to a user as one or more audio signals. For instance, a virtual object in the virtual environment may generate a sound originating from a location coordinate of the object (e.g., a virtual character may speak or cause a sound effect); or the virtual environment may be associated with musical cues or ambient sounds that may or may not be associated with a particular location. A processor can determine an audio signal corresponding to a “listener” coordinate—for instance, an audio signal corresponding to a composite of sounds in the virtual environment, and mixed and processed to simulate an audio signal that would be heard by a listener at the listener coordinate (e.g., using the methods and systems described herein)—and present the audio signal to a user via one or more speakers.
Because a virtual environment exists as a computational structure, a user may not directly perceive a virtual environment using one's ordinary senses. Instead, a user can perceive a virtual environment indirectly, as presented to the user, for example by a display, speakers, haptic output devices, etc. Similarly, a user may not directly touch, manipulate, or otherwise interact with a virtual environment; but can provide input data, via input devices or sensors, to a processor that can use the device or sensor data to update the virtual environment. For example, a camera sensor can provide optical data indicating that a user is trying to move an object in a virtual environment, and a processor can use that data to cause the object to respond accordingly in the virtual environment.
A MR system can present to the user, for example using a transmissive display and/or one or more speakers (which may, for example, be incorporated into a wearable head device), a MR environment (“MRE”) that combines aspects of a real environment and a virtual environment. In some embodiments, the one or more speakers may be external to the wearable head device. As used herein, a MRE is a simultaneous representation of a real environment and a corresponding virtual environment. In some examples, the corresponding real and virtual environments share a single coordinate space; in some examples, a real coordinate space and a corresponding virtual coordinate space are related to each other by a transformation matrix (or other suitable representation). Accordingly, a single coordinate (along with, in some examples, a transformation matrix) can define a first location in the real environment, and also a second, corresponding, location in the virtual environment; and vice versa.
In a MRE, a virtual object (e.g., in a virtual environment associated with the MRE) can correspond to a real object (e.g., in a real environment associated with the MRE). For instance, if the real environment of a MRE comprises a real lamp post (a real object) at a location coordinate, the virtual environment of the MRE may comprise a virtual lamp post (a virtual object) at a corresponding location coordinate. As used herein, the real object in combination with its corresponding virtual object together constitute a “mixed reality object.” It is not necessary for a virtual object to perfectly match or align with a corresponding real object. In some examples, a virtual object can be a simplified version of a corresponding real object. For instance, if a real environment includes a real lamp post, a corresponding virtual object may comprise a cylinder of roughly the same height and radius as the real lamp post (reflecting that lamp posts may be roughly cylindrical in shape). Simplifying virtual objects in this manner can allow computational efficiencies, and can simplify calculations to be performed on such virtual objects. Further, in some examples of a MRE, not all real objects in a real environment may be associated with a corresponding virtual object. Likewise, in some examples of a MRE, not all virtual objects in a virtual environment may be associated with a corresponding real object. That is, some virtual objects may solely in a virtual environment of a MRE, without any real-world counterpart.
In some examples, virtual objects may have characteristics that differ, sometimes drastically, from those of corresponding real objects. For instance, while a real environment in a MRE may comprise a green, two-armed cactus-a prickly inanimate object-a corresponding virtual object in the MRE may have the characteristics of a green, two-armed virtual character with human facial features and a surly demeanor. In this example, the virtual object resembles its corresponding real object in certain characteristics (color, number of arms); but differs from the real object in other characteristics (facial features, personality). In this way, virtual objects have the potential to represent real objects in a creative, abstract, exaggerated, or fanciful manner; or to impart behaviors (e.g., human personalities) to otherwise inanimate real objects. In some examples, virtual objects may be purely fanciful creations with no real-world counterpart (e.g., a virtual monster in a virtual environment, perhaps at a location corresponding to an empty space in a real environment).
In some examples, virtual objects may have characteristics that resemble corresponding real objects. For instance, a virtual character may be presented in a virtual or mixed reality environment as a life-like figure to provide a user an immersive mixed reality experience. With virtual characters having life-like characteristics, the user may feel like he or she is interacting with a real person. In such instances, it is desirable for actions such as muscle movements and gaze of the virtual character to appear natural. For example, movements of the virtual character should be similar to its corresponding real object (e.g., a virtual human should walk or move its arm like a real human). As another example, the gestures and positioning of the virtual human should appear natural, and the virtual human can initial interactions with the user (e.g., the virtual human can lead a collaborative experience with the user). Presentation of virtual characters or objects having life-like audio responses is described in more detail herein.
Compared to VR systems, which present the user with a virtual environment while obscuring the real environment, a mixed reality system presenting a MRE affords the advantage that the real environment remains perceptible while the virtual environment is presented. Accordingly, the user of the mixed reality system is able to use visual and audio cues associated with the real environment to experience and interact with the corresponding virtual environment. As an example, while a user of VR systems may struggle to perceive or interact with a virtual object displayed in a virtual environment-because, as noted herein, a user may not directly perceive or interact with a virtual environment-a user of an MR system may find it more intuitive and natural to interact with a virtual object by seeing, hearing, and touching a corresponding real object in his or her own real environment. This level of interactivity may heighten a user's feelings of immersion, connection, and engagement with a virtual environment. Similarly, by simultaneously presenting a real environment and a virtual environment, mixed reality systems may reduce negative psychological feelings (e.g., cognitive dissonance) and negative physical feelings (e.g., motion sickness) associated with VR systems. Mixed reality systems further offer many possibilities for applications that may augment or alter our experiences of the real world.
Persistent coordinate data may be coordinate data that persists relative to a physical environment. Persistent coordinate data may be used by MR systems (e.g., MR system 112, 200) to place persistent virtual content, which may not be tied to movement of a display on which the virtual object is being displayed. For example, a two-dimensional screen may display virtual objects relative to a position on the screen. As the two-dimensional screen moves, the virtual content may move with the screen. In some embodiments, persistent virtual content may be displayed in a corner of a room. A MR user may look at the corner, see the virtual content, look away from the corner (where the virtual content may no longer be visible because the virtual content may have moved from within the user's field of view to a location outside the user's field of view due to motion of the user's head), and look back to see the virtual content in the corner (similar to how a real object may behave).
In some embodiments, persistent coordinate data (e.g., a persistent coordinate system and/or a persistent coordinate frame) can include an origin point and three axes. For example, a persistent coordinate system may be assigned to a center of a room by a MR system. In some embodiments, a user may move around the room, out of the room, re-enter the room, etc., and the persistent coordinate system may remain at the center of the room (e.g., because it persists relative to the physical environment). In some embodiments, a virtual object may be displayed using a transform to persistent coordinate data, which may enable displaying persistent virtual content. In some embodiments, a MR system may use simultaneous localization and mapping to generate persistent coordinate data (e.g., the MR system may assign a persistent coordinate system to a point in space). In some embodiments, a MR system may map an environment by generating persistent coordinate data at regular intervals (e.g., a MR system may assign persistent coordinate systems in a grid where persistent coordinate systems may be at least within five feet of another persistent coordinate system).
In some embodiments, persistent coordinate data may be generated by a MR system and transmitted to a remote server. In some embodiments, a remote server may be configured to receive persistent coordinate data. In some embodiments, a remote server may be configured to synchronize persistent coordinate data from multiple observation instances. For example, multiple MR systems may map the same room with persistent coordinate data and transmit that data to a remote server. In some embodiments, the remote server may use this observation data to generate canonical persistent coordinate data, which may be based on the one or more observations. In some embodiments, canonical persistent coordinate data may be more accurate and/or reliable than a single observation of persistent coordinate data. In some embodiments, canonical persistent coordinate data may be transmitted to one or more MR systems. For example, a MR system may use image recognition and/or location data to recognize that it is located in a room that has corresponding canonical persistent coordinate data (e.g., because other MR systems have previously mapped the room). In some embodiments, the MR system may receive canonical persistent coordinate data corresponding to its location from a remote server.
With respect to
In the example shown, mixed reality objects comprise corresponding pairs of real objects and virtual objects (e.g., 122A/122B, 124A/124B, 126A/126B) that occupy corresponding locations in coordinate space 108. In some examples, both the real objects and the virtual objects may be simultaneously visible to user 110. This may be desirable in, for example, instances where the virtual object presents information designed to augment a view of the corresponding real object (such as in a museum application where a virtual object presents the missing pieces of an ancient damaged sculpture). In some examples, the virtual objects (122B, 124B, and/or 126B) may be displayed (e.g., via active pixelated occlusion using a pixelated occlusion shutter) so as to occlude the corresponding real objects (122A, 124A, and/or 126A). This may be desirable in, for example, instances where the virtual object acts as a visual replacement for the corresponding real object (such as in an interactive storytelling application where an inanimate real object becomes a “living” character).
In some examples, real objects (e.g., 122A, 124A, 126A) may be associated with virtual content or helper data that may not necessarily constitute virtual objects. Virtual content or helper data can facilitate processing or handling of virtual objects in the mixed reality environment. For example, such virtual content could include two-dimensional representations of corresponding real objects; custom asset types associated with corresponding real objects; or statistical data associated with corresponding real objects. This information can enable or facilitate calculations involving a real object without incurring unnecessary computational overhead.
In some examples, the presentation described herein may also incorporate audio aspects. For instance, in MRE 150, virtual character 132 could be associated with one or more audio signals, such as a footstep sound effect that is generated as the character walks around MRE 150. As described herein, a processor of mixed reality system 112 can compute an audio signal corresponding to a mixed and processed composite of all such sounds in MRE 150, and present the audio signal to user 110 via one or more speakers included in mixed reality system 112 and/or one or more external speakers.
Example mixed reality system 112 can include a wearable head device (e.g., a wearable augmented reality or mixed reality head device) comprising a display (which may comprise left and right transmissive displays, which may be near-eye displays, and associated components for coupling light from the displays to the user's eyes); left and right speakers (e.g., positioned adjacent to the user's left and right ears, respectively); an inertial measurement unit (IMU) (e.g., mounted to a temple arm of the head device); an orthogonal coil electromagnetic receiver (e.g., mounted to the left temple piece); left and right cameras (e.g., depth (time-of-flight) cameras) oriented away from the user; and left and right eye cameras oriented toward the user (e.g., for detecting the user's eye movements). However, a mixed reality system 112 can incorporate any suitable display technology, and any suitable sensors (e.g., optical, infrared, acoustic, LIDAR, EOG, GPS, magnetic). In addition, mixed reality system 112 may incorporate networking features (e.g., Wi-Fi capability, mobile network (e.g., 4G, 5G) capability) to communicate with other devices and systems, including neural networks (e.g., in the cloud) for data processing and training data associated with presentation of elements (e.g., virtual character 132) in the MRE 150 and other mixed reality systems. Mixed reality system 112 may further include a battery (which may be mounted in an auxiliary unit, such as a belt pack designed to be worn around a user's waist), a processor, and a memory. The wearable head device of mixed reality system 112 may include tracking components, such as an IMU or other suitable sensors, configured to output a set of coordinates of the wearable head device relative to the user's environment. In some examples, tracking components may provide input to a processor performing a Simultaneous Localization and Mapping (SLAM) and/or visual odometry algorithm. In some examples, mixed reality system 112 may also include a handheld controller 300, and/or an auxiliary unit 320, which may be a wearable beltpack, as described herein.
In some embodiments, an animation rig is used to present the virtual character 132 in the MRE 150. Although the animation rig is described with respect to virtual character 132, it is understood that the animation rig may be associated with other characters (e.g., a human character, an animal character, an abstract character) in the MRE 150.
In some examples involving augmented reality or mixed reality applications, it may be desirable to transform coordinates from a local coordinate space (e.g., a coordinate space fixed relative to headgear device 500A) to an inertial coordinate space, or to an environmental coordinate space. For instance, such transformations may be necessary for a display of headgear device 500A to present a virtual object at an expected position and orientation relative to the real environment (e.g., a virtual person sitting in a real chair, facing forward, regardless of the position and orientation of headgear device 500A), rather than at a fixed position and orientation on the display (e.g., at the same position in the display of headgear device 500A). This can maintain an illusion that the virtual object exists in the real environment (and does not, for example, appear positioned unnaturally in the real environment as the headgear device 500A shifts and rotates). In some examples, a compensatory transformation between coordinate spaces can be determined by processing imagery from the depth cameras 544 (e.g., using a Simultaneous Localization and Mapping (SLAM) and/or visual odometry procedure) in order to determine the transformation of the headgear device 500A relative to an inertial or environmental coordinate system. In the example shown in
In some examples, the depth cameras 544 can supply 3D imagery to a hand gesture tracker 511, which may be implemented in a processor of headgear device 500A. The hand gesture tracker 511 can identify a user's hand gestures, for example by matching 3D imagery received from the depth cameras 544 to stored patterns representing hand gestures. Other suitable techniques of identifying a user's hand gestures will be apparent.
In some examples, one or more processors 516 may be configured to receive data from headgear subsystem 504B, the IMU 509, the SLAM/visual odometry block 506, depth cameras 544, microphones 550; and/or the hand gesture tracker 511. The processor 516 can also send and receive control signals from the 6DOF totem system 504A. The processor 516 may be coupled to the 6DOF totem system 504A wirelessly, such as in examples where the handheld controller 500B is untethered. Processor 516 may further communicate with additional components, such as an audio-visual content memory 518, a Graphical Processing Unit (GPU) 520, and/or a Digital Signal Processor (DSP) audio spatializer 522. The DSP audio spatializer 522 may be coupled to a Head Related Transfer Function (HRTF) memory 525. The GPU 520 can include a left channel output coupled to the left source of imagewise modulated light 524 and a right channel output coupled to the right source of imagewise modulated light 526. GPU 520 can output stereoscopic image data to the sources of imagewise modulated light 524, 526. The DSP audio spatializer 522 can output audio to a left speaker 512 and/or a right speaker 514. The DSP audio spatializer 522 can receive input from processor 519 indicating a direction vector from a user to a virtual sound source (which may be moved by the user, e.g., via the handheld controller 500B). Based on the direction vector, the DSP audio spatializer 522 can determine a corresponding HRTF (e.g., by accessing a HRTF, or by interpolating multiple HRTFs). The DSP audio spatializer 522 can then apply the determined HRTF to an audio signal, such as an audio signal corresponding to a virtual sound generated by a virtual object. This can enhance the believability and realism of the virtual sound, by incorporating the relative position and orientation of the user relative to the virtual sound in the mixed reality environment—that is, by presenting a virtual sound that matches a user's expectations of what that virtual sound would sound like if it were a real sound in a real environment.
In some examples, such as shown in
While
In some embodiments, the microphones 602 and 604 are offset about a Z-axis (e.g., z-axis 114Z). For example, the microphone 602 is at a first Z value, and the microphone 604 is at a second Z value. In some embodiments, the microphones 606 and 608 are offset about an X-axis (e.g., x-axis 114X). For example, the microphone 606 is at a first X value, and the microphone 608 is at a second X value. In some embodiments, the microphones 606 and 608 are proximal to the user's ears (e.g., 3-6 cm from the user's ears). By locating the microphones 606 and 608 proximal to the user's ears, ambient noise around the user's ears may be more accurately captured, and a speaker output signal (e.g., configured for acoustic cancellation) may more accurately cancel the ambient noise.
It is understood that the illustrated microphone locations in
The microphone configuration of MR system 600 advantageously allows sound information to be captured along an axis of asymmetry (e.g., an axis of offset between a pair of microphones, Z-axis, X-axis) (e.g., by taking advantage of amplitude and phase differences captured by the different microphones, as a consequence of the asymmetrical configuration), without adding microphones that would result in increased weight and power consumption. That is, the microphone configuration introduces geometrical diversity (e.g., offset along a Z-axis, offset along an X-axis) along three dimensions (e.g., x-axis 114X, y-axis 114Y, z-axis 114Z) to enable discrimination of audio objects (e.g., audio objects (e.g., non-user voice, noise) in a user's vicinity) along the three dimensions. For example, the microphones capture a sound. A first microphone (e.g., a microphone of a plurality of co-planar microphones) generates a first microphone signal based on the captured sound, and the second microphone (e.g., a non-co-planar microphone) generates a second microphone signal based on the captured sound. Based on the amplitude and/or phase difference between the two microphone signals, a non-co-planar component may be derived by the wearable head device.
The microphone configuration of MR system 600 additionally allow the weight and power consumption of the system to be minimized, which may be desirable for a battery-powered device worn by a user, such as a wearable head device.
Because this configuration allows the system to capture sound information along an axis of asymmetry user voice isolation, acoustic cancellation, audio scene analysis, fixed-orientation environment capture, and lobe steering are facilitated because sound information along information along all axis of an environment (e.g., an augmented reality (AR), MR, or extended reality (XR) environment) may be obtained, without suffering from the cost of additional microphones.
In some embodiments, asymmetrical microphone configurations may be used because an asymmetrical configuration may be better suited at distinguishing a user's voice from other audio signals. The MR system 600 (which may correspond to MR system 112, wearable head device 200, or system 501) can be configured to receive voice input from a user. In some embodiments, a first microphone may be placed at location 610, and a second microphone may be placed at location 604. In some embodiments, MR system 600 can include a wearable head device, and a user's mouth may be positioned at location 610. Sound originating from the user's mouth at location 610 may take longer to reach microphone location 602 than microphone location 604 because of the larger travel distance between location 610 and location 602 than between location 610 and location 604.
In some embodiments, an asymmetrical microphone configuration (e.g., the microphone configuration shown in
Although asymmetrical microphone configurations may provide additional information about a sound source (e.g., an approximate height of the sound source), a sound delay may complicate subsequent calculations. In some embodiments, adding and/or subtracting audio signals that are offset (e.g., in time) from each other may decrease a signal-to-noise ratio (“SNR”), rather than increasing the SNR (which may happen when the audio signals are not offset from each other). It can therefore be desirable to process audio signals (e.g., using a disclosed microphone signal preconditioning block) received from an asymmetrical microphone configuration such that a beamforming analysis (e.g., noise cancellation, 4-channel beamforming (as disclosed herein)) may still be performed to determine voice activity. In some embodiments, a voice onset event can be determined based on a beamforming analysis and/or single channel analysis. A notification may be transmitted to a processor (e.g., a DSP or x86 processor) in response to determining that a voice onset event has occurred. The notification may include information such as a timestamp of the voice onset event and/or a request that the processor begin speech recognition.
In some embodiments, because the microphone arrangement of MR system 600 provides more information along all axes of the environment (e.g., improved Z-axis captured without additional microphones), the disclose microphone arrangements also advantageously allow improved user voice isolation, acoustic cancellation, audio scene analysis, fixed-orientation environment capture and lobe steering, compared to a symmetric microphone arrangement. For example, voices (e.g., a non-user voice) and noises around the user (e.g., left, right, front, back, or above the user) may be more accurately rejected. As another example, the disclosed microphone arrangements allow a sound field (e.g., a sound field at a user's ear) to be better controlled, acoustic cancellation (e.g., acoustic echo cancellation using a disclosed acoustic echo cancellation block) may be improved for ambient noise suppression and audio object occlusion.
As yet another example, the disclosed microphone arrangements may improve an audio scene analysis by allowing real-time, low-latency detection (e.g., acoustic detection) of scene elements that may not be detectable (e.g., visible) by cameras. The disclosed microphone arrangements may be used for acoustic detection in conjunction with or in lieu of other scene detection methods (e.g., simultaneous localization and mapping, visual inertial odometry) and/or other scene detection sensors (e.g., camera, gyroscope, inertial measurement unit, LiDAR sensor, or other suitable sensor). As yet another example, the disclosed microphone arrangements allow the system to record a sound field more independently from a user's movements (e.g., head rotation) (e.g., by allowing head movement along all axes of the environment to be detected acoustically, by allowing a sound field that may be more easily adjusted (e.g., the sound field has more information along different axes of the environment) to compensate these movements). More examples of these features and advantages are described herein.
As yet another example, the disclosed microphone arrangements allow beamformer lobe to be resolved along an angle (e.g., an angle about a Z-axis, steerable beamforming along angles in ISO-80000-2:2019 spherical coordinates) with less required microphones. For example, the disclosed four microphone arrangements advantageously allow beamformer lobe to be steered along three axis and/or polar coordinates of an environment, compared to six microphones (two per axis). As examples, the beamformed patterns include at least one of cardioid, hypercardioid, supercardioid, dipole, bipolar, and shotgun shapes. The disclosed microphone arrangements also allow a sound field (e.g., Ambisonics) to form along the axes of an environment with less required microphones.
In some embodiments,
As illustrated, the sound from the user at location 710 is at an angle θ (e.g., a polar angle) relative to the positive Z-axis. The microphone arrangement of MR system 700 advantageously allow a beamforming pattern to more accurately capture the sound from the user. For example, the beamforming patterns generated from the microphone arrangement may more accurately reject non-user sounds or noises in front of the user (e.g., from a non-user sound or noise source on the X-Y plane). For example, a beamforming pattern comprising a main directional lobe 712 (for clarity, side and rear lobes are not shown) may be formed to more accurately capture the sound from the user. In some embodiments, the main directional lob 712 is configured to include the location 710 (e.g., to capture the intended sound source). For example, the pattern is formed such that a focus of the main directional lobe 712 is located at location 710. The main directional lobe 712 may have a length of r (e.g., a radial component).
As illustrated by this example, the microphone arrangement advantageously allows polar angle steering (e.g., rotating by an angle θ and lengthening by r) with a minimum number of microphones. Polar angle steering may not be possible (e.g., the beamforming patterns are fixed at θ=90 degrees) using a four-microphone symmetrical configuration (e.g., the four-microphones are co-planar).
In some embodiments,
As illustrated, the sound at location 810 is at an angle θ (e.g., a polar angle) relative to the positive Z-axis and at an angle-q (e.g., an azimuthal angle) relative to the positive X-axis. The microphone arrangement of MR system 800 advantageously allow a beamforming pattern to more accurately capture the sound. For example, the beamforming patterns generated from the microphone arrangement may more accurately reject unintended captures (e.g., from a non-user sound or noise source around the location 810). For example, a beamforming pattern comprising a main directional lobe 812 (for clarity, side and rear lobes are not shown) may be formed to more accurately capture the sound at location 810. In some embodiments, the main directional lob 812 is configured to include the location 810 (e.g., to capture the intended sound source). For example, the location 810 is located at an edge of the main directional lobe 812. The main directional lobe 812 may have a length of r.
As illustrated by this example, the microphone arrangement advantageously allows polar angle steering (e.g., rotating by an angles θ and φ and lengthening by r) with a minimum number of microphones. Polar angle steering may not be possible (e.g., the beamforming patterns are fixed at θ=90 degrees, and may not reach the location 810 at (r, φ, θ)) using a four-microphone symmetrical configuration (e.g., the four-microphones are co-planar).
In some embodiments,
As illustrated, the sound from the user at location 910 is at an angle θ relative to the positive Z-axis. The microphone arrangement of MR system 900 advantageously allow a beamforming pattern to more accurately capture the sound from the user. For example, the beamforming patterns generated from the microphone arrangement may more accurately reject non-user sounds or noises in front of the user (e.g., from a non-user sound or noise source on the X-Y plane). As described with respect to
In some embodiments, the cone 914 represent a pickup cone that has a focus along the edges of the cone, but a null centered on the x-axis. Thus, as illustrated, the cone 914 rejects the distractor voice pickup (e.g., located at location 912).
In some embodiments, some processes described with respect to diagram 1000 are performed with a first processor (e.g., a processor that consumes less power than the second processor, a first processor of a disclosed MR system), and some processes described with respect to diagram 1000 are performed with a second processor (e.g., a processor that has more processing power than the first processor, a second processor of a disclosed MR system). For example, processes performed with respect to the acoustic echo cancellation (AEC) blocks may be performed with the first processor, and the remaining processes may be performed with the second processor. As another example, processes performed with respect to the acoustic echo cancellation (AEC) blocks and beamforming block may be performed with the first processor, and the remaining processes may be performed with the second processor.
In some embodiments, the MR system includes AEC blocks 1002A-1002D. In some embodiments, as illustrated, the AEC blocks 1002A-1002D are stereo AEC blocks. In some embodiments, the AEC blocks are configured to receive microphone signals. For example, each of AEC blocks 1002A-1002D is configured to receive a microphone signal (e.g., microphone signal 1008A-1008D) of the MR system. Ambient noise around the user's ears may be captured (e.g., corresponding to the microphone signals 1008A-1008D), and the AEC blocks 1002A-1002D may generate a signal for a speaker to output an acoustic cancellation signal for acoustic cancellation (e.g., an audio signal that destructively interferes or cancels a level of ambient noise at the user's ears).
Each microphone signal may correspond to a microphone of the MR system. For example, microphone signal 1008A may correspond to microphone 608, microphone signal 1008B may correspond to microphone 604, microphone signal 1008C may correspond to microphone 602, and microphone signal 1008D may correspond to microphone 606.
In some embodiments, the AEC blocks are also configured to receive speaker reference signals. For example, the AEC blocks 1002A-1002D are configured to receive speaker reference signals 1010A and 1010B. The speaker reference signals may represent a magnitude and/or frequency response of a speaker of the MR system, and the speaker reference signals may be used for acoustic echo cancellation. Each of the speaker reference signals may correspond to a speaker of the MR system. For example, speaker reference signal 1010A may correspond to speaker 220A, and speaker reference signal 1010B may correspond to speaker 220B. As discussed earlier, the microphone arrangement of the MR system advantageously allow more acoustic echo cancellation without adding additional microphones.
In some embodiments, outputs of the AEC blocks 1002A-1002D are transmitted to a beamforming block 1004. In some embodiments, the beamforming block 1004 is configured to receive the processed microphone signals (e.g., microphone signals after acoustic echo cancellation) for beamforming. For example, as illustrated, the beamforming block 1004 receives steering parameters 1012. The steering parameters may include angle q and angle θ. The angle q and angle θ may correspond to the angle q and angle θ described with respect to
In some embodiments, the beamformed mic signal from the beamforming block 1004 is transmitted to a noise reduction block 1006. The noise reduction block 1006 may reduce any other noises that were not reduced or eliminated during the acoustic echo cancellation (e.g., by AEC blocks 1002A-1002D) or beamforming (e.g., by beamforming block 1004). In some embodiments, the noise reduction block 1006 is configured to output a signal for outputting an acoustic cancellation signal at a speaker. In some embodiments, the noise reduction block 1006 is configured to output a mono mic signal 1014 for further processing (e.g., stored, translated into a system command, processed to become an AR, MR, or XR environment recording). In some embodiments, the noise reduction block 1006 is configured to reject steady state noise such as fans, machines, or electronic self-noise (e.g., MEMS microphones). In some embodiments, the noise reduction block 1006 is configured to adaptively reject a part of a signal determined to not be human speech.
In some embodiments, some processes described with respect to diagram 1100 are performed with a first processor (e.g., a processor that consumes less power than the second processor, a first processor of a disclosed MR system), and some processes described with respect to diagram 1100 are performed with a second processor (e.g., a processor that has more processing power than the first processor, a second processor of a disclosed MR system). For example, processes performed with respect to the microphone signal preconditioning block may be performed with the first processor, and the remaining processes may be performed with the second processor. As another example, processes performed with respect to the microphone signal preconditioning block and beamforming block may be performed with the first processor, and the remaining processes may be performed with the second processor.
In some embodiments, the MR system includes microphone signal preconditioning block 1102. In some embodiments, the microphone signal preconditioning block 1102 comprises more than one block (e.g., one block per microphone signal). In some embodiments, the microphone signal preconditioning block 1102 is configured to process a microphone signal, adjust for a delay caused by the asymmetric microphone configuration, determine input power, smooth the microphone signal, calculate SNR, determine/remove speaker contribution to a captured sound field, and/or determine sounds of interest from the microphone signals. In some embodiments, the microphone signal preconditioning block includes calibration filters configured for compensation for acoustic variations due to manufacturing variability (e.g, of the microphone, of the system).
In some embodiments, the microphone signal preconditioning block 1102 is configured to receive microphone signals. For example, the microphone signal preconditioning block 1102 is configured to receive microphone signals (e.g., microphone signals 1108A-1108D) of the MR system. Each microphone signal may correspond to a microphone of the MR system. For example, microphone signal 1108A may correspond to microphone 608, microphone signal 1108B may correspond to microphone 604, microphone signal 1108C may correspond to microphone 602, and microphone signal 1108D may correspond to microphone 606.
In some embodiments, the microphone signal preconditioning block 1102 is also configured to receive speaker reference signals. For example, the microphone signal preconditioning block 1102 is configured to receive speaker reference signals 1110A and 1110B. The speaker reference signals may represent a magnitude and/or frequency response of a speaker of the MR system, and the speaker reference signals may be used for determining a contribution of the speakers to a recorded sound field (e.g., to determine a speaker's contribution to a captured sound field and remove the contribution). Each of the speaker reference signals may correspond to a speaker of the MR system. For example, speaker reference signal 1110A may correspond to speaker 220A, and speaker reference signal 1110B may correspond to speaker 220B.
In some embodiments, outputs of the microphone signal preconditioning block 1102 are transmitted to a beamforming block 1104. In some embodiments, the beamforming block 1104 is configured to receive the processed microphone signals (e.g., microphone signals after preconditioning) for beamforming. For example, as illustrated, the beamforming block 1104 receives steering parameters 1112. The steering parameters may include angle q and angle θ. The angle q and angle θ may correspond to the angle q and angle θ described with respect to
In some embodiments, the beamformed mic signal from the beamforming block 1104 is transmitted to block 1106. In some embodiments, the block 1106 is a post conditioning block. In some embodiments, the post conditioning block is configured to apply gain with soft clipping, apply tone EQ, function as an exciter or a de-esser, apply compression, perform automatic level control, perform other dynamics processing, perform noise reduction, and/or perform functions of a microphone channel strip. For example, the post conditioning block is configured to output a post conditioned stream. As another example, the post conditioning block is a voice stream post conditioning block configured to output a user voice stream (e.g., stored, processed to become an AR, MR, or XR environment recording).
In some embodiments, the block 1106 is a voice activity detection block. In some embodiments, the voice activity detection block is configured to detect for speech associated with a system command (e.g., wake up system, perform a command of the system). In some embodiments, the voice activity detection block outputs a voice activity flag 1116 corresponding to a detected voice activity (e.g., from the microphone signals). In some embodiments, the block 1106 is both a post conditioning block and a voice activity detection block, as illustrated. As discussed earlier, the microphone arrangement of the MR system advantageously allow more accurate user voice isolation (e.g., for more accurately capturing a user voice stream, for more accurately detecting voice activity) without adding additional microphones.
In some embodiments, some processes described with respect to diagram 1100 are performed with a first processor (e.g., a processor that consumes less power than the second processor, a first processor of a disclosed MR system), and some processes described with respect to diagram 1100 are performed with a second processor (e.g., a processor that has more processing power than the first processor, a second processor of a disclosed MR system). For example, processes performed with respect to the microphone signal preconditioning block may be performed with the first processor, and the remaining processes may be performed with the second processor. As another example, processes performed with respect to the microphone signal preconditioning block and beamforming block may be performed with the first processor, and the remaining processes may be performed with the second processor.
In some embodiments, the MR system includes microphone signal preconditioning block 1202. In some embodiments, the microphone signal preconditioning block 1202 comprises more than one block (e.g., one block per microphone signal). In some embodiments, the microphone signal preconditioning block 1202 is configured to process a microphone signal, adjust for a delay caused by the asymmetric microphone configuration, determine input power, smooth the microphone signal, calculate SNR, determine/remove speaker contribution to a captured sound field, and/or determine sounds of interest from the microphone signals.
In some embodiments, the microphone signal preconditioning block 1202 is configured to receive microphone signals. For example, the microphone signal preconditioning block 1202 is configured to receive microphone signals (e.g., microphone signals 1208A-1208D) of the MR system. Each microphone signal may correspond to a microphone of the MR system. For example, microphone signal 1208A may correspond to microphone 608, microphone signal 1208B may correspond to microphone 604, microphone signal 1208C may correspond to microphone 602, and microphone signal 1208D may correspond to microphone 606.
In some embodiments, the microphone signal preconditioning block 1202 is also configured to receive speaker reference signals. For example, the microphone signal preconditioning block 1202 is configured to receive speaker reference signals 1210A and 1210B. The speaker reference signals may represent a magnitude and/or frequency response of a speaker of the MR system, and the speaker reference signals may be used for determining a contribution of the speakers to a recorded sound field (e.g., to determine a speaker's contribution to a captured sound field and remove the contribution). Each of the speaker reference signals may correspond to a speaker of the MR system. For example, speaker reference signal 1210A may correspond to speaker 220A, and speaker reference signal 1210B may correspond to speaker 220B.
In some embodiments, outputs of the microphone signal preconditioning block 1202 are transmitted to a beamforming block 1204. In some embodiments, the beamforming block 1204 is configured to receive the processed microphone signals (e.g., microphone signals after preconditioning) for beamforming. For example, as illustrated, the beamforming block 1204 receives steering parameters 1212. The steering parameters may include angle φn and angle θn. The angle φ and angle θ may correspond to the angle φ and angle θ described with respect to
In some embodiments, the beamformed mic signals from the beamforming block 1204 is transmitted to block 1206. For example, N beamformed signals 1214A to 1214N are outputted from the beamforming block 1204. In some embodiments, more than one of the N beamformed signals are outputted at a same time. In some embodiments, one of the N beamformed signals is outputted at a time.
In some embodiments, the block 1206 is a post conditioning block. In some embodiments, the post conditioning block is configured to to apply gain with soft clipping, apply tone EQ, function as an exciter or a de-esser, apply compression, perform automatic level control, perform other dynamics processing, perform noise reduction, and/or perform functions of a microphone channel strip. For example, the post conditioning block is configured to output a post conditioned stream. As another example, the post conditioning block is a voice stream post conditioning block configured to output a user voice stream (e.g., stored, processed to become an AR, MR, or XR environment recording). As a specific example, the post conditioning block receives a beamformed signal 1214N and outputs a user voice stream 1216N. The post conditioning block may be configured to receive N beamformed signals 1214A to 1214N and output N user voice streams 1216A to 1216N. In some embodiments, more than one of the N user voice streams are outputted at a same time. In some embodiments, one of the N user voice streams is outputted at a time.
In some embodiments, the block 1206 is a voice activity detection block. In some embodiments, the voice activity detection block is configured to detect for speech associated with a system command (e.g., wake up system, perform a command of the system). In some embodiments, the voice activity detection block outputs a voice activity flag corresponding to a detected voice activity (e.g., from the microphone signals). As a specific example, the voice activity detection block receives a beamformed signal 1214N and outputs a voice activity flag 1216N. The voice activity detection block may be configured to receive N beamformed signals 1214A to 1214N and output N voice activity flags 1216A to 1216N. In some embodiments, more than one of the N voice activity flags are outputted at a same time. In some embodiments, one of the N voice activity flags is outputted at a time.
In some embodiments, the block 1206 is both a post conditioning block and a voice activity detection block, as illustrated. As a specific example, the combined post conditioning and voice activity detection block receives a beamformed signal 1214N and outputs a user voice stream 1216N or a voice activity flag 1216N, depending on a desired type of output. The combined post conditioning and voice activity detection block may be configured to receive N beamformed signals 1214A to 1214N and output N user voice streams and voice activity flags 1216A to 1216N, each output signal depending on a desired type of output. In some embodiments, more than one of the N output signals are outputted at a same time. In some embodiments, one of the N output signals is outputted at a time.
As discussed earlier, the microphone arrangement of the MR system advantageously allow more accurate user voice isolation (e.g., for more accurately capturing a user voice stream, for more accurately detecting voice activity) without adding additional microphones.
In some embodiments, the method 1300 includes capturing a sound with microphones (step 1302). In some embodiments, the method 1300 includes capturing the sound with four microphones in the disclosed asymmetric configuration (e.g., three of the microphones are co-planar and the fourth microphone is not co-planar; without additional microphones), as described with respect to
In some embodiments, the method 1300 includes forming a beamforming pattern (step 1304). In some embodiments, the beamforming pattern comprises a location of the captured sound (e.g., from step 1302). In some embodiments, the beamforming pattern comprises a component that is not co-planar with a plane formed by three of the four microphones. For example, as described with respect to
In some embodiments, the method 1300 includes generating a first microphone signal based on the sound captured by a microphone of the first plurality of microphones and generating a second microphone signal based on the sound captured by the second microphone. In some embodiments, the method 1300 includes calculating a magnitude difference, a phase difference, or both between the first and second microphone signals; and based on the magnitude difference, the phase difference, or both, deriving a coordinate of the sound not co-planar with the plurality of microphones. For example, as described with respect to
In some embodiments, the method 1300 includes applying the beamforming pattern (step 1306). For example, as described with respect to
In some embodiments, prior to applying the beamforming pattern, acoustic cancellation processing (e.g., using AEC blocks 1002A-1002D) is performed on the captured microphone signals (e.g., from step 1302), as described with respect to
In some embodiments, the method 1300 includes processing a signal (step 1308). For example, a signal (e.g., a beamformed signal) is generated by applying a beamforming pattern (e.g., from step 1306, based on the disclosed asymmetric configuration (e.g., three of the microphones are co-planar and the fourth microphone is not co-planar; without additional microphones)) to the captured microphone signal (e.g., from step 1302), as described with respect to
In some embodiments, a wearable head device (e.g., a wearable head device described herein, AR/MR/XR system described herein) includes: a processor; a memory; and a program stored in the memory, configured to be executed by the processor, and including instructions for performing the methods described with respect to
In some embodiments, a non-transitory computer readable storage medium stores one or more programs, and the one or more programs includes instructions. When the instructions are executed by an electronic device (e.g., an electronic device or system described herein) with one or more processors and memory, the instructions cause the electronic device to perform the methods described with respect to
Although examples of the disclosure are described with respect to a wearable head device or an AR/MR/XR system, it is understood that the disclosed sound field recording and playback methods may also be performed using other devices or systems. For example, the disclosed methods may be performed using a mobile device for compensating for effects of movement during recording or playback. As another example, the disclosed methods may be performed using a mobile device for recording a sound field including extracting sound objects and combining the sound objects and a residual.
Although examples of the disclosure are described with respect to headpose compensation, it is understood that the disclosed sound field recording and playback methods may also be performed generally for compensation of any movement. For example, the disclosed methods may be performed using a mobile device for compensating for effects of movement during recording or playback.
With respect to the systems and methods described herein, elements of the systems and methods can be implemented by one or more computer processors (e.g., CPUs or DSPs) as appropriate. The disclosure is not limited to any particular configuration of computer hardware, including computer processors, used to implement these elements. In some cases, multiple computer systems can be employed to implement the systems and methods described herein. For example, a first computer processor (e.g., a processor of a wearable device coupled to one or more microphones) can be utilized to receive input microphone signals, and perform initial processing of those signals (e.g., signal conditioning and/or segmentation). A second (and perhaps more computationally powerful) processor can then be utilized to perform more computationally intensive processing, such as determining probability values associated with speech segments of those signals. Another computer device, such as a cloud server, can host an audio processing engine, to which input signals are ultimately provided. Other suitable configurations will be apparent and are within the scope of the disclosure.
According to some embodiments, a wearable head device comprises: a first plurality of microphones, wherein the first plurality of microphones are co-planar; a second microphone, wherein the second microphone is not co-planar with the plurality of microphones; and one or more processors configured to perform: capturing, with the microphones, a sound of an environment; forming a beamforming pattern, wherein: the beamforming pattern comprises a location of the sound of the environment, and the beamforming pattern comprises a component that is not co-planar with the plurality of microphones; applying the beamforming pattern on a signal of the captured sound to generate a beamformed signal; and processing the beamformed signal.
According to some embodiments, a number of the first plurality of microphones is three.
According to some embodiments, the beamforming pattern comprises a radial component, an azimuthal angle component, and a non-zero polar angle component.
According to some embodiments, the beamforming pattern comprises at least one of cardioid, hypercardioid, supercardioid, dipole, bipolar, and shotgun shapes.
According to some embodiments, processing the beamformed signal comprises at least one of: reducing a noise level in the signal, performing post conditioning on the signal, detecting a voice activity in the signal, generating a speaker signal for acoustic cancellation, analyzing an audio scene associated with the captured sound, and compensating for a movement of the wearable head device.
According to some embodiments, the one or more processors are configured to further perform preconditioning the signal of the captured sound.
According to some embodiments, one of the first plurality of microphones and the second microphone are located on a front of the wearable head device.
According to some embodiments, the beamforming pattern does not include a location of a second sound on a plane co-planar with the first plurality of microphones.
According to some embodiments, a microphone of the first plurality of microphones is located proximal to an ear location.
According to some embodiments, the one or more processors are configured to further perform: generating a first microphone signal based on the sound captured by a microphone of the first plurality of microphones; generating a second microphone signal based on the sound captured by the second microphone; calculating a magnitude difference, a phase difference, or both between the first and second microphone signals; and based on the magnitude difference, the phase difference, or both, deriving a coordinate of the sound not co-planar with the plurality of microphones.
According to some embodiments, a method of operating a wearable head device comprising: a first plurality of microphones, wherein the first plurality of microphones are co-planar; and a second microphone, wherein the second microphone is not co-planar with the plurality of microphones, the method comprising: capturing, with the microphones, a sound of an environment; forming a beamforming pattern, wherein: the beamforming pattern comprises a location of the sound of the environment, and the beamforming pattern comprises a component that is not co-planar with the plurality of microphones; applying the beamforming pattern on a signal of the captured sound to generate a beamformed signal; and processing the beamformed signal.
According to some embodiments, a number of the first plurality of microphones is three.
According to some embodiments, the beamforming pattern comprises a radial component, an azimuthal angle component, and a non-zero polar angle component.
According to some embodiments, the beamforming pattern comprises at least one of cardioid, hypercardioid, supercardioid, dipole, bipolar, and shotgun shapes.
According to some embodiments, processing the beamformed signal comprises at least one of: reducing a noise level in the signal, performing post conditioning on the signal, detecting a voice activity in the signal, generating a speaker signal for acoustic cancellation, analyzing an audio scene associated with the captured sound, and compensating for a movement of the wearable head device.
According to some embodiments, the method further comprises performing preconditioning the signal of the captured sound.
According to some embodiments, one of the first plurality of microphones and the second microphone are located on a front of the wearable head device.
According to some embodiments, the beamforming pattern does not include a location of a second sound on a plane co-planar with the first plurality of microphones.
According to some embodiments, a microphone of the first plurality of microphones is located proximal to an ear location.
According to some embodiments, the method further comprises: generating a first microphone signal based on the sound captured by a microphone of the first plurality of microphones; generating a second microphone signal based on the sound captured by the second microphone; calculating a magnitude difference, a phase difference, or both between the first and second microphone signals; and based on the magnitude difference, the phase difference, or both, deriving a coordinate of the sound not co-planar with the plurality of microphones.
According to some embodiments, a non-transitory computer-readable medium storing one or more instructions, which, when executed by one or more processors of an electronic device comprising: a first plurality of microphones, wherein the first plurality of microphones are co-planar; and a second microphone, wherein the second microphone is not co-planar with the plurality of microphones, cause the device to perform a method comprising: capturing, with the microphones, a sound of an environment; forming a beamforming pattern, wherein: the beamforming pattern comprises a location of the sound of the environment, and the beamforming pattern comprises a component that is not co-planar with the plurality of microphones; applying the beamforming pattern on a signal of the captured sound to generate a beamformed signal; and processing the beamformed signal.
According to some embodiments, a number of the first plurality of microphones is three.
According to some embodiments, the beamforming pattern comprises a radial component, an azimuthal angle component, and a non-zero polar angle component.
According to some embodiments, the beamforming pattern comprises at least one of cardioid, hypercardioid, supercardioid, dipole, bipolar, and shotgun shapes.
According to some embodiments, processing the beamformed signal comprises at least one of: reducing a noise level in the signal, performing post conditioning on the signal, detecting a voice activity in the signal, generating a speaker signal for acoustic cancellation, analyzing an audio scene associated with the captured sound, and compensating for a movement of the wearable head device.
According to some embodiments, the method further comprises performing preconditioning the signal of the captured sound.
According to some embodiments, one of the first plurality of microphones and the second microphone are located on a front of the wearable head device.
According to some embodiments, the beamforming pattern does not include a location of a second sound on a plane co-planar with the first plurality of microphones.
According to some embodiments, a microphone of the first plurality of microphones is located proximal to an ear location.
According to some embodiments, the method further comprises: generating a first microphone signal based on the sound captured by a microphone of the first plurality of microphones; generating a second microphone signal based on the sound captured by the second microphone; calculating a magnitude difference, a phase difference, or both between the first and second microphone signals; and based on the magnitude difference, the phase difference, or both, deriving a coordinate of the sound not co-planar with the plurality of microphones.
Although the disclosed examples have been fully described with reference to the accompanying drawings, it is to be noted that various changes and modifications will become apparent to those skilled in the art. For example, elements of one or more implementations may be combined, deleted, modified, or supplemented to form further implementations. Such changes and modifications are to be understood as being included within the scope of the disclosed examples as defined by the appended claims.
This application claims priority to U.S. Provisional Application No. 63/255,882, filed on Oct. 14, 2021, the contents of which are incorporated by reference herein in their entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2022/078073 | 10/13/2022 | WO |
Number | Date | Country | |
---|---|---|---|
63255882 | Oct 2021 | US |