An aspect of the disclosure here relates to digital audio signal processing techniques that reduce the effort by a person who is wearing headphones, of having a conversation with another person in a noisy ambient sound environment. Other aspects are also described and claimed.
Having a conversation with someone who is nearby but in a noisy environment, such as in a restaurant, bar, airplane, or a bus, takes effort as it is difficult to hear and understand the other person. A solution that may reduce this effort is to wear headphones that passively isolate the wearer from the noisy environment but also actively reproduce the other person's voice through the headphone's speaker. This is referred to as a transparency mode of operation. In one type of transparency mode, selective reproduction of the ambient sound environment takes place, by applying beamforming signal processing to the output of a microphone array in the headphones. This focuses sound pickup in the direction of arrival of the voice of a talker (de-emphasizing or suppressing the pickup of ambient sound in other directions.) Such headphones may also have an acoustic noise cancellation (ANC) mode of operation in which a quiet listening experience is created for the wearer by electronically cancelling any ambient sounds that would otherwise still be heard by the wearer (due to having leaked past the passive sound isolation of the headphones).
The selective reproduction of a headphone wearer's ambient sound, by digital signal processing of the signals produced by the microphones of the headphone as part of a transparency mode of operation, may be designed to make it easier for the wearer to hear and understand another person (for example in the same room) with whom they are in a conversation. There is however some risk that such signal processing will not be able to achieve that goal, which leads to an unpleasant listening experience for the headphone wearer when the transparency mode is activated. This may be due to the transparency mode being activated at the wrong time, or deactivated at the wrong time, thereby reproducing undesirable ambient sounds.
An aspect of the disclosure here is a signal processing technique referred to as a conversation detector or conversation detect process. The conversation detector is a digital signal processing technique that operates upon one or more external microphone signals of the headphone, and perhaps one more other sensor signals such as produced by an audio accelerometer or bone conduction sensor, to decide when to activate or trigger a transparency mode of operation, and it ideally should be active only during an actual conversation between a wearer of the headphone and another talker in the same ambient environment. The talker (referred to here as “other talker”) is a person who is nearby for instance within two meters of the headphone wearer. The other talker may be standing next to or sitting across a table or side by side, for instance in a dining establishment, in the same train car, or in the same bus as the wearer. In one aspect, the transparency mode activates a conversation-focused transparency signal processing path (C-F transparency) in which one or more of the microphone signals of the headphone are processed to produce a conversation-focused transparency audio signal which is input to a speaker of the headphone. The conversation detector may declare the conversation has ended more accurately than relying solely on the absence of own voice activity. To declare the conversation ended, the conversation detector may implement an own voice activity detector, OVAD, and a target voice activity detector, TVAD whose inputs are one or more of the microphone signals and when available one or more other sensor signals. The OVAD and the TVAD detect own-voice activity (the wearer is talking) and far-field target voice activity (the other talker is speaking.) The conversation detector monitors a duration in which the OVAD and the TVAD are both or simultaneously indicating no activity and may declare the end of the conversation in response to the duration being longer than an idle threshold.
The conversation detector thus helps not only reduce power consumption, which is particularly relevant in wireless headphones, but also reduce the instances of distortion that might be introduced by the conversation-focused transparency signal processing path. It can advantageously prevent the mode being activated in unsuitable situations.
In one aspect, the conversation-focused transparency audio signal is different than a normal transparency audio signal that is also routed to drive the speaker, where the latter may or may not have been active prior to a conversation-focused mode being activated. In another aspect, an ANC path may have been active before activation of the conversation-focused mode, producing an anti-noise signal that was being routed to the headphone speaker. This anti-noise signal may have accompanied the normal transparency audio signal, or it may have been active by itself (without the normal transparency audio signal).
A filter block produces the conversation-focused transparency audio signal by enhancing or isolating the speech of the other talker. It may be performed in many ways, e.g., by processing two or more external microphone signals (from two or more external microphones, respectively) of the headset using sound pickup beamforming to perform spatially selective sound pick up in a primary lobe having an angular spread of less than 180 degrees in front of the wearer. It may be performed using knowledge based statistical or deterministic algorithms, or it may be performed using data driven techniques such as machine learning (ML) model processing, or any combination of the above.
In one aspect, when the conversation detector declares an end to the conversation, then at that point the transparency mode is deactivated. That means, for example, deactivating the conversation-focused transparency audio signal. In one aspect, the transparency mode is deactivated by also activating an anti-noise signal (or by raising selected frequency-dependent gains of, or raising the scalar gain of, the anti-noise signal.) In other aspects, entering and exiting the transparency mode during media playback (e.g., music playback, movie soundtrack playback) changes how the media playback signal is rendered.
The above summary does not include an exhaustive list of all aspects of the present disclosure. It is contemplated that the disclosure includes all systems and methods that can be practiced from all suitable combinations of the various aspects summarized above, as well as those disclosed in the Detailed Description below and particularly pointed out in the Claims section. Such combinations may have advantages not specifically recited in the above summary.
Several aspects of the disclosure here are illustrated by way of example and not by way of limitation in the figures of the accompanying drawings in which like references indicate similar elements. It should be noted that references to “an” or “one” aspect in this disclosure are not necessarily to the same aspect, and they mean at least one. Also, in the interest of conciseness and reducing the total number of figures, a given figure may be used to illustrate the features of more than one aspect of the disclosure, and not all elements in the figure may be required for a given aspect.
Several aspects of the disclosure with reference to the appended drawings are now explained. Whenever the shapes, relative positions and other aspects of the parts described are not explicitly defined, the scope of the invention is not limited only to the parts shown, which are meant merely for the purpose of illustration. Also, while numerous details are set forth, it is understood that some aspects of the disclosure may be practiced without these details. In other instances, well-known circuits, structures, and techniques have not been shown in detail so as not to obscure the understanding of this description.
The headphone 10 is part of an audio system that has a digital audio processor 15, two or more external microphones 12, 13, at least one internal microphone (not shown in the figure), and a headphone speaker 14, all of which may be integrated within the housing of the headphone 10. The internal microphone may be one that is arranged and configured to receive the sound reproduced by the speaker 14 and is sometime referred to as an error microphone. The external microphone 12 is arranged and configured to receive ambient sound directly and is sometimes referred to as a reference microphone. The external microphone 13 is arranged and configured to be more responsive than the other external and internal microphones in picking up the sound of the wearer voice and is sometimes referred to as a voice microphone.
In a transparency mode of operation, such a system actively reproduces the voice of the other talker 29 that has been picked up by the external microphones 12, 13, through the headphone speaker 14, while the processor 15 is performing signal processing for suppressing background noise or undesirable sounds. The transparency mode may be implemented separately in each of the primary headphone and the secondary headphone, using much the same methodology described below. The primary headphone may be in wireless data communication with the secondary headphone, for purposes of sharing control data that is transmitted over-the-air from a different instance of a conversation detector (described below) that is operating in the secondary headphone 10b. Also, one or both of the headphones may also be in wireless communication with a companion device (e.g., a smartphone, a tablet computer) of the wearer, for purposes of for example receiving from the companion device a user content audio signal (e.g., a downlink call signal, a media playback signal such as a mono media playback signal or a multi-channel media playback signal), sending an external or internal microphone signal to the companion device as an uplink call signal, or receiving control data from the companion device that configures the transparency mode (at least as to those portions that are performed in the headphone 10, noting that in some cases some of the operations of the transparency mode may be performed in the companion device).
Referring now to
The processor 15 has a filter block 11 through which one or more of the external microphone signals 12, 13 are processed before driving the speaker 14. The filter block is to process one or more of the microphone signals in the headphone, using various digital signal processing paths to produce one or more of the following audio signals. An anti-noise signal may be produced by an acoustic noise cancellation (ANC) subsystem, e.g., having a digital filter that is adaptively updated based on a feedforward ANC, feedback ANC or hybrid ANC arrangement to produce the anti-noise signal. A conversation-focused transparency audio signal may be produced by a C-F transparency signal processing path which may contain a digital filter (a transparency digital filter) that is configured to filter one or more of the microphone signals, where the digital filter may be a time-varying filter that is updated or adapted in real-time or on a per audio frame basis based on the processor detecting a far-field speech (or far-field target voice) in the microphone signals. More generally, the filter block 11 may implement any suitable digital processing of the two or more external microphone signals (from two or more external microphones, respectively) of the headset to produce the conversation-focused transparency audio signal, e.g., sound pickup beamforming, knowledge based statistical or deterministic algorithms, or data driven techniques such as machine learning (ML) model processing of one or more of the external microphone signals.
The processor 15 implements a conversation detector which controls automatically (without requiring explicit input from the wearer) when to activate and when to deactivate the transparency mode of operation, as well as when to enter other modes of operation described below (such as an own voice or sidetone mode.) It should declare a conversation when the wearer of the headphone is conversing with or is about to converse with another talker who is in an ambient environment of the wearer. It should end the conversation when the wearer has stopped conversing with the other talker. Still referring to
The conversation detector may declare the conversation has ended based on i) comparing a detected gap after a word in the own voice, to an own voice threshold and ii) comparing a detected gap after a word in the target voice (that of the other talker), to a target voice threshold. These gaps may be detected based on speech vs. non-speech activity outputs of an own voice activity detector (OVAD) and a target voice activity detector (TVAD) described further below in connection with
Still referring to
As an alternate to the conversation-focused transparency and own voice modes, there may be a normal transparency mode of operation in which the filter block 11 is configured with a normal transparency digital signal processing path that produces a normal transparency audio signal. The normal transparency signal processing path may be configured to pass ambient sounds that have been picked up by the external microphone signals (and are reproduced through the speaker 14) without trying to enhance any far-field speech or without trying to suppress ambient sound pick up in a particular direction around the wearer. The normal transparency signal processing path may be considered to process the microphone signals to achieve an omnidirectional ambient sound pick up around the headphone, e.g., at least within an azimuthal plane through the headphone.
In some cases, the ANC subsystem may be active simultaneously with one of the transparency paths but primarily in a different audible frequency band, while both are feeding the speaker 14 simultaneously. Alternatively, the ANC may be active while the transparency path is entirely inactive, i.e., inactive across the entire audio spectrum, to produce a quiet listening experience. The conversation detector may decide when and how to configure these ANC and transparency signal processing paths (or transition between the various modes of operation).
When the transparency mode is activated, the processor configures the filter block 11 to activate the conversation-focused transparency audio signal and routes the conversation-focused transparency audio signal to the speaker 14 of the headphone. When the conversation ends (the wearer and the other talker have stopped talking to each other), the conversation detector should declare that the conversation has ended in response to which the conversation-focused transparency audio signal is deactivated. The conversation detector may do so based on processing one or more of the microphone signals and the other sensor signals as described in more detail below.
The task of when and how to declare the conversation has ended is addressed first. In one aspect, the conversation detector performs machine learning (ML) model based monitoring of the wearer's voice (own voice) and that of another talker, using the microphone and other sensor signals as input, to declare the end of the conversation. This is depicted in
In accordance with an adaptive tuning aspect, the idle threshold (for when to declare the conversation as ended) may be varied, as follows. The conversation end pointing process of
Another aspect of the disclosure here relates to the processor buffering one or more of the microphone signals as “past ambient audio” and routing the past ambient audio to the speaker 14 while the conversation detector is processing the microphone signal to declare the conversation or declare the conversation has ended. For example, the last one second of ambient audio just prior to the conversation being declared may always be buffered and routed through the speaker 14, so that the wearer can hear the ambient sounds just prior to any change in the mode of operation (e.g., when transitioning from transparency to ANC mode).
In one aspect, the processor is configured to buffer and process one or more of the microphone signals for detecting far-field speech. So long as no far-field speech is detected, the transparency mode remains deactivated, and then is activated in response to far-field speech being detected.
Turning now to the task of how to declare the conversation (has started), one approach is to simply react to the OVAD output indicating (speech) activity. In one aspect, the processor also implements a false trigger detector that prevents the conversation from being declared (despite the OVAD output indicating speech activity), by processing one or more of the microphone signals and the other sensor signals to detect a false trigger sound. The false trigger sound could represent chewing, sneeze, cough, yawn, or burp by the wearer of the headphone. These are examples of nonverbal vocalizations that do not resemble speech. The false trigger sound could alternatively represent loud breath, loud sigh, face scratch, walking, or running by a wearer of the headphone. In yet another example, the false trigger sound represents the wearer of the headphone talking to themselves or singing a song to which they are listening. In still another example, the false trigger sound represents sound from a source that is in a far-field of the external microphones but that is not speech. If such a false trigger sound is detected, then the conversation is not declared. The false trigger detector may be implemented as depicted in
Referring now to
It should be noted that
In one aspect, when the false trigger detector prevents the transparency mode from being activated and the second flag indicated the detected sound is a verbal vocalization, the conversation detector configures the filter block 11 (see
In one aspect, a machine learning model, ML model, is configured to detect a false trigger sound as that of the wearer singing or humming to a song that is simultaneously playing through the headphones. This is considered a more challenging problem than detecting the wearer is coughing, sneezing, or throat clearing. In one instance, the ML model is configured with several inputs all of which are available or active simultaneously in the headphone, such as one or both of the output signals from the external microphone 12 and the external microphone 13, the bone conduction sensor signal, and the user content audio signal (being in particular a media playback signal) that is simultaneously driving the headphone speaker 14 to play back the song. When the ML model detects a time interval that exhibits sufficient correlation between these inputs, it marks that time interval of the output signals of the external microphones 12, 13, and the bone conductions sensor signal. This (singing or humming) time interval may be detected in the false trigger detector, for example in the first stage processing, in the second stage processing, or in the third stage processing as they are depicted in
A short latency is desirable for the false trigger detector, to prevent the wearer from noticing the ensuing transition to the conversation-focused transparency audio signal (in instances where the false trigger sound is not detected and hence the conversation is declared.) To enable the short latency, the ML model that detects the wearer signing or humming could be configured to have a longer time interval or historical context from which to make its decision. In other words, a longer buffer for storing the inputs to the ML model. The ML model could also be configured to look ahead in its input being the user content audio signal, to anticipate the melody of the song; this capability is available in instances where the user content audio signal (of the song) is from a music file that has been downloaded in its entirety and is stored locally in the companion device, or where a longer look ahead or download buffer is possible (in the headphone or in the companion device) during streaming of the song.
Returning momentarily to
Turning now to
Each speaker ID model can determine in real time, based on for example the activity vs. inactivity output of the TVAD—see
Back to
In another aspect, the conversation detector may declare the conversation based on an output of a VAD (e.g., the OVAD) indicating speech activity. The VAD receives as input the one or more microphone signals and the bone conduction signal and provides as output a sequence of instances of speech vs. non-speech, for a sequence of instances of a window, respectively, wherein the window is longer than any single syllable duration. The window may have a duration of at least any two consecutive syllables. The window may be longer than one hundred milliseconds and shorter than three hundred milliseconds. In another aspect, the conversation detector declares the conversation based on a heuristic-model based speech detector or an automatic speech recognition machine learning model, both configured to receive as input the one or more microphone signals and other sensor signals and provide an output that differentiates spoken voice syllables from other sounds.
Having described several ways in which the conversation detector can declare and end the conversation, another aspect of the system of
The recommended aperture A represents an angular spread or sector that extends in front of and outward from the headset (worn by the wearer 27), between a beginning direction and an ending direction. Spatially selective sound pickup (e.g., beamforming), speech enhancement, or speech isolation/speech separation is performed, based on or within the recommended aperture A, by appropriately configuring the filter block 11 to process at least two of the external microphone signals (that are produced by at least two external microphones in the headset.) As an example, a beamforming algorithm may be performed that suppresses sound pickup in directions that are outside of the recommended aperture. As another example, processing the microphone signals comprises using an ML model to perform speech enhancement or speech separation. In these cases, this results in the voice of the other talker 29 being isolated or enhanced in the conversation-focused transparency audio signal, which may also encompass suppressing incoming sound from outside of the recommended aperture A.
Next, the processor expands the recommended aperture in response to the wearer of the headset looking away from the first direction in a different, second direction. A yaw angle sensor in the headset may be used to sense such changes of direction. The situation is depicted in
Next, the wearer immediately turns to their right, towards the other talker 29, as seen in
In one aspect, the recommended aperture shrinks according to a decay parameter, which may be the rate at which the recommended aperture shrinks from a previous instance of the recommended aperture, e.g., a time constant. Thus, while expanding the aperture may occur immediately (in response to the wearer looking in a different direction), its shrinking occurs gradually while the wearer continues to look in the same direction.
The recommended aperture may be a sequence of instances over time where each instance is generated based on a yaw angle history, a previous instance of the recommended aperture, and the decay parameter. The yaw angle history comprises several instances over time of sensed yaw angle of the headset and may be stored in memory within the headset.
In another aspect, the recommended aperture can be expanded to one of only a handful of predetermined apertures. Thus, rather than allow the recommended aperture to be expanded to every value, only a limited number of predetermined, different apertures (and their associated beamforming algorithms) are permitted, which may be beneficial when constrained by computing or memory resources in the headset.
In another aspect, expanding the recommended aperture comprises using an ML model to analyze the yaw angle history of the headset to determine not only when to expand the recommended aperture but also by how much.
In yet another aspect, expanding the recommended aperture is in response to the processor detecting a head tilt by the wearer, which may suggest that the other talker 29 is sitting or standing to the side of the wearer (rather than directly in front as depicted in
Turning now to
Next, the conversation-focused transparency audio signal is adjusted based on the OVAD indicating activity and based on the TVAD indicating activity (a far-field target voice which may be that of the other talker 29 is active.) The transparency mode is sustained in this manner, in response to the own voice and the far-field target voice being detected. Here, a boundary is defined around a currently location of the wearer 27 as illustrated in
In one aspect, the operations described above in connection with
In another aspect, the transparency mode of operation may be activated during media playback (through the primary headphone 10a.—see
In another aspect, there is a second processor in the secondary headphone 10b, that comprises a second filter block similar to the filter block 11, which is able to process one or more of microphone signals in the secondary headphone 10b, for producing a second conversation-focused transparency audio signal (the latter would be routed to a speaker of the secondary headphone 10b.) During playback of a multi-channel media playback signal, one or the other of these two processors is configured to, in response to the conversation detector declaring the conversation, downmix the multi-channel media playback signal into a mono media playback signal. The processor then renders the mono media playback signal by spatializing it out of the wearer's head (during the media playback.) In other words, when the conversation detector declares a conversation, the multi-channel media playback becomes spatialized into a single virtual sound source that is outside of the wearer's head, instead of simply pausing the playback.
In yet another media playback aspect, the processor is configured to pause or duck the media playback and then resume the media playback, in response to the conversation being declared and then ended, respectively.
It is well understood that the use of personally identifiable information should follow privacy policies and practices that are generally recognized as meeting or exceeding industry or governmental requirements for maintaining the privacy of users. In particular, personally identifiable information data should be managed and handled so as to minimize risks of unintentional or unauthorized access or use, and the nature of authorized use should be clearly indicated to users.
While certain aspects have been described and shown in the accompanying drawings, it is to be understood those are merely illustrative of and not restrictive on the broad invention, and that the invention is not limited to the specific constructions and arrangements shown and described, since various other modifications may occur to those of ordinary skill in the art. The description is thus to be regarded as illustrative instead of limiting.
To aid the Patent Office and any readers of any patent issued on this application in interpreting the claims appended hereto, applicant wishes to note that it does not intend any of the appended claims or claim elements to invoke 35 U.S.C. 112 (f) unless the words “means for” or “step for” are explicitly used in the particular claim.
The following statements are based on the disclosure here.
16. A digital audio processor for use with a headphone, the digital audio processor comprising
17. The processor of statement 16 wherein the filter block is configured so that the transparency audio signal is a conversation-focused transparency audio signal.
18. The processor of statement 16 wherein the filter block is configured produce an anti-noise signal that is being routed to the speaker while the transparency audio signal is inactive, the anti-noise signal being deactivated, or its selected frequency dependent gains being reduced, or its scalar gain being reduced whenever the transparency audio signal is activated.
19. The processor of claim 16 wherein whenever the transparency audio signal is deactivated, the selected frequency-dependent gains of the anti-noise signal are increased, the anti-noise signal is activated, or the scalar gain of the anti-noise signal is increased.
20. The processor of claim 16 wherein deactivating the conversation-focused transparency audio signal activates a normal transparency audio signal that is routed to drive the speaker.
21. The processor of any one of statements 16-20 wherein the second portion, in its entirety, is later in time than the first portion.
22. The processor of any one of statements 16-21 wherein the third portion, in its entirety, is earlier in time than the first portion.
23. A digital audio processor for use with a first headphone, the digital audio processor comprising:
24. The processor of statement 23 wherein the first false trigger sound represents chewing, sneeze, cough, yawn, or burp by a wearer of the headphone.
25. The processor of statement 23 wherein the first false trigger sound represents loud breath, loud sigh, face scratch, walking, or running by a wearer of the headphone.
26. The processor of statement 23 wherein the first false trigger sound represents a wearer of the headphone singing or humming to a song to which they are listening and is being played back through a speaker of the first headphone, and the false trigger comprises a machine learning model (an ML model) configured to detect the first false trigger sound, as the wearer is singing or humming to the song, based on the following inputs to the ML model being simultaneously active in the first headphone: i) the one or more of the plurality of microphone signals, ii) the bone conduction sensor signal of the first headphone, and iii) a user content audio signal that is driving a speaker of the first headphone to play back the song.
27. The processor of statement 23 wherein the first false trigger sound represents sound from a source that is in a far-field of a plurality of external microphones of the first headphone.
28. The processor of any one of statements 23-27 configured to receive a message about a second false trigger sound from another processor in a second headphone of a headset, and i) set an agreement flag if the first false trigger sound is consistent with the second false trigger sound, or ii) set a disagreement flag if the first false trigger sound is inconsistent with the second false trigger sound.
29. The processor of any one of statements 23-27 wherein the false trigger detector is configured to:
30. The processor of statement 29 wherein when the false trigger detector prevents the transparency mode from being activated and the second flag indicates the detected sound is a verbal vocalization, the filter block comprises an own voice digital filter that produces an own voice or sidetone audio signal which is routed to the speaker.
31. The processor of any one of statements 23-30 wherein the transparency audio signal is a conversation-focused transparency audio signal.
This nonprovisional patent application claims the benefit of the earlier filing dates of U.S. provisional application Nos. 63/499,174 and 63/499,180 both filed 28 Apr. 2023.
Number | Date | Country | |
---|---|---|---|
63499174 | Apr 2023 | US | |
63499180 | Apr 2023 | US |